Skip to content

Getting Started

Before You Start

The Getting Started guide that follows leverages the standalone punchplatform. Make sure you the get latest one from our download area.

In minutes you will be able to run, operate, test and develop various use cases. The standalone punchplatform is also very useful in operation, coupled with your production platform, to run prototype or test setups either in sandbox mode or directly on the production traffic.

Danger

What you cannot do is to go production with the standalone, that is not its purpose. For production, use the official deployment packages. They take care of properly installing the punchplatform components. The PunchPlatform team will provide production support only on versions installed using the official deployment packages.

Requirements

At least 8 GB of memory. RAM requirement will depends on the number and the complexity of storm/spark topologies you will be running (e.g. input and output channels, tenants, ...).

The standalone runs on the following systems:

  • MacOS Sierra 10.12 or later,
  • Ubuntu 16.04 LTS or later (64bits),
  • Debian 7 or later (64bits),
  • CentOS/RHEL 7 or later (64bits)

In all cases check if you have the required dependencies:

  • bash
  • jq 1.5 (or above)
  • curl
  • python 2.7 with virtualenv and pip modules 'Jinja2' and 'PyYAML' installed

Mac OS

Note

we assume that HomeBrew is already install on your machine

1
2
3
brew install jq curl
sudo easy_install pip
pip install virtualenv PyYAML Jinja2

Ubuntu/Debian

1
2
sudo apt install curl jq python-minimal openjdk-8-jdk unzip python-pip python-virtualenv
pip install PyYAML Jinja2

CentOS/RedHat

1
2
3
4
5
sudo yum install zip unzip wget
sudo curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" && sudo python get-pip.py
sudo pip install virtualenv PyYAML Jinja2
sudo wget https://github.com/stedolan/jq/releases/download/jq-1.5/jq-linux64 -O /usr/bin/jq && sudo chmod +x /usr/bin/jq
sudo yum install java-1.8.0-openjdk-devel

Important

A troubleshooting procedure for the standalone is available at Troubleshooting your standalone

Congratulation, now we can get started using the PunchPlatform !

Tutorial Tracks

The standalone platform provides a number of ready to start with scenarios. Before you start with one, here is a quick overview.

The examples are part of an example tenant called "mytenant". Remember the PunchPlatorm is multi-tenant. Everything you do is defined as part of a well defined and isolated tenant.

This tutorial is divided into three parts:

  • 10 minutes tour: install the standalone platform, and run some preconfigured punchlets, channels, storm and spark jobs
  • 15 minutes tour: see how you can plugin in Kafka in there.
  • 20 minutes tour: a quick tour of platform administration.

Ten Minutes Tour

This chapter assumes you have already downloaded the standalone PunchPlatform archive and unzipped it: PunchPlatform Download Area. That is, you should have a punchplatform-standalone-x.y.z directory on your filesystem.

Important

You must install the Standalone as a non-root user. The PunchPlatform should be installed/run under your standard user account.

1
2
cd punchplatform-standalone-x.y.z
./install.sh -s

This unarchives all our friends (Kafka, Storm, Elasticsearch, Spark etc...), and patches their configuration files (cleanly) so that you have a simple local PunchPlatform setup. Any file related to the PunchPlatform will be installed under this directory, nowhere else on your machine. When a prompt ask you to patch your environment, if you answer "yes", it will only add a few lines at the end of your ~/.bashrc (on linux) or ~/.bash_profile (on MacOS) to update your path variable. You then only need to start a new terminal.

If you prefer not impacting your environment, just say no. To setup your environment you must then execute the following command every time you start a new terminal:

1
source ./activate.sh

Assuming you are happy with a local setup, you can start all of the standalone platform components (Zookeeper, Storm, Elasticsearch, Kibana, Spark).

1
punchplatform-standalone.sh --start

Check if your standalone version is running properly:

1
punchplatform-standalone.sh --status

One last step : load some Kibana resources provided with the punch:

1
punchplatform-setup-kibana.sh --import

Success

You have now an up and running PunchPlatform installed

Enter Kibana

You may not be familiar yet with Elasticsearch and Kibana. Before even trying out the punch features, it is a good idea to simply visit your local Kibana http://localhost:5601.

System Monitoring

On the left-hand panel, select the Dashboard menu. You will see there a number of dashboards, ready to be visualised. Find and select the Metricbeat System Overview dashboard. You should see something like this:

image

The metricbeat dahsboards let you visualise metrics of each of your computer hardware: cpu usage, disk usage, memory usage, etc. These metrics are generated by the Metricbeat installed as part of your standalone. You can see it running by typing the following command:

1
punchplatform-standalone.sh --status

or even simpler:

1
punchplatform-metricbeat.sh --status

The metricbeat collects various system and monitoring metrics and forwards them to Elasticsearch. You then visualise these through a Kibana dashboard.

Tip

The so-called Beats are the Elastic agents in charge of collecting various events (windows, network, host, files, audit). What you see here in action is the Metricbeat. Metricbeats are extensively used in the punch. They are deployed as part of the punchplatform setup and provide you with a complete view of your servers.

Audit Data

Let us now explore yet another beat: the Auditbeat. It monitors user activity and processes. Auditbeat communicates directly with the Linux audit framework and sends the events to the Elastic Stack in real time.

Because the auditbeat requires root privilege, it is not started automatically. Here is how you can start it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cd $PUNCHPLATFORM_CONF_DIR/../external/auditbeat-*/
sudo chown root auditbeat.yml

# load the auditbeat dashboards (you can skip this step if you don't want the audit beat dashboard)
# this step may takes up to 1 minute
sudo ./auditbeat setup -e

# On Linux, there is an extra step: you must chose your architecture
# For example, on a 64 bits computer, delete any unecessary 32-bits configuration files
# Otherwise, delete the 64-bits files.
rm audit.rules.d/*-32bit.conf

# Go for it !
sudo ./auditbeat -c auditbeat.yml -e

You can now visit the "[Auditbeat] File Integrity" dashboard. Have fun dicovering what you can learn from such a tool.

image

Tip

When you look for a dashboard use the top level search box. Simply type 'Aud' and it will automatically list the available audit beat dashboards.

Running a Punchlet

Now that you have a sense of what Elasticsearch, Kibana and Beats can do, let us move on to punch features. First we will explore punchlets.

A punchlet is a small function in charge of transforming your data. A typical example is log parsing. If you are familiar with logstash, think of punchlet as the filter part of a logstash configuration.

The standalone ships in with simple examples. Run one as follows:

1
cd $PUNCHPLATFORM_CONF_DIR/examples/punch
1
punchplatform-puncher.sh operators_ipmatch.punch

You will get

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "check": true,
  "logs": {
    "log": "172.16.0.2"
  }
}
{
  "check": false,
  "logs": {
    "log": "5.36.18.2"
  }
}

The code of that particular punchlet is quite simple. It checks if an IP address belongs to some defined range.

1
2
3
4
{
  Tuple ranges = getResourceTuple("ranges");
  [check] = ipmatch(getResourceTuple("ranges")).contains([logs][log]);
}

Have a look at that example file as well as other examples, they are self-explanatory. The Punch language is powerful and is provided with a complete documentation. All in all, simply understand that you can write simple pieces of code using our punch language. You will later on see how to invoke it from various stream or batch applications.

Running a Topology

The next concept to understand are topologies. A topology is a data pipeline, configured to fetch or receive data, process it and push it downstream. Why not just run one ?

1
cd $PUNCHPLATFORM_CONF_DIR/examples/topologies/files/csv_to_stdout_and_elasticsearch

Have a look at the csv_to_stdout_and_elasticsearch.json file. It is a topology file. It reads data from the local file AAPL.csv, then calls a small processing function AAPL.punch (a punchlet) to convert CSV into JSON, prints that JSON to standard output and inserts it into an elasticsearch cluster.

Go for it !

1
punchplatform-topology.sh csv_to_stdout_and_elasticsearch.json

You will get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# [...]
{
  "logs": {
    "log": {
      "High": 172.0,
      "Low": 170.059998,
      "Volume": 171.509995,
      "Adj Close": 37687500,
      "Close": 171.509995,
      "Date": "2018-01-26",
      "Open": 172.0
      }
    }
}
# Ctrl+C to quit

The data displayed on your terminal comes from the punchlet. Notice the print function on the code below. The punchlet in action simply transforms a CSV input string into a JSON document which in turn will be indexed to elasticsearch.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
{
// Take care of the formatting line, not containing actual data
if ([logs][log].contains("Date")) {
  root.empty();
  return;
}

// Use the csv operator to do the job
if (!csv("Date","Open","High","Low","Close","Volume","Adj Close")
     .delim(",")
     .inferTypes()
     .on([logs][log])
     .into([logs][log])) {
     raise("unexpected format");
}

// Print the result. Do not do this in production!
print(root);
}

At this point, you should have access to your data within Kibana. We will take this opportunity to get familiar with what Kibana has to offer, i.e visualising our data.

  1. Open the kibana GUI in your browser http://localhost:5601
  2. Navigate to the Dev Tools tab on the left side.
  3. Execute the first request shown there: GET /_cat/indices?v. You should see one line with a stock-YYYY-MM-DD index, the one that holds our stock price data.
  4. Now, navigate to the "Management" tab. Click on "index patterns" to create a new index pattern. Use the pattern stock-* and click on "next step". Here, select "@timestamp" as time filter field name.
  5. Navigate to the "Discover" tab. On the Discover menu, at the top, select the stock-* index using the drop-down menu and chose a large time scope.
  6. Now, you should see your data, congratulation !

Mastering Elasticsearch and Kibana is ultimately required to take the best out of your punchplatform. If you are a newcomer: you can consider that an index in elasticsearch is equivalent to a table in a traditional SQL database. Each document you see represent a line in that table, and each field a column. What Kibana does is to allow you to explore and visualise your data easily."

As you see, topologies are quite simple to understand. They are (very) powerful. You can do all sort of stream computing with them. Now that you have a good understanding of topologies, let's move on.

Running a Spark Job

Let's move on to the spark world with the spark job concept. First go to the following folder:

1
cd $PUNCHPLATFORM_CONF_DIR/examples/pml/files

There you will find a spark pipeline example that performs a very simple operation: it reads in a csv file and show it to stdout.

Run it using the following command:

1
punchplatform-analytics.sh --job csv_files_to_stdout.pml

You should see the result below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# [...]
component: input_default
+----------+----------+----------+----------+----------+----------+--------+
|       _c0|       _c1|       _c2|       _c3|       _c4|       _c5|     _c6|
+----------+----------+----------+----------+----------+----------+--------+
|      Date|      Open|      High|       Low|     Close| Adj Close|  Volume|
|2017-12-28|171.000000|171.850006|170.479996|171.080002|171.080002|16480200|
|2017-12-29|170.520004|170.589996|169.220001|169.229996|169.229996|25999900|
+----------+----------+----------+----------+----------+----------+--------+
only showing top 3 rows

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)

There you got it. What this simple example shows is how daringly simple, concise and clear it is to design an arbitrary spark pipeline using the punch. Better: the punch ships in with a pipeline graphical editor. Checkout http://localhost:5601/app/punchplatform.

Running a Channel

Here is now the third concept, the channel. A channel groups several applications (called jobs) into a consistent and useful unit. Once you defined a channel, you can start or stop it. All its jobs will be started or stopped accordingly.

Stream processing

A job can be streaming or a batch applicition. Let us consider the simplest channels you can think of, composed of simple streaming jobs to parse continuously parse logs, received on a TCP socket, and indexed into Elasticsearch once transformed in a normalised and enriched json data. We will go through that usage now.

Start again from the configuration directory. You will find some ready to use log channels defined for the mytenant example tenant.

1
2
ls $PUNCHPLATFORM_CONF_DIR/tenants/mytenant/channels
admin  aggregation  apache_httpd  sourcefire  stormshield_networksecurity  universal  websense_web_security

As you probably guessed, each channel deals with the corresponding log types. Except for the universal channel that we will explain next. To start channels, launch the punchctl command line tool.

1
punchctl

You have multiple tenants? Use these options: punchctl --tenant mytenant or punchctl -t mytenant

From there you have autocompletion. All the commands are documented. Try the start --channel one. If you prefer to start the channel directly you can type in

1
2
3
punchctl:mytenant> start --channel apache_httpd
job:storm:apache_httpd/main/single_topology  (mytenant_apache_httpd_single_topology-7-1557926010) ....................... ACTIVE
job:storm:apache_httpd/main/archiving_topology  (mytenant_apache_httpd_archiving_topology-8-1557926011) ................. ACTIVE

Check their running status using

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
punchctl:mytenant> status
channel:stormshield_networksecurity ..................................................................................... STOPPED
channel:admin ........................................................................................................... STOPPED
channel:sourcefire ...................................................................................................... STOPPED
channel:aggregation ..................................................................................................... STOPPED
channel:websense_web_security ........................................................................................... STOPPED
channel:universal ....................................................................................................... STOPPED
job:storm:apache_httpd/main/archiving_topology  (mytenant_apache_httpd_archiving_topology-8-1557926011) ................. ACTIVE
job:storm:apache_httpd/main/single_topology  (mytenant_apache_httpd_single_topology-7-1557926010) ....................... ACTIVE
channel:apache_httpd .................................................................................................... ACTIVE

Feel free to explore the various punchctl commands, autocompletion and inline documentation are your friends. In particular it is important to understand:

1
punchctl:mytenant> status --help

Good to know you can also use non interactive variants. Hit Ctrl-C or Ctrl-D to exit the punchctl shell. Then simply type:

1
punchctl status

Once your channel(s) are running you can inject logs. To do that the punch provides you with an injector tool. You can start it by executing the command below:

1
punchplatform-log-injector.sh -c $PUNCHPLATFORM_CONF_DIR/resources/injector/mytenant/apache_httpd_injector.json

Your are now sending generated (fake) sourcefire logs to your channel. They will be parsed, normalised and indexed into Elasticsearch. Check Kibana (http://localhost:5601) and Storm UI (http://localhost:8080), to run the platform monitoring:

If you want to generate more types of logs simply type in the following command to start all injector file found at once:

1
punchplatform-log-injector.sh -c $PUNCHPLATFORM_CONF_DIR/resources/injector/mytenant

This will launch all the logs as defined in the $PUNCHPLATFORM_CONF_DIR/resources/injector/mytenant folder. When you are done, stop injection with Ctrl-C and stop your channel. To do that you can use again the punchctl interactive tool. You can also simply directly type in:

1
punchctl stop
Batch processing

Now that you are comfortable with streaming, let's move on to batch processing. We will run a continous aggregation channel based on a PML Plan (Spark). This aggregation is executed each minute and fetch all the logs stored in the mytenant-events-* Elasticsearch index since the last minute. Here, by minute, we want to compute:

  1. how many bytes have been written to this index
  2. what was the size (in bytes) of the biggest log

Before running the aggregation, we need to provide some data. To do so, let's start two channels with the punchctl and inject some logs.

1
2
3
4
punchctl:mytenant> start --channel sourcefire
job:storm:sourcefire/main/single_topology  (mytenant_sourcefire_single-9-1557930620) .................................... ACTIVE
punchctl:mytenant> start --channel websense_web_security
job:storm:websense_web_security/main/single_topology  (mytenant_websense_web_security_single-10-1557930646) ............. ACTIVE

Now that channels are running, let's inject some logs:

1
punchplatform-log-injector.sh -c $PUNCHPLATFORM_CONF_DIR/resources/injector/mytenant

It is important to keep injecting logs in real time because the aggregation will only fetch the last minute's logs. Keep the log injector running and start a new terminal. From the new terminal, type in this command to start the aggregation:

1
2
3
# From another terminal
punchctl start --channel aggregation
job:shiva:aggregation/common/plan-example-1  (punchplatform/mytenant/channels/aggregation/plan-example-1) ............... ACTIVE

Wait about a 1 minute, the time for the first aggregation to be completed. Then, a new Elasticsearch index should shows up with this name mytenant-aggregations-YYYY.MM.DD. Add this new index pattern to Kibana and see the results. The documents have the following fields:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
{
  "_index": "mytenant-aggregations-2019.05.15",
  "_type": "_doc",
  "_id": "QmHvu2oBm9lH_e9QjytC",
  "_version": 1,
  "_score": null,
  "_source": {
    "total_size_value": 1013339,
    "max_size_value": 298,
    "key": "sourcefire",
    "timestamp": "2019-05-15T16:40:00.148+02:00"
  },
  "fields": {
    "timestamp": [
      "2019-05-15T14:40:00.148Z"
    ]
  },
  "sort": [
    1557931200148
  ]
}

As you can see, we get an overview of the total log size and the larger log size over the last minute sorted by technologie vendor. Note that one event is generated by vendor each minute. The vendor can be found in the field "key". In this example, the vendor is "sourcefire".

To stop everything, run the following commands:

1
2
3
4
5
# first, stop the aggregation channel
punchctl stop --channel aggregation

# then, stop any existing channel
punchctl stop

Congratulation ! Now, you are ready for high performance stream or batch processing pipeline !

Fifteen Minutes Tour

Kafka

If you are not familiar with Kafka, follow this short tour. Kafka is a message broker in which you can produce (i.e. publish) and consume records. The standalone punch has a kafka running already. All you have to do is to create a kafka topic, and try producing and consuming record. We will do that using the handy standalone tools. First let us create a topic:

1
punchplatform-kafka-topics.sh --create --kafkaCluster local --topic test_topic --replication-factor 1 --partitions 1

Each topic can be defined with some level or replication and number of partition. These are Kafka concepts. Next let us fill our topic with 1000 apache logs. To do that go to the examples/injector folder, and simply launch:

1
punchplatform-log-injector.sh -c kafka_apache_httpd_injector.hjson

Have a look at the kafka_apache_httpd_injector.hjson file. It is self explanatory. Remember the punchplatform-log-injector.sh tool is extremelly powerful and enables you to produce arbitrary data, that you can in turn send to kafka, elasticsearch, topologies etc..

Let us now check our messages are in our topic, as expected. You can again use the punch injector, but this time in comsumer mode:

1
punchplatform-log-injector.sh --kafka-consumer -topic raw_apache_httpd -brokers local -earliest

It should show the your expected number of records. Try it also using -v.

Templates

Now that you are familiar with some of the most important concepts used in the PunchPlatform, let's try to create a channel. To create new channels you have two options. First you can refer to the spouts and bolts documentation, and write your own. A second options is to work with templates to ease the channel configuration files generations.

Here is how this second option works. To generate channel configuration files you need

  1. a channel high level configuration json file : in there you define only the most important properties of your channel. A typical example is the listening (tcp) port, the punchlets and the output elasticsearch cluster.
  2. template file to generate the detailed configuration files : these are .j2 jinja2 files, one for each required channel configuration file.

Trying Kafka

Have a look at the tenants/mytenant/etc/channel_config folder. There you will find the channel high level configuration json files.

Next have a look next at the tenants/mytenant/etc/templates folders. One (single) can be used to generate the example channels you just executed. The second (input_kafka_processing) is a variant to generate channels made of two topologies with a Kafka topic in between.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# make sure you are at the top of your platform configuration folder
cd $PUNCHPLATFORM_CONF_DIR

# Stop your channels
punchplatform-channel.sh --stop mytenant

# Re-generate your apache channel using the input_kafka_processing template
punchplatform-channel.sh \
  --configure tenants/mytenant/etc/channel_config/apache_httpd_channel.json \
  --profile input_kafka_processing
# answer yes to override your current channel generation

# ready to go, restart your channel
punchplatform-channel.sh --start mytenant/apache_httpd

# inject some data
punchplatform-log-injector.sh -c resources/injector/mytenant/apache_httpd_injector.json

Go have a quick look at the channel generated files. You should find out easily that your channel is now composed of two topologies, the first one pushing the data to a Kafka topic, the second one consuming that topic to parse the logs and insert them into elasticsearch. An easy way to visualise this new setup is to visit the Storm UI on http://localhost:8080. You should see your two topologies.

Note

in the tenants/mytenant/channels/apache_httpd folder, have a quick look at the channel_structure.json file. This is the one that defines the overall structure of your channel. Compare it to the sourcefire original channel.

This concludes our 10 minutes tour, in order to come back to the original single channel layout, simply type in :

1
2
3
4
5
punchplatform-channel.sh --stop mytenant
# the -f option is to force the generation without asking you to confirm
punchplatform-channel.sh -f \
  --configure tenants/mytenant/etc/channel_config/apache_httpd_channel.json \
  --profile single

Twenty Minutes Tour

This short chapter explains how you, the platform operator, deal with the essential platform management issues. Operators act upon the platform using only four essential commands. This is depicted next:

image

Where the yellow server is the administration server. By now you already know the punchctl command. We will now have a look at punchctl configuration command.

The principles are quite simple to understand. A first good news: there are almost no difference in managing a standalone platform (as the one you have now running on your machine) and a full fledge-distributed production platform.

To save or load configurations, you can use respectively configuration pull and configuration push. The punch relies on zookeeper for maintaining configuration information. Loading or saving the configuration is as simple as executing getconf and putconf:

image

It cannot be simpler. Let us see this in action on the standalone.

Configuration Save

The first thing you should do is to save your tenant configuration. The configuration is a set of files stored both in your $PUNCHPLATFORM_CONF_DIR folder and in ZooKeeper. Saving it is important for two reasons:

  1. it will be saved and replicated within ZooKeeper. ZooKeeper is used to ensure configuration high-availability. In other words, losing a server configured with some specifics parameters does not matter anymore as you will be able to easily start a new one with the exact same configurations.
  2. once the configuration are stored in ZooKeeper, it will be available to all platform servers and components that requires it.

Go for it, simply type in:

1
punchctl configuration --push

This command flushes the content of the $PUNCHPLATFORM_CONF_DIR/tenants/mytenant folder to ZooKeeper.

Configuration Restore

In order to fetch the saved configuration you simply execute:

1
punchctl configuration --pull

That will simply fetch from ZooKeeper the saved configuration and write it back into $PUNCHPLATFORM_CONF_DIR.

What Next ?

Checkout the example folder of your standalone platform. You will find there severals self documented examples with ready to start: topologies or spark pipelines.

The punchplatform relies on few simple but important concepts and commands. We suggest you have a quick reading to better understand these, start here.

Troubleshooting

If an error happens, please have a look at the Troubleshooting section. There is a section dedicated to the common configuration errors.

Uninstall

Do this only if you want to uninstall your PunchPlatform !

First, stop the PunchPlatform.

1
punchplatform-standalone.sh stop

Execute the uninstall.sh script.

1
2
cd $PUNCHPLATFORM_CONF_DIR/../bin
./uninstall.sh

Optionally, save any configuration you have in $PUNCHPLATFORM_CONF_DIR, and remove the punchplatform-standalone-x.y.z directory.

Patch

this procedure only applies to the standalone, not to a production environment.

First, download a patch by filling the form at https://punchplatform.com/download-area/patch-delivery-area/ . You will shortly receive an email containing the download link.

Then, apply the patch to your standalone:

1
2
cd $PUNCHPLATFORM_CONF_DIR/../external/punchplatform-operator-environment-*
mkdir patch

Copy the jar recently downloaded in the patch directory. And simply restart your channel

1
2
punchplatform-channel.sh --stop mytenant/apache_httpd
punchplatform-channel.sh --start mytenant/apache_httpd

By performing this procedure, you have updated the punchplatform jars, and restarted on the fly the apache channel. A similar procedure exists on production setups.