This brief chapter highlights key concepts that will help you understand what is the Punchplatform Machine Learning (
PML) feature, and why we did it.
PML goal is to let you leverage Spark. Hence understanding Spark first is required. In this short paragraph we provide you with the minimal set of concepts and explanations.
A spark application is; in fact; a graph of operations. You can code that graph using scala or java APIs.
The punchplatorm lets you do that using a configuration file where each operation is reffered to as a
Here is one of the simplest spark job example (courtesy of their documentation) :
This job performs a simple word count. First, it performs a textFile operation to read an input file, then a flatMap operation to split each line into words, then a map operation to form (word, 1) pairs, then finally a reduceByKey operation to sum the counts for each word.
The blue boxes in the visualization refer to the Spark operation that the user calls in his / her code.
Using PML these (blue) operations are defined in a json configuration file. Each operation is called a
Another particularly important feature of Spark is the machine learning library : Mllib. PML provides you with a mmlib node so that you can benefit from it.
The Mllib apis introduce an additional concept :
ML pipeline. A pipeline is (yet another) graph of operations referred to as pipeline
stages. Here is an example:
Typical stages are
Transformers. Even if not clear to you, just assume for now these
are essential and useful operations you need to combine in order to design machine learning processings.
Such pipeline can be embedded in a Spark node. To sum up a typical standard machine learning workflow is as follows:
- Loading the data (aka data ingestion)
- Extracting features from that data (aka feature extraction)
- Training model (aka model training)
- Evaluate (or predictionize)
PML introduces no new concept. It simply lets you define all that in a single configuration file. Conceptually a PML job looks like this:
The PunchPlatform Machine Learning (PML) sdk lets you design and run arbitrary spark applications on your platform data.
Why would you do that ? Punchplatforms process large quantities of data, parsed, normalized then indexed in ElasticSeach in real time. As a result you have powerful search capabilities through Kibana dashboards. In turn, spark applications let you extract more business values from your data. You can compute statistical indicators, run learn-then-predict or learn-then-detect applications, the scope of application is extremely wide, it all depends on your data.
Why Spark ? For two essential reasons. First is is a de facto standard framework to execute parrallel and distributed tasks. Second, because it provides many (many) useful functions that dramatically simplify the coding of application logic. In particular it provides machine learning libraries. Together these two reasons makes Spark offer the best simplicity/efficiency tradeoff, as illustrated next.
Doing that on your own requires some work. First you have to code and build your Spark pipeline application, taking care of lots of configuration issues such as selecting the input and output data sources. Once ready you have to deploy and run it in cycles of train then-detect/predict rounds, on enough real data so that you can evaluate if it outputs interesting findings. In many case you operate on production system where the real data resides, making it risky should you not master the resources needed by your application.
In short : it is not that easy.
The goal of PML is to render all that much simpler and safer. In a nutshell, PML lets you configure arbitrary spark pipelines. Instead of writing code, you define a few configuration file to select your input and output data, and to fully describe your spark pipeline. You also specify the complete cycle of execution. For example : train on every last day of data, and detect on todays live data.
That is it. You submit that to the platform and it will be scheduled accordingly.
it is very similar to the Punchplatform way of exposing Storm topologies as plain configuration files. In turn combined as part of channels.
Comparing to developing your own Spark applications, one for each of your machine-learning use-case, working with PML configuration files has key benefits:
- the overall development-deployment-testing process is dramatically speed up.
- once ready you go production at no additional costs.
- all MLlib algorithms are available to your pipelines, plus the ones provided by thales or third party contributors.
- the new ML features from future spark versions will be available to you as soon as available.
- everything is in configuration, hence safely stored in the PunchPlatform git based configuration manager. No way to mess around or loose your working configurations.
- you use the robust, state-of-the art and extensively used Spark MLlib architecture.
In the following we will go through the PML configuration files. Make sure you understand enough of the Spark concepts first.
Jobs versus Plans¶
Using PML you define and run jobs or plans.
A PML job is a spark pipeline, defined by a json configuration. You can execute a job in one of two modes : cluster or local. Using the local mode, the job is started in a local jvm. Try that first. Using the cluster mode, the job is submitted to a spark worker node. A first process, called the Driver is launched first. That driver submit in turn so called executor processes.
A PML plan is basically a job iterator. It lets you run periodic jobs, typically in charge of handling data in between time period. A plan is composed of a job template file and a configuration file. Together they define how are generated the actual jobs.
In order to quickly understand and manipulate the Punchplatform Machine Learning feature, here is a short description of the approach you have to adopt.
First, you have to build a standalone and install it.
Then, depending on the version of Punchplatform that you have, several options are possible.
Starting with the craig version (version > 5.0.0) you can start the Punchplatform with the following command :
cd $PUNCHPLATFORM_CONF_DIR punchplatform-standalone.sh --start
Once it is started you can open your browser and go to : localhost:5601
Go to the Punchplatform Panel, Spark section and you will have the list
of the node and stage that are accessible through PML. Just check the
documentation if you want more details about each stage or node and
click on Run to test a stage. If you want to see what is doing a stage
specifically, go into the
$PUNCHPLATFORM_CONF_DIR and type the
cd $PUNCHPLATFORM_CONF_DIR punchplatform-analytics.sh --spark-master local[*] --deploy-mode client --job examples/pml/nodes_unit_tests/batch_input.json
This will execute the
batch_input.json PML job. Try the
others as well.
Next you can try to submit these example jobs to the spark cluster :
cd $PUNCHPLATFORM_CONF_DIR punchplatform-analytics.sh --spark-master local[*] --deploy-mode client --job examples/pml/nodes_unit_tests/<my_job>.json
This time the job is submitted to the spark master, executed in turn by the spark slave. Have fun !