Pipelines are the best architectural and programming pattern for data centric applications. Machine learning, log management, stream processing, CEP are all built on top of pipelines. This short chapter explains how the punch helps you think, build and run pipeline.
To go straight, the Punch helps you inventing useful processing pipelines. And you do that using only a configuration file.
Some pipelines run best in the stream, i.e. continuously, while others run as batch. It all depends on the source and destination data sinks, and of course on the type of application you deal with. For example, a log parser runs in the stream, whereas a spark machine learning job might run as a periodic batch job. After all, it does not matter so much.
We could call these pipeline applications or jobs or extract-transform-load or .. there are many well-known excellent words. Pipeline looks like a good term. Now pay attention to the following statement:
The Punch lets you design arbitrary pipelines using graphs through simple configuration files.
More precisely you can choose between two types of DAGs (Direct Acyclic Graphs): Storm DAGs or Spark DAGS.
Storm DAGS are ideal to design streaming pipelines where you plug-in arbitrary modules and functions. These functions
are traversed by the data and you act on it in a number of ways.
In contrast Spark DAGs are better suited to design machine learning pipelines (stream or batch). The Spark DAG model is different than the Storm model, we will give some light on this below.
The key point is : because DAGs are very simple and generic,
you can define virtually every kind of useful applications. No need to code anything.
Execute a Pipeline¶
Say you have a useful pipeline. For example a stream processing application that you want to plugin between kafka and elasticsearch (that is one of the best-seller pipeline). How do you run it ?
Start Simple ...¶
The Punch is quite original here. It is up to you to choose ! Why not start simple and execute your pipeline in a single (unix) process, onto a single (unix) server. Something like this.
All you need to have is a linux server. If you need your pipeline to continuously run you can make it a cron or a linux service, or rely on some systemd-like system. Nothing complex.
Does it scale enough ? Maybe not. You might need to add a more CPU power. How to do that ? The simplest solution is to request more instance of the function that is the bottleneck. For example:
If you do that you still have a single process but more threads. That can be enough to do the job. Another Solution consists in adding a server:
Wait there are some magic here: how will these two processes share the input load ? I.e. how will they consume the data from Kafka ? Short answer: the punch pipelines do that automatically. Just like Kafka Streams applications. Let us quote the benefits of such a simple approach from the kafka stream documentation:
- Elastic, highly scalable, fault-tolerant
- Deploy to containers, VMs, bare metal, cloud
- Equally viable for small, medium, & large use cases
- Write standard Java and Scala applications
- Exactly-once processing semantics
- No separate processing cluster required
To sum up, these examples simply require you have the servers, and something to execute processes. You can use physical linux boxes, virtual servers, docker, kubernetes .. it is up to you.
But Do Not Miss the Point¶
The previous examples are simple to understand, but do they make sense ? Yes if you have small apps dealing with a few thousands or events per second. Or only a few Gb of data to deal with using batch processing. If however you start processing lots of data (stream or batch) it is wise to benefit from more powerful runtime engines: Spark or Storm.
If you take the same punch configuration file and submit it to a (Storm|Spark) cluster instead of running in processes, here is what happens:
The sharding, distribution and data transport issues are all handled by and optimized by (Storm|Spark). It becomes much simpler and efficient to scale. Why simpler ? because you can ask (Storm|Spark) for more processes or threads. They will do the wiring for you.
And why is it more efficient ? Take the Spark use case: data are processed in there as
DataFrame, the whole pipeline is highly optimized by Spark before even be submitted. There are countless good references on the web to understand this.
- databrick video on sql joins
- a focus on sql optimization
- Memory Cache and Code generation optimization
- etc ..
Why the Punch ?¶
Putting it all together, the Punch is quite unique. It is simple to understand, simple to learn and test, you can start small and end up big.
It helps you not spend lots of money and effort to try scalable pipelines, stream processing and machine learning platforms upfront. If you succeed on a simple setup, you are ready to go production.