The goal of this chapter is simple: make it clear in 5 mn reading what the punch is.
punchlineis a data processing pipeline.
- Stream, batch, storm, spark, flink, go pipelines are all supported through a single declarative format. The range of appplications you can implement with punchline is extremally broad from tiny embedded to large scale distributed data processing pipelines.
- A punchline must be included in a channel in order to be executed.
planis a periodic data processing pipeline
- A plan basically is a punchline that is executed periodically, each time configured to consume specific ranges of data.
- A plan is the simplest and most common example of data processing workflow. It is so common and useful that the punch provides a dedicated concept to deal with it.
Plansare resilient. If something bad happens, plans will resume and will keep processing the data from where they were interrupted.
bookis an arbitrary workflow.
- Books leverage the power of kubernetes argo workflow engine.
- Books can chain arbitrary punchlines, react to incoming events etc..
functionis not a punch concept. It is here to emphasize that you can code your own function (java, python, soon go) and deploy it as part of a punchline.
channelis the punch execution unit.
- Channels support 4 operations:
- Channels can include one or several
booksor your own containerised application.
- grouping several applications in a channel is a mere facility to manage large scale applications with many (tenth or hundreds) of applications.
- Channels support 4 operations:
tenantis a logical grouping of channels.
- Using the punch you always work as part of a tenant.
- Just like channels, tenants support the
With these you can design a variety of data processing applications in a matter of hours.
Among these one family of applications are central to the Punch: log management. The punch provides
an additional concept:
log parsers. Besides providing you with many default parsers, and benefit from a marketplace,
and a complete development toolkit to code your own.
The point of using the punch is to design applications, in particular data centric applications. Here is an application that process data from (say) radar-related equipments. Central to that apps are a few punchline that take care of the various processing stages: ingesting, filtering, detecting etc..
To define the simplest channel you need to define two configuration files: the first models the application itself. In this example, the application process Kafka data with a third-party (python) node that compute (say) some predictions.
Here is the punchline configuration file:
version: "7.0" name: predicter runtime: pyspark settings: resources: - punch-pex:com.mycompany:my-python-prediction-library:3.0.0 dag: - type: kafka_input settings: topic: input publish: - stream: data fields: - temperature - radar-id - type: detection settings: parameter: subscribe: - component: kafka_input stream: data publish: - stream: data fields: - temperature - radar-id - prediction - type: elasticsearch_output settings: per_stream_settings: - stream: data index: type: daily prefix: mytenant-events- subscribe: - component: detection stream: logs
Let us now define the
version: "7.0" name: predict applications: -name: predicter runtime: kubernetes cluster: west command: punchlinectl args: - "start" - "--punchline" - "predicter.yml" - "--runtime" - "pyspark"
Hopefully that file is simple to understand. It tells the punch to submit that punchline to a given kubernetes cluster. And what are the args required to start it. To start you channel simply type in:
channelctl --start predict
Check your kubernetes cluster, there will be pod(s) running. Maybe one or many more, depending on your punchline (spark flink storm etc..). That is plumbery and should not concern you.
Note that this looks simple and great, but it is not the simplest or convenient way to test, debug or design your punchline. In development mode you will prefer to start your punchline as a straight and simple foreground application, frol a terminal or direcly from within your code editor. You can do that like this:
It cannot be simpler. Simple to understand, simple to work with, simple to tune.
A log management platform is just another use case. Here it is illustrated. You typically have archiving capabilities.
Note that implementing a log management platform on top of open source technologies pose serious difficulties. First you need to assemble many components (elastic, kafka etc ..). Second you must have log parsers, preferably ready-to-use, or at least have a development kit to design your own. Last you need to have (many) additional configurations and services: long term archiving, log collection at th edge, site to site log transfer etc..
The punch provides all that. If you have access to the internal thales gitlab inner source area, checkout our parser space.
For you to quickly grasp the way this is achieved here is a log punchline example that leverages the standard punch log parsers.
version: "7.0" name: sourcefire-parser runtime: punch settings: resources: - punch-parser:org.thales.punch:punch-core-parsers:1.0.0 dag: - type: syslog_input settings: listen: proto: tcp host: 0.0.0.0 port: 9902 publish: - stream: logs fields: - log - type: punchlet_node settings: punchlet: - common/syslog_header_parser.punch - sourcefire/parser.punch subscribe: - component: syslog_input stream: logs publish: - stream: logs fields: - log - type: elasticsearch_output settings: per_stream_settings: - stream: logs index: type: daily prefix: mytenant-events- document_json_field: log subscribe: - component: punchlet_node stream: logs
Note here the import of a punch parser package
You can provide your own of course. That package provides
number of parsing functions (i.e. punchlets). Here for example
You can chain these in many ways.
Developing on the Punch¶
One of the key characteristics of the punch is to require little coding. Equipped with a few punchlets (snippets of code written using the punch language), SQL statements, punch users can design impressive and complete applications in matter of hours. These applicationss are, in turn, easy to maintain and operate.
The punch also let you implement your own functions. Using a function-as-a-service approach you can provide your own business modules and make them part of punchlines by combining yours with the many provided by the punch.
The punch also lets you implement your own functions. Using a function-as-a-service approach you can provide your own business modules and make them part of punchlines.
Here is the punch design illustrated:
It takes something like an hour to understand the few punch concepts:
We tried our best not to invent unecessary concepts, there are already too many: pods, workflows, jobs, applications, schedulers, pipelines, containers, images, ci/cd, .. and too many technologies: kubernetes, argo, kafka, clickhouse, elastic, S3, mlflow, kubeflow, airflow, jupyterhub, dockerhub, helm registries, .. .
Punch concepts come from a simple motivation : ensure we can efficiently help our customers. When the punch forward-engineer rescues a customer on its platform, dealing only with channels and punchlines make it a lot easier to help. These are robust and bug free. If something is not working well it will not take long to identify the issue, most often an infrastructure or misconfiguration issue.
At the end, the concept that are useful for providing support are useful for users and customers. Who wants to configure yam files, struggle with CI/CD, define templates ? To our view, all that is hell. Hence the punch.