Skip to content

Spark Punchlines

Let us move on to the spark world. On the punch, spark applications are simply (again) punchlines. This seems magic, but the reason is quite simple : a spark application can be represented as a graph (more precisely a dag for direct acyclic graph). Punchlines are dags of input, processing and output nodes. If these nodes are spark compatible nodes, there you have it your punchline will run on spark.

A great characteristics of (spark) punchlines is to expose all spark SQL and machine learning api, nodes and stages. In turn you can use punchline to design stream or batch applications, benefiting from spark dataframes, SQL, machine learning and all spark APIs. Punchlines are particularly well equiped with ready to use node to read or write data to/from elasticsearch, kafka and the other COTS integrated in the punch.

Besides machine learning, you can also do lots of simpler useful applications : aggregating, extracting, filtering your data.

Note

spark punchlines heavily relies on Spark paradigms. Although it is not necessary to be a Spark expert to use it, it helps to be familiar with spark concepts. In particular to start using the Machine Learning libraries.

In this getting started guide, we will illustrate a very simple example.

Reading a CSV file

This chapter demonstrate how Punchlines and Spark/Pyspark works together. Go to the sample folder:

cd $PUNCHPLATFORM_CONF_DIR/samples/punchlines/spark/read_file
You have three files there:
tree .
├── AAPL.csv
├── csv_to_stdout_pyspark.yaml
└── csv_to_stdout_spark.yaml

There you will find a spark punchlines example that performs a very simple operation: it reads in a csv file and show it to stdout. The data is stored in the the AAPL.csv file:

Date,Open,High,Low,Close,Adj Close,Volume
2017-12-28,171.000000,171.850006,170.479996,171.080002,171.080002,16480200
2017-12-29,170.520004,170.589996,169.220001,169.229996,169.229996,25999900
2018-01-02,170.160004,172.300003,169.259995,172.259995,172.259995,25555900
2018-01-03,172.529999,174.550003,171.960007,172.229996,172.229996,29517900
...
Two examples are given, spark java & python implementation:

  • csv_to_stdout_spark.yaml
  • csv_to_stdout_pyspark.yaml

Let us have a look at the java spark punchline:

---
type: punchline
version: '6.0'
runtime: spark
dag:
- type: file_input
  settings:
    path: "./AAPL.csv"
    format: csv
    options:
      #inferSchema: true
      header: true
  component: input
  publish:
  - stream: data
- type: show
  component: show
  subscribe:
  - component: input
    stream: data
  publish: []

Have a look at the pyspark file. Only the runtime attribute differs

That punchline contains two (so-called) nodes. A spark application is a in fact a directed graph of nodes, linked the one with each other through a publish subscribe relationship. Run it using the following command:

punchlinectl start -p csv_to_stdout_spark.yaml
To run the pyspark punchline:
punchlinectl start -p csv_to_stdout_pyspark.yaml

The output looks like this:

# output

| Date       | Open       | High       | Low        | Close      | Adj Close  | Volume   |

| 2017-12-28 | 171.000000 | 171.850006 | 170.479996 | 171.080002 | 171.080002 | 16480200 |
| 2017-12-29 | 170.520004 | 170.589996 | 169.220001 | 169.229996 | 169.229996 | 25999900 |
| 2018-01-02 | 170.160004 | 172.300003 | 169.259995 | 172.259995 | 172.259995 | 25555900 |

root
 |-- Date: string (nullable = true)
 |-- Open: string (nullable = true)
 |-- High: string (nullable = true)
 |-- Low: string (nullable = true)
 |-- Close: string (nullable = true)
 |-- Adj Close: string (nullable = true)
 |-- Volume: string (nullable = true)

[
  {
    "name": "input_data",
    "title": "SHOW"
  }
]

There you got it. What this simple example shows is how daringly simple, concise and clear it is to design an arbitrary spark pipeline using the punch.

Inferring Types

Let us improve our punchline to make it infer the types of the columns, rather than generating Strings. For instance by using the first line as header and inferring the schema of the dataframe dynamically upon reading.

Edit the csv_to_stdout_spark.yaml and uncomment: inferSchema

...
    options:
      inferSchema: true
      header: true
...

Run again your punchline and see now the result:

Show node result: input_data

| Date          | Open       | High       | Low        | Close      | Adj Close  | Volume   |

| 2017-12-28... | 171.0      | 171.850006 | 170.479996 | 171.080002 | 171.080002 | 16480200 |
| 2017-12-29... | 170.520004 | 170.589996 | 169.220001 | 169.229996 | 169.229996 | 25999900 |
| 2018-01-02... | 170.160004 | 172.300003 | 169.259995 | 172.259995 | 172.259995 | 25555900 |

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Adj Close: double (nullable = true)
 |-- Volume: integer (nullable = true)

[
  {
    "name": "input_data",
    "title": "SHOW"
  }
]

This is better, we now have types automatically inferred.

SQL

Let us now try to calculate the average amount of the Volume column. For that we are going to add an Sql statement Node as follows:.

...
- type: sql
  component: sql
  settings:
   statement: SELECT AVG(Volume) AS volume from input_data
  publish:
  - stream: data
  subscribe:
  - component: input
    stream: data
...

Execute:

punchlinectl start -p csv_to_stdout_spark_sql.yaml

The output is:

Show node result: sql_data

| volume |

| 2.8576825E7 |

root
 |-- volume: double (nullable = true)

[
  {
    "name": "sql_data",
    "title": "SHOW"
  }
]

Conclusions

Spark punchlines are extremely powerful. With a few nodes, you have stream and batch SQL power at hand. The punch team now uses only spark punchlines to enrich production platforms with various extraction and aggregation jobs. No more coding, which in turn makes it a lot easier to upgrade our customer platforms every year.

Last, punchlines supports both python (pyspark) and java (spark) based machine learning. A dedicated getting started on this topic is planned. The punch standalone already contains various examples.