Example¶
This chapter shows differences between regular spark job creation and PunchPlatform job configuration. For that, I use the use case Index Count.
Regular Java¶
First, you need to implement a main java:
package com.thales.services.cloudomc.punchplatform.analytics.example;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.SparkSession;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
public class Main {
public static void main(String[] args) {
// Get or create the spark session object. It's your link with the spark cluster.
SparkSession sparkSession = SparkSession.builder().appName("example").getOrCreate();;
// Create a spark context from the spark session. It's the old version of spark api
// used by elastic api.
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sparkSession.sparkContext());
// Get the index name given by argument
String index = args[0];
// Use elastic api to get documents
JavaPairRDD<String, String> data = JavaEsSpark.esJsonRDD(jsc, index);
// Count documents
Long count = data.count();
// Print count
System.out.println("Number of documents in index: " + count);
}
}
Next, you compile it inside a jar and you submit it to spark:
./spark-submit –deploy-mode cluster –master spark://localhost:7077 –class com.thales.services.cloudomc.punchplatform.analytics.example.Main punchplatform-analytics-example-4.0.1-SNAPSHOT-jar-with-dependencies.jar metricbeat-2018.01.19
You can test this command. The compiled jar is provided in folder “punchplatform-standalone-###/external/spark-2.2.1-bin-hadoop2.7/punchplatform/analytics/job/example”.
PunchPlatform Job¶
[
{
"type": "elastic_batch_input",
"component": "input",
"settings": {
"index": "metricbeat-2018.01.19",
"cluster_name": "es_search",
"nodes": [
"localhost"
]
},
"publish": [
{
"field": "data"
}
]
},
{
"type": "count",
"component": "count",
"subscribe": [
{
"component": "input",
"field": "data"
}
],
"publish": [
{
"field": "value"
}
]
},
{
"type": "print",
"component": "print",
"settings": {
"title": "COUNT"
},
"subscribe": [
{
"component": "count",
"field": "value"
}
]
}
]