Metrics/logs in Apache Spark¶
Description¶
The process of executing a spark application (punchline) is divided into events of different types:
- spark.application.start
- spark.application.end
- spark.application.failed
- spark.job.start
- spark.job.end
- spark.stage.start
- spark.stage.end
- spark.task.start
- spark.task.end
Those events are published at the level of a spark/pyspark
punchline.
Hence, to activate the submission of those informations to a database,
you will have to define one of our reporters.
Events are published by default as JSON and should be sent to a kafka topic. In the advent you want to output a different format, for instance CSV, You can use a punchline reading from the kafka topic and transform your event in your desirered format.
Finally, to make a long story short, we decided to publish those events using the ECS 1.5 format. Published events are designed for taking making user-friendly dashboard and with multiple granularity of data filtering while ensuring that no visualization will be broken while users will be investigating their data. Therefore, it was unavoidable of having redundant fields event-wise.
Events definition¶
Note 1: a spark application is divided into mutiple jobs
Note 2: a spark job is divided into multiple stages
Note 3: a spark stage is divided into multiple tasks
Note 4: data locality for a given stage, hence for a set of tasks should be on the same server
Note 5: mutliple jobs are not executed in parallel, but stages of a given job are !
Event | Definition |
---|---|
spark.application.start | This event always happen and is published when a punchline using a spark or pyspark runtime is executed. |
spark.application.end | This event always happen and is published when a punchline using a spark or pyspark runtime is going to exit (0 or 1). |
spark.application.failed | This event is published when an exception occurs at application level for a punchline using spark or pyspark runtime. |
spark.job.start | This event is published when a job is going to start for a given punchline using spark or pyspark runtime. |
spark.job.end | This event is published when a job is going to end for a given punchline using spark or pyspark runtime. |
spark.stage.start | This event is published when a stage of a given job is going to start for a given punchline using spark or pyspark runtime. |
spark.stage.end | This event is published when a stage of a given job is going to end for a given punchline using spark or pyspark runtime. |
spark.task.start | This event is published when a task of a given stage is going to start for a given punchline using spark or pyspark runtime. |
spark.task.end | This event is published when a task of a given stage is going to end for a given punchline using spark or pyspark runtime. |
Event examples¶
Note 1: Due to the nature on how spark works internally, labels.application.progression
is calculated per job. An application may be composed of severals jobs.
Application End¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
"progression": 100
},
"id": {
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "application",
"id": "local-1599119321655",
"failure_reason": "None"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.application.end",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Job End¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 2000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
"progression": 100
},
"id": {
"job": "local-1599119321655_3",
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "job",
"parent": "local-1599119321655",
"id": "local-1599119321655_3",
"status": "JobSucceeded",
"failure_reason": "None"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.job.end",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Stage End¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
"progression": 50
},
"id": {
"stage": "local-1599119321655_3_3",
"job": "local-1599119321655_3",
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "stage",
"parent": "local-1599119321655_3",
"id": "local-1599119321655_3_3",
"status": "succeeded",
"num_tasks": 11,
"failure_reason": "None"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.stage.end",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Task End¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
"progression": 50
},
"id": {
"task": "local-1599119321655_3_3_29",
"stage": "local-1599119321655_3_3",
"job": "local-1599119321655_3",
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "task",
"parent": "local-1599119321655_3_3",
"id": "local-1599119321655_3_3_29",
"status": "SUCCESS",
"failure_reason": "None",
"duration": 60,
"cpu_time": 10000,
"disk_spilled_time": 0,
"attempt": 0,
"gc_time": 0,
"host": "localhost",
"index": 8,
"mem_spilled_time": 0,
"num": 8,
"reason": "Success",
"size": 1723,
"action": "ResultTask"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.task.end",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Task Start¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
"progression": 50
},
"id": {
"task": "local-1599119321655_3_3_29",
"stage": "local-1599119321655_3_3",
"job": "local-1599119321655_3",
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "task",
"parent": "local-1599119321655_3_3",
"id": "local-1599119321655_3_3_29",
"status": "SUCCESS",
"failure_reason": "None",
"attempt": 0,
"executor_id": "driver",
"host": "localhost",
"index": 8,
"mem_spilled_time": 0,
"num": 8,
"size": 1723
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.task.end",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Stage Start¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
},
"id": {
"stage": "local-1599119321655_3_3",
"job": "local-1599119321655_3",
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "stage",
"parent": "local-1599119321655_3",
"id": "local-1599119321655_3_3",
"status": "SUCCESS",
"failure_reason": "None"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.task.submitted",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}
Application Start¶
{
"labels": {
"level": "info",
"runtime": "pyspark",
"duration": {
"application_ms": 5000
},
"application": {
"deploy_mode": "client",
"driver_host": "localhost",
},
"id": {
"application": "local-1599119321655"
},
"event": {
"timestamp": "2020-09-03T07:48:47.827Z",
"type": "application",
"id": "local-1599119321655",
"failure_reason": "None"
}
},
"platform": {
"tenant": "mytenant",
"id": "standalone",
"channel": "default",
"version": "6.0",
"application": {
"name": "punchline.template",
"id": "mytenant_default_punchline.template"
}
},
"@timestamp": "2020-09-03T07:48:47.827Z",
"message": "spark.task.submitted",
"tags": [],
"host": {
"hostname": "PUNCH",
"os": "Linux",
"domain": "PUNCH",
"ip": "127.0.1.1",
"name": "PUNCH",
"user": "jonathan",
"architecture": "amd64"
},
"ecs": {
"version": "1.5.0"
}
}