Skip to content

Executing Jobs

This section explains how to launch Spark jobs in practice and which are the useful commands to keep in mind.

It is important to have a minimal understanding of Spark concepts. If you haven't already, please go through the Jobs concepts section first.

The basics: foreground mode

The quickest and easiest way to run a Spark job is to call it using the foreground mode. This mode is the one used by defaut. To run it, use these commands:

1
2
3
# These 2 are stricly equivalent
$ punchplatform-analytics.sh --job <job_path>
$ punchplatform-analytics.sh --job <job_path> --deploy-mode foreground

In this mode, everything is executed as part of the same process. This mode is useful should you develop PML stages and nodes. You can then easily debug your PML applications using your favorite IDE.

When using that mode, you will see no information on the Spark UI

The command outputs look like the following one, everything will show up directly in your terminal.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
$ punchplatform-analytics.sh --job count.pml 

/home/punch/punchplatform/standalone/punchplatform-standalone-5.1.3-SNAPSHOT/conf/examples/pml/nodes_unit_tests/count.pml            
    launching .................................................................................. START
        spark launcher logging: 

SHOW:
+--------------------------+
|source                    |
+--------------------------+
|{"A":{"B":4},"C":"banana"}|
|{"A":{"B":2},"C":"orange"}|
|{"A":{"B":1},"C":"lemon"} |
+--------------------------+

root
 |-- source: string (nullable = true)

COUNT: 3

Local Mode

In this mode, the client fully embeds both the driver and the executors together in an single JVM. Again, the client stays alive during the job lifetime.

1
$ punchplatform-analytics.sh --job <job_path> --deploy-mode client --spark-master local[*] 

When using that mode, you will see no information on the Spark UI

Launch it on a cluster (cluster)

In this mode, the job is submitted to the spark cluster. The client only submit the job to the Spark master and returns. The job is distributed to the various spark workers. This is the production setup.

To submit a job in cluster mode, execute the following command:

1
$ punchplatform-analytics.sh --job <job_path> --deploy-mode cluster --spark-master <master_url>

Launch it on a cluster (client)

In this mode, the client start the driver internally, i.e. without spawning a child process. In turn the driver requests for executors to the master.

The client stays alive during the whole job lifetime, and collects metrics and logs that are redirected to its standard console.

To submit a job in client mode, execute the following command:

1
$ punchplatform-analytics.sh --job <job_path> --deploy-mode client --spark-master <master_url>