Skip to content

Random Split

Before you start...

Before using...

The random_split node splits an entry dataset randomly into two output streams. It wraps the org.apache.spark.sql.Dataset.randomSplit

This is typically used to separate a training dataset from a test dataset when trying out a pipeline. By default a split ratio of 0.9 - 0.1 is used.

Pyspark ->

Spark ->

Examples

Use-cases

Our "hello world" punchline configuration.

beginner_use_case.punchline

{
    type: punchline
    version: "6.0"
    runtime: spark
    tenant: default
    dag: [
        {
            type: random_split
            component: random_split
            settings: {
                // Optional seed to split node
                random_seed: 12345

                // Optional split ratio. Default is [0.9, 0.1]
                weights: [ "0.9", "0.1" ]
            }
            publish: [
                {
                    // stream corresponding to 90%
                    stream: left
                }
                {
                    // stream corresponding to 10%
                    stream: right
                }
            ]
            subscribe: [
                {
                    component: input
                    stream: data
                }
            ]
        }
    ]
}

run beginner_use_case.punchline by using the command below:

CONF=beginner_use_case.punchline
punchlinectl start -p $CONF

Comming soon

Comming soon

Parameters

Common Settings

Name Type mandatory Default value Description
random_seed Integer false NONE Seed to split node
Weights List of String false [ "0.9" , "0.1" ] Weights for splits, will be normalized, "The exact number of entries in each dataset varies slightly due to the random nature of the randomSplit() transformation." if they don't sum to 1.

Advanced Settings

No advanced settings