Skip to content

Random Split

Overview

The random_split node splits an entry dataset randomly into two output streams. It wraps the org.apache.spark.sql.Dataset.randomSplit

This is typically used to separate a training dataset from a test dataset when trying out a pipeline. By default a split ratio of 0.9 - 0.1 is used.

Runtime Compatibility

  • PySpark :
  • Spark :

Example

---
type: punchline
version: '6.0'
runtime: spark
dag:
- type: random_split
  component: random_split
  settings:
    random_seed: 1
    weights:
    - '0.9'
    - '0.1'
  publish:
  - stream: left
  - stream: right
  subscribe:
  - component: input
    stream: data

Parameters

Name Type mandatory Default value Description
random_seed Integer false NONE Seed to split node
Weights List of String false [ "0.9" , "0.1" ] Weights for splits, will be normalized, "The exact number of entries in each dataset varies slightly due to the random nature of the randomSplit() transformation." if they don't sum to 1.