Random Split¶
Overview¶
The random_split
node splits an entry dataset randomly
into two output streams. It wraps the org.apache.spark.sql.Dataset.randomSplit
This is typically used to separate a training dataset from a test dataset when trying out a pipeline. By default a split ratio of 0.9 - 0.1 is used.
Runtime Compatibility¶
- PySpark : ❌
- Spark : ✅
Example¶
---
type: punchline
version: '6.0'
runtime: spark
dag:
- type: random_split
component: random_split
settings:
random_seed: 1
weights:
- '0.9'
- '0.1'
publish:
- stream: left
- stream: right
subscribe:
- component: input
stream: data
Parameters¶
Name | Type | mandatory | Default value | Description |
---|---|---|---|---|
random_seed | Integer | false | NONE | Seed to split node |
Weights | List of String | false | [ "0.9" , "0.1" ] | Weights for splits, will be normalized, "The exact number of entries in each dataset varies slightly due to the random nature of the randomSplit() transformation." if they don't sum to 1. |