Random Split¶
The random_split
node splits an entry dataset randomly
into two output streams. It wraps the org.apache.spark.sql.Dataset.randomSplit
This is typically used to separate a training dataset from a test dataset when trying out a pipeline. By default a split ratio of 0.9 - 0.1 is used.
<<<<<<< HEAD
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | { type: random_split component: random_split settings: { // Optional seed to split node random_seed: 12345 // Optional split ratio. Default is [0.9, 0.1] weights: [ "0.9", "0.1" ] } publish: [ { // stream corresponding to 90% stream: left } { // stream corresponding to 10% stream: right } ] subscribe: [ { component: input stream: data } ] } |
Settings¶
random_seed: Number Optional Seed to split node
weights: Double[]
Optional split weights. Default [ "0.9" , "0.1" ]
Weights for splits, will be normalized if they don't sum to 1.
The exact number of entries in each dataset varies slightly due to the random nature of the randomSplit() transformation.