Dataset Generator¶
Overview¶
The dataset_generator
will output a Spark Datasetinput_data
argument. Column's type will automatically be inferred based on your json data...
Runtime Compatibility¶
- PySpark : ✅
- Spark : ✅
Examples¶
To Begin¶
---
type: punchline
version: '6.0'
runtime: spark
tenant: default
dag:
- type: dataset_generator
component: input
settings:
input_data:
- age: 21
name: phil
musician: false
- age: 23
name: alice
musician: true
- age: 53
name: dimi
musician: true
publish:
- stream: data
What did we achieved ? This punchline display the following result:
punchlinectl start --punchline file.yaml
+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false |
|23 |alice|true |
|53 |dimi |true |
+---+-----+-
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
|-- musician: boolean (nullable = true)
Nested Objects¶
You can provide nested objects or arrays as well. These will be converted to a string representation**
---
description: |-
The batch_input node simply generates some data.
You simply write your data inline, it it as Dataset<Row>
type: dataset_generator
component: input
settings:
input_data:
- age: 21
name: phil
address:
street: clemenceau
friends:
- alice
- age: 23
name: alice
address:
street: clemenceau
friends:
- dimi
- phil
- age: 53
name: dimi
address:
street: clemenceau
friends:
- alice
- phil
publish:
- stream: default
What did we achieved ?
| address | age | friends
| [clemenceau] | 21 | WrappedArray(aliphil |
| [clemenceau] | 23 | WrappedArray(dimalice |
| [clemenceau] | 53 | WrappedArray(ali
root
|-- address: struct (nullable = true)
| |-- street: string (nullable = true)
|-- age: long (nullable = true)
|-- friends: array (nullable = true)
| |-- element: string (containsNull = true)
|-- name: string (nulla
Parameters¶
Name | Type | mandatory | Default value | Description |
---|---|---|---|---|
input_data | List of Json | true | NONE | A list of Json string that will be used to create your dataset. |