Skip to content

Dataset Generator

Overview

The dataset_generator will output a Spark Dataset based on the inputted json provided in the input_data argument. Column's type will automatically be inferred based on your json data...

Runtime Compatibility

  • PySpark :
  • Spark :

Examples

To Begin

---
type: punchline
version: '6.0'
runtime: spark
tenant: default
dag:
- type: dataset_generator
  component: input
  settings:
    input_data:
    - age: 21
      name: phil
      musician: false
    - age: 23
      name: alice
      musician: true
    - age: 53
      name: dimi
      musician: true
  publish:
  - stream: data

What did we achieved ? This punchline display the following result:

punchlinectl start --punchline file.yaml

+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false   |
|23 |alice|true    |
|53 |dimi |true    |
+---+-----+-
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
|-- musician: boolean (nullable = true)

Nested Objects

You can provide nested objects or arrays as well. These will be converted to a string representation**

---
description: |-
  The batch_input node simply generates some data.
  You simply write your data inline, it it as Dataset<Row>
type: dataset_generator
component: input
settings:
  input_data:
  - age: 21
    name: phil
    address:
      street: clemenceau
    friends:
    - alice
  - age: 23
    name: alice
    address:
      street: clemenceau
    friends:
    - dimi
    - phil
  - age: 53
    name: dimi
    address:
      street: clemenceau
    friends:
    - alice
    - phil
publish:
- stream: default

What did we achieved ?

| address      | age | friends         
| [clemenceau] | 21  | WrappedArray(aliphil  |
| [clemenceau] | 23  | WrappedArray(dimalice |
| [clemenceau] | 53  | WrappedArray(ali
root
|-- address: struct (nullable = true)
|    |-- street: string (nullable = true)
|-- age: long (nullable = true)
|-- friends: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- name: string (nulla

Parameters

Name Type mandatory Default value Description
input_data List of Json true NONE A list of Json string that will be used to create your dataset.