Skip to content

Dataset Generator

Overview

The dataset_generator will output a Spark Dataset based on the inputted json provided in the input_data argument. Column's type will automatically be inferred based on your json data...

Runtime Compatibility

  • PySpark :
  • Spark :

Examples

To Begin

{
    type: punchline
    version: "6.0"
    runtime: spark
    tenant: default
    dag: [
        {
            type: dataset_generator
            component: input
            settings: {
                input_data: [
                    {
                        age: 21
                        name: phil
                        musician: false
                    }
                    {
                        age: 23
                        name: alice
                        musician: true
                    }
                    {
                        age: 53
                        name: dimi
                        musician: true
                    }
                ]
            }
            publish: [
                {
                    stream: data
                }
            ]
        }
    ]
}

What did we achieved ? This punchline display the following result:

$ punchlinectl --punchline pml.json
+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false   |
|23 |alice|true    |
|53 |dimi |true    |
+---+-----+-
root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
|-- musician: boolean (nullable = true)

Nested Objects

You can provide nested objects or arrays as well. These will be converted to a string representation**

{
    description:
        '''
        The batch_input node simply generates some data.
        You simply write your data inline, it it as Dataset<Row>
        '''
    type: dataset_generator
    component: input
    settings: {
        input_data: [
            {
                age: 21
                name: phil
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                ]
            }
            {
                age: 23
                name: alice
                address: {
                    street: clemenceau
                }
                friends: [
                    dimi
                    phil
                ]
            }
            {
                age: 53
                name: dimi
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                    phil
                ]
            }
        ]
    }
    // each node publish its value (here a Donto a stream.
    // This particular batch_input node publDataset.
    publish: [
            {
                stream: default
            }
    ]
}

What did we achieved ?

| address      | age | friends         
| [clemenceau] | 21  | WrappedArray(aliphil  |
| [clemenceau] | 23  | WrappedArray(dimalice |
| [clemenceau] | 53  | WrappedArray(ali
root
|-- address: struct (nullable = true)
|    |-- street: string (nullable = true)
|-- age: long (nullable = true)
|-- friends: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- name: string (nulla

Parameters

Name Type mandatory Default value Description
input_data List of Json true NONE A list of Json string that will be used to create your dataset.