Skip to content

Dataset Generator

Before you start...

Before using...

The dataset_generator will output a Spark Dataset based on the inputted json provided in the input_data argument. Column's type will automatically be inferred based on your json data...

Pyspark ->

Spark ->

Examples

Use-cases

Our "hello world" punchline configuration.

beginner_use_case.punchline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
```hjson
{
    type: punchline
    version: "6.0"
    runtime: spark
    tenant: default
    dag: [
        {
            type: dataset_generator
            component: input
            settings: {
                input_data: [
                    {
                        age: 21
                        name: phil
                        musician: false
                    }
                    {
                        age: 23
                        name: alice
                        musician: true
                    }
                    {
                        age: 53
                        name: dimi
                        musician: true
                    }
                ]
            }
            publish: [
                {
                    stream: data
                }
            ]
        }
    ]
}
```

run beginner_use_case.punchline by using the command below:

CONF=beginner_use_case.punchline
punchlinectl start -p $CONF
What did we achieved ?

Executing this pml will display the following result:

$ punchlinectl --punchline pml.json
+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false   |
|23 |alice|true    |
|53 |dimi |true    |
+---+-----+--------+

root
|-- age: integer (nullable = true)
|-- name: string (nullable = true)
|-- musician: boolean (nullable = true)

You can provide nested objects or arrays as well. These will be converted to a string representation

intermediate_use_case.punchline

{
    description:
        '''
        The batch_input node simply generates some data.
        You simply write your data inline, it convert it as Dataset<Row>
        '''
    type: dataset_generator
    component: input
    settings: {
        input_data: [
            {
                age: 21
                name: phil
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                ]
            }
            {
                age: 23
                name: alice
                address: {
                    street: clemenceau
                }
                friends: [
                    dimi
                    phil
                ]
            }
            {
                age: 53
                name: dimi
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                    phil
                ]
            }
        ]
    }
    // each node publish its value (here a Dataset) onto a stream.
    // This particular batch_input node publishes a Dataset.
    publish: [
            {
                stream: default
            }
    ]
}

let's excute it with the command below:

CONF=intermediate_use_case.punchline
punchlinectl start -p $CONF
What did we achieved ?
SHOW:

| address      | age | friends                   | name  |

| [clemenceau] | 21  | WrappedArray(alice)       | phil  |
| [clemenceau] | 23  | WrappedArray(dimi, phil)  | alice |
| [clemenceau] | 53  | WrappedArray(alice, phil) | dimi  |

root
|-- address: struct (nullable = true)
|    |-- street: string (nullable = true)
|-- age: long (nullable = true)
|-- friends: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- name: string (nullable = true)

Comming soon

Parameters

Common Settings

Name Type mandatory Default value Description
input_data List of Json true NONE A list of Json string that will be used to create your dataset.

Advanced Settings

No advance settings