Skip to content

Dataset Generator

Compatible Spark/Pyspark

The dataset_generator will output a Spark Dataset based on the inputted json provided in the input_data argument. Column's type will automatically be inferred based on your json data...

Below is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[
    {
        type: dataset_generator
        component: input
        settings: {
            input_data: [
                {
                    age: 21
                    name: phil
                    musician: false
                }
                {
                    age: 23
                    name: alice
                    musician: true
                }
                {
                    age: 53
                    name: dimi
                    musician: true
                }
            ]
        }
        publish: [
            {
                stream: data
            }
        ]
    }
]

Executing this pml will display the following result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ punchplatform-analytics.sh --job pml.json
+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false   |
|23 |alice|true    |
|53 |dimi |true    |
+---+-----+--------+

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- musician: boolean (nullable = true)

Using Nested Types

You can provide nested objects or arrays as well. These will be converted to a string representation. Here is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
{
    description:
        '''
        The batch_input node simply generates some data.
        You simply write your data inline, it convert it as Dataset<Row>
        '''
    type: dataset_generator
    component: input
    settings: {
        input_data: [
            {
                age: 21
                name: phil
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                ]
            }
            {
                age: 23
                name: alice
                address: {
                    street: clemenceau
                }
                friends: [
                    dimi
                    phil
                ]
            }
            {
                age: 53
                name: dimi
                address: {
                    street: clemenceau
                }
                friends: [
                    alice
                    phil
                ]
            }
        ]
    }
    // each node publish its value (here a Dataset) onto a stream.
    // This particular batch_input node publishes a Dataset.
    publish: [
            {
                stream: default
            }
    ]
}

Here is the resulting dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
SHOW:

| address      | age | friends                   | name  |

| [clemenceau] | 21  | WrappedArray(alice)       | phil  |
| [clemenceau] | 23  | WrappedArray(dimi, phil)  | alice |
| [clemenceau] | 53  | WrappedArray(alice, phil) | dimi  |

root
 |-- address: struct (nullable = true)
 |    |-- street: string (nullable = true)
 |-- age: long (nullable = true)
 |-- friends: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- name: string (nullable = true)

Configuration(s)

  • input_data: List(JSON(String))

    Description: [Required] A list of Json string that will be used to create your dataset.