Skip to content

Dataset Generator

The dataset_generator node makes it easy to generate Dataset to in turn, test your pipelines. You define the generated rows using a simple inline json notation.

Here is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
[
    {
        type: dataset_generator
        component: input
        settings: {
            input_data: [
                {
                    age: 21
                    name: phil
                    musician: false
                }
                {
                    age: 23
                    name: alice
                    musician: true
                }
                {
                    age: 53
                    name: dimi
                    musician: true
                }
            ]
        }
        publish: [
            {
                stream: data
            }
        ]
    }
]

Executing this pml show the following result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ punchplatform-analytics.sh --job pml.json
+---+-----+--------+
|age|name |musician|
+---+-----+--------+
|21 |phil |false   |
|23 |alice|true    |
|53 |dimi |true    |
+---+-----+--------+

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- musician: boolean (nullable = true)

Type are automatically infered from the provided documents

Using Nested Types

You can provide nested objects or arrays as well. These will be converted to a string representation. Here is an example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
        settings: {
            input_data: [
                {
                    age: 21
                    name: phil
                    address: {
                        street: clemenceau
                    }
                    friends: [
                        alice
                    ]
                }
                {
                    age: 23
                    name: alice
                    address: {
                        street: clemenceau
                    }
                    friends: [
                        dimi
                        phil
                    ]
                }
                {
                    age: 53
                    name: dimi
                    address: {
                        street: clemenceau
                    }
                    friends: [
                        alice
                        phil
                    ]
                }
            ]

Here is the resulting dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
SHOW:
+---+-----+-----------------------+----------------+--------+
|age|name |address                |friends         |musician|
+---+-----+-----------------------+----------------+--------+
|21 |phil |{"street":"clemenceau"}|["alice"]       |true    |
|23 |alice|{"street":"clemenceau"}|["dimi","phil"] |true    |
|53 |dimi |{"street":"clemenceau"}|["alice","phil"]|true    |
+---+-----+-----------------------+----------------+--------+

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- friends: string (nullable = true)
 |-- musician: boolean (nullable = true)

Forcing the use of Double or Long

If you need to generate longs or double instead of integers or float, you can use the following syntax:

1
2
3
4
5
6
            input_data: [
                {
                    age: 21L
                    name: phil
                    musician: false
                }

or

1
2
3
4
            input_data: [
                {
                    temperature: 18.8D
                }