Skip to content

Data Simulation

Abstract

The punch ships in with a high performance injector tool. It allows you to test kafka, elasticsearch, your pipelines. It also allows you to generate arbitrary data to make it easy to write your parser, machine learning, aggregation use cases. In short it has a tremendous value.

Overview

The PunchPlatform brings a data simulation tool to generate, send and or consume arbitrary data: the punchplatform-log-injector.sh, simply referred to as the injector hereafter. This tool is the swiss-army knife of data/log simulation.

You can use it to inject test messages into any message processing platform. To do that you define injection json configuration files containing the load characteristics, the data message format, and the destination input point.

The injector is capable of writing to sockets (udp, tcp), to Kafka, to elasticsearch and to lumberjack endpoints.

The injector can also play a server role. You can use it to bench your networking plane, to stress test an elasticsearch server, and of course to test a punchplatform. It can read from tcp, udp, lumberjack and Kafka.

Online Parameters

Tip

launch it with without option to have online help.

  • --brokers <arg>: the kafka broker where to read from
  • -c,--conf <arg>: your load configuration file
  • -check,--dump: dump to stdout instead of injecting
  • -cp,--compression: set the protocol compression to true
  • -d,--delay <arg>: overides injection file throughput or inter message delay, in milliseconds
  • --dump: print the injected log
  • --earliest: start consuming kafka message from the earliest
  • -h,--help: print this message
  • -H,--host <arg>: overrides injection file destination host
  • -k,--kafka-consumer: act as kafka consumer, to count the number of received by reading kafka directly. You must set a kafka broker and topic
  • --latest: start consuming kafka message from the latest
  • -lc,--lumberjack-client: check the proper functioning of a lumberjack connection
  • -ls,--lumberjack-server: act as lumberjack server to count the number of received logs. You must set the port number
  • -n,--number <arg>: overrides injection file total message number port
  • -p,--port <arg>: overrides injection file destination port
  • -punchlets,--punchlets <arg>: stress a chain of punchlets (comma separated)
  • -q,--silent: reduce verbosity to error messages
  • -resources,--resources <arg>: adds punchlets resources files (comma separated)
  • -stream,--stream <arg>: storm stream for injected logs
  • -sustain: find out the maximum sustainable rate. This works with the lumberjack client only
  • -t,--throughput <arg>: overrides injection file message throughput, in message per second
  • -thread,--thread <arg>: set the number of thread used per configuration file.
  • --topic <arg>: the kafka topic
  • -ts,--tcp-server: act as tcp server to count the number of received logs. You must set the port number
  • -u,--udp: use udp
  • -us,--udp-server: act as udp server to count the number of received logs. You must set the port number
  • -v,--verbose: prints out the read data. It only work with some sender or receiver.
  • -w,--connection-timeout <arg>: defines maximum wait time in ms for the receiver port to be available (not in udp mode) - 0 (default value) means infinte wait. Also applies on reconnection after connection loss.

Examples

Injecting Apache logs

To inject Apache HTTPD traffic, checkout the injector file shipped with the standalone platform. You also have ready to use examples for other logs formats.

1
punchplatform-log-injector.sh -c apache_injection.json

Idem but changing the rate to 1500 messages per seconds

1
punchplatform-log-injector.sh -c apache_injection.json --throughput 1500

Injecting Lumberjack Data

You can set a lumberjack destination in your injector files. You can also use the injector in server mode should you need a quick test server. Here is how to run a lumberjack server listening on tcp/lumberjack port 21212

1
punchplatform-log-injector.sh --lumberjack-server --port 21212

To send lumberjack traffic to the server we just started:

1
punchplatform-log-injector.sh -c lumberjack_injector.json -t 1000

If you need an example just check in the $PUNCHPLATFORM_CONF_DIR/conf/resources/injector/examples/ folder. Refer to the injection file documentation below for details about the lumberjack settings.

Kafka

You can write to Kafka using the kafka destination. What is very handy on the punchplatform is to generate directly a lumber=jack encoded record, taken from your injector json defined payload. This lets you fill Kafka with records that can be imeediately consumed by punch topologies or pml.

Here is the principle of the injector file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    destination : {
        proto : kafka

        // this must be a kafka broker declared in your platform property file.
        brokers : local

        // encoding can be bytes, in which case the Kafka record will
        // contain the payload value as a byte array, or lumberjack.
        encoding : lumberjack
        topic: mytopic
    }
    message : {
        // In this example you generate json datam each with two fields,
        // These will be encoded as lumberjack frame with two key-values
        // automatically deduced from the json, i.e. A user key and an age key.
        //
        payloads : [
            {
                "user" : "%{user}"
                "age" : %{age}
            }
        ]
        ...
    }

You can directly read and check it works as expected by reading from Kafka as follows:

1
punchplatform-log-injector.sh --kafka-consumer --topic mytopic --brokers local -v

Testing Punchlet(s)

Test or stress a punchplets pipeline to see overall perfomance

1
punchplatform-log-injector.sh -c [injection_file].json --punchlets p1.punch,p2.punch,... --resources r1.json,r2.json,...

Test Lumberjack Connections

Tip

In thales, LTR stands for Log TRransporter. LMR stands for Log *Receiver. Agreed ! These are bad names. But they are well known inside Thales so we keep them. A LTR is a small log shipper installed on a source (customer) site, the LMR is receiving these logs on a central site. The lumberjack protocol is used in between to ensure end-to-end acknowledgement, rate limiting etc.

When deploying a PunchPlatform that involve a LTR to LMR setup, ensuring that the netork connection between the hosts is wortking as expected is a must. You can use the injector to quickly test that. Once ok, no more worries about firewalls, proxies or any network issues you can move on to setup up your final configuration.

First, start a Lumberjack server on the LMR host:

1
punchplatform-log-injector.sh --lumberjack-server --port 9900

If you need to display what is received use:

1
2
# also output the received message
punchplatform-log-injector.sh --lumberjack-server --port 9900 --verbose

Now, on the LTR side, start a client that will send messages to the server:

1
punchplatform-log-injector.sh --lumberjack-client --host localhost --port 9900

Unless stopped, the client will try to forward logs to the server. Watch the console output from the LMR to see if the messages are correctly received.

Injection Configuration File

The injection file is a hjson file. You can add '\' prefixed comments. The various sections are described below.

Destination Section

This section defines where you want to send your generated data. This section is optional, you can define it using the command-line parameters. If you set one, you can also override it using online parameters. Here is an example to send your generated data to a TCP server.

1
2
3
4
5
6
7
8
{
    "destination" : { 
        "proto" : "tcp", 
        "host" : "127.0.0.1", 
        "port" : 9999 
    },
...
}

The supported protocols are :

  • tcp: send the data to TCP server
  • udp: send the data to UDP server
  • lumberjack: send the data to UDP server
  • http: performs POST REST requests to an http server
  • stdout: just print out the generated data. Use for debugging purposes and copy pasta.
  • kafka: act as a kafka producer, toward a given topic
  • elasticsearch: inject data directly to an Elasticsearch cluster.

Here are examples configurations for the \"destination\" section:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# all these require plain host port parameters
{ "proto" : "tcp", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "udp", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "lumberjack", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "http",
    "host" : "127.0.0.1",
    "port" : 9999,
    "http_method": "POST",
    "http_root_url": "/",
    "bulk_size": 1
}

# Elasticsearch configuration ('port' is optional, 'bulk_size' default is 1)
{ 
    "proto" : "elasticsearch", "host": "127.0.0.1", "port": 9300, 
    "cluster_name" : "es_search", 
    "index": "test", 
    "type": "doc", 
    "bulk_size": 1000
}

# Kafka only accepts a "brokers" name that must be defined in your 
# punchplatform.properties file. That is : this option only works
# (as of today) on an installed punchplatform.
{ 
    "proto": "kafka", 
    "brokers": "local", 
    "topic": "mytenant_bluecoat_proxysg"
}

Load Section

This section lets you control the injector's throughput. It is also optional if you prefer using online parameters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
"load" :{

    # "message_throughput" indicates the number of message per second.
    # Sometimes you want to inject fewer message than 1 per second, 
    # you can then use the alternative property : "inter_message_delay" 
    # For example to inject one message every 30 seconds :
    #   "inter_message_delay" : 30
    # 
    "message_throughput" : 1000,

    # Optional : control how often you have a recap stdout message. 
    "stats_publish_interval" : "2s",

    # The total number of message. Use -1 for almost infinite (2³¹-1 messages). 
    "total_messages" : 1000000,

    # Optional : make you throughput fixed or variable. By default fixed.
    # Using "variable" makes your load vary between 50 and 150 % of your 
    # set throughput.
    "type" : "fixed"
}

Punchlets Performance Test

The injector is great to stress one or a chain of punchlet under a high load of data. Using the --punchlets argument you basically make a chain a punchlets traversed by tons of (reprensentative) data.

To check everything runs fine before stressing the punchlets, use the -v option to dump the prunchlet result Again the -t option is your friend here to do that slowly

1
punchplatform-log-injector.sh -c <json-injection-file> --punchlets punchlet1,punchlet2,.. -t 1 -v

If you need to include punchlet resources, use --resources option

1
2
3
4
punchplatform-log-injector.sh -c <json-injection-file> \
    --punchlets standard/common/input.punch,standard/common/parsing_syslog_header.punch,... \
    --resources standard/apache_httpd/taxonomy.json,standard/apache_httpd/http_codes.json \
    --dump

Note, on punchlet performance, you should expect on a Intel Core i7 2,5GHz:

  • running the injection without doing nothing : 730 Keps
  • running the injection with the input tuple creation only : 670 Keps
  • running the injection with the punchlets : 30 Keps

Message Section

This mandatory section contains the payload sent by the log injector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
"message" : {

    # the payloads are templates of what you inject. In there you 
    # can insert %{} variable fields that will be replaced by the 
    # corresponding element you define in the "fields"section 
    # described right next. For example here, %{src} will be replaced 
    # by the "src" field.
    #
    # You can define a single payload. Should you define several 
    # ones like illustrated here, the injector will simply round-robin 
    # on each one.
    #
    # Every time a message is generated, each %{} variable field is 
    # replaced by a new value.
    #
    # You can thus finely control what your output data will look like.
    #
    "payloads" : [
        "%{timestamp}: New session from IP %{src_ip} UUID %{uuid}.",
        "%{timestamp}: %{owner} visited URL %{url} %{nb_visits} times.",
        "%{timestamp}: %{owner} also uploaded %{outbytes}kb and downloaded %{inbytes}kb."
    ],

    # The fields sections lets you define various kind of generated 
    # values. In the following all the supported injector fields are 
    # described.
    #
    "fields" : {

        "src_ip" : {
            # Generate IPV4 addresses. 
            "type" : "ipv4",

            # You use brackets to control what part of the address 
            # you want to make variable. Here all of them. 
            "format" : "[0-255].[0-255].[0-255].[0-255]"
        },
        "url" : {

            # Take the values from a list. Every time a value is 
            # generated you getn next element of your list.
            "type" : "list",

            # Here is your list. 
            "content" : [
                "GET /ref/index.html HTTP/1.1", 
                "GET /yet/another.html.css HTTP/1.1"
                ]
        },

        "owner" : {
            "type" : "list",
            "content" : ["frank", "bob", "alice", "ted", 
                            "dimi", "ced", "phil", "julien"]

            # This time we want to iterate differently. We want to 
            # send "frank" 3 times then "bob" 3 times and so on. 
            "update_every_loop": false,
            "update_every": 3
        },

        "uuid": {
            # Generate a valid unique string identifier
            "type": "session_id"
        }

        "nb_visits" : {
            "type" : "counter",
            "min" : 0,
            "max" : 12
        },
        "inbytes" : {
            "type" : "random",
            "min" : 1000,
            "max" : 30000
        },
        "outbytes" : {
            "type" : "gaussian",
            "mean": 200.0,
            "deviation" : 30.0,
            "mantissa_precision": 2,
            "always_positive": true
        },
        "timestamp" : {
            "type" : "timestamp",
            "format" :  "dd/MMM/yyyy:HH:mm Z",
            "start_time" : "2012.12.31",
            "start_time_format" : "yyyy.MM.dd",
            "tick_interval" : "1h"
        }
    }
}

In many case you want to send json payloads. You can use embedded Json to make it easier. An example explains it all:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
"message" : {
    "payloads" : [
        { 
            "time" : "%{timestamp}", 
            "aNumber" : %{number}
        }
    ],
    "fields" : {
        ...
    }
}

Note

the resulting file is not a valid Json anymore because the %{number} would require to be enclosed by quotes. The log injectors will deal with it, but that suppose you generate a numerical or boolean value..

Here are the several supported templated types:

  • ipv4 : to generate ipv4 addresses
  • list : to loop over a set of items
  • counter : an iterating numeric value
  • random : a random numeric value following uniform probability density.
  • gaussian : a random value following a gaussian probability density.
  • session_id : Generates an UUID
  • timestamp : a timestamp, for which you fully control the format, the start time, and the tick interval.

Loop control

Whatever be the type you can control the value generation using the following optional parameter:

  • update_every_loop (boolean)

    control the way the field is updated, either every time you refer to it, or one out of update_every loop. Note that if set to false, the update_every parameter is mandatory.

    default: true

  • update_every (integer)

    the number of loop iterations before the generated value is changed.

    default: 1

list

  • content (array)

    an array of values the injector will loop over. example: [1, 2, 3], ["hello", "world"]

counter

  • min (integer)

    the (inclusive) min value

  • max (integer)

    the (inclusive) max value

random

  • min (integer)

    the (exclusive) min value

  • max (integer)

    the (exclusive) max value

gaussian

  • mean (integer)

    the average value of the repartition.

    default: 0

  • deviation (integer)

    the standard deviation. Note: this means that 68% of the values will be contained in "mean".

    default: 1

  • mantissa_precision (integer)

    number of digits after the comma. If set to 0, the comma char \'.\' itself is removed.

    default: 0

  • always_positive : boolean

    only generate positive values. Note that the gaussian is cropped also in 2 * AVERAGE to keep the mean value intact

    default: true