Data Simulation¶

Abstract

The punch ships in with a high performance injector tool. It allows you to test kafka, elasticsearch, your pipelines. It also allows you to generate arbitrary data to make it easy to write your parser, machine learning, aggregation use cases. In short it has a tremendous value.

Overview¶

The PunchPlatform brings a data simulation tool to generate, send and or consume arbitrary data: the punchplatform-log-injector.sh, simply referred to as the injector hereafter. This tool is the swiss-army knife of data/log simulation.

You can use it to inject test messages into any message processing platform. To do that you define injection json configuration files containing the load characteristics, the data message format, and the destination input point.

The injector is capable of writing to sockets (udp, tcp), to Kafka, to elasticsearch and to lumberjack endpoints.

The injector can also play a server role. You can use it to bench your networking plane, to stress test an elasticsearch server, and of course to test a punchplatform. It can read from tcp, udp, lumberjack and Kafka.

Online Parameters¶

Tip

launch it with without option to have online help.

--brokers <arg>: the kafka broker where to read from
-c,--conf <arg>: your load configuration file
-check,--dump: dump to stdout instead of injecting
-cp,--compression: set the protocol compression to true
-d,--delay <arg>: overrides injection file throughput or inter message delay, in milliseconds
--dump: print the injected log
--earliest: start consuming kafka message from the earliest
-h,--help: print this message
-H,--host <arg>: overrides injection file destination host
-k,--kafka-consumer: act as kafka consumer, to count the number of received by reading kafka directly. You must set a kafka broker and topic
--latest: start consuming kafka message from the latest
-lc,--lumberjack-client: check the proper functioning of a lumberjack connection
-ls,--lumberjack-server: act as lumberjack server to count the number of received logs. You must set the port number
-n,--number <arg>: overrides injection file total message number port
-p,--port <arg>: overrides injection file destination port
-punchlets,--punchlets <arg>: stress a chain of punchlets (comma separated)
-q,--silent: reduce verbosity to error messages
-resources,--resources <arg>: adds punchlets resources files (comma separated)
-stream,--stream <arg>: storm stream for injected logs
-sustain: find out the maximum sustainable rate. This works with the lumberjack client only
-t,--throughput <arg>: overrides injection file message throughput, in message per second
-thread,--thread <arg>: set the number of thread used per configuration file.
--topic <arg>: the kafka topic
-ts,--tcp-server: act as tcp server to count the number of received logs. You must set the port number
-u,--udp: use udp
-us,--udp-server: act as udp server to count the number of received logs. You must set the port number
-v,--verbose: prints out the read data. It only work with some sender or receiver.
-w,--connection-timeout <arg>: defines maximum wait time in ms for the receiver port to be available (not in udp mode) - 0 (default value) means infinite wait. Also applies on reconnection after connection loss.

Examples¶

Injecting Apache logs¶

To inject Apache HTTPD traffic, checkout the injector file shipped with the standalone platform. You also have ready to use examples for other logs formats.

punchplatform-log-injector.sh -c apache_injection.json

Idem but changing the rate to 1500 messages per seconds

punchplatform-log-injector.sh -c apache_injection.json --throughput 1500

Injecting Lumberjack Data¶

You can set a lumberjack destination in your injector files. You can also use the injector in server mode should you need a quick test server. Here is how to run a lumberjack server listening on tcp/lumberjack port 21212

punchplatform-log-injector.sh --lumberjack-server --port 21212

To send lumberjack traffic to the server we just started:

punchplatform-log-injector.sh -c lumberjack_injector.json -t 1000

If you need an example just check in the $PUNCHPLATFORM_CONF_DIR/conf/resources/injector/examples/ folder. Refer to the injection file documentation below for details about the lumberjack settings.

Kafka¶

You can write to Kafka using the kafka destination. What is very handy on the punchplatform is to generate directly a lumberjack encoded record, taken from your injector json defined payload. This lets you fill Kafka with records that can be immediately consumed by punch topologies or pml.

Here is the principle of the injector file:

    destination : {
        proto : kafka

        // this must be a kafka broker declared in your platform property file.
        brokers : local

        // encoding can be bytes, in which case the Kafka record will
        // contain the payload value as a byte array, or lumberjack.
        encoding : lumberjack
        topic: mytopic
    }
    message : {
        // In this example you generate json datam each with two fields,
        // These will be encoded as lumberjack frame with two key-values
        // automatically deduced from the json, i.e. A user key and an age key.
        //
        payloads : [
            {
                "user" : "%{user}"
                "age" : %{age}
            }
        ]
        ...
    }

You can also inject custom headers in each kafka record sent by the log injector with the following configuration:

{
    destination : {
        proto : kafka,
        headers : [
            {
                key : myheader1
                value : hello world
                "type" : string
            },
            {
                key : myheader2
                value : 12345
                type : integer
            }
        ] 
    }
}

You can directly read and check it works as expected by reading from Kafka as follows:

punchplatform-log-injector.sh --kafka-consumer --topic mytopic --brokers local -v

Testing Punchlet(s)¶

Test or stress a punchlets pipeline to see overall performance

punchplatform-log-injector.sh -c [injection_file].json --punchlets p1.punch,p2.punch,... --resources r1.json,r2.json,...

Test Lumberjack Connections¶

Tip

In thales, LTR stands for Log TRransporter. LMR stands for Log *Receiver. Agreed ! These are bad names. But they are well known inside Thales so we keep them. A LTR is a small log shipper installed on a source (customer) site, the LMR is receiving these logs on a central site. The lumberjack protocol is used in between to ensure end-to-end acknowledgement, rate limiting etc.

When deploying a PunchPlatform that involve a LTR to LMR setup, ensuring that the network connection between the hosts is working as expected is a must. You can use the injector to quickly test that. Once ok, no more worries about firewalls, proxies or any network issues you can move on to setup up your final configuration.

First, start a Lumberjack server on the LMR host:

punchplatform-log-injector.sh --lumberjack-server --port 9900

If you need to display what is received use:

# also output the received message
punchplatform-log-injector.sh --lumberjack-server --port 9900 --verbose

Now, on the LTR side, start a client that will send messages to the server:

punchplatform-log-injector.sh --lumberjack-client --host localhost --port 9900

Unless stopped, the client will try to forward logs to the server. Watch the console output from the LMR to see if the messages are correctly received.

Injection Configuration File¶

The injection file is a hjson file. You can add '\' prefixed comments. The various sections are described below.

Destination Section¶

This section defines where you want to send your generated data. This section is optional, you can define it using the command-line parameters. If you set one, you can also override it using online parameters. Here is an example to send your generated data to a TCP server.

{
    "destination" : { 
        "proto" : "tcp", 
        "host" : "127.0.0.1", 
        "port" : 9999 
    },
...
}

The supported protocols are :

tcp: send the data to TCP server
udp: send the data to UDP server
lumberjack: send the data to UDP server
http: performs POST REST requests to an http server
stdout: just print out the generated data. Use for debugging purposes and copy pasta.
kafka: act as a kafka producer, toward a given topic
elasticsearch: inject data directly to an Elasticsearch cluster.

Here are examples configurations for the \"destination\" section:

# all these require plain host port parameters
{ "proto" : "tcp", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "udp", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "lumberjack", "host" : "127.0.0.1", "port" : 9999 }
{ "proto" : "http",
    "host" : "127.0.0.1",
    "port" : 9999,
    "http_method": "POST",
    "http_root_url": "/",
    "bulk_size": 1
}

# Elasticsearch configuration ('port' is optional, 'bulk_size' default is 1)
{ 
    "proto" : "elasticsearch", "host": "127.0.0.1", "port": 9300, 
    "cluster_name" : "es_search", 
    "index": "test", 
    "type": "doc", 
    "bulk_size": 1000
}

# Kafka only accepts a "brokers" name that must be defined in your 
# punchplatform.properties file. That is : this option only works
# (as of today) on an installed punchplatform.
{ 
    "proto": "kafka", 
    "brokers": "local", 
    "topic": "mytenant_bluecoat_proxysg"
}

Load Section¶

This section lets you control the injector's throughput. It is also optional if you prefer using online parameters.

"load" :{

    # "message_throughput" indicates the number of message per second.
    # Sometimes you want to inject fewer message than 1 per second, 
    # you can then use the alternative property : "inter_message_delay" 
    # For example to inject one message every 30 seconds :
    #   "inter_message_delay" : 30
    # 
    "message_throughput" : 1000,

    # Optional : control how often you have a recap stdout message. 
    "stats_publish_interval" : "2s",

    # The total number of message. Use -1 for almost infinite (2³¹-1 messages). 
    "total_messages" : 1000000,

    # Optional : make you throughput fixed or variable. By default fixed.
    # Using "variable" makes your load vary between 50 and 150 % of your 
    # set throughput.
    "type" : "fixed"
}

Punchlets Performance Test¶

The injector is great to stress one or a chain of punchlet under a high load of data. Using the --punchlets argument you basically make a chain a punchlets traversed by tons of (representative) data.

To check everything runs fine before stressing the punchlets, use the -v option to dump the punchlet result Again the -t option is your friend here to do that slowly

punchplatform-log-injector.sh -c <json-injection-file> --punchlets punchlet1,punchlet2,.. -t 1 -v

If you need to include punchlet resources, use --resources option

punchplatform-log-injector.sh -c <json-injection-file> \
    --punchlets standard/common/input.punch,standard/common/parsing_syslog_header.punch,... \
    --resources standard/apache_httpd/taxonomy.json,standard/apache_httpd/http_codes.json \
    --dump

Note, on punchlet performance, you should expect on a Intel Core i7 2,5GHz:

running the injection without doing nothing : 730 Keps
running the injection with the input tuple creation only : 670 Keps
running the injection with the punchlets : 30 Keps

Message Section¶

This mandatory section contains the payload sent by the log injector.

"message" : {

    # the payloads are templates of what you inject. In there you 
    # can insert %{} variable fields that will be replaced by the 
    # corresponding element you define in the "fields"section 
    # described right next. For example here, %{src} will be replaced 
    # by the "src" field.
    #
    # You can define a single payload. Should you define several 
    # ones like illustrated here, the injector will simply round-robin 
    # on each one.
    #
    # Every time a message is generated, each %{} variable field is 
    # replaced by a new value.
    #
    # You can thus finely control what your output data will look like.
    #
    "payloads" : [
        "%{timestamp}: New session from IP %{src_ip} UUID %{uuid}.",
        "%{timestamp}: %{owner} visited URL %{url} %{nb_visits} times.",
        "%{timestamp}: %{owner} also uploaded %{outbytes}kb and downloaded %{inbytes}kb."
    ],

    # The fields sections lets you define various kind of generated 
    # values. In the following all the supported injector fields are 
    # described.
    #
    "fields" : {

        "src_ip" : {
            # Generate IPV4 addresses. 
            "type" : "ipv4",

            # You use brackets to control what part of the address 
            # you want to make variable. Here all of them. 
            "format" : "[0-255].[0-255].[0-255].[0-255]"
        },
        "url" : {

            # Take the values from a list. Every time a value is 
            # generated you getn next element of your list.
            "type" : "list",

            # Here is your list. 
            "content" : [
                "GET /ref/index.html HTTP/1.1", 
                "GET /yet/another.html.css HTTP/1.1"
                ]
        },

        "owner" : {
            "type" : "list",
            "content" : ["frank", "bob", "alice", "ted", 
                            "dimi", "ced", "phil", "julien"]

            # This time we want to iterate differently. We want to 
            # send "frank" 3 times then "bob" 3 times and so on. 
            "update_every_loop": false,
            "update_every": 3
        },

        "uuid": {
            # Generate a valid unique string identifier
            "type": "session_id"
        }

        "nb_visits" : {
            "type" : "counter",
            "min" : 0,
            "max" : 12
        },
        "inbytes" : {
            "type" : "random",
            "min" : 1000,
            "max" : 30000
        },
        "outbytes" : {
            "type" : "gaussian",
            "mean": 200.0,
            "deviation" : 30.0,
            "mantissa_precision": 2,
            "always_positive": true
        },
        "timestamp" : {
            "type" : "timestamp",
            "format" :  "dd/MMM/yyyy:HH:mm Z",
            "start_time" : "2012.12.31",
            "start_time_format" : "yyyy.MM.dd",
            "tick_interval" : "1h"
        }
    }
}

In many case you want to send json payloads. You can use embedded Json to make it easier. An example explains it all:

"message" : {
    "payloads" : [
        { 
            "time" : "%{timestamp}", 
            "aNumber" : %{number}
        }
    ],
    "fields" : {
        ...
    }
}

Note

the resulting file is not a valid Json anymore because the %{number} would require to be enclosed by quotes. The log injectors will deal with it, but that suppose you generate a numerical or boolean value..

Here are the several supported templated types:

ipv4 : to generate ipv4 addresses
list : to loop over a set of items
counter : an iterating numeric value
random : a random numeric value following uniform probability density.
gaussian : a random value following a gaussian probability density.
session_id : Generates an UUID
timestamp : a timestamp, for which you fully control the format, the start time, and the tick interval.

Input file option¶

You may want to input data from a file containing logs.

The configuration file allows you to set a file path to load. Each line of this file will then be sent as data to the destination. When the end of the file is reached, the log injector reset to the first line to ensure continuous data injection.
Thus you may use a big log file or a small one to loop over it.

{
    "message" : {
        "file" : "/path/to/log_input.csv"
    }
}

Loop control¶

Whatever be the type you can control the value generation using the following optional parameter:

update_every_loop (boolean)

control the way the field is updated, either every time you refer to it, or one out of update_every loop. Note that if set to false, the update_every parameter is mandatory.

default: true
update_every (integer)

the number of loop iterations before the generated value is changed.

default: 1

list¶

content (array)

an array of values the injector will loop over. example: [1, 2, 3], ["hello", "world"]

counter¶

min (integer)

the (inclusive) min value
max (integer)

the (inclusive) max value

random¶

min (integer)

the (exclusive) min value
max (integer)

the (exclusive) max value

gaussian¶

mean (integer)

the average value of the repartition.

default: 0
deviation (integer)

the standard deviation. Note: this means that 68% of the values will be contained in "mean".

default: 1
mantissa_precision (integer)

number of digits after the comma. If set to 0, the comma char \'.\' itself is removed.

default: 0
always_positive : boolean

only generate positive values. Note that the gaussian is cropped also in 2 * AVERAGE to keep the mean value intact

default: true