Performance

Introduction

In this chapter we give some useful performance numbers. These numbers are only indicative. They are not meant to be used to size a production platform.

Note

The PunchPlatform team provides professional services to help you correctly size your solution. Do not hesitate getting in touch with us.

Before going through the numbers it is important to understand the typology of some important PunchPlatform components. Most ready-to-use components (i.e. the ones your are likely to use in production first) are storm topology. In there you design a graph of processing made of spouts and bolts. To make it simple a spout is in charge of receiving/fetching the data from the outside world. A bolt is used either to run some processing on the data, or to output the data further your channel.

The following picture highlights this simplistic but useful view of a topology.

../../../_images/StormRamCpu.png

As depicted, input and output components requires mostly RAM for buffering in and out data. Of course they also require some CPU to perform their IOs, but you are likely to be more concerned about setting their buffering strategy.

Processing bolts may require CPU and/or RAM. It depends. A parser bolt (i.e. the PunchBolt) requires mostly CPU. Instead the ArchiverBolt requires RAM and few CPU.

Last, output bolts requires typically some RAM, to bulk and possibly compress their output data.

As a general rule of thumb, PunchPlatform topologies have excellent performance with low CPU and RAM requirements. The following is a simple bench running 1000 eps with 8K size logs. As you see CPU is low and the RAM required is about 30Mb. It drops to less than 8Mb with no traffic.

../../../_images/LowCpu.png ../../../_images/LowRam.png

Memory usage

The PunchPlatform spouts and bolts are designed to :

  1. be fully configurable in order to be deployed in small memory footprints processes
  2. have no complex dependencies to heavy framework. You can actually run a topology in a standalone process without using the Storm runtime.
  3. follow (whenever applicable) a single threaded asynchronous pattern. I.e. no heavy thread pool is required to run a (say) TCP server spout.

This said, sizing a topology in terms of the required memory really depends on its type. You configure the amout of RAM in each topology file using the following parameter:

"storm_settings" : {
    "topology.worker.childopts" : "-server -Xms128m -Xmx128m"
    ...
}

Here the topology will be granted with 128 mb of heap storage. If you end up needing more, you will hit the OOM (OutOfMemory) runtime error. In the rest of this chapter we cover a few important components.

Storm Engine

The PunchPlatform run topologies in a Storm runtime, or in a lightweight proprietary engine that only support single-process topologies. The later requires less CPU and memory and is well suited for small setups.

In both case the important parameter is the topology ` “topology.max.spout.pending”` property that defines the maximum number of data items (i.e. logs) that can be accumulated. That number has of course a direct impact of the memory you need. Although usually it accounts for a few Mbs.

Socket Spouts

A socket spout is in fact a socket (tcp, lumberjack with or without SSL) server. It will read and buffer the data to in turn feed the downstream bolts. The Lumberjack spout must in addition reply to the clients, it thus also requires some outbound buffers.

All socket operation are implemented on top of the Netty library, a well known asynchronous io library. By default Netty allocates native memory, instead of regular on-heap memory. For that reason the RAM used by your topology may end up being 2 to 3 times bigger than what you express in your configuration. The reason is native memory is not visible to the GC visible heap memory.

In general this does not pose any problem. If you want to limit memory to a more deterministic amount you can use the following parameter.

"storm_settings" : {
    "topology.worker.childopts" : "-server -Xms100m -Xmx100m -Dio.netty.noPreferDirect=true -Dio.netty.allocator.type=unpooled"
    ...
}

Kafka Spouts

The KafkaSpout read data from Kafka. By default it uses a 10Mb input buffer. I.e. every read from Kafka will fetch up to 10 Mb of data per partition. If you have many spouts reading many topics with many partition, this may be an issue. For a log management solution, 10Mb is too much. A buffer of 64K to 1Mb is enough. You can tune this with the fetch_size spout parameter.

Kafka Bolt

The KafkaBolt writes data to Kafka. By default it uses a 32Mb outbound buffer. You can tune this with the producer.buffer.size bolt parameter. Refer to the KafaBolt documentation.

CPU

CPU measures given below focus on IO intensive components and punchlets.

Measurement Methodology

The test tool used to bench the PunchPlatform is the punchplatform-log-injector.sh. That tool is used to inject fake logs whose content can be made arbitrarily variable. It works with configuration files defining the content, the load to send, the destination endpoint, the protocol (tcp, lumberjack, with or without compression, etc …). It can be used to load arbitrarily complex channel as illustrated next:

../../../_images/InjectorDesign.png

That tools is capable of sending a high load with low CPU usage. Also it provides a load mode to automatically find out the maximum load a PunchPlatform topology can sustain, It does that only when working bot as a client (to inject logs0 and as a server (to receive them after they traversed the PunchPlatform topology(ies)). It then adjust the sent load in function of the received load. If you let it run a few seconds in that mode, it will quickly stabilize at the maximum load. This is illustrated next:

../../../_images/InjectorPerfLoop.png

The injector is also capable of consuming/injecting message directly from Kafka. This makes it possible to test a Kafka producer or consumer topology in isolation.

../../../_images/InjectorPerfKafkaLoop.png

Socket Forwarder Performance

All the tests described in this chapter can be played on a PunchPlatform standalone version. The purpose of forwarder tests are to measure the capacity to forward logs to a remote peer using :

  1. plain syslog over TCP
  2. plain syslog over TCP with compression
  3. lumberjack encoding
  4. lumberjack encodign with compression

The samples and templates files to play these tests are located under :

> $PUNCHPLATFORM_CONF_DIR/samples/perf
> $PUNCHPLATFORM_CONF_DIR/templates/perf
> $PUNCHPLATFORM_CONF_DIR/resources/injector/perf

Each test relies on a single topology configured with four parallel spout-bolt pipes, each listening on a tcp port. That makes it possible to send a load on a single spout-bolt chain configured with a single thread each, or to load on four different chains 9again each with single executor). The former illustrates performance of a single executor configuration while the latter illustrate the gain of leveraging all the threads of a servers.

The payload used is an apache log. Check the injection file for the details. Topologies are configured on a production platform. In particular they are configured to enrich the logs with the following information :

  • a unique identifier : this is most often required to provide downstream idempotent data replay
  • a timestamp : useful to keep track a a reference timestamp, possibly used for real time correlation.
  • local and remote socket addresses and port : possibly used downstream for filtering or enrichment

This is probably more enrichment than what you will use on a a production setup, but it gives a fair setup. To play these tests do the following:

# generate the performance channel.
> cd $PUNCHPLATFORM_CONF_DIR/samples/perf
> punchplatform-channel.sh --configure tcp_2_tcp

# start the channel to test
> punchplatform-channel --start perf/tcp__2_tcp

# start loading a single executor
> punchplatform-log-injector -c $PUNCHPLATFORM_CONF_DIR/resource/injector/perf/tcp_2_tcp/perf1.json

# start loading all executors
> punchplatform-log-injector -c $PUNCHPLATFORM_CONF_DIR/resource/injector/perf/tcp_2_tcp

The following result were obtained on Macbook pro with the following characteristics:

  • MacBook Pro (Retina, 15-inch, Mid 2014)
  • Processor 2,5 GHz Intel Core i7

Warning

In average, the injector took 10 % of the CPU. The compression level is important here because of the logs similarities. That level of compression cannot be guaranteed, of course.

Scenario compression threads rate cpu load compression ratio
tcp -> tcp off 1 17 Keps 33u 22s  
tcp -> tcp off 4 30 Keps 64u 34s  
tcp -> lumberjack off 1 22 Keps 35u 25s  
tcp -> lumberjack off 4 40 Keps 70u 26s  
tcp -> tcp on 1 16 Keps 42u 32s 12.8
tcp -> tcp on 4 28 Keps 70u 30s 12.8
tcp -> lumberjack on 1 22 Keps 36u 26s 17
tcp -> lumberjack on 4 37 Keps 74u 24s 17
tcp -> kafka n/a 1 13 Keps 35u 30s  
tcp -> kafka n/a 4 20 Keps 60u 30s  

Kafka Producers and Consumers

The purpose of these tests is to measure performance of a single topology consuming or producing data from/into Kafka.

The samples and templates files to play these tests are located under :

> $PUNCHPLATFORM_CONF_DIR/samples/perf/tcp_2_kafka

The payload used is an apache log. Check the injection file for the details. Topologies are configured as on a production platform. with the same enrichment than for socket tests.

The following result were obtained on Macbook pro with the following characteristics:

  • MacBook Pro (Retina, 15-inch, Mid 2014)
  • Processor 2,5 GHz Intel Core i7

Warning

in average, the injector took 10 % of the CPU. The compression level is important here because of the logs similarities. That level of compression cannot be guaranteed, of course.

Scenario threads rate cpu load
tcp -> kafka 1 13 Keps 35u 30s
tcp -> kafka 4 20 Keps 60u 30s

Parsers

The PunchPlatform ships in with a tool to measure the maximum sustainable rate a parser can achieve on a single thread. This feature is provided by (once again) the punchplatform-log-injector tool. The principle is to test a chain of punchlets (possibly only one) traversed by a high load of logs, so as to consume 100 % of a single CPU. To obtain useful numbers the injector is directly chaining the punchlets itself, and does not necessitate any socket file or kafka input or output connectors that would add some IO load to the measure.

Here is the example of a typical command :

$ punchplatform-log-injector.sh -c <json-injection-file> --punchlets punchlet1,punchlet2,..

Note

if you have a platform at hand, type in punchplatform-log-injector.sh with no argument, all the options are fully described.

The interesting numbers is the achieved rate of log (eps or event per second) per cpu. Here are results for a few standard parsers. The “injector rate” correspond to the described method above. The “end-to-end rate” correspond to a production-like test with this topology : kafka_spout -> punch_bolt (parser processing) -> kafka_bolt

What’s interesting there is that there is a correlation between what you get theoretically (with this injector locally) and what you could expect practically. Tested on all the standard parsers, There was a 52% ratio observed between injector and proper end-to-end processing onto the same machine. Moreover, assuming that topologies are properly configures, we can observe a proportional performance per worker and per executor on the Punch Bolt.

You can find in the Standard parsers page the theoretical indicative performance (on a typical virtual machine) of each parser.