AbstractKafkaOutput (Punch Storm Spouts and Bolts 6.4.4 API)

java.lang.Object
- org.thales.punch.libraries.storm.api.BaseProcessingNode
- - org.thales.punch.libraries.storm.bolt.AbstractKafkaOutput

All Implemented Interfaces:: Serializable, org.apache.storm.task.IBolt, org.apache.storm.topology.IComponent, org.apache.storm.topology.IRichBolt

Direct Known Subclasses:: JsonKafkaOutput, LumberjackKafkaOutput

public abstract class AbstractKafkaOutput
extends org.thales.punch.libraries.storm.api.BaseProcessingNode

The KafkaBolt writes the content of storm tuples to Kafka topics.

This bolt can subscribe to one or several Storm stream(s). Upon receiving a Storm tuple on a stream, it takes the corresponding key-value from the tuple, encode it (see below for the encoding format) as a Kafka message and writes it to the configured topic.

The Kafka bolt produces key-values messages, encoded using the Lumberjack or JSON format. Either cases, the keys corresponds to the Storm tuple fields and the value to the Storm tuple values. As an example suppose the input Storm tuple is received on the stream "logs" and contain two fields "es_id" and "data". Using JSON encoding, the Kafka message will be encoded as follows:

 
   {
     "es_id" : "ejKld9iwoo",
     "data" : "hello world"
   }

Note that the Storm stream ("logs") does not appear in there, only the fields and values are sent out to kafka. Using the Lumberjack encoding the format will be a Lumberjack binary frame, but the principle is similar, a Lumberjack frame is a binary encoded key-value map.

The point is that the Kafka consumer that will eventually read your messages must be prepared to deal with Json or lumberjack format, and of course, must expect to deal with key values. In the likely case where you use the punchplatform Kafka spout in a Storm topology to consume the messages, it will automatically be able to deal with whatever format. But you will still have to declare in the Kafka spout configuration the list of fields to send as part of Storm tuple. Please refer to the KafkaSpout documentation.

If you use a third party Kafka consumer, you are on your own.

Examples

With a unique output Kafka topic:

 
   "bolts": [
     {
       "type": "kafka_bolt",
       "bolt_settings": {
         "brokers": "local",
         "topic": "my_test_topic",
         "encoding": "json"
       },
       "storm_settings": {
         ...
       }
     },
     ...
   ]

With multiple output topics based on Storm streams:

 
   "bolts": [
     {
       "type": "kafka_bolt",
       "bolt_settings": {
         "brokers": "local",
         "encoding": "lumberjack",
         "topics": [
           {
             "stream": "log",
             "topic": "mytenant_apache_httpd"
           },
           {
             "stream": "events",
             "topic": "mytenant_events"
           }
         ]
       },
       "storm_settings": {
         ...
       }
     },
     ...
   ]

Multiples partitions load-balancing strategies

Let's say you have a topic with multiple partitions. By default, the Kafka Bolt will use the default Kafka producer strategy and fill these partitions using a round-robin strategy.

If you need to split your data across partitions according to a field value, use the 'partition_key' option. For example, if your stream has the field "group_id" and you want to group messages with the same "group_id" in the same partition, you will get the following configuration:

 
   "bolts": [
     {
       "type": "kafka_bolt",
       "bolt_settings": {
         "brokers": "local",
         "topic": "user_topic",
         "encoding": "json",
         "partition_key": "group_id"
       },
       "storm_settings": {
         ...
       }
     },
     ...
   ]

Properties table

Note: all properties starting with "producer." refer to native Kafka producer properties.

**Bolt Settings**
property	mandatory	type	default	description
brokers	yes	String	-	Name of the Kafka cluster. The corresponding cluster must be defined in the punchplatform.properties file.
topic	yes/no	String	-	Define the Kafka output topic to use. If not set, 'topics' must be provided.
topics	yes/no	Map List	-	Define a Kafka output topic per Storm stream. If not set, 'topic' must be provided.
encoding	yes	String	-	Valid values are "lumberjack" or "json" to encode your Kafka message accordingly
partition_key	no	String	-	On a multiple partitions topic, the value of this Storm tuple field will be used to regroup message with the same value in the same partition.
producer.bootstrap.servers	no	String	-	This is only required if you do not use the 'brokers' property. A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
producer.key.serializer	no	String	"org.apache.kafka.common.serialization.ByteArraySerializer"	Serializer class for key that implements the Serializer interface.
producer.value.serializer	no	String	"org.apache.kafka.common.serialization.ByteArraySerializer"	Serializer class for value that implements the Serializer interface.
producer.partitioner.class	no	String	"org.apache.kafka.clients.producer.internals.DefaultPartitioner"	Partitionner class used to load-balance message to multiple topic partitions.
producer.acks	no	String	"all"	The number of acknowledgments the producer requires the leader to have received before considering a request complete. This controls the durability of records that are sent. The following settings are allowed: acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1. acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost. acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.
producer.batch.size	no	int	16384	The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This helps performance on both the client and the server. This configuration controls the default batch size in bytes. No attempt will be made to batch records larger than this size. Requests sent to brokers will contain multiple batches, one for each partition with data available to be sent. A small batch size will make batching less common and may reduce throughput (a batch size of zero will disable batching entirely). A very large batch size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional records.
producer.linger.ms	no	long	0	The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay—that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper bound on the delay for batching: once we get batch.size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay). Setting linger.ms=5, for example, would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absence of load.
producer.buffer.memory	no	int	33554432	The total bytes of memory the producer can use to buffer records waiting to be sent to the server. If records are sent faster than they can be delivered to the server the producer will block for max.block.ms after which it will throw an exception. This setting should correspond roughly to the total memory the producer will use, but is not a hard bound since not all memory the producer uses is used for buffering. Some additional memory will be used for compression (if compression is enabled) as well as for maintaining in-flight requests.
producer.compression.type	no	String	"none"	The compression type for all data generated by the producer. The default is none (i.e. no compression). Valid values are none, gzip, snappy, or lz4. Compression is of full batches of data, so the efficacy of batching will also impact the compression ratio (more batching means better compression).
producer.retries	no	int	0	Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error. Allowing retries without setting max.in.flight.requests.per.connection to 1 will potentially change the ordering of records because if two batches are sent to a single partition, and the first fails and is retried but the second succeeds, then the records in the second batch may appear first.
producer.connections.max.idle.ms	no	int	540000	Close idle connections after the number of milliseconds specified by this configuration.
producer.max.block.ms	no	int	60000	The configuration controls how long KafkaProducer.send() and KafkaProducer.partitionsFor() will block.These methods can be blocked either because the buffer is full or metadata unavailable.Blocking in the user-supplied serializers or partitioner will not be counted against this timeout.
producer.max.request.size	no	int	1048576	The maximum size of a request in bytes. This is also effectively a cap on the maximum record size. Note that the server has its own cap on record size which may be different from this. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests.
producer.receive.buffer.bytes	no	int	32768	The size of the TCP receive buffer (SO_RCVBUF) to use when reading data. If the value is -1, the OS default will be used.
producer.request.timeout.ms	no	int	30000	The configuration controls the maximum amount of time the client will wait for the response of a request. If the response is not received before the timeout elapses the client will resend the request if necessary or fail the request if retries are exhausted. This should be larger than replica.lag.time.max.ms (a broker configuration) to reduce the possibility of message duplication due to unnecessary producer retries.
producer.timeout.ms	no	int	30000	The configuration controls the maximum amount of time the server will wait for acknowledgments from followers to meet the acknowledgment requirements the producer has specified with the acks configuration. If the requested number of acknowledgments are not met when the timeout elapses an error will be returned. This timeout is measured on the server side and does not include the network latency of the request.
producer.max.in.flight.requests.per.connection	no	int	5	The maximum number of unacknowledged requests the client will send on a single connection before blocking. Note that if this setting is set to be greater than 1 and there are failed sends, there is a risk of message re-ordering due to retries (i.e., if retries are enabled).
producer.metadata.fetch.timeout.ms	no	int	60000	The first time data is sent to a topic we must fetch metadata about that topic to know which servers host the topic's partitions. This config specifies the maximum time, in milliseconds, for this fetch to succeed before throwing an exception back to the client.
producer.metadata.max.age.ms	no	int	300000	The period of time in milliseconds after which we force a refresh of metadata even if we haven't seen any partition leadership changes to proactively discover any new brokers or partitions.
producer.reconnect.backoff.ms	no	int	50	The amount of time to wait before attempting to reconnect to a given host. This avoids repeatedly connecting to a host in a tight loop. This backoff applies to all requests sent by the consumer to the broker.
producer.retry.backoff.ms	no	int	100	The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios.

Storm level settings

You can add a "storm_settings" section to your Spout configuration.

property	mandatory	type	default	comment
executors	no	long	1	The number of bolt thread(s) that will be launched by Storm.

See Also:: Serialized Form

Nested Class Summary

Nested Classes
Modifier and Type Class and Description

protected static class AbstractKafkaOutput.Topic
Gather any information related to a topic.

Nested Classes
Modifier and Type	Class and Description
`protected static class`	`AbstractKafkaOutput.Topic` Gather any information related to a topic.

Field Summary

Fields
Modifier and Type	Field and Description
`protected AsyncKafkaProducer`	`boltProducer` The Kafka producer used to send the message/records
`protected boolean`	`exitOnFailure` Make the process exit if the writing to Kafka fails
`protected Properties`	`producerProperties` The Kafka producer properties
`protected org.thales.punch.settings.api.ISettingsMap`	`settings` these are the bolt settings as revamped by the factory

Fields inherited from class org.thales.punch.libraries.storm.api.BaseProcessingNode
ackRate, collector, componentId, every, failRate, metricContext, nodeSettings, topologyContext, traversalTime

Constructor Summary

Constructors
Constructor and Description
`AbstractKafkaOutput(org.thales.punch.libraries.storm.api.NodeSettings boltConfig, Properties producerProperties)` The properties must only contain string.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)`
`protected AbstractKafkaOutput.Topic`	`getTargetTopic(String streamId)`
`void`	`prepare(Map conf, org.apache.storm.task.TopologyContext context, org.apache.storm.task.OutputCollector collector)`
`void`	`sendKafkaMessage(org.apache.storm.tuple.Tuple tuple, AbstractKafkaOutput.Topic topic, byte[] kafkaMessage)` Call the Kafka producer to send the message

Methods inherited from class org.thales.punch.libraries.storm.api.BaseProcessingNode
ack, cleanup, enrichAndForwardMonitoringTuple, execute, fail, getComponentConfiguration, getNewPoint, process

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - producerProperties
```
protected Properties producerProperties
```
    The Kafka producer properties
  - boltProducer
```
protected transient AsyncKafkaProducer boltProducer
```
    The Kafka producer used to send the message/records
  - settings
```
protected org.thales.punch.settings.api.ISettingsMap settings
```
    these are the bolt settings as revamped by the factory
  - exitOnFailure
```
protected boolean exitOnFailure
```
    Make the process exit if the writing to Kafka fails
- Constructor Detail
  - AbstractKafkaOutput
```
public AbstractKafkaOutput(org.thales.punch.libraries.storm.api.NodeSettings boltConfig,
                           Properties producerProperties)
                    throws org.thales.punch.exceptions.ConfigurationException
```
    The properties must only contain string. They are serialized and sent over the wire when the bolt is deployed in the storm cluster.
    
    Parameters:
    
    boltConfig - the bolt configuration
    
    producerProperties - the bolt properties. You can include in these the native Kafka producer properties
    
    Throws:
    
    org.thales.punch.exceptions.ConfigurationException
- Method Detail
  - prepare
```
public void prepare(Map conf,
                    org.apache.storm.task.TopologyContext context,
                    org.apache.storm.task.OutputCollector collector)
```
    Specified by:
    
    prepare in interface org.apache.storm.task.IBolt
    
    Overrides:
    
    prepare in class org.thales.punch.libraries.storm.api.BaseProcessingNode
  - declareOutputFields
```
public void declareOutputFields(org.apache.storm.topology.OutputFieldsDeclarer declarer)
```
    Specified by:
    
    declareOutputFields in interface org.apache.storm.topology.IComponent
    
    Overrides:
    
    declareOutputFields in class org.thales.punch.libraries.storm.api.BaseProcessingNode
  - sendKafkaMessage
```
public void sendKafkaMessage(org.apache.storm.tuple.Tuple tuple,
                             AbstractKafkaOutput.Topic topic,
                             byte[] kafkaMessage)
```
    Call the Kafka producer to send the message
    
    Parameters:
    
    tuple - tuple
    
    topic - topic
    
    kafkaMessage - kafkaMessage
  - getTargetTopic
```
protected AbstractKafkaOutput.Topic getTargetTopic(String streamId)
```
    Parameters:
    
    streamId - the originator stream identifier
    
    Returns:
    
    the target topic

Class AbstractKafkaOutput

Examples

Multiples partitions load-balancing strategies

Properties table

Storm level settings

Nested Class Summary

Field Summary

Fields inherited from class org.thales.punch.libraries.storm.api.BaseProcessingNode

Constructor Summary

Method Summary

Methods inherited from class org.thales.punch.libraries.storm.api.BaseProcessingNode

Methods inherited from class java.lang.Object

Field Detail

producerProperties

boltProducer

settings

exitOnFailure

Constructor Detail

AbstractKafkaOutput

Method Detail

prepare

declareOutputFields

sendKafkaMessage

getTargetTopic