Skip to content

ArchiveSpout

The ArchiveSpout is a simple spout to extract data from archives. In this context, archive means data saved using the punchplatform File Bolt.

Understanding the basic design of the punchplatform archiving helps understanding how this spout works. In a nutshell the archive spout

  1. first reads from elasticsearch some metadata to determine what data batches to load from the archive.
  2. then reads each batch, each in turn containing many single-line log or event.
  3. finally emits each batch in the storm topology, each as a single (big) tuple.

The ArchiveSpout configuration is straigtforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
    "type" : "archive_spout",
    "spout_settings" : {
        # where to read the archived data batchs
        "cluster" : "indexed_filestore:///tmp/archiving",
        "pool" : "mytenant-data",
        "topic" : "apache_httpd_parsed",

        # what is the date range to load
        "from" : "2018-01-01T00:00:00+01:00",
        "to" : "2018-01-01T23:59:59+01:00",

        # where to read the archive batchs metadata
        "elasticsearch_cluster" : "es_search",
    }
    "storm_settings" : {
        "executors": 1,
        "component": "archive_spout",
        "publish" : [ 
           { 
             "stream" : "logs", 
             "fields" : ["meta", "data"] 
           } 
         ] 
    }
}

Streams And fields

The Archive Spout emits 2-fields tuples in the topology. It must be configured to emit these fields.

  • the meta field contains meta informations about the extracted data.
  • the data field contains a set of uncompressed and deciphered (when it makes sense) events as a single multi-lines string.

As just explained, the data field contains a complete set of events. Its size depends on how the data has been archived. It typically depends on the Kafka Spout and File Bolt configuration of the archiving topology.

Info

each emitted tuple can thus be big, typically 20 000 events for example, hence can take several MBytes, depending on the events size. Be careful when dealing with these: you need to carefully fit the number of pending tuples in your topology, and the size of the JVM.

Mandatory Parameters

  • cluster : String

    the storage endpoint. For examples: “file:///tmp/archiving” or “ceph:main-client”.

  • pool : String

    the pool name, i.e. mytenant-data

  • topic : String

    the topic name, i.e. apache_httpd_parsed

  • from : String

    An ISO-format date indicating the oldest data to load. I.e.”2018-01-01T00:00:00+01:00”.

  • to : String

    An ISO-format date indicating the latest data to load. I.e.”2018-01-01T23:59:59+01:00”.

Optional Parameters

Warning

The following optional parameters are for advanced users. Their default values are production ready.

  • queue_size: Integer : 10

    Like most spouts, the Archive Spout will not read too much data and flood the topology. Should the topology be slower to process the (batched) tuples read from the archive, it will slow down. This backpressure behavior is achieved using two settings. First the storm level [topology.max.spout.pending] that will make Storm stop emiting new tuples in the topology, and second the [queue_size] parameter that will limit to that size the number of batch preloaded in the spout internal memory.

  • es_request_size: Integer : 1000

    By default the archive spout will fetch 1000 batch descriptors from elasticsearch per request. Each descriptor defines the corresponding batch location into the target archiving backend. You can change the number of loaded descriptor per request using this parameter.

  • es_timeout: String : "60s"

    The timeout before considering the elasticseach request are failed, using the notation.

  • credentials

    If you need basic auth, use a credentials dictionary to provide the user password to use. For example : "credentials" : { "user" : bob, "password" : "bob's password" }

    This settings can be combined with ssl. token parameter can be specified like that: "credentials": { "token": "mytoken", "token_type": "ApiKey" }. Note, if user and password are specified, they will be ignored in favor of token parameter. Token are the base64 encoded string "user:password" if set to type: Basic

Metrics

The Archive spout emits the usual storm metrics, and the two additional metrics described next:

  • blocked_by_queue_full_ns:

    the duration in nanoseconds spent waiting on the internal spout queue. This indicates your topology is slower to process the loaded data than fetching it from the archive.

  • file_size :

    the size of the extracted batch.

Also refer to the Archiving Service