Skip to content

Archiving and Extracting

Abstract

Archiving is simple to setup but lets you choose subtle options. This guide provides you with a walkthrough to understand the various operations from archiving data to extracting it.

Requirements

Make sure you succesfully followed the Punchlets Getting Started guide first. This guide assumes you are familiar with topologies and punch injection files.

Generate Logs

All the configuration files used in this guide are located under the conf/guides/archiving folder. Start by injecting logs into a Kafka topic. Simply use the provided injector file.

1
punchplatform-log-injector.sh -c archiving_guide_injector.hjson

This will generate 1024 records into a mytenant_archiving_guide topic. Each record looks like this:

1
2
3
4
{"ip":"128.78.0.8","raw":"20-06-2019 01:18:00 host 128.78.0.8 bob","user":"bob","timestamp":"20-06-2019 01:18:00"}
{"ip":"128.78.3.42","raw":"20-06-2019 01:19:00 host 128.78.3.42 alice","user":"alice","timestamp":"20-06-2019 01:19:00"}
{"ip":"128.78.0.30","raw":"20-06-2019 01:20:00 host 128.78.0.30 ted","user":"ted","timestamp":"20-06-2019 01:20:00"}
...

In order to produce interesting data the injector file generates 1K records, with ip addresses among 16K possible values. This will be useful too illustrate how efficiently locate a given ip address in a batch file. Also, the traffic is spread over three days. This is easily achieved using the punch injector simulated timestamp.

Tip

checkout the injection file, it is self explanatory.

Archive the Logs

Now that you have your (1024) logs into Kafka, archive them into an indexed file store. Before we execute the archiving topology, let us have a look at its FileBolt settings:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
    {
      "type": "file_bolt",
      "bolt_settings": {

        # this is where you set the destination archive
        "cluster" : "indexed_filestore:///tmp/indexed",

        # the elasticsearch cluster in charge of storing our index
        "es_cluster_id": "es_search",

        # these are the default settings. Most often you have one pool per tenant
        "pool" : "mytenant",

        # .. and one topic per channel.
        "topic" : "demo",

        # this is to generate a folder hierarchy using a partition number
        # and a (day) date. You can choose what you need here this is only
        # illustr ative
        "folders_hierachy_pattern" : "%{partition}/%{date}",

        # Could be GZIP or SNAPPY
        "compression_format" : "NONE",

        # In this tutorial we only want to archive the raw log. Not the other
        # fields.
        "fields": [
          "raw"
        ],

        # We want to use the `timestamp` incoming field as indication For
        # future search. You may not have such a timestamp at hand but 
        # using the punch you typically have one. It is extremally useful
        # so as to perform search based on the real timestamp event.
        "timestamp_field" : "timestamp",

        # Bloom filtering is activated on the field "ip"
        "bloom_filter_fields": [
          "ip"
        ],
        # estimate of the number of ip addresses
        "bloom_filter_expected_insertions" : 10000,
        # we accept 10 % of false positive
        "bloom_filter_false_positive_probability" :  0.1,
      }

Simply run the following command:

1
punchplatform-topology.sh --start-foreground -m light -t archiving_guide_topology.hjson

For the sake of the example it is configured to generate batch files containing 8 records. Of course this is a very small number but it will help us illustrating the various settings without filling your laptop with huge files. Since we processed 1024 logs, we thus generated 128 batch files. Checkout the resulting archive folder. You will the following structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/tmp/indexed/
└── mytenant
    └── 0
        ├── 2019.05.20
        │   ├── demo-0-1559404473771.data
        │   ├── demo-0-1559404473771.metadata
        │   ├── ...
        │   ├── demo-0-1559404473806.data
        │   └── demo-0-1559404473806.metadata
        ├── 2019.05.21
        │   ├── demo-0-1559404473807.data
        │   ├── demo-0-1559404473807.metadata
        │   ├── ...
        │   ├── demo-0-1559404473842.data
        │   └── demo-0-1559404473842.metadata
        ├── 2019.05.22
        │   ├── demo-0-1559404473843.data
        │   ├── demo-0-1559404473843.metadata
        │   ├── ...
        │   ├── demo-0-1559404473878.data
        │   └── demo-0-1559404473878.metadata
        └── 2019.05.23
            ├── demo-0-1559404473879.data
            ├── demo-0-1559404473879.metadata
            ├── ...
            ├── demo-0-1559404473898.data
            └── demo-0-1559404473898.metadata

6 directories, 256 files

The "%{partition}/%{date}" settings is the one responsible for that structure. Here we have only a single Kafka partition (0).

Query the Archive

In the following we go through the essential commands provided by the punchplatform-objects-storage.sh cli tool. It allows you to query and extract your archived data.

Pool Status

Start by qyerying the status of yor pool.

1
2
3
punchplatform-objects-storage.sh pool-status \
    --cluster indexed_filestore:///tmp/indexed  \
    --pool mytenant
1
2
3
4
5
Pool                          :mytenant
Batch Count                   :128
Pool usage                    :153.6 kB
Cluster usage                 :182.7 GB
Cluster free                  :50.8 GB

Topic Status

Print a topic status for the last 20 days. Notice the short date notation. You can provide ISO dates as well.

1
2
3
4
punchplatform-objects-storage.sh topic-status \
    --cluster indexed_filestore:///tmp/indexed  \
    --from-date -20d --to-date now \
    --pool mytenant --topic demo
1
2
3
4
5
6
7
8
9
Topic                         : demo
Batch Size                    : 128
Tuple Count                   : 1024
Earliest                      : 2019-05-20T00:05+02:00[Europe/Paris]
Latest                        : 2019-05-23T13:20+02:00[Europe/Paris]
Uncompressed Size             : 56.7 kB
Effective Stored Size         : 56.7 kB
Compression Ratio             : 1.000000
1/Compression Ratio           : 1.000000

List Topics

List the last 20 days topic, together with some detailed info:

1
2
3
4
punchplatform-objects-storage.sh list-topics \
    --cluster indexed_filestore:///tmp/indexed  \
    --from-date -20d --to-date now \
    --pool mytenant --details
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
Pool                          : mytenant
Topics Number                 : 1

Topic                         : demo
Batch Size                    : 128
Tuple Count                   : 1024
Earliest                      : 2019-05-20T00:05+02:00[Europe/Paris]
Latest                        : 2019-05-23T13:20+02:00[Europe/Paris]
Uncompressed Size             : 56.7 kB
Effective Stored Size         : 56.7 kB
Compression Ratio             : 1.000000
1/Compression Ratio           : 1.000000

List Batch Files

List the last 20 days batch files for a given topic:

1
2
3
4
punchplatform-objects-storage.sh list-objects \
    --cluster indexed_filestore:///tmp/indexed \
    --from-date -20d --to-date now \
    --pool mytenant --topic demo
1
2
3
4
5
6
0/2019.05.20/demo-0-1559468932653
0/2019.05.20/demo-0-1559468932654
0/2019.05.20/demo-0-1559468932655
0/2019.05.20/demo-0-1559468932656
...
0/2019.05.23/demo-0-1559468932780

Using Bloom Filters

If you activated bloom filters on one or several fields, you can extract efficiently only those batches that contain a target value. In our example bloom filter was set on the ip incoming tuple field. Here is how you can request for batches that contain a given ip:

1
2
3
4
5
punchplatform-objects-storage.sh list-objects  \
    --cluster indexed_filestore:///tmp/indexed \
    --from-date -30d --to-date now \
    --pool mytenant --topic demo \
    --match 128.78.18.47
1
0/2019.05.21/demo-0-1559468932689

In this case only a single batch contain the target ip. This in turn allows you to perform efficient data extractions.

Tip

If the result is empty, it's because IPs are randomly generated for this example. You can get these IP by using the command below and then copy one of them in --match args grep -r 128.78.18 /tmp/indexed/

Extract Tuples

From a Given Batch

Here is a first simple method to retrieve the content of a given batch file:

1
2
3
4
punchplatform-objects-storage.sh dump-object \
    --cluster indexed_filestore:///tmp/indexed  \
    --pool mytenant \
    --object-id 0/2019.05.20/demo-0-1559408745249
1
2
3
4
5
6
7
8
20-05-2019 00:05:00 host 128.78.37.98 frank
20-05-2019 00:10:00 host 128.78.45.127 bob
20-05-2019 00:15:00 host 128.78.48.152 alice
20-05-2019 00:20:00 host 128.78.61.23 ted
20-05-2019 00:25:00 host 128.78.8.250 dimi
20-05-2019 00:30:00 host 128.78.1.160 ced
20-05-2019 00:35:00 host 128.78.43.122 phil
20-05-2019 00:40:00 host 128.78.9.62 julien

Tip

You can get object-id value by doing :
tree /tmp/indexed/

Using a date range

Here is how you can extract the data using a time scope selection.

1
2
3
4
punchplatform-objects-storage.sh extract-scope \
    --cluster indexed_filestore:///tmp/indexed  \
    --from-date -20d --to-date now \
    --pool mytenant --topic demo
1
2
3
4
5
6
<many lines>
...
23-05-2019 13:05:00 host 128.78.12.101 parker
23-05-2019 13:10:00 host 128.78.2.123 evans
23-05-2019 13:15:00 host 128.78.2.95 edward
23-05-2019 13:20:00 host 128.78.2.66 ross

Using Bloom Filters

Here is how to leverage the bloom filter we computed when generating our archive. The following command returns the lines that matches the provided ip address.

1
2
3
4
5
punchplatform-objects-storage.sh extract-scope \
    --cluster indexed_filestore:///tmp/indexed  \
    --from-date -20d --to-date now \
    --pool mytenant --topic demo \
    --match 128.78.18.47

The result is :

1
21-05-2019 00:05:00 host 128.78.18.47 ted

This is particularly efficient as only the batch file(s) whose bloom filter matched the '128.78.18.47' ip address was read from the archive.

Tip

If the result is empty, it's because IPs are randomly generated for this example. You can get these IP by using the command below and then copy one of them in --match args grep -r 128.78.18 /tmp/indexed/