Archiving and Extracting¶
Abstract
Archiving is simple to setup but lets you choose subtle options. This guide provides you with a walkthrough to understand the various operations from archiving data to extracting it.
Requirements¶
Make sure you successfully followed the Punchlets Getting Started guide first. This guide assumes you are familiar with punchlines and punch injection files.
Generate Logs¶
All the configuration files used in this guide are located under the conf/samples/archiving
folder.
Start by injecting logs into a Kafka topic. Simply use the provided injector file.
punchplatform-kafka-topics.sh --create mytenant_archiving_guide
punchplatform-log-injector.sh -c $PUNCHPLATFORM_CONF_DIR/samples/archiving/archiving_guide_injector.hjson
This will generate 1024 records into a mytenant_archiving_guide
topic. Each record looks like this:
{"ip":"128.78.0.8","raw":"20-06-2019 01:18:00 host 128.78.0.8 bob","user":"bob","timestamp":"20-06-2019 01:18:00"}
{"ip":"128.78.3.42","raw":"20-06-2019 01:19:00 host 128.78.3.42 alice","user":"alice","timestamp":"20-06-2019 01:19:00"}
{"ip":"128.78.0.30","raw":"20-06-2019 01:20:00 host 128.78.0.30 ted","user":"ted","timestamp":"20-06-2019 01:20:00"}
...
In order to produce interesting data the injector file generates 1K records, with ip addresses among 16K possible values. This will be useful to illustrate how efficiently you can locate a given ip address in a batch file.
Tip
checkout the injection file, it is self explanatory.
Archive the Logs¶
Now that you have your logs into Kafka, archive them into an indexed file store. Before we execute the archiving, let us have a look at the punchline. The Kafka input simply reads our topic. The file output is in charge of archiving. It is configured with an additional bloom filtering capability.
Last we catch the indexing metadata generated by the file output and we index it into Elasticsearch. That last step will be used later to efficiently query our archived data. Here is the punchline content:
version: '6.0'
name: archiving_guide_topology
type: punchline
runtime: storm
tenant: sample
channel: sample
dag:
- component: input
type: kafka_input
settings:
fail_action: exit
topic: mytenant_archiving_guide
start_offset_strategy: earliest
publish:
- stream: logs
fields:
- raw
- ip
- user
- timestamp
- _ppf_partition_id
- _ppf_partition_offset
- component: output
type: file_output
settings:
strategy: at_least_once
batch_size: 8
destination: file:///tmp/archives
pool: sample
topic: demo
file_prefix_pattern: '%{partition}/%{date}/puncharchive-%{topic}-%{partition}-%{offset}'
create_root: true
compression_format: NONE
timestamp_field: timestamp
fields:
- raw
bloom_filter_fields:
- ip
bloom_filter_expected_insertions: 10000
bloom_filter_false_positive_probability: 0.1
subscribe:
- component: input
stream: logs
publish:
- stream: logs
fields:
- metadata
- component: elastic
type: elasticsearch_output
settings:
per_stream_settings:
- stream: logs
index:
prefix: mytenant-archive-
type: daily
document_json_field: metadata
subscribe:
- component: output
stream: logs
settings:
topology.worker.childopts: -server -Xms256m -Xmx256m
In particular note in there :
-
timestamp_field
: We want to use thetimestamp
incoming field as indication for future search. You may not have such a timestamp at hand but using the punch you typically have one. It is extremely useful so as to perform search based on the real timestamp event. -
bloom_filter_fields
: Bloom filtering is activated on the field "ip" -
bloom_filter_expected_insertions
: an estimate of the number of ip addresses -
bloom_filter_false_positive_probability
: we target to accept 10 % of false positive
For the sake of simplicity, the option create_root
as been activated in this punchline. This option allows root
directory (here /tmp/archives/
) to be automatically created. We do not recommend activating this option for anything
else than tests. Setting this option to true may create unwanted directories if destinations are mistyped. Furthermore,
not creating root directory allows to check that destination cluster is reachable.
To start the punchline, simply run the following command:
punchlinectl archiving_guide_punchline.hjson
For the sake of the example it is configured to generate batch files containing 8 records. Of course this is a very small number, but it will help us to illustrate the various settings without filling your laptop with huge files. Since we processed 1024 logs, we thus generated 128 batches files. Checkout the resulting archive folder. You will have the following structure:
/tmp/archives/
└── mytenant
└── 0
└── 2020.10.13
├── puncharchive-demo-0-0-bZJgInUBSqxavLe4BCX8.csv
├── ...
└── puncharchive-demo-0-992-athgInUBSqxavLe4Bfu9.csv
3 directories, 128 files
The file_prefix_pattern
settings is the one responsible for that structure. Here we have only a single Kafka
partition (0).
Topic Status¶
Start by querying the status of your topic. Time filter here takes data for the last 20 days. Notice the short date notation. You can provide ISO dates as well.
topic-status \
--from-date -20d --to-date now \
--pool mytenant --topic demo \
--es-index-pattern mytenant-archive
demo:
uncompressed_size: 44.3 kB
tuples_count: 1024
compression_factor: 1.0
size: 44.3 kB
latest_date: 2020-02-20T10:24:02+01:00[Europe/Paris]
batches_count: 128
compression_ratio: 1.0
earliest_date: 2020-02-20T10:24:01+01:00[Europe/Paris]
List Topics¶
List the last 20 days topic, together with some detailed info:
list-topics \
--from-date -20d --to-date now \
--pool mytenant --details \
--es-index-pattern mytenant-archive
topic_number: 1
topics:
demo:
uncompressed_size: 44.3 kB
tuples_count: 1024
compression_factor: 1.0
size: 44.3 kB
latest_date: 2020-02-20T10:24:02+01:00[Europe/Paris]
batches_count: 128
compression_ratio: 1.0
earliest_date: 2020-02-20T10:24:01+01:00[Europe/Paris]
pool: mytenant
List Batch Files¶
List the last 20 days batch files for a given topic:
list-objects \
--from-date -20d --to-date now \
--pool mytenant --topic demo \
--es-index-pattern mytenant-archive
0/2020.10.13/puncharchive-demo-0-0-bZJgInUBSqxavLe4BCX8.csv
0/2020.10.13/puncharchive-demo-0-8-DHZgInUBSqxavLe4BRIN.csv
0/2020.10.13/puncharchive-demo-0-16-7RtgInUBSqxavLe4BQER.csv
0/2020.10.13/puncharchive-demo-0-24-unhgInUBSqxavLe4BbcT.csv
0/2020.10.13/puncharchive-demo-0-32-II1gInUBSqxavLe4BbAV.csv
0/2020.10.13/puncharchive-demo-0-40-tL5gInUBSqxavLe4BdAY.csv
...
0/2020.10.13/puncharchive-demo-0-1016-t59gInUBSqxavLe4BYXB.csv
Using Bloom Filters¶
If you activated bloom filters on one or several fields, you can extract efficiently only those batches that contain a target value.
In our example bloom filter was set on the ip
incoming tuple field. Here is how you can request for batches that contain a given ip:
list-objects \
--from-date -30d --to-date now \
--pool mytenant --topic demo \
--match 128.78.18.47 \
--es-index-pattern mytenant-archive
0/2019.12.05/puncharchive-demo-fYi41W4Btd7z87AblKC6-0-1575543674595.csv
In this case only a single batch contain the target ip. This in turn allows you to perform efficient data extractions.
Tip
If the result is empty, it's because IPs are randomly generated for this example. You can get these IP by using the command below and then copy one of them in --match args
grep -r 128.78.18 /tmp/archives/
Extracting using punchline¶
Refer to the Archive Reader Node. Using that node you can design punchlines to axtract small or big volumes of data and save them to various destinations.