Archive Reader¶
Overview¶
Compatible Spark/Pyspark
The archive_reader
node extracts archived data from S3, Ceph or filesystem, using metadata stored in Elasticsearch.
This node must follow an Elastic Input node providing metadata about the archived data.
Runtime Compatibility¶
- PySpark : ✅
- Spark : ✅
Filters¶
The Archive Reader will read a batch for every corresponding Metadata received as input. Therefore, filters and selections about dates, pool, tags etc... must be applied in the Elasticsearch query.
Warn
When extracting from S3 using avro/parquet encoding, you must set the device address using the full ip and not the host name (ex : 127.0.0.1 instead of localhost).
---
type: elastic_input
component: input
settings:
index: mytenant-archive-*
nodes:
- localhost
publish:
- stream: metadata
Once you have filtered your metadata, you can now apply some filters on your data.
Two options are available : match_bloom
and match_string
.
match_bloom
allows you to filter out archives matching a bloom
filter value. Of course, you must have archived your data using bloom filter to use this option.
This option will filter out some files, not the lines in those files.
match_string
allows you to filter out lines containing a specific string. This happens after the bloom filter.
Each line of the opened archives will be processed and only the ones containing the provided string will be kept
in the output dataset.
Credentials¶
When using the archive reader, most of the settings needed for the extraction are contained in the metadata. However,
the credentials are obviously not provided in the metadata. Therefore, you need to provide them in your node settings
(user
for ceph, access_key
and secret_key
for S3). You can also provide some hadoop_settings
if you're using
avro/parquet encoding.
Note
The device address used is the first one found in the metadata. You can force a device_address by providing it in the node settings, but it is your responsability to ensure the archive you're reading are present on the provided address.
Avro/Parquet¶
The avro schema of your archives are provided in the metadata. Therefore, there is no additional settings required for
avro/parquet encoding. However, you can provide a schema
in the node settings to force its use. You can also set some
hadoop_settings
for your extraction.
Example¶
---
version: "6.0"
type: punchline
runtime: spark
meta:
tenant: mytenant
channel: archive_reader
vendor: archive_reader
dag:
- type: elastic_input
component: input
settings:
index: mytenant-archive-*
nodes:
- localhost
query:
bool:
must:
- term:
archive.pool: mytenant
- term:
tags.topic: apache_httpd
- range:
batch.earliest_ts:
gte: now-1h
lt: now
publish:
- stream: metadata
- type: archive_reader
component: reader
settings:
device_address: file:///tmp/archive-logs/storage
subscribe:
- component: input
stream: metadata
publish:
- stream: data
- type: show
component: show
settings:
title: RESULT
num_rows: 1
truncate: false
subscribe:
- component: reader
stream: data
settings: {}
This example can be used to read files archived using the apache_httpd channel. It's composed of 3 elements : - An Elasticsearch input sending a query to Elasticsearch to get metadata. - An Archive Reader reading files from the provided device, using metadata information. - A Show node displaying the resulting Dataset. This Dataset should contain one row per line read from files.
Parameters¶
device_address
: String > Description: [Optional] Address to the device where data is stored. Start with file://, ceph_configuration://, http://...user
: String > Description: [Optional] The user used to access Ceph cluster.access-key
: String > Description: [Optional] Access key for MinIO cluster.secret-key
: String > Description: [Optional] Secret key for MinIO cluster.match_string
: String > Description: [Optional] String to be looked for in the extracted lines.match_bloom
: String > Description: [Optional] Value to be looked for in the bloom filter fields set when archiving.schema
: String > Description: [Optional] Avro schema to be used instead of the one provided in the metadata.hadoop_settings
: String > Description: [Optional] Additional hadoop settings when extracting avro/parquet.charset
: StringDescription: [Optional] Charset to use when extracting bytes from avro/parquet.