Archive Reader¶

Overview¶

Compatible Spark/Pyspark

The archive_reader node extracts archived data from S3, Ceph or filesystem, using metadata stored in Elasticsearch.

This node must follow an Elastic Input node providing metadata about the archived data.

Runtime Compatibility¶

PySpark : ✅
Spark : ✅

Filters¶

The Archive Reader will read a batch for every corresponding Metadata received as input. Therefore, filters and selections about dates, pool, tags etc... must be applied in the Elasticsearch query.

Warn

When extracting from S3 using avro/parquet encoding, you must set the device address using the full ip and not the host name (ex : 127.0.0.1 instead of localhost).

---
type: elastic_input
component: input
settings:
  index: mytenant-archive-*
  nodes:
  - localhost
publish:
- stream: metadata

Once you have filtered your metadata, you can now apply some filters on your data.

Two options are available : match_bloom and match_string.

match_bloom allows you to filter out archives matching a bloom filter value. Of course, you must have archived your data using bloom filter to use this option. This option will filter out some files, not the lines in those files.

match_string allows you to filter out lines containing a specific string. This happens after the bloom filter. Each line of the opened archives will be processed and only the ones containing the provided string will be kept in the output dataset.

Credentials¶

When using the archive reader, most of the settings needed for the extraction are contained in the metadata. However, the credentials are obviously not provided in the metadata. Therefore, you need to provide them in your node settings (user for ceph, access_key and secret_key for S3). You can also provide some hadoop_settings if you're using avro/parquet encoding.

Note

The device address used is the first one found in the metadata. You can force a device_address by providing it in the node settings, but it is your responsability to ensure the archive you're reading are present on the provided address.

Avro/Parquet¶

The avro schema of your archives are provided in the metadata. Therefore, there is no additional settings required for avro/parquet encoding. However, you can provide a schema in the node settings to force its use. You can also set some hadoop_settings for your extraction.

Example¶

---
version: "6.0"
type: punchline
runtime: spark
meta:
  tenant: mytenant
  channel: archive_reader
  vendor: archive_reader
dag:
- type: elastic_input
  component: input
  settings:
    index: mytenant-archive-*
    nodes:
    - localhost
    query:
      bool:
        must:
        - term:
            archive.pool: mytenant
        - term:
            tags.topic: apache_httpd
        - range:
            batch.earliest_ts:
              gte: now-1h
              lt: now
  publish:
  - stream: metadata
- type: archive_reader
  component: reader
  settings:
    device_address: file:///tmp/archive-logs/storage
  subscribe:
  - component: input
    stream: metadata
  publish:
  - stream: data
- type: show
  component: show
  settings:
    title: RESULT
    num_rows: 1
    truncate: false
  subscribe:
  - component: reader
    stream: data
settings: {}

This example can be used to read files archived using the apache_httpd channel. It's composed of 3 elements : - An Elasticsearch input sending a query to Elasticsearch to get metadata. - An Archive Reader reading files from the provided device, using metadata information. - A Show node displaying the resulting Dataset. This Dataset should contain one row per line read from files.

Parameters¶

device_address: String > Description: [Optional] Address to the device where data is stored. Start with file://, ceph_configuration://, http://...
user: String > Description: [Optional] The user used to access Ceph cluster.
access-key: String > Description: [Optional] Access key for MinIO cluster.
secret-key: String > Description: [Optional] Secret key for MinIO cluster.
match_string : String > Description: [Optional] String to be looked for in the extracted lines.
match_bloom : String > Description: [Optional] Value to be looked for in the bloom filter fields set when archiving.
schema : String > Description: [Optional] Avro schema to be used instead of the one provided in the metadata.
hadoop_settings : String > Description: [Optional] Additional hadoop settings when extracting avro/parquet.
charset : String

Description: [Optional] Charset to use when extracting bytes from avro/parquet.