Archives Housekeeping¶

Overview¶

If you use archives to store data on long term object storage, you will need to define a strategy to clean old data.

The Punch provides a ready-to-use archives-housekeeping application. It can be included in one of your channel to clean old data periodically.

Prerequisites¶

Elastic mapping¶

The archives-housekeeping application reads archives metadata from Elasticsearch. When archiving, metadata should be indexed on an index configured with mapping_archive.json.

Here is the command to load this mapping on the Standalone:

curl -X POST localhost:9200/_template/mapping_archive \
      -H "Content-Type: application/json" \
      -d @$PUNCHPLATFORM_CONF_DIR/resources/elasticsearch/mapping_archive.json

Permissions¶

In case of filesystem, do not forget to give read+write permissions to the archive folder. When running this application in Shiva, the Punch daemon user must have those permissions on the archive data.

Configuration¶

Here is a configuration example from standalone that cleans

data older than 1 hour on File System

data older than 3 days on Minio

archiving_pools:
  # File system
  - devices_addresses:
      - file:///tmp/archive-logs/storage
    pool: mytenant
    retention: 1h
    max_deletion_percentage: 100
    es_cluster_id: common
    es_index: mytenant-archive
    es_bulk_size: 1000
    es_timeout: 10s
    delete_metadata: all_devices

  # Minio
  - devices_addresses:
      - http://localhost:9200
    pool: mytenant
    retention: 3d
    max_deletion_percentage: 100
    es_cluster_id: common
    es_index: mytenant-archive
    es_bulk_size: 1000
    es_timeout: 10s
    delete_metadata: all_devices
    access_key: minioadmin
    secret_key: minioadmin

Parameters¶

Mandatory Parameters¶

devices_addresses (array of string)

Array of devices addresses.
For Ceph, the address is the absolute path to the ceph cluster configuration file.
with format ceph_configuration://<path>.
For File-System, the address is the absolute path of the archive root directory with format file:///<path>.
For Minio, the address is the URL to the Minio cluster with format http://<url>.
pool (string)

Archiving pool name. It is the tenant name by default.
retention (string)

Retention time. All data with batch.latest_ts older than specified retention will be deleted. Available formats:
- Seconds : 30s, 30secs, 30seconds.
- Minutes : 1m, 10mins, 2minutes.
- Hours : 1h, 1hrs, 1hours.
- Days : 1d, 1day, 1days.
max_deletion_percentage (decimal)

Maximum deletion percentage compared with the whole cluster data. This feature exists to prevent accidental deletion: operation will fail if we ask to delete more that x% of data.
Set to 100.0 to disable this protection.
es_cluster_id (string)

Elasticsearch cluster used to store objects meta-data.
es_index (string)

Elasticsearch index containing objects meta-data (will be appended -*)

Security Parameters¶

credentials.user (string)

Username to authenticate to ES cluster. Needs credentials.password configuration.
credentials.password (string)

Password to authenticate to ES cluster. Needs credentials.user configuration.
credentials.token (string)

Token string to authenticate to ES cluster. Needs credentials.token_type configuration.
credentials.token_type (string)

Token type used to authenticate to ES cluster. Needs credentials.token configuration.
ssl (Boolean)

If true, encrypt the connection to the ES cluster with TLS
ssl_private_key (String)

Path to the client's private key for TLS connection
ssl_private_key_password (String)

Password for client's private key for TLS connection
ssl_private_key_alias (String)

Alias for client's private key for TLS connection
ssl_certificate (String)

Path to the client's certificate for TLS connection
ssl_trusted_certificate (String)

Path to the client's CA file for TLS connection
ssl_keystore_location (String)

Path to the client's keystore for TLS connection
ssl_keystore_password (String)

Password for client's keystore for TLS connection
ssl_truststore_location (String)

Path to the client's truststore for TLS connection
ssl_truststore_password (String)

Password for client's truststore for TLS connection
user (string)

Only for Ceph. Ceph user name.
access_key (string)

Only for Minio. Minio access key.
secret_key (string)

Only for Minio. Minio secret key.

Optional Parameters¶

delete_metadata (string)
Deletion strategy regarding metadata. Default: "always". Possible values:
- "always": metadata is always deleted after processing devices.
- "never": metadata is never deleted after processing devices.
- "all_devices": metadata is deleted only when all devices are cleaned by housekeeping. If some devices were not cleaned (because they were not specified or because the application failed to clean the device) then the application updates the metadata to keep the device where the archive is still present. This is the recommended behavior for production.
es_bulk_size (int)

Size of elastic bulk request containing delete and update metadata actions. Default: 1000
es_timeout (String)

Elastic request timeout. Default: 10s

Running in foreground¶

Run the application in foreground on a Punch Operator :

archives-housekeeping /path/to/your/archives-housekeeping.yaml

Running in Shiva¶

Include the application in a Shiva channel. Example from standalone :

version: '6.0'
start_by_tenant: true
stop_by_tenant: true
applications:

  - name: elasticsearch-housekeeping
    runtime: shiva
    cluster: common
    command: elasticsearch-housekeeping
    args:
      - --tenant-configuration-path
      - elasticsearch-housekeeping.json
    apply_resolver_on:
      - elasticsearch-housekeeping.json
    quartzcron_schedule: 0 * * ? * * *

  - name: archives-housekeeping
    runtime: shiva
    cluster: common
    command: archives-housekeeping
    args:
      - archives-housekeeping.yaml
      - --childopts
      - -Xms100m -Xmx500m
    quartzcron_schedule: 0 * * ? * * *
resources: [ ]