Skip to content

Archives Housekeeping

Overview

If you use archives to store data on long term object storage, you will need to define a strategy to clean old data.

The Punch provides a ready-to-use archives-housekeeping application. It can be included in one of your channel to clean old data periodically.

Prerequisites

Elastic mapping

The archives-housekeeping application reads archives metadata from Elasticsearch. When archiving, metadata should be indexed on an index configured with mapping_archive.json.

Here is the command to load this mapping on the Standalone:

curl -X POST localhost:9200/_template/mapping_archive \
      -H "Content-Type: application/json" \
      -d @$PUNCHPLATFORM_CONF_DIR/resources/elasticsearch/mapping_archive.json

Permissions

In case of filesystem, do not forget to give read+write permissions to the archive folder. When running this application in Shiva, the Punch daemon user must have those permissions on the archive data.

Configuration

Here is a configuration example from standalone that cleans

  • data older than 1 hour on File System
  • data older than 3 days on Minio
    archiving_pools:
      # File system
      - devices_addresses:
          - file:///tmp/archive-logs/storage
        pool: mytenant
        retention: 1h
        max_deletion_percentage: 100
        es_cluster_id: common
        es_index: mytenant-archive
        es_bulk_size: 1000
        es_timeout: 10s
        delete_metadata: all_devices
    
      # Minio
      - devices_addresses:
          - http://localhost:9200
        pool: mytenant
        retention: 3d
        max_deletion_percentage: 100
        es_cluster_id: common
        es_index: mytenant-archive
        es_bulk_size: 1000
        es_timeout: 10s
        delete_metadata: all_devices
        access_key: minioadmin
        secret_key: minioadmin
    

Parameters

Mandatory Parameters

  • devices_addresses (array of string)

    Array of devices addresses.
    For Ceph, the address is the absolute path to the ceph cluster configuration file.
    with format ceph_configuration://<path>.
    For File-System, the address is the absolute path of the archive root directory with format file:///<path>.
    For Minio, the address is the URL to the Minio cluster with format http://<url>.

  • pool (string)

    Archiving pool name. It is the tenant name by default.

  • retention (string)

    Retention time. All data with batch.latest_ts older than specified retention will be deleted. Available formats:
    - Seconds : 30s, 30secs, 30seconds.
    - Minutes : 1m, 10mins, 2minutes.
    - Hours : 1h, 1hrs, 1hours.
    - Days : 1d, 1day, 1days.

  • max_deletion_percentage (decimal)

    Maximum deletion percentage compared with the whole cluster data. This feature exists to prevent accidental deletion: operation will fail if we ask to delete more that x% of data.
    Set to 100.0 to disable this protection.

  • es_cluster_id (string)

    Elasticsearch cluster used to store objects meta-data.

  • es_index (string)

    Elasticsearch index containing objects meta-data (will be appended -*)

Security Parameters

  • credentials.user (string)

    Username to authenticate to ES cluster. Needs credentials.password configuration.

  • credentials.password (string)

    Password to authenticate to ES cluster. Needs credentials.user configuration.

  • credentials.token (string)

    Token string to authenticate to ES cluster. Needs credentials.token_type configuration.

  • credentials.token_type (string)

    Token type used to authenticate to ES cluster. Needs credentials.token configuration.

  • ssl (Boolean)

    If true, encrypt the connection to the ES cluster with TLS

  • ssl_private_key (String)

    Path to the client's private key for TLS connection

  • ssl_private_key_password (String)

    Password for client's private key for TLS connection

  • ssl_private_key_alias (String)

    Alias for client's private key for TLS connection

  • ssl_certificate (String)

    Path to the client's certificate for TLS connection

  • ssl_trusted_certificate (String)

    Path to the client's CA file for TLS connection

  • ssl_keystore_location (String)

    Path to the client's keystore for TLS connection

  • ssl_keystore_password (String)

    Password for client's keystore for TLS connection

  • ssl_truststore_location (String)

    Path to the client's truststore for TLS connection

  • ssl_truststore_password (String)

    Password for client's truststore for TLS connection

  • user (string)

    Only for Ceph. Ceph user name.

  • access_key (string)

    Only for Minio. Minio access key.

  • secret_key (string)

    Only for Minio. Minio secret key.

Optional Parameters

  • delete_metadata (string)

    Deletion strategy regarding metadata. Default: "always". Possible values:

    • "always": metadata is always deleted after processing devices.
    • "never": metadata is never deleted after processing devices.
    • "all_devices": metadata is deleted only when all devices are cleaned by housekeeping. If some devices were not cleaned (because they were not specified or because the application failed to clean the device) then the application updates the metadata to keep the device where the archive is still present. This is the recommended behavior for production.
  • es_bulk_size (int)

    Size of elastic bulk request containing delete and update metadata actions. Default: 1000

  • es_timeout (String)

    Elastic request timeout. Default: 10s

Running in foreground

Run the application in foreground on a Punch Operator :

archives-housekeeping /path/to/your/archives-housekeeping.yaml

Running in Shiva

Include the application in a Shiva channel. Example from standalone :

version: '6.0'
start_by_tenant: true
stop_by_tenant: true
applications:

  - name: elasticsearch-housekeeping
    runtime: shiva
    cluster: common
    command: elasticsearch-housekeeping
    args:
      - --tenant-configuration-path
      - elasticsearch-housekeeping.json
    apply_resolver_on:
      - elasticsearch-housekeeping.json
    quartzcron_schedule: 0 * * ? * * *

  - name: archives-housekeeping
    runtime: shiva
    cluster: common
    command: archives-housekeeping
    args:
      - archives-housekeeping.yaml
      - --childopts
      - -Xms100m -Xmx500m
    quartzcron_schedule: 0 * * ? * * *
resources: [ ]