# Retention and Expiration ## About A data retention and expiration management subsystem for CrateDB, implementing multiple strategies. ### Details The application manages the life-cycle of data stored in CrateDB, handling concerns of data expiry, size reduction, and archival. Within a system storing and processing large amounts of data, it is crucial to manage data flows between "hot", "warm", and "cold" storage types better than using ad hoc solutions. Data retention policies can be flexibly configured by adding records to the retention policy database table, which is also stored within CrateDB. ### Background With other databases, this technique, or variants thereof, are known as [rolling up historical data], [downsampling a time series data stream], [downsampling and data retention], or just [downsampling]. They are useful to reduce the storage size of historical data by decreasing its resolution, where this is acceptable related to your needs. > ES: The Rollup functionality summarizes old, high-granularity data into a reduced > granularity format for long-term storage. By "rolling" the data up, historical > data can be compressed greatly compared to the raw data. ### Side looks In classical OLAP operations, > a [roll-up] involves summarizing the data along a dimension. The summarization > rule might be an aggregate function, such as computing totals along a hierarchy > or by applying a set of formulas. In classical dashboarding applications, reducing the amount of timeseries-data _on-demand_ is also applicable. > When rendering data to your screen, and its density is larger than the amount > of pixels available to display, data is reduced by using [time bucketing] for > grouping records into equal-sized time ranges, before applying a resampling > function on it. The width of those time ranges is usually automatically derived > from the zoom level, i.e. the total time range the user is looking at. Contrary to other techniques which may compute data _on-demand_, the operations performed by this application **permanently** reduce the amount, size, or resolution of data. ### Details > The retention policy database table is also stored within CrateDB. By default, the `ext` schema is used for that, so the effective full-qualified database table name is `"ext"."retention_policy"`. The schema is configurable by using the `--schema` command-line option, or the `CRATEDB_EXT_SCHEMA` environment variable. ## Strategies This section enumerates the available data retention and expiration strategies. More strategies can be added, and corresponding contributions are welcome. ### DELETE A retention policy algorithm that deletes records from expired partitions. ```shell cratedb-retention create-policy --strategy=delete \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=1 \ "${CRATEDB_URI}" ``` This retention policy implements the following directive. > **Delete** all data from the `"doc"."raw_metrics"` table, on partitions defined by > the column `ts_day`, which is older than **1** day at the given cutoff date when > running the retention job. [implementation][strategy-delete] | [tutorial](https://community.crate.io/t/cratedb-and-apache-airflow-implementation-of-data-retention-policy/913) ### REALLOCATE A retention policy algorithm that reallocates expired partitions from "hot" nodes to "warm" nodes. Because each cluster member is assigned a designated node type by using the `-Cnode.attr.storage=hot|warm` parameter, this strategy is only applicable in cluster/multi-node scenarios. On the data expiration run, corresponding partitions will get physically moved to cluster nodes of the `warm` type, which are mostly designated archive nodes, with large amounts of storage space. ```shell cratedb-retention create-policy --strategy=reallocate \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=60 \ --reallocation-attribute-name=storage --reallocation-attribute-value=warm \ "${CRATEDB_URI}" ``` This retention policy implements the following directive. > **Reallocate** data from the `"doc"."raw_metrics"` table, on partitions defined by > the column `ts_day`, which is older than **60** days at the given cutoff date, to > nodes tagged with the `storage=warm` attribute. [implementation][strategy-reallocate] | [tutorial](https://community.crate.io/t/cratedb-and-apache-airflow-building-a-hot-cold-storage-data-retention-policy/934) ### SNAPSHOT A retention policy algorithm that snapshots expired partitions to a repository, and prunes them from the database afterwards. It is suitable for long-term data archival purposes. In CrateDB jargon, a repository is a storage unit, where data in form of snapshots can be exported to, and imported from. A repository can be located on a filesystem, or on a supported object storage backend. CrateDB is able to use buckets on S3-compatible storage backends, or on Azure blob storage, using the `CREATE REPOSITORY ... TYPE = s3|azure|fs` SQL statement. ```sql CREATE REPOSITORY export_cold TYPE s3 WITH ( protocol = 'https', endpoint = 's3-store.example.org:443', access_key = '', secret_key = '', bucket = 'cratedb-cold-storage' ); ``` ```shell cratedb-retention create-policy --strategy=snapshot \ --table-schema=doc --table-name=sensor_readings \ --partition-column=time_month --retention-period=365 \ --target-repository-name=export_cold \ "${CRATEDB_URI}" ``` This retention policy implements the following directive. > Run a **snapshot** on the data within the `"doc"."sensor_readings"` table, on partitions > defined by the column `time_month`, which is older than **365** days at the given cutoff > date, to a previously created repository called `export_cold`. Delete the corresponding > data afterwards. [implementation][strategy-snapshot] | [tutorial](https://community.crate.io/t/building-a-data-retention-policy-for-cratedb-with-apache-airflow/1001) ## Data model ### SQL DDL schema The database schema for the retention policy table is defined like this. Each record represents a rule how to expire and retain data per table. At runtime, each record will correspond to a dedicated retention job task. ```sql -- Set up the retention policy database table schema. CREATE TABLE IF NOT EXISTS "ext"."retention_policy" ( -- Strategy to apply for data retention. "strategy" TEXT NOT NULL, -- Tags: For grouping, multi-tenancy, and more. "tags" OBJECT(DYNAMIC), -- Source: The database table operated upon. "table_schema" TEXT, -- The source table schema. "table_name" TEXT, -- The source table name. "partition_column" TEXT NOT NULL, -- The source table column name used for partitioning. -- Retention parameters. "retention_period" INTEGER NOT NULL, -- Retention period in days. The number of days data gets -- retained before applying the retention policy. -- Target: Where data is moved/relocated to. -- Targeting specific nodes. -- You may want to designate dedicated nodes to be responsible for "hot" or "warm" storage types. -- To do that, you can assign attributes to specific nodes, effectively tagging them. -- https://crate.io/docs/crate/reference/en/latest/config/node.html#custom-attributes "reallocation_attribute_name" TEXT, -- Name of the node-specific custom attribute. "reallocation_attribute_value" TEXT, -- Value of the node-specific custom attribute. -- Targeting a repository. "target_repository_name" TEXT, -- The name of a repository created with `CREATE REPOSITORY ...`. PRIMARY KEY ("table_schema", "table_name", "strategy") ) CLUSTERED INTO 1 SHARDS; ``` ### Record examples You can add a retention policy by using an SQL statement like this. ```sql -- Add a retention policy using the DELETE strategy. INSERT INTO "ext"."retention_policy" (strategy, table_schema, table_name, partition_column, retention_period) VALUES ('delete', 'doc', 'raw_metrics', 'ts_day', 1); ``` Using the corresponding CLI command is equivalent to the SQL statement. ```shell cratedb-retention create-policy --strategy=delete \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=1 \ "${CRATEDB_URI}" ``` After running that command, and creating a record, the program reports about its identifier. ``` Created new retention policy: adb90608-c20f-4de7-be98-93588d8358dc ``` You can use it to delete the rule again. ```shell cratedb-retention delete-policy --id=adb90608-c20f-4de7-be98-93588d8358dc \ "${CRATEDB_URI}" ``` For enumerating all policies, use the `list-policies` subcommand. ```shell cratedb-retention list-policies "${CRATEDB_URI}" ``` ### Tags By using tags, you can conveniently define groups of retention policies. This policy is being tagged with both `foo`, and `bar`. ```shell cratedb-retention create-policy --strategy=delete \ --tags=foo,bar \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=1 \ "${CRATEDB_URI}" ``` In order to select retention policies tagged with `foo`, use such an SQL statement. ```sql SET error_on_unknown_object_key=false; SELECT * FROM "ext"."retention_policy" WHERE tags['foo'] IS NOT NULL; ``` Delete all retention policy records tagged with `foo` and `bar`. ```sql DELETE FROM "ext"."retention_policy" WHERE tags['foo'] IS NOT NULL AND tags['bar'] IS NOT NULL; ``` ```shell cratedb-retention delete-policy --tags=foo,bar "${CRATEDB_URI}" ``` In order to use tags when running retention jobs, the program accepts an optional command-line option like `--tags=foo,bar`. It will limit the retention policies to all records which contain both tags, i.e. they are combined using `AND`. Currently, there is no interface to obtain multiple tag values, and combine them using `OR`. Running retention jobs for multiple tags needs multiple invocations. ```shell cratedb-retention run --strategy=delete --tags=foo,bar "${CRATEDB_URI}" ``` ## Install The CrateDB data retention and expiration toolkit program `cratedb-retention` can be installed by using the corresponding Python package, or it can alternatively be invoked on the command line using Podman or Docker. For installing the package, please refer to the [install documentation section](#install). ```shell cratedb-retention --version ``` ## Usage The toolkit can be used both as a standalone program, and as a library. ### Command-line use This section outlines how to connect to, and run data retention jobs, both on a database cluster running on your premises, and on [CrateDB Cloud]. The steps are the same, only the connection parameters are different. ```{todo} Note that, currently, the Python driver and the crash database shell need to obtain slightly different parameters. ``` By using the `--verbose` option, directly after the main command, before the subcommand, you can inspect the SQL statements issued to the database. ```shell cratedb-retention --verbose setup "${CRATEDB_URI}" ``` By using the `--dry-run` option, after the subcommand, the program will not issue altering or modifying SQL statements to the database. You can use it to learn about which actions would be performed. ```shell cratedb-retention setup --dry-run "${CRATEDB_URI}" ``` The `DBURI` command-line argument can be omitted when defining the `CRATEDB_URI` environment variable. ```shell export CRATEDB_URI='crate://localhost/' ``` There are also a few shortcuts for subcommands. For example, the shortest form of incantation for deleting a retention policy would be: ```shell cratedb-retention rm --tags=baz ``` #### Workstation / on-premise ```shell export CRATEDB_URI='crate://localhost/' export CRATEDB_HOST='http://localhost:4200/' ``` #### CrateDB Cloud ```shell export CRATEDB_URI='crate://admin:@.aks1.eastus2.azure.cratedb.net:4200?ssl=true' export CRATEDB_HOST='https://admin:@.aks1.eastus2.azure.cratedb.net:4200/' ``` #### General Install retention policy bookkeeping tables. ```shell cratedb-retention setup "${CRATEDB_URI}" ``` Add a retention policy rule. ```shell cratedb-retention create-policy --strategy=delete \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=1 \ "${CRATEDB_URI}" ``` Invoke the data retention job, expiring all data older than one day. ```shell cratedb-retention run --strategy=delete "${CRATEDB_URI}" ``` Invoke the data retention job, expiring all data older than one day before a specific cutoff date. ```shell cratedb-retention run --cutoff-day=2023-06-27 --strategy=delete "${CRATEDB_URI}" ``` Simulate the data retention job, display SQL statements only. ```shell cratedb-retention run --dry-run --strategy=delete "${CRATEDB_URI}" ``` #### Podman or Docker This section offers a complete walkthrough, how work with the `cratedb-retention` program interactively using Podman or Docker, step by step. It starts a single- node CrateDB instance on your workstation for demonstration purposes. Please note that some retention policy strategies will not work on single-node installations. ```shell export CRATEDB_URI='crate://localhost/' export CRATEDB_HOST='http://localhost:4200/' export OCI_IMAGE='ghcr.io/crate/cratedb-retention:nightly' ``` Display version number. ```shell docker run --rm -i --network=host "${OCI_IMAGE}" \ cratedb-retention --version ``` Start CrateDB. ```shell docker run --rm -it \ --name=cratedb \ --publish=4200:4200 --publish=5432:5432 \ --env=CRATE_HEAP_SIZE=4g crate \ -Cdiscovery.type=single-node ``` Install retention policy bookkeeping tables. ```shell docker run --rm -i --network=host "${OCI_IMAGE}" \ cratedb-retention setup "${CRATEDB_URI}" ``` Add a retention policy rule using the DELETE strategy. ```shell docker run --rm -i --network=host "${OCI_IMAGE}" \ cratedb-retention create-policy --strategy=delete \ --table-schema=doc --table-name=raw_metrics \ --partition-column=ts_day --retention-period=1 \ "${CRATEDB_URI}" ``` Invoke the data retention job. ```shell docker run --rm -i --network=host "${OCI_IMAGE}" \ cratedb-retention run --strategy=delete "${CRATEDB_URI}" ``` ### Library use This section outlines how to use the toolkit as a library within your own applications. The code displayed below is a stripped-down version of the runnable example program [`examples/retention_retire_cutoff.py`], located within this repository. ```python from cratedb_toolkit.retention.core import RetentionJob from cratedb_toolkit.retention import JobSettings, RetentionPolicy, RetentionStrategy from cratedb_toolkit.retention.setup import setup_schema from cratedb_toolkit.retention.store import RetentionPolicyStore from cratedb_toolkit.model import DatabaseAddress # Define the database URI to connect to. DBURI = "crate://localhost/" # A. Set up adapter to retention policy store. settings = JobSettings( database=DatabaseAddress.from_string(DBURI), ) store = RetentionPolicyStore(settings=settings) # B. Initialize the subsystem, and create a data retention policy. # TODO: Refactor to `RetentionPolicyStore`. setup_schema(settings=settings) # Add a basic retention policy. policy = RetentionPolicy( strategy=RetentionStrategy.DELETE, table_schema="doc", table_name="raw_metrics", partition_column="ts_day", retention_period=1, ) identifier = store.create(policy, ignore="DuplicateKeyException") # C. Define job settings, and invoke the data retention job. settings = JobSettings( database=DatabaseAddress.from_string(DBURI), strategy=RetentionStrategy.DELETE, ) job = RetentionJob(settings=settings) job.start() ``` [CrateDB]: https://crate.io/products/cratedb [CrateDB Cloud]: https://console.cratedb.cloud/ [downsampling]: https://docs.victoriametrics.com/#downsampling [downsampling a time series data stream]: https://www.elastic.co/guide/en/elasticsearch/reference/current/downsampling.html [downsampling and data retention]: https://docs.influxdata.com/influxdb/v1.8/guides/downsample_and_retain/ [`examples/retention_retire_cutoff.py`]: https://github.com/crate/cratedb-toolkit/blob/main/examples/retention_retire_cutoff.py [rolling up historical data]: https://www.elastic.co/guide/en/elasticsearch/reference/current/rollup-overview.html [roll-up]: https://en.wikipedia.org/wiki/OLAP_cube#Operations [strategy-delete]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/strategy/delete.py [strategy-reallocate]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/strategy/reallocate.py [strategy-snapshot]: https://github.com/crate/cratedb-toolkit/blob/main/cratedb_toolkit/retention/strategy/snapshot.py [time bucketing]: https://community.crate.io/t/resampling-time-series-data-with-date-bin/1009