migr8 migration utility

About

A utility program, called migr8, supporting data migrations between MongoDB and CrateDB.

Tip

Please also visit the documentation about the MongoDB Table Loader to learn about a more high-level interface.

Details

This tool iterates over one or multiple MongoDB collections, and iteratively builds up a description of the schema of those collections.

In a second step, this description can be used to create a CrateDB table schema, which will attempt to determine a best-fit table definition for that schema.

As such, this means the tool works best on collections of similarly structured and typed data.

Supported MongoDB versions

The application supports the following versions of MongoDB.

Supported MongoDB versions

If you need support for MongoDB 2.x, you will need to downgrade the pymongo client driver library to version 3, like pip install 'pymongo<4'.

Installation

Use pip to install the package from PyPI.

pip install --upgrade 'cratedb-toolkit[mongodb]'

To verify if the installation worked, invoke:

ctk --version

Usage

ctk load table is your one-stop command to populate a CrateDB table from a MongoDB collection.

export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/testdrive/demo
ctk load table mongodb://localhost:27017/testdrive/demo

It will run extract and translate to gather the SQL DDL schema, and will invoke export and cr8 to actually transfer data.

Usage for migr8

The program migr8 offers three subcommands extract, translate, and export, to conclude data transfers from MongoDB to CrateDB. Please read this section carefully to learn how they can be used successfully.

If you intend to evaluate migr8 on a small portion of your data in MongoDB, the --limit command-line option for the migr8 extract and migr8 export subcommands might be useful. Using --limit 10000 is usually both good and fast enough, to assess if the schema translation and data transfer works well.

migr8 --version
migr8 --help

Schema Extraction

To extract a description of the schema of a collection, use the extract subcommand. For example:

migr8 extract --host localhost --port 27017 --database test_db --out mongodb_schema.json

After connecting to the designated MongoDB server, it will look at the collections within that database, and will prompt you which collections to exclude from analysis.

You can then do a full or partial scan of the collection.

A partial scan will only look at the first entry in a collection, and thus may produce an ambiguous schema definition. It is still useful if you already know the collection is systematically and regularly structured.

A full scan will iterate over the entire collection and build up the schema description. Cancelling the scan will cause the tool to output the schema description it has built up thus far.

For example, scanning a collection of payloads including a ts field, a sensor field, and a payload object, may yield this outcome:

{
  "test": {
    "count": 100000,
    "document": {
      "_id": {
        "count": 100000,
        "types": {
          "OID": {
            "count": 100000
          }
        }
      },
      "ts": {
        "count": 100000,
        "types": {
          "DATETIME": {
            "count": 100000
          }
        }
      },
      "sensor": {
        "count": 100000,
        "types": {
          "STRING": {
            "count": 100000
          }
        }
      },
      "payload": {
        "count": 100000,
        "types": {
          "OBJECT": {
            "count": 100000,
            "document": {
              "temp": {
                "count": 100000,
                "types": {
                  "FLOAT": {
                    "count": 1
                  },
                  "INTEGER": {
                    "count": 99999
                  }
                }
              },
              "humidity": {
                "count": 100000,
                "types": {
                  "FLOAT": {
                    "count": 1
                  },
                  "INTEGER": {
                    "count": 99999
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

This description indicates that the data is well-structured, and has mostly consistent data-types.

Schema Translation

Once a schema description has been extracted, it can be translated into a CrateDB schema definition using the translate subcommand:

migr8 translate --infile mongodb_schema.json

This will attempt to translate the description into a best-fit CrateDB table definition. Where datatypes are ambiguous, it will choose the most common datatype. For example, the previous schema definition would be translated into this SQL DDL statement:

CREATE TABLE IF NOT EXISTS "doc"."test" (
    "ts" TIMESTAMP WITH TIME ZONE,
    "sensor" TEXT,
    "payload" OBJECT (STRICT) AS (
        -- ⬇️ Types: FLOAT: 0.0%, INTEGER: 100.0%
        "temp" INTEGER,
        -- ⬇️ Types: FLOAT: 0.0%, INTEGER: 100.0%
        "humidity" INTEGER
    )
);

You can also connect both programs to each other, to execute both steps at once.

migr8 extract ... | migr8 translate

MongoDB Collection Export

To export a MongoDB collection to a JSON stream, use the export subcommand:

migr8 export --host localhost --port 27017 --database test_db --collection test

This will convert the collection’s records into JSON, and output the JSON to stdout. For example, to redirect the output to a file, run:

migr8 export --host localhost --port 27017 --database test_db --collection test > test.json

Alternatively, use cr8 to directly write the MongoDB collection into a CrateDB table:

migr8 export --host localhost --port 27017 --database test_db --collection test | \
    cr8 insert-json --hosts localhost:4200 --table test

Using Tikray transformations

You can use Tikray transformations to change the shape of the data while being transferred. To add it to the pipeline, use the --transformation command line option on the migr8 extract and migr8 export commands.

You can find an example file at examples/tikray/tikray-transformation.yaml.