(mongodb-loader)= # MongoDB Table Loader ## About Load data from MongoDB and its file formats into CrateDB using a one-stop command `ctk load table`, in order to facilitate convenient data transfers to be used within data pipelines or ad hoc operations. ## Coverage CrateDB Toolkit supports different variants to load MongoDB data from server instances and filesystems. - `mongodb://` Connect to MongoDB Community or Enterprise Edition. - `mongodb+srv://` Connect to MongoDB Atlas. - `file+bson://` Read files in [MongoDB Extended JSON] or [BSON] format from filesystem. - `http+bson://` Read files in [MongoDB Extended JSON] or [BSON] format from HTTP resource. ## Install ```shell pip install --upgrade 'cratedb-toolkit[mongodb]' ``` ## Usage The MongoDB I/O adapter can process MongoDB data from different sources. This section enumerates relevant connectivity options on behalf of concrete usage examples. ### MongoDB Atlas Transfer a single collection from MongoDB Atlas. ```shell export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/ticker/stocks ctk load table "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker/stocks?batch-size=5000" ``` Transfer all collections in database from MongoDB Atlas. ```shell export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/ticker ctk load table "mongodb+srv://john:EeY6OocooL8rungu@testdrive.ahnaik1.mongodb.net/ticker?batch-size=5000" ``` :::{important} When transferring **multiple collections**, make sure to use a CrateDB database address which DOES NOT reference an individual table. It MUST stop at the **schema** label, here, `ticker`. Likewise, the MongoDB database address also MUST reference a **database**, NOT a specific collection. ::: ### MongoDB Community and Enterprise Transfer data from MongoDB database/collection into CrateDB schema/table. ```shell export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/testdrive/demo ctk load table "mongodb://localhost:27017/testdrive/demo" ``` Query data in CrateDB. ```shell export CRATEDB_CLUSTER_URL=crate://crate@localhost:4200/testdrive/demo ctk shell --command "SELECT * FROM testdrive.demo;" ctk show table "testdrive.demo" ``` ### MongoDB JSON/BSON files Load data from MongoDB JSON/BSON files, for example produced by the `mongoexport` or `mongodump` programs. ```shell # Extended JSON, filesystem, full path. ctk load table "file+bson:///path/to/mongodb-json-files/datasets/books.json" # Extended JSON, HTTP resource. ctk load table "https+bson://github.com/ozlerhakan/mongodb-json-files/raw/master/datasets/books.json" # BSON, filesystem, relative path, compressed. ctk load table "file+bson:./var/data/testdrive/books.bson.gz" # Extended JSON, filesystem, multiple files. ctk load table \ "file+bson:///path/to/mongodb-json-files/datasets/*.json?batch-size=2500" \ --CRATEDB_CLUSTER_URL-url="crate://crate@localhost:4200/datasets" ``` :::{important} When transferring **multiple collections**, make sure to use a CrateDB database address which DOES NOT reference an individual table. It MUST stop at the **schema** label, here, `datasets`. Likewise, the path to the MongoDB JSON files also MUST reference the **parent folder**, NOT a specific JSON or BSON file. ::: To exercise a full example importing multiple MongoDB Extended JSON files, see [](#file-import-tutorial). ## Options ### Batch Size The default batch size is 100, but for many datasets a much larger batch size is applicable for most efficient data transfers. You can adjust the value by appending the HTTP URL query parameter `batch-size` to the source URL, like `mongodb+srv://managed.mongodb.net/ticker/stocks?batch-size=5000`. ### Filter Use the HTTP URL query parameter `filter` on the source URL, like `&filter={"exchange":{"$eq":"NASDAQ"}}`, or `&filter={"_id":"66f0002e98c00fb8261d87c8"}`, in order to provide a MongoDB query filter as a JSON string. It works in the same way like `mongoexport`'s `--query` option. On more complex query expressions, make sure to properly encode the right value using URL/Percent Encoding. ### Limit Use the HTTP URL query parameter `limit` on the source URL, like `&limit=100`, in order to limit processing to a total number of records. ### Offset Use the HTTP URL query parameter `offset` on the source URL, like `&offset=42`, in order to start processing at this record from the beginning. ## Transformations You can use [Tikray Transformations] to change the shape of the data while being transferred. To add it to the pipeline, use the `--transformation` command line option. It is also available for the `migr8 extract` and `migr8 export` commands. Example transformation files in YAML format can be explored at [examples/tikray]. ## Appendix ### Insert Exercise Import two data points into MongoDB database `testdrive` and collection `demo`, using the `mongosh` CLI program. ```shell mongosh mongodb://localhost:27017/testdrive <