Partitioned Databases - Data Migration

March 05, 2019 | Glynn Bird | Partitioned Databases Migration

This is the third part of a series of posts on Partitioned Databases in Cloudant. Part One, Part Two and Part Four may also be of interest.

Cloudant’s new Partitioned Databases feature allows a Cloudant database to be organised into partitions (blocks of data guaranteed to reside on the same database shard) by specifying a two part _id field consisting of the parition and document id e.g.

{
  "_id": "US:52412",
  "name": "Boston",
  "state": "Massachusetts",
  "country": "US"
  ...
}

A partitioned database allows queries limited to a single parition - such queries can be performed much more efficiently than whole-database (global) queries.

pie

Photo by Timothy Muza on Unsplash

Migrating to a partitioned database

Migrating existing data over to a partitioned database will require creating a new database with the partitioned=true flag:

curl -X PUT "$URL/cities2?partitioned=true"

The new database will need to be populated with a copy of the original data, but with the new partitionid:documentid format.

i.e we need to transform documents of this form:

{"_id":"52412","country":"US",name:"Boston"}

into this form:

{"_id":"US:52412","country":"US",name:"Boston"}

to ensure that each city is placed in a per-country partition.

Cloudant’s Replication API allows data to be copied or synced from a source database to a target database. Filters can be used to decide whether a document should be replicated or not but replication doesn’t allow you to transform the data as it is replicated.

There is a neat trick that allows data to be moved from one database to another, while modifying the _id field (or any other field for that matter) in the process. To do this we are going to use two command-line tools:

  1. couchbackup - allows CouchDB/Cloudant data be backed-up and stored as text files on your machine. It also comes with a tool to restore that data back to the database, or to a new empty database.
  2. jq is a JSON processor used to format and modify JSON data structures.

Our process is this:

The three actions can be achieved in a single command:

couchbackup --db cities | jq -c '. | map(._id = .country + ":" + ._id)' | couchrestore --db cities2

Let’s break that down:

couchbackup --db cities simply initiates a backup of the “cities” database. The data is output in batches of several hundred documents with one batch per line e.g.

[{"_id":"52412","country":"US",name:"Boston"},{"_id":"781","country":"UK",name:"Oxford"}]
[{"_id":"152","country":"IN",name:"Malaut"},{"_id":"782","country":"PK",name:"Nārang"}]

The jq line jq -c '. | map(._id = .country + ":" + ._id)' means:

The result is a transformed array of countries:

[{"_id":"US:52412","country":"US",name:"Boston"},{"_id":"UK:781","country":"UK",name:"Oxford"}]
[{"_id":"IN:152","country":"IN",name:"Malaut"},{"_id":"PK:782","country":"PK",name:"Nārang"}]

Piping this data to couchrestore populates the new database.

Other considerations

Further information