Partitioned Databases - Data Migration
This is the third part of a series of posts on Partitioned Databases in Cloudant. Part One, Part Two and Part Four may also be of interest.
Cloudant’s new Partitioned Databases feature allows a Cloudant database to be organised into partitions (blocks of data guaranteed to reside on the same database shard) by specifying a two part _id
field consisting of the parition and document id e.g.
{
"_id": "US:52412",
"name": "Boston",
"state": "Massachusetts",
"country": "US"
...
}
- “US” identifies the partition - all documents starting with “US” will be stored in the same physical database shard
- “52412” is a unique identifier for the document. It must be unique within the partition.
- the partition and document identifiers are separated by a “:” character.
A partitioned database allows queries limited to a single parition - such queries can be performed much more efficiently than whole-database (global) queries.
Photo by Timothy Muza on Unsplash
Migrating to a partitioned database🔗
Migrating existing data over to a partitioned database will require creating a new database with the partitioned=true
flag:
curl -X PUT "$URL/cities2?partitioned=true"
The new database will need to be populated with a copy of the original data, but with the new partitionid:documentid
format.
i.e we need to transform documents of this form:
{"_id":"52412","country":"US",name:"Boston"}
into this form:
{"_id":"US:52412","country":"US",name:"Boston"}
to ensure that each city is placed in a per-country partition.
Cloudant’s Replication API allows data to be copied or synced from a source database to a target database. Filters can be used to decide whether a document should be replicated or not but replication doesn’t allow you to transform the data as it is replicated.
There is a neat trick that allows data to be moved from one database to another, while modifying the _id
field (or any other field for that matter) in the process. To do this we are going to use two command-line tools:
- couchbackup - allows CouchDB/Cloudant data be backed-up and stored as text files on your machine. It also comes with a tool to restore that data back to the database, or to a new empty database.
- jq is a JSON processor used to format and modify JSON data structures.
Our process is this:
- export the source data using couchbackup.
- transform the backed-up data using jq.
- restore the transformed data to a new database using couchrestore.
The three actions can be achieved in a single command:
couchbackup --db cities | jq -c '. | map(._id = .country + ":" + ._id)' | couchrestore --db cities2
Let’s break that down:
couchbackup --db cities
simply initiates a backup of the “cities” database. The data is output in batches of several hundred documents with one batch per line e.g.
[{"_id":"52412","country":"US",name:"Boston"},{"_id":"781","country":"UK",name:"Oxford"}]
[{"_id":"152","country":"IN",name:"Malaut"},{"_id":"782","country":"PK",name:"Nārang"}]
The jq line jq -c '. | map(._id = .country + ":" + ._id)'
means:
-c
- compact output (one array per line)..
- process the top level JSON object, in this case our array of cities.| map()
- iterate over every item in the array._id = .country + ":" + ._id
- sets the_id
field to be the document’scountry
attribute followed by a colon followed by the existing_id
.
The result is a transformed array of countries:
[{"_id":"US:52412","country":"US",name:"Boston"},{"_id":"UK:781","country":"UK",name:"Oxford"}]
[{"_id":"IN:152","country":"IN",name:"Malaut"},{"_id":"PK:782","country":"PK",name:"Nārang"}]
Piping this data to couchrestore
populates the new database.
Other considerations🔗
- the choice of a partition key is very important. Consult the documentation to pick a partition key that has many values, no hot spots and repeats throughout the data set.
- Design documents that contain index definitions will need careful thought through. An index definition in a partitioned database is itself partitioned by default. Audit your indexes and try to make your most common access patterns are serviced by partitioned queries with an index, as these are the cheapest and most performant.
- Global queries can still be used on a partitioned database but ensure they are backed by a matching
partitioned: false
index.