Fast Data Transfer

January 17, 2020 | Glynn Bird | Replication Transfer

Cloudant replication is the database’s method of of choice for transferring data from a source to a target database. It’s main use-cases are:

Replication is easy to set up, can be run as a one-off or continuous operation and can be resumed from its last position. For all of its good points, replication does have some drawbacks:

fire hose

Photo by Juliana Kozoski on Unsplash

How fast is replication?

It depends how many documents you have, how big they are, how many of them are conflicted, how many attachments there are, how big they are, how many deletions you have, the reads-per-second capacity at the source, the writes-per-second capacity at the target, network bandwidth, how geographically close the source and target are etc.

As an indicative example, I was able to transfer 500,000 documents (around 700 bytes each) in around 300 seconds. This number is highly dependent on the provisioned capacity of the Cloudant service you have. A free “Lite” account, for example, can only transfer 20 documents per second because it is rate-limited to 20 reads per second.

If we’re prepared to make some assumptions and drop some of replication’s features, we can achieve a faster data transfer than that using couchfirehose, a command-line utility that transfers data from a source to a target database without using replication.

How do I install couchfirehose?

Simply run:

> npm install -g couchfirehose

How do a I transfer data using couchfirehose?

Let’s set up our Cloudant URL, including authentication credentials, in an environment variable:

> export URL="https://myusername:mypassword@mycloudantservice.cloudant.com"

We can then use the URL variable in our next command:

> couchfirehose --source "$URL/mysourcedb" --target "$URL/mytargetdb"

As well as --source and --target there are other parameters we can use to customise the data transfer:

e.g.

> # filter out deleted documents
> couchfirehose --source "$URL/mysourcedb" --target "$URL/mytargetdb" --fd true
> # larger batch size
> couchfirehose --source "$URL/mysourcedb" --target "$URL/mytargetdb" -b 2000
> # ensure that only five writes are made per second
> couchfirehose --source "$URL/mysourcedb" --target "$URL/mytargetdb" -m 5
> # allow eight writes to be in flight at any one time
> couchfirehose --source "$URL/mysourcedb" --target "$URL/mytargetdb" -c 8

By experimenting with these parameters, it’s possible to see a four-fold increase in speed compared with replication, all though it’s worth remembering that this is not replication:

Advanced features

We don’t have to transfer all of the source data to the target if we don’t want to. By supplying a Cloudant Query Selector as the --selector parameter, the source data will be filtered according to the query e.g.

> # only transfer completed orders
> couchfirehose --source "$URL/s" --target "$URL/t" --selector '{"status":"completed"}'
> # only transfer last year's data that is not null
> couchfirehose --source "$URL/s" --target "$URL/t" --selector '{"year":2018,"value":{"$ne":null}}'

We can also supply a custom JavaScript function to the --transform parameter, where the function transforms the source document prior to it being written to the target. Create your function in a file:

module.exports = (doc) => {
  // delete unwanted field
  delete doc.deprecatedAttribute
  // add a new field
  doc.newAttribute = 'new'
  // coerce the type of a field
  if (typeof doc,price === 'string') {
     doc.price = parseFloat(doc.price)
  }
  // modify the _id as we're moving to a partitioned database
  doc,_id = doc.userid + ':' + doc._id
  // ignore zero value transactions
  if (doc.price === 0) {
     // if we return null, the document is not written to the target
     return null
  }
  return doc
}

module.exports = f

The transform feature can be used to correct data, add new attributes, remove unwanted attributes and for migrations from a non-partitioned database to a partitioned database, modify the _id field to contain a partition key.