Filtered Replication

December 13, 2019 | Glynn Bird | Replication Filter

Cloudant’s replication protocol allows data to flow from one Cloudant database to another, on the same Cloudant service or to an entirely separate service on the other side of the world. The replication protocol is also understood by Apache CouchDB and PouchDB allowing hybrid and mobile apps to be created with Cloudant acting as the cloud-based source of truth. The changes feed itself is also used to stream data to external services such as couchwarehouse.

filter

Photo by Karl Fredrickson on Unsplash

In the cases where not all the data in a Cloudant database is required to be replicated, a JavaScript filter or Cloudant Query selector can be defined to act as a gatekeeper to decide which documents are replicated. In this post we’ll see how such selectors are setup and some common use-cases.

Setting up replication

Initiating a replication is a simple as sending a JSON document to your Cloudant’s _replicator database. The document defines the source database and the target database together with any authentication credentials required.

{
  "_id": "myfirstreplication",
  "source" : "http://<username1>:<password1>@<account1>.cloudant.com/<sourcedb>",
  "target" : "http://<username2:<password2>@<account2>.cloudant.com/<targetdb>"
}

Instead of running replications as your Cloudant account’s admin user, it’s better to create API Keys which can be used as the username/password for your replication process. It’s always best to run with the minimum permissions needed to do the job.

The replication will start in due course and you can watch the progress by pulling the document (GET /_replicator/myfirstreplication) and examining the extra attributes. The replication can be stopped by deleting the replication document (DELETE /_replicator/myfirstreplication?rev=<rev>).

Adding a replication selector

To only allow a subset of documents to be replicated, a selector object can be added to your replication document when creating the replication document:

{
  "_id": "myfirstreplication",
  "source" : "http://<username1>:<password1>@<account1>.cloudant.com/<sourcedb>",
  "target" : "http://<username2:<password2>@<account2>.cloudant.com/<targetdb>",
  "selector": {
    "$or": {
      "author": "Virginia Woolf",
      "year": {
         "$lt": 1900
      }
    }
  }
}

In this case only document which have an ‘author’ attribute of Virginia Woolf or a year attribute less than 1900 will be replicated. The selector can contain any valid Cloudant Query syntax, and as it operates on every document in the the changes feed, it doesn’t have to be backed by a suitable index.

Note: A selector-based replication filter is more efficient than the JavaScript-based filter functions as Cloudant can evaluate whether a document passes a selector without having to spin up a JavaScript process.

Ignoring deletions

Cloudant stores deletions as an additional revision to an existing document. This means that a tombstone document remains after document deletion. To clean up tombstones, a database can be replicated to a new empty database, but ignoring deleted documents. This leaves the target database free of tombstones, using less disk space and with a de-cluttered primary index.

A selector to filter out tombstones is:

"selector": {
  "_deleted": {
    "$exists": false
  }
}

which translates as “only replicate documents where the attribute _deleted is not present”.

Ignoring design documents

Sometimes the purpose of replication is to take a backup of the data in the database, but the design documents need to be filtered out so that they don’t trigger the building of indexes on the target service.

You could use a selector to filter out Design Documents:

"selector": {
  "$not": {
    "_id": {
      "$regex": "^_design"
    }
  }
}

which translates as “only replicate documents whose _id fields does NOT begin with _design”, but a more common approach is to rely on the fact that to be able to write design documents at the target end a user/api-key with an _admin role is required. So one way of ensuring that design documents ARE NOT written is to run the replication as a non-admin user by creating an API Key with only _reader/_writer roles.

Custom selectors

If you only need to replicate a sub-set of data to the target, then you can devise any Cloudant Query selector you need e.g.

"selector": {
  "type": "order",
  "status": "complete",
  "date": { "gte": "2019-01-01" }
}

You can combine your custom selector with an off-the-shelf selector to filter out deletions:

"selector": {
  "$and": [
     "_deleted": {
      "$exists": false
     },
    "type": "order",
    "status": "complete",
    "date": { "gte": "2019-01-01" }
  ]
}

This can be run as a non-admin user to filter out design documents too.

Measure twice, cut once

Before running a filtered replication on your production data, it’s worth making sure your logic works on a smaller data set:

If all is well, you can move on to your live replication.