Best Practice - Cloudant TxE

July 02, 2020 | Kruger & Bird | Best-practice TxE

Cloudant on Transaction Engine is a new Cloudant plan where the database runs on top of a consistent, distributed key-value store. As such, its best practice guidance differs from that given on Cloudant Classic. If you are using Cloudant on any plan other than Cloudant on Transaction Engine, then this blog post has best practice guidance for that platform.

image

Note: this best practice post may change over time as features are added to Cloudant on Transaction Engine.

Let’s get started.

Rule 0: Understand the API you are targeting

Understand the API from the ground-up before diving into the supported client libraries, which offer an abstraction on top of the API. Start with some training materials that explain how Cloudant works and then it’s recommended you try out the service using the web-based dashboard and from the command-line.

As Cloudant on Transaction Engine only offers IAM-based authentication, it is recommended you try a command line tool like ccurl which will handle the IAM authentication for you:

export COUCH_URL="https://my.cloudant.service.com"
export IAM_API_KEY="my_iam_api_key"
# ping the Cloudant service
ccurl /
# get a list of databases
ccurl /_all_dbs
# create a database
ccurl -X PUT /mydb
# write a document
ccurl -X POST -d '{"a":1,"b":2,"c":true}' /mydb
# fetch documents back
ccurl '/mydb/_all_docs?include_docs=true'

At an early stage, set up logging on your IBM Cloud platform - this is your window into API calls that you are making against your Cloudant service. It contains a wealth of information, especially when looking to measure latencies.

By understanding the API better, you also gain experience in how Cloudant behaves, especially in terms of performance. If you’re using a client library, you should aim to at least know how to find out which HTTP requests are generated by a given function call, by marrying the function calls you make in your code to the logs that appear in LogDNA.

Rule 1: Documents should group data that mostly change together

When you start to model your data, sooner or later you’ll run into the issue of how your documents should be structured. You’ve gleaned that Cloudant doesn’t enforce any kind of normalisation and that it has no joins, transactions, or store procedures of the type you’re used to from, say, Postgres, so the temptation can be to cram as much as possible into each document, given that this would also save on HTTP overhead.

This is often a bad idea.

If your model groups information together that doesn’t change together, you’re more likely to suffer from document write failures.

Consider a situation where you have users, each having a set of orders associated with them. One way might be to represent the orders as an array in the user document:

{ // DON'T DO THIS
    "customer_id": 65522389,
    "orders": [ {
      "order_id": 887865,
      "items": [ {
          "item_id": 9982,
          "item_name": "Iron sprocket",
          "cost": 53.0
        }, {
          "item_id": 2932,
          "item_name": "Rubber wedge",
          "cost": 3.0
        }
      ]
    }
  ]
}

To add a new order, I need to fetch the complete document, unmarshal the JSON, add the item, marshal the new JSON, and send it back as an update. If I’m the only one doing so, it may work for a while. If the document is being updated concurrently, or being replicated, we’ll likely see conflicting writes.

Instead, keep orders separate as their own document type, referencing the customer id. Now the model is immutable. To add a new order, I simply create a new order document in the database, which is contention-free.

To be able to retrieve all orders for a given customer, we can employ a view, which we’ll cover later.

Avoid constructs that rely on updates to parts of existing documents, where possible. Bad data models are often extremely hard to change once you’re in production.

The pattern above can be solved efficiently using separate documents for each line item in the order.

Rule 2: Keep documents small

Cloudant on Transaction Engine imposes a max document size of 1 MB. This does not mean that a close-to-1-MB document size is a good idea. On the contrary, if you find you are creating documents that exceed single-digit KB in size, you should probably revisit your model. Several things in Cloudant becomes less performant as documents grow. JSON decoding is costly, for example.

Given Rules 1 and 2, it’s worth stressing that models that rely on updates have an upper volume limit of 1MB – the cut-off for document size. This isn’t what you want.

Rule 3: Put attachments in object storage

If you need to store binary blobs of data, you can store, say, a base64 blob of data in the document body itself.

{
  "name": "Frank":
  "profilePicture": "LS0tCnR5cGU6IGJsb2cKdGl0bGU6IEZhc3QgRGF0YSBUcmFuc2ZlcgpkZXNjcmlwdGlvbjogQ29weWluZyBkYXRhIGZhc3RlciB0aGFuIHJlcGxpY2F0aW9uCmxheW91dDogZGVmYXVsdApkYXRlOiAyMDIwLTAxLTE3IDA2OjAwOjAwIDAwMDAKYXV0aG9yOiBHbHlubiBCaXJkCmF1dGhvckxpbms6IGh0dHBzOi8vZ2x5bm5iaXJkLmNvbS8KaW1hZ2U6IGFzc2V0cy9pAKCg==",
  "date": "2019-02-23",
  "verified": true
}

But this isn’t a great use of Cloudant as there is a 1MB document size limit and there are cheaper places to store binary files. Much better to store your unstructured data directly in Cloud Object Storage and store a reference to the object in your Cloudant object.

{
  "name": "Frank",
  "profilePicture": {
    "bucket": "profile_pics",
    "key": "a/653/222/6685.jpg"
  },
  "date": "2019-02-23",
  "verified": true
}

This keeps documents smaller, keeping the meta data in Cloudant and the binary assets in Object Storage.

Rule 4: Understand the trade-offs in emitting data or not into a view

When using MapReduce to create secondary indexes keyed on data you choose, there is also the opportunity to emit something into the value of each index entry. This value is returned to you when querying the resultant materialized view.

A common trick is to emit a sub-set of your document (sometimes called projection) e.g.

function(doc) {
  if (doc.verified) {
    emit(doc.date, { name: doc.name, profilePicture: doc.profilePicture })
  }
}

This view will allow us to query our documents by date or range of dates (because doc.date forms the key of the index) and each return value will contain enough of the document itself for our needs.

The alternative approach is to emit nothing and ask Cloudant to return the entire document at query time by adding ?include_docs=true:

function(doc) {
  if (doc.verified) {
    emit(doc.date, null)
  }
}

The first index is larger as it contains a copy of part of the document body, but fetching keys and values from an index is very fast, much faster than adding ?include_docs=true at query time. As such, the number of read units consumed by the first technique is far fewer.

The second index is smaller, so you save on data storage costs, but requires ?include_docs=true at query time which is expensive computationally and in consumed read units.

Rule 5: Never rely on the default behaviour of Cloudant Query’s no-indexing

It’s tempting to rely on Cloudant Query’s ability to query without creating explicit indexes. This is extremely costly in terms of performance, as every lookup is a incremental scan of the database in _id order rather than an indexed lookup. If the number of documents in the database is small this won’t matter, but as the dataset grows this will become a problem for you, and for the cluster as a whole.

Creating indexes and crafting queries that take advantage of them requires some flair. To identify which index is being used by a particular query, send a POST to the _explain endpoint for the database, with the query as data.

Add execution_stats to a Cloudant Query JSON object to find out how expensive it was to execute i.e. how many documents had to be scanned to yield the search results returned.

Rule 6: Deleting documents won’t delete them

Deleting a document from a Cloudant database doesn’t actually purge it. Deletion is implemented by writing a new revision of the document under deletion, with an added field _deleted: true. This special revision is called a tombstone. Tombstones still take up space and are also passed around by the replicator.

Models that rely on frequent deletions of documents are not suitable for Cloudant.

Rule 7: Avoid concurrent writes to the same document

Cloudant on Transaction Engine is conflict-free for in-region writes, but that doesn’t mean that all write operations will succeed, or that your writes will never get a 409 Conflict response. Let’s say two processes are trying to mutate a document in different ways at the same time. One process will succeed and get a HTTP 20x response, the other will get a 409 response because it was beaten to it by the other.

Cloudant will prevent the generation of conflicts in this case, but it’s probably not a good idea to use Cloudant to store data that is updated over and over in a short time window - at the very least the failing writer process would have to retry its request later (with a newer revision token).

It is more expensive in the longer run to mutate existing documents than to create new ones, as Cloudant will always need to keep the document tree structure around, even if internal nodes in the tree will be stripped of their payloads. If you find that you create long revision trees, your replication performance will suffer. Moreover, if your update frequency goes above, say, once or twice every few seconds, you’re more likely to produce update conflicts.

The 409 response can be avoided altogether by adopting a write only pattern, where documents are only ever added to databases.

Rule 8: Use the bulk API

Cloudant has nice API endpoints for bulk loading (and reading) many documents at once. This can be much more efficient and cheaper per document than reading/writing many documents one at a time. The write endpoint is:

POST ${database}/_bulk_docs.

Its main purpose is to be a central part in the replicator algorithm, but it’s available for your use too, and it’s pretty awesome.

With _bulk_docs, in addition to creating new docs you can also update and delete. Some client libraries, including PouchDB, implement create, update and delete even for single documents this way for fewer code paths.

Here is an example creating one new, updating a second existing, and deleting a third document:

curl -X POST 'https://ACCT.cloudant.com/DB/_bulk_docs' \
     -H "Content-Type: application/json" \
     -d '{"docs":[{"baz":"boo"}, \
         {"_id":"463bd...","foo":"bar"}, \
         {"_id":"ae52d...","_rev":"1-8147...","_deleted": true}]}'

You can also fetch many documents at once by issuing a POST to _all_docs(there is also a newish endpoint called _bulk_get, but this is probably not what you want  —  it’s there for a specific internal purpose).

To fetch a fixed set of docs using _all_docs, POST with a keys body:

curl -XPOST 'https://ACCT.cloudant.com/DB/_all_docs' \
     -H "Content-Type: application/json" \
     -d '{"keys":["ab234....","87addef...","76ccad..."]}'

In terms of pricing, the bulk APIs are much cheaper per document:

i.e it’s nearly half the cost to read data in bulk than to fire off individual fetches for each document.

Note the service limits on request size, number of items in a bulk request and the number returned documents for a bulk read.

Rule 9 : Design document (ddoc) management requires some flair

As your data set grows, and your number of views goes up, sooner or later you will want to ponder how you organise your views across ddocs. A single ddoc can be used to form a so-called view group: a set of views that belong together by some metric that makes sense for your use case. If your views are pretty static, that makes your view query URLs semantically similar for related queries. It’s also more performant at index time because the index loads the document once and generates multiple indexes from it.

Ddocs themselves are read and written using the same read/write endpoints as any other document. This means that you can create, inspect, modify and delete ddocs from within your application. However, even small changes to ddocs can have big effects on your database. When you update a ddoc, all views in it become unavailable until indexing is complete. This can be problematic in production. To avoid it you have to do a crazy ddoc-swapping dance (see couchmigrate).

In most cases, this is probably not what you want to have to deal with. As you start out, it is most likely more convenient to have a one-view-per-ddoc policy.

Also, in case it isn’t obvious, views are code and should be subject to the same processes you use in terms of source code version management for the rest of your application code. How to achieve this may not be immediately obvious. You could version the JS snippets and then cut & paste the code into the Cloudant dashboard to deploy whenever there is a change, and yes, we all resort to this from time to time.

There are better ways to do this, and this is one reason to use some of the tools surrounding the couchapp concept. A couchapp is a self-contained CouchDB web application that nowadays doesn’t see much use. Several couchapp tools exist that are there to make the deployment of a couchapp — including its views, crucially — easier.

Using a couchapp tool means that you can automate deployment of views as needed, even when not using the couchapp concept itself.

Rule 10: Cloudant is rate limited — let this inform your code

Cloudant-the-service (unlike vanilla CouchDB) is sold on a “reserved throughput capacity” model. That means that you pay for the right to use up to a certain throughput, rather than the throughput you actually end up consuming. This takes a while to sink in. One somewhat flaky comparison might be that of a cell phone contract where you pay for a set number of minutes regardless of whether you end up using them or not.

Although the cell phone contract comparison doesn’t really capture the whole situation. There is no constraint on the sum of requests you can make to Cloudant in a month; the constraint is on how fast you make requests.

It’s really a promise that you make to Cloudant, not one that Cloudant makes to you: you promise to not make more requests per second than what you said you would up front. A top speed limit, if you like. If you transgress, Cloudant will fail your requests with a status of 429: Too Many Requests. It’s your responsibility to look out for this, and deal with it appropriately, which can be difficult when you’ve got multiple app servers. How can they coordinate to ensure that they collectively stay below the requests-per-second limit?

Cloudant’s official client libraries have some built-in provision for this that can be enabled (note: this is switched off by default to force you to think about it), following a “back-off & retry” strategy. However, if you rely on this facility alone you will eventually be disappointed. Back-off & retry only helps in cases of temporary transgression, not a persistent butting up against your provisioned throughput capacity limits.

Your business logic must be able to handle this condition. Another way to look at it is that you get the allocation you pay for. If that allocation isn’t sufficient, the only solution is to pay for a higher allocation.

Provisioned throughput capacity is split into two different buckets: reads and writes. Pretty much every Cloudant operation will consume one of these units to “open a transaction” with the underlying data store, so fetching a single document by its id costs 2 read units: one to open the transaction, the other to read the document.

To this end, we heavily incentivise use of the bulk API calls which only open one transaction and return or operate on many documents. It’s as cheap to fetch several dozen key/value pairs from a MapReduce index as it is to fetch one document on its own.

Rule 11: Use timeboxed databases for ever-growing data sets

It’s generally not a good idea to have an ever-growing database in Cloudant. Very large databases can be difficult to backup and suffer from ever-increasing index build times.

One way of mitigating this problem is to have several smaller databases instead, with a very common pattern being timeboxed databases: a large data set is split into smaller databases, each representing a time window e.g. a month.

New data is written to this month’s database and queries for historical data can be directed to previous months’ databases. When a month’s data is no longer of interest, it can be archived to Object Storage, the monthly Cloudant database deleted and the disk space recovered.

Rule 12: Logging helps you see what’s going on

Cloudant’s logs indicating each API call made, what was requested and how long it took to respond can be automatically spooled to LogDNA for analysis and reporting for IBM Cloud-based services. This data is useful to keeping an eye on request volumes, performance and whether your application is exceeding your Cloudant service’s provisioned capacity.

The logging service is easy to setup and free to get started. Paid-for plans allow data to be parsed, retained and archived to Object Storage. Slices and aggregations of your data can be built up into visual dashboards to give you an at-a-glance view of your Cloudant traffic.

Rule 13: Compress your HTTP traffic

Cloudant will compress its JSON responses to you if you supply an HTTP header in the request indicating that your code can handle data in this format:

Request:
> GET /cars/_all_docs?limit=5&include_docs=true HTTP/2
> Host: myhost.cloudant.com
> Accept: */*
> Accept-Encoding: deflate, gzip

Response:

< HTTP/2 200 
< content-type: application/json
< content-encoding: gzip

Compressed content occupies a fraction of the size of the uncompressed equivalent, meaning that it takes a shorter time to transport the data from Cloudant’s servers to your application.

Note you may also choose to compress HTTP request bodies too by using the Content-encoding header. This help lower data transfer times when writing documents to Cloudant.

Rule 14: Treat the primary index as a free search index

A default Cloudant document _id is a 32 character string, encoding 128 bits of random data. The _id attribute is used to construct the database’s primary index which used by Cloudant to retreive documents by _id or ranges of keys when the user supplies a startkey/endkey pair. We can leverage this fact to pack our data into the _id field and use it as “free” index which we can query for ranges of values.

Here are some examples: