Time-series Data Storage

April 08, 2019 | Glynn Bird | Time-series Partitioned Databases

Time-series data is simply the recording of data points in time order such as:

In this post we’ll examine options for storing time-series data in Cloudant.

clock

Photo by Sonja Langford on Unsplash

The ever-growing data set 🔗

It’s tempting to create a single, ever-growing Cloudant database to store your time-series data but it’s worth taking time to think about how much data you’ll be storing and to estimate the data size growth.

Let’s say we’re storing IoT data in documents like this:

{
  "_id": "e30bff4b286dd88c1cc178b7dbacaded",
  "_rev": "1-8a10be69dc1e81531fee2c9ca2688508",
  "type": "reading",
  "device": "GE5521965B",
  "reading": 25.3,
  "units": "celcius",
  "timestamp": "2019-03-29T09:28:31.229Z"
}

At approximately 200 bytes per reading, each sensor would generate 6GB of raw data per year. Bear in mind that Cloudant stores data in triplicate and additional storage will be required for any secondary indexes you define. With only fifty or so sensors, your application could easily generate a terabyte of data per year.

In a lot of applications the most recent data, say the last year’s, is of the most interest and older data can be archived to slower and cheaper storage methods such as IBM Cloud Object Storage. Keeping your time-series store scalable and cost effective can become striking a balance between fast, primary data in a database and slower, cheaper data access in archive storage.

Putting all of our data in a single, ever-growing Cloudant database is impractical because:

A common work-around is to adopt the “timeboxed database” pattern, as outlined in the next section.

The timeboxed database pattern 🔗

This pattern is pretty straightforward.

  1. Create a database per time-period (e.g month).
  2. Always write new data into this month’s database.
  3. Ahead of time, create a new empty database ready for next month’s data. Don’t forget to create any Design Documents needed to service queries in the new database.

ts1

This month’s data can be queried from the current database. Historical data can be queried by directing requests to the older databases that cover the time-range you are interested in.

ts2

To remove unwanted data, simply delete older monthly databases that are no longer needed. Deleting a whole database recovers all the data occupied by the database (JSON documents and associated indexes) cleanly and quickly.

ts3

This write only approach combined with the deletion of older data by removing databases plays to Cloudant strengths.

Storing time/date in a time-series database 🔗

The choice of time and date format in your JSON document is important because JSON has no native date/time data type. This is explained in more detail here but the gist is:

Time-sortable ids 🔗

If we’re using 32 characters to store a “random” _id field, then it would be handy if it sorted in chronological order in a time-series database. This technique is outlined here.

In brief, the front of the _id is a time-sortable string and rest is random data. The _id field then sorts in approximate date/time order (to a precision of one second).

An example time-sortable _id value

Using timeboxed databases that are also partitioned 🔗

Depending on your use-case, it may also be useful to make your timeboxed databases partitioned too, that is generated with the ?partitioned=true flag to enable Cloudant’s new Partitioned Databases feature.

The choice of partition key is beyond the scope of this post, but in a IoT application a device id may be a good choice, as long as there are many devices generating data at a similar rate.

Here’s an example document that uses the full gamut of Cloudant time-series tricks: monthly databases, time-ordered ids and partitioned databases:

{
  "_id": "GE5521965B:001h9p4Z0XeJzT0372TQ0hk2q83CLdqg",
  "type": "reading",
  "device": "GE5521965B",
  "reading": 25.3,
  "units": "celcius",
  "timestamp": "2019-03-29T10:49:30.980Z"
}

Queries for a single timebox and a single device (e.g. “get me the latest 50 readings for device X”) can be directed at a single partition and will use only a fraction of the resources of a global query.

// Fetch the 50 latest readings for device GE5521965B
// As the data is partitioned by device id, we can direct the query to a single partition.
// The data within the partition is in date/time order so the last 50 records are the
// ones we are interested in. descending=true reverses the order.
curl "$URL/logs_2019_03/_partition/GE5521965B/_all_docs?limit=50&descending=true&include_docs=true"

Aggregating time-series data with MapReduce 🔗

A very common use-case for a time-series database is to produce aggregations of the recorded data grouped by time. Cloudant’s MapReduce allows the generation of materialized views of data by grouped, complex keys.

Cloudant supports the following reducers:

Before we can reduce any data we need to generate the keys and values in the MapReduce index using a JavaScript function:

function(doc) {

  // convert timestamp to JavaScript Date object
  var d = new Date(doc.timestamp)
  
  // calculate time components
  var year = d.getFullYear()
  var month = d.getMonth() + 1
  var day = d.getDate()
  var hour = d.getHours()
  var minute = d.getMinutes()
  
  emit([year,month,day,hour,minute], doc.reading);
}

This JavaScript function is encoded into a Design Document and saved in the database:

{
  "_id": "_design/aggregate",
  "views": {
    "byYMDHM": {
      "reduce": "_stats",
      "map": "function(doc) {\n\n  // convert timestamp to JavaScript Date object\n  var d = new Date(doc.timestamp)\n  \n  // calculate time components\n  var year = d.getFullYear()\n  var month = d.getMonth() + 1\n  var day = d.getDate()\n  var hour = d.getHours()\n  var minute = d.getMinutes()\n  \n  emit([year,month,day,hour,minute], doc.reading);\n}"
    }
  },
  "language": "javascript",
  "options": {
    "partitioned": false
  }
}

Once the Design Document is saved in the database, the view builds asynchronously - for a small database it will appear that the view is ready “immediately”, but for larger datasets it may take some time before the view is ready to query.

The view is queried using the design document name and view name used earlier:

curl "$URL/logs_2019_03/_design/aggregate/_view/byYMDHM?group_level=3"

Note that when using monthly timeboxed databases, you may not need to emit the year and month values as all documents would emit the same first two keys, although it does help when combining aggregated results from multiple databases.

Out-of-the box time-series solutions 🔗

If you don’t want to build your own time-series solution, then the Watson Internet Of Things platform can run the whole thing as-a-service including device registration, data storage, analytics, monitoring and archival.

Further reading 🔗