Partitioned Databases - Data Design

March 05, 2019 | Glynn Bird | Partitioned Databases Indexing Query

This is the second part of a series of posts on Partitioned Databases in Cloudant. Part One, Part Three and Part Four may also be of interest.

Modelling data with a JSON document store is very different from modelling data in a relational database system. Generations of computer scientists have been taught how to normalize data into tables, that is organising data into their own collections so that information is not repeated and relationships between collections are modelled with foreign keys.

Organising data in a “NoSQL” database like Cloudant makes “joins” prohibitively expensive and as a result the data design may involve the repetition of data. As with any data modelling exercise, there should be careful consideration of how the data is to be consumed - it is these “access patterns” that, together with database’s best practise documentation, will influence how data is represented in the JSON documents.

Let’s imagine we are building an e-commerce store with the Cloudant service. To do so we are going to use three Cloudant databases to store the following objects:

We are going to design our data with the following goals:

cake

Photo by Toa Heftiba on Unsplash

The products database 🔗

The products database contains one document per sellable product. Products are categorised into a taxonomy of categories e.g. “Home > Kitchen > Small Appliances”, a hierarchy which is surfaced in the website as a clickable menu of category names.

Each product has a simple list of attributes, some of which are searchable (e.g. keywords) others (e.g. image) are used to render the front-end web page or mobile app.

The document’s _id contains:

A sample document looks like this:

{
  "_id": "Home#Kitchen#Small Appliances:1000042",
  "type": "product",
  "taxonomy": ["Home","Kitchen","Small Appliances"],
  "keywords": ["Salter","Scales","Weight","Digital","Kitchen"],
  "productid": "1000042",
  "brand": "Salter",
  "name": "Digital Kitchen Scales",
  "description": "Slim Colourful Design Electronic Cooking Appliance for Home / Kitchen, Weigh up to 5kg + Aquatronic for Liquids ml + fl. oz. 15Yr Guarantee - Green",
  "colours": ["red","green","black","blue"],
  "price": 14.99,
  "delivery"; 0,
  "image": "assets/img/0gmsnghhew.jpg"
}

Users database 🔗

The users database stores details about the registered users of our store. We store:

The document’s _id fields are organised so that for a known userid (i.e. if a user is logged in and we know their userid), we can fetch everything we need to know about that user in one Cloudant Query because the data resides in the same partition. If we need to store other data about the user in the future e.g. payment methods, user preferences etc, further documents can be added following the same pattern.

Here’s some example documents:

{
  "_id": "user19952622:auth",
  "type": "user",
  "userid": "user19952622",
  "name": "Bob Smith",
  "email": "bob.smith@aol.com",
  "password": "1f6b5d0e151388786d3820cded9408e2",
  "salt": "43614d9b1dec23da34a5b6f4eb71fb59",
  "active": true,
  "email_verified": true
}
{
  "_id": "user19952622:delivery1",
  "type": "userdelivery",
  "userid": "user19952622",
  "name": "home",
  "address": "19 Front Street, Darlington, DL5 1TY",
  "default": true
}
{
  "_id": "user19952622:delivery2",
  "type": "userdelivery",
  "userid": "user19952622",
  "name": "work",
  "address": "22 Central Tower, Newcastle, NE1 4JD",
  "default": false
}

Orders database 🔗

An order consists of a number of documents written to the database when a shopping basket of items is paid for:

As all of the documents reside in the same partition and can be fetched with a single query directed to the order’s partition. In most cases, data is only written to Cloudant - the same document is not updated over and over as the order progresses.

{
  "_id": "order555:order",
  "type": "order",
  "user": "Bob Smith",
  "orderid": "order555",
  "userid": "user19952622",
  "basket": ["Salter - Digital Kitchen Scales", "Kenwood - Stand Mixer"],
  "total": "214.98",
  "deliveryAddress": "19 Front Street, Darlington, DL5 1TY",
  "date": "2019-01-28T10:44:22.000Z"
}
{
  "_id": "order555:item1",
  "type": "orderlineitem",
  "orderid": "order555",
  "userid": "user19952622",
  "productid": "1000042",
  "name": "Salter - Digital Kitchen Scales",
  "quantity": 1,
  "unitPrice": 14.99,
  "delivery": 0
}
{
  "_id": "order555:item2",
  "type": "orderlineitem",
  "orderid": "order555",
  "userid": "user19952622",
  "productid": "88752",
  "name": "Kenwood - Stand Mixer",
  "quantity": 1,
  "unitPrice": 199.99,
  "delivery": 0
}
{
  "_id": "order555:payment",
  "type": "orderpayment",
  "orderid": "order555",
  "userid": "user19952622",
  "paid": "true",
  "provider": "PayPal",
  "provider_ref": "PayPal161619885998772",
  "date": "2019-01-28T10:45:27.000Z",
  "total": "214.98"
}
{
  "_id": "order555:dispatch",
  "type": "orderdispatch",
  "orderid": "order555",
  "userid": "user19952622",
  "dispatched": "true",
  "date": "2019-01-28T16:02:00.000Z",
  "courier": "UPS",
  "courierid": "15125425151261289"
}
{
  "_id": "order555:delivery",
  "type": "orderdelivery",
  "orderid": "order555",
  "userid": "user19952622",
  "delivered": "true",
  "courier": "UPS",
  "courierid": "15125425151261289"
}

Querying the partitions 🔗

The database already indexes each document’s _id field and with partitioned databases, documents belonging to the same partition reside on the same shard, making querying data in a single partition very efficient. We’ve made use of partitions to keep:

Here’s how we can use Cloudant Query to fetch documents from these partitions.

Fetch products belonging a category 🔗

To fetch the first one hundred products from the Home#Kitchen#Small Appliances category, we can simply send a blank selector to the partition’s _find endpoint:

POST /products/_partition/Home%23Kitchen%23Small%20Appliances/_find
{
  "selector": {},
  "limit": 100
}

Alternatively, the _all_docs endpoint can be used to fetch all the data from a partition

GET /products/_partition/Home%23Kitchen%23Small%20Appliances/_all_docs

Searching within a known category 🔗

If we know the product category (perhaps the user has navigated to the “Home#Kitchen#Small Appliances” page), we can search for products within that partition by first defining a partitioned index and then querying it. A query aimed at a single partition and serviced by a pre-defined index constitutes best practice for a Cloudant database.

We can define a partial Cloudant Search index with a JavaScript function

function(doc) {
  if (doc.type == 'product') {
    var words = doc.taxonomy.join(' ') + 
                doc.keywords.join(' ') + 
                doc.brand + ' ' +
                doc.name + ' ' +
                doc.description + ' ' +
                doc.colours.join(' ');
    index('default', words, { store: false, index: true });
  }
}

Queries can be directed to a single partition with:

curl $URL/products/_partition/Home%23Kitchen%23Small%20Appliances/_design/mydesigndoc/_search/mysearchindex?q=salter+scales+red

Note that an index defined on a partitioned database is itself partitioned by default, although this behaviour can be overridden by supplying a partitioned: false flag at query-time to create a global index.

Fetch order data 🔗

Similarly, all of an order’s details can be fetched by pulling all of the data from the order’s partition:

GET /orders/_partition/order555/_all_docs

If we only need the line items from the order we can be more specific:

POST /products/_partition/order555/_find
{
  "selector": {
    "type": "orderlineitem"
  },
  "limit": 100
}

If we only need the top-level meta data we can be even more selective: in fact, we don’t even need to perform a query - we can simply fetch the document by its _id:

GET /orders/order555:order

Fetch user data 🔗

The same technique can be used for the users database:

GET /users/_partition/user19952622/_all_docs

or we could fetch a user’s default postal address with this query:

POST /users/_partition/user19952622/_find
{
  "selector": {
    "type": "userdelivery",
    "default": "true"
  },
  "fields": ["address"]
}

Querying the whole data set - Indexing 🔗

In addition to the default primary index, we can define secondary indicies to instruct the database to create additional data structures, ordering the documents by different attributes. The secondary indicies can service additional access patterns we need for our application. Here’s some examples:

Let’s dive into the detail of how we would achieve each of these use-cases using different features of Cloudant.

In order to allow the user to search products with a string of words e.g. “Salter Scales Red” or “Digital Aquatronic”, we need to index each document’s searchable words and employ a “free text” search engine. Cloudant has a free text search engine built in in the form of its Cloudant Search API. An index is defined with a JavaScript index definition function inside a design document. The function calls index to instruct the database to index selected data items. Here’s an example:

function(doc) {
  if (doc.type == 'product') {
    var words = doc.taxonomy.join(' ') + 
                doc.keywords.join(' ') + 
                doc.brand + ' ' +
                doc.name + ' ' +
                doc.description + ' ' +
                doc.colours.join(' ');
    index('default', words, { store: false, index: true });
  }
}

In this case we are concatenating all the strings we want to be searchable and indexing them as the default text index, which provides a general-purpose search facility. We can query the index with a simple HTTP API call:

curl $URL/_design/mydesigndoc/_search/mysearchindex?q=salter+scales+red

With a little extra work we could:

Note that in order to create an index that spans all the partitions, we need to supply partitioned: false in the design document that defines the index. See documentation

Indexing - Orders by Customer 🔗

In our website’s dashboard, we need to display a single user’s order history. To service this access pattern, the orders database needs a global secondary index on the userid and date fields.

To create an index that only works on documents where type="order" and is indexed by userid and date we POST the following JSON to the /orders/_index endpoint:

{
   "index": {
     "partial_filter_selector": {
       "type": "order"
     },
     "fields": [ "userid", "date" ]
   },
   "ddoc": "orders-by-customer-index",
   "type": "json",
   "partitioned": false
}

We can then query the index by posting JSON to the /orders/_find endpoint:

{
   "selector": {
      "type": "order",
      "userid": "user19952622"
   },
   "use_index": "orders-by-customer-index"
}

Indexing - Orders by Time 🔗

In order to get all orders ordered by time, we need to create another global Cloudant Query index, again only including the core order documents and ordering by the date field:

{
   "index": {
     "partial_filter_selector": {
       "type": "order"
     },
     "fields": ["date" ]
   },
   "ddoc": "orders-by-date",
   "type": "json",
   "partitioned": false
}

We can then query the index by posting JSON to the /orders/_find endpoint, in this case to find orders occuring after in January 2019:

{
   "selector": {
      "type": "order",
      "date": {
         "$gte": "2019-01-01",
         "$lt": "2019-01-02"
      }
   },
   "use_index": "orders-by-date"
}

Indexing - Sales Report 🔗

Performing aggregations for reporting purposes requires the use of Cloudant’s MapReduce feature. A JavaScript function, embedded in a design document, is called by Cloudant for every document in the database. The function emits the keys/value pairs that define the ordering and grouping of the data and which fields are to be aggregated. Here’s an example function that emits orders by year/month/day:

function(doc) {
  if (doc.type == "order") {
    // turn the date string into a Date object
    var date = new Date(doc.date);
    // extract the date parts
    var year = date.getFullYear();
    var month = date.getMonth() + 1; // months go 0-11
    var day = date.getDate();
    // emit the key and value
    emit([year,month,day], doc.total);
  }
}

By choosing the _sum reducer, we instruct Cloudant to totalise the emitted value (doc.total). The MapReduce query engine allows data to be grouped by year/month/day, year/month, year or to group all the data to produce a grand total. In addition, the data can be filtered between startkey and endkey key values and the reducer can be switched off at query-time to just extract data between two dates.

Note that in order to create an index that spans all the partitions, we need to supply partitioned: false in the design document that defines the index. See documentation

Indexing - Authentication 🔗

When a user is logging in, we can’t do partitioned query on our users database because we don’t yet know the user’s userid. We have to create a secondary index on the user’s email address and select a user record by the email field. We can reduce the index size by using a partial_filter_selector to only include the main user records that are active and have been verified:

{
   "index": {
     "partial_filter_selector": {
       "type": "user",
       "active": true,
       "email_verified": true
     },
     "fields": ["email" ]
   },
   "ddoc": "users-by-email",
   "type": "json",
   "partitioned": false
}

To find a document matching a supplied email address we can use the following query, only asking for the fields we need:

{
   "selector": {
      "type": "user",
      "active": true,
      "email_verified": true,
      "email": "joe@aol.com"
   },
   "use_index": "users-by-email",
   fields: ["userid", "password", "salt"]
}

Summary 🔗

The Cloudant partial databases feature provides a step change in query performance for data designs that use the _id field to store a partition key and a document key. Careful data design can ensure that data that belongs together, and that you application expects to query in isolation, is stored in its own partition. Queries directed at a single partition only use fraction of the computing resource of a “whole database” query, resulting in faster performance and lower per-query pricing.

Other use-cases and access patterns can be serviced by secondary indexes that span all the partitions.

Further reading 🔗