In part 7 of this series, we saw a warning in the search results:

“no matching index found, create an index to optimize query time”

This is Cloudant’s polite way of saying that your query is expensive and the database is having to walk through the whole data set to calculate the answer. In small databases this is not a problem but in a production system, with the data growing all the time, an index is essential.

indexing

What is an index?

An database index is just like an index in a book or a biblical concordance. It is a sorted data structure that allows quick access to a portion of the data.

Our data looks like this:

{
  "_id": "0d960fedfb62499abe35557f2d2c7c5e",
  "name": "Ferris Bueller", 
  "actor": "Matthew Broderick", 
  "dob": "1962-03-21"
}

Cloudant automatically creates an index on the _id field so that it can retrieve data by _id. If we are going to be making lots of queries on the dob field, then it might make sense to instruct Cloudant to create a secondary index on that field.

This is acheived by writing to the POST /db/_index endpoint:

curl -X POST \
     -H 'Content-type:application/json' \
     -d'{"index":{"fields":["dob"]}}' \
     "$URL/newdb/_index"

In the body of the JSON supplied to the _index endpoint, we specify the array of fields that are to be indexed. Cloudant then creates the index on disk (ordered by dob in this case), so that it can be used by future queries.

If we repeat our query, we should see the same results, but without the warning:

$ curl -X POST \
    -H'Content-type:application/json' \
    -d@query.json \
    "$URL/newdb/_find"
# { "docs":[ ], "bookmark": "" }

Our query is now using the index and you should see a performance improvement, especially with larger data sets.

Index types - json & text

Cloudant Query has two types of index - json and text.

Both index types have their subtleties, so make sure you’ve read the documentation and understand how it works before going into production!

Indexing strategies

The job of an index is to help the database to reduce the volume of data it’s working with to a managable size. Imagine you have a query like this for a database of movies:

"selector": {
    "$and": [
        {
            "actor": {
                "$eq": "Al Pacino"
            }
        },
        {
            "year": {
                "$eq": 2010
            }
        }
    ]
}

Which field or fields should you create the index on to best serve this query? It depends on whether which is the smaller:

In this case, I’d be inclined to create an index on actor because there may be thousands of movies in a year, but an actor might be credited with only a few dozen movies in their entire career. So an index on actor winows the database to a smaller dataset than an index on year.

Our index on actor reduces a database of tens of thousands of records down to a few dozen. It is then very simple for the database to apply the second clause of the $and to this smaller data set.

Other tips:

Next time

In the final part of this series, we’ll look at performing aggregations using Cloudant.