Search Analyzers

October 19, 2018 | Glynn Bird | search

Cloudant Search is the free-text search technology built in to the Cloudant database that is powered by Apache Lucene. Lucene-based indexes are used for:

When creating a Cloudant Search index, thought must be given as to which fields from your documents need to indexed and how they are to be indexed.

One aspect of the indexing process is the choice of analyzer. An analyzer is code that may:

At indexing-time source data is processed using the analyzer logic prior to sorting and storage in the index. At query-time the search terms are processed using the same analyzer code before interrogating the index.

jigsaw

Photo by Hans-Peter Gauster on Unsplash

Testing the analyzer

There is a Cloudant Search API call that will apply one of the built-in Lucene analyzers to a supplied string to allow you to see the effect of each analyzer.

To look at each anaylzer in turn, I’m going to pass the same string to each analyzer to measure the effect:

“My name is Chris Wright-Smith. I live at 21a Front Street, Durham, UK - my email is chris7767@aol.com.”

Standard analyzer

{“tokens”:[“my”, “name”, “chris”, “wright”, “smith”, “i”, “live”, “21a”, “front”, “street”, “durham”, “uk”, “my”, “email”, “chris7767”, “aol.com”]}

Keyword analyzer

{“tokens”:[“My name is Chris Wright-Smith. I live at 21a Front Street, Durham, UK - my email is chris7767@aol.com.”]}

Simple analyzer

{“tokens”:[“my”, “name”, “is”, “chris”, “wright”, “smith”, “i”, “live”, “at”, “a”, “front”, “street”, “durham”, “uk”, “my”, “email”, “is”, “chris”, “aol”,”com”]}

Whitespace analyzer

{“tokens”:[“My”, “name”, “is”, “Chris”, “Wright-Smith.”, “I”, “live”, “at”, “21a”, “Front”, “Street,”, “Durham,”, “UK”, “-“ , “my” ,”email”, “is”, “chris7767@aol.com.”]}

Classic analyzer

{“tokens”:[“my”, “name”, “chris”, “wright”, “smith”, “i”, “live”, “21a”, “front”, “street”, “durham”, “uk”, “my”, “email”, “chris7767@aol.com”]}

English analyzer

{“tokens”:[“my”, “name”,”chri”, “wright”, “smith”, “i”, “live”, “21a”, “front”, “street”, “durham”, “uk”, “my”, “email”, “chris7767”,”aol.com”]}

The language-specific analyzers make the most changes to the source data:

The quick brown fox jumped over the lazy dog.
{"tokens":["quick","brown","fox","jump","over","lazi","dog"]}

Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
{"tokens":["four","score","seven","year","ago","our","father","brought","forth","contin","new","nation","conceiv","liberti","dedic","proposit","all","men","creat","equal"]}

Which analyzer should I pick?

It depends on your data. If you have structured data (email addresses, zip codes, names etc in separate fields), then it’s worth picking an analyzer that retains the data you need to keep intact for your search needs.

Only index the fields that you need. Keeping the index small helps to improve performance.

Let’s deal with common data sources and look at best analyzer choices.

Names

It’s likely that name fields should use an analyzer that doesn’t stem words. The Whitespace analyzer retains the words’ case (meaning the search terms would have to be a full, case-senstive match) and leaves double-barrelled names intact. If you want to split up double-barrelled names, then the Standard analyzer would do the job.

Email addresses

There is a built-in Email analyzer for just this purpose which lowercases everything and then behaves like the Keyword analyzer.

Unique id

Order numbers, payment references and UUIDs such as “A1324S”, “PayPal0000445” and “ABC-1412-BBG” should be retained without any pre-processing, so the Keyword analyzer is preferred.

Country codes

Country codes such as “UK” should also use the Keyword analyzer to prevent the removal of stopwords that match the country codes e.g. “IN” for India. Note that the Keyword Analzer is case-sensitive.

Text

A block of free-form text is best processed with a language-specific analyzer such as the English analyzer or in a more general case, the Standard analyzer.

Store: true or include_docs=true?

When returning data from a search there are two options

The former option means having a larger index but is the fastest way of retrieving data. The latter option keeps the index small but adds extra query-time work for Cloudant as it has to fetch document bodies after the search result set is calculated. This can be slower to execute and add a further burden to a Cloudant cluster.

If possible, choose the former option:

Entity extraction

Providing a good search experience depends on the alignment of your users’ search needs with structure in the data. Throwing lots of unstructured data at an indexing engine gets you only so far; if you can add further structure to unstructured data, then the search experience will benefit as fewer “false positives” will be returned. Let’s take an example:

“Edinson Cavani scored two superb goals as Uruguay beat Portugal to set up a World Cup quarter-final meeting with France. Defeat for the European champions finished Cristiano Ronaldo’s hopes of success in Russia just hours after Lionel Messi and Argentina were knocked out, beaten 4-3 by Les Bleus.”

Source: BBC News https://www.bbc.co.uk/sport/football/44439361

From this snippet, I would manually extract the following “entities”:

Entity extraction is the process of locating known entities (given a database of such entities) and storing the entities in the search engine instead of or as well as the source text. The Watson Natural Language and Understanding API can be fed raw text and will return entities it knows about (you can provide your own enitity model for your domain-specific application):

screenshot

As well as entities, the API can also place the article in a hierarchy of categories. In this case, Watson suggested:

Pre-processing your raw data, by calling the Watson API for each document and storing a list of entities/concepts/categories in your Cloudant document, provides automatic meta data about your free-text information and can provide an easier means to search and navigate your app.