Elasticsearch – analyzers and tokenizers – PART 1 Theory

E

Hi, that article is the first one part from series devoted to rather interesting, but together with that very complicated topic at elasticsearch – analyzers and tokenizers. And we will start from boring theory. Unfortunately that it is the best way to understand current theme. And then, in next parts, we will skip to practice examples. So let’s start.

At first we have to answer the small question: how elasticsearch deals with full text search?

Here is the answer from official documentation: “Elasticsearch first analyzes the text, and then uses the results to build an inverted index. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.”

Sounds complicated, does not it? But don’t worry. It is much more easy to understand it at simple example. Lets assume that we have to phrases:

  • The small black cat (that phrase would be indexed to 1st elasticsearch document)
  • Small black cats (that phrase would be indexed to 2d elasticsearch document)

The simplified inverted index for that phrases can be the next one:

Term      Doc_1  Doc_2
-------------------------
The     |   X   |  
small   |   X   |
black   |   X   |  X
cat     |   X   |
cats    |       |  X
Small   |       |  X

So, we have the table of unique terms, that are met in both phrases, plus we mark at what documents exactly that terms are met. Now, imagine that you want to search for “small black cat” using naive similarity algorithm e.g we just need to find the documents in which each term appears:

Term      Doc_1  Doc_2
-------------------------
small   |   X   |  
black   |   X   |  X
cat     |   X   |
------------------------
Total   |   3   |  1

So, both document matched – great :). But 1st one matched more better then second one. Is it good? From human perspective both phrases are having the same value. Can machine recognize it successfully? Hard to say if “computer” can do it so good as a people ;). But let’s assume that we will perform some small transformations around our phrases:

  • we will lowercase all words. It does not matter for us if we have “small” or “Small”
  • we will replace plural forms at their root alternatives – e.g “cat” and “cats” having the same root “cat” (it is called stemming)
  • we will also throw away some service words e.g “a”, “the”

After that our phrases would like: “small black cat” and “small black cat”. Generally they would be completely equal, or we can say it in another words: their search relevance value is the same. And if will apply according transformations for both – indexed documents and search terms – then we can get much more better search results. Such transformations of text is called analysis. Generally that is very simplified example, at practice it is much more complicated. But it allows to understand the idea 🙂

From official Elasticsearch documentation: Analysis is a process of tokenizing a block of text into individual terms suitable for use in an inverted index and normalizing these terms into a standard form to improve their searchability. At Elasticsearch that job is performed by analyzers.

In more simplified human language – analyzer is a tool that split phrase at words (that is called tokenizing) and then performs at each word (token) some filtering (it is called token filters). Elasticsearch has a lot of built in analyzers, tokenizers and token filters.

Example: standard analyzer (which used as default) split words by “spaces”, removes punctuation and then lowercase every word. So after applying standard analyzer at phrase “Hello, Dolly, I have (5) cents” we will get -> “hello, dolly, i, have, 5” tokens that would be used for building inverted index.

Human language is very complicated, so analyzers can be very complicated accordingly also. Elasticsearch already has a lot of useful language specific analyzers and plugins, e.g. icu_tokenizer for dealing with Unicode and Asian languages. I am using analyzers extensively. From my commercial experience appears that using correct analyzers cant boost search experience a lot. Moreover, sometimes it is even worth to create custom analyzer – it is also possible and not so complicated as it probably seems from the first glance. I am going to show you how search experience depends a lot from analyzers at practice example within next article. I will also show you how to build your own custom analyzer from scratch and how useful can it be. Thank you for being with me whole that time and welcome to my course if you want to get know more.


architecture AWS cluster cyber-security devops devops-basics docker elasticsearch flask geo high availability java machine learning opensearch php programming languages python recommendation systems search systems spring boot symfony