ELASTICSEARCH – ANALYZERS AND TOKENIZERS – PART 2 Practice

E

Hi, at previous article we spoke about theory around analyzers and tokenizers. Now lets apply theory at practice. Lets assume that we already have elasticsearch index available at localhost:9200. Now lets create some test index using next mapping and settings :

{
    "settings": {
      "number_of_shards": 1,
      "analysis": {
        "filter": {
          "email_filter": {
            "type": "pattern_capture",
            "preserve_original": true,
            "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)"
            ]
          }
        },
        "analyzer": {
          "email": {
            "tokenizer": "uax_url_email",
            "filter": [
                "email_filter",
                "lowercase",
                "unique"
            ]
          },
          "lowercased_string": {
            "filter": [
                "lowercase"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "email1": {
          "analyzer": "email",
          "type": "text",
          "fields": {
            "lowercased": {
              "analyzer": "lowercased_string",
              "type": "text"
            }
          }
        },
        "email2": {
          "type": "text"
        }
      }
    }
  }

At our text index we defined 2 text fields – email1 and email2. Email2 field is using built in standard elasticsearch analyzer as default option, while at email1 we have multifield mapping, which allows us to use different variants of email field that can be used for different purposes. The 1st variant is using custom “email” analyzer, and the second one is using custom “lowercased_string” analyzer. Lets investigate the definition of analyzers one by one.

Custom “email” analyzer consist of:

  • uax_url_email tokenizer – that is built in elasticsearch tokenizer, you may read about it at official elasticsearch documentation – in short – it simply preserves our original email without splitting it to parts
  • after tokenizing we use 3 filters: email_filter, lowercase and unique. The most interesting here is email_filter – that is again the custom filter which is defined at setting. It preserves original email value and creates additional “tokens” using regexp expression. After applying custom filter we lowercase our tokens and leave only unique values using according built in elasticsearch filters. Sounds complicated, don’t it? But don’t worry – it will be clear how it works exactly in a moment, please be patient 🙂

Custom “lowercased_string” analyzer is rather simpler. It consists of:

  • keyword tokenizer – that is built in at elasticsearch tokenizer which keeps original word
  • lowercase filter

Now let’s try to understand what it really gives us in practice. It is easy to check how all that stuff works using special elasticsearch analyze elasticsearch API. Let’s use that api at first for lowercased_string analyzer using next json body command:

{
  "analyzer": "lowercased_string",
  "text": "john.snow@gmail.com"
}

The result would be next:

{
    "tokens": [
        {
            "token": "john.snow@gmail.com",
            "start_offset": 0,
            "end_offset": 19,
            "type": "word",
            "position": 0
        }
    ]
}

As you see that is very simple case – we are simply getting original lowercased email value. In practice it means that while searching by email field that using lowercased_analyzer, elasticsearch will return result only when we have exact match: john.snow@gmail.com. But it will return nothing if we will search by name “john” or surname “snow”. And at real practice we would like such full text to be working, don’t we? Now let’s check how it would be looking in case using a standard analyzer. The json body command would be:

{
  "analyzer": "standard",
  "text": "john.snow@gmail.com"
}

Result would be the next:

{
    "tokens": [
        {
            "token": "john.snow",
            "start_offset": 0,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "gmail.com",
            "start_offset": 10,
            "end_offset": 19,
            "type": "<ALPHANUM>",
            "position": 1
        }
    ]
}

What it gives at practice? Elasticsearch will return result when we will search by name “john” or surname “snow”, but it will return nothing in case using full email. Search by domain, e.g: “gmail”, also would not be working. And that is again – poor search experience. So how can we make our full text search to be really good? How can we make our email field to be searchable by name, surname, domain, full email value? It is not complicated in case using correct analyzer. Have a look at email analyzer now:

{
  "analyzer": "email",
  "text": "john.snow@gmail.com"
}

We will get next interesting result:

{
    "tokens": [
        {
            "token": "john.snow@gmail.com",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "john.snow",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "john",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "snow",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "gmail.com",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "gmail",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        },
        {
            "token": "com",
            "start_offset": 0,
            "end_offset": 19,
            "type": "<EMAIL>",
            "position": 0
        }
    ]
}

Wow, cool, is not it? We have combinations of tokens that will allow us to perform really good full text search at email1 field. While at email2 field search result experience would be poor.

Hope that now you understand how powerful are analyzers at practice. If you found some problems at your search mechanisms – first of all apply to analyzers. Check what you get under the hood. Hope that you liked current article. Thank you for being with me whole that time and welcome to my course if you want to get know more.


architecture AWS cluster devops devops-basics docker elasticsearch flask geo high availability java php programming languages python recommendation systems search systems spring boot symfony