Inside Elasticsearch cluster - monitoring and hardware

Hi, Elasticsearch fans.

At current, 3d article, we will continue deep investigation of Elasticsearch cluster (the 2d part can be found here). Here we will speak in details about how to choose hardware for Elasticsearch cluster and what parameters should we monitor to reach optimal configuration. I will start from bad news – there is no golden rule, or any method, that allows to choose hardware for Elasticsearch cluster at easy way. All depends at how exactly are you going to use Elasticsearch. But here are some essential notes:

If you are going to run complex filtered queries or provide intensive indexing, then you will require more powerful CPU resources
In case Elasticsearch, memory question is always a painful thing. Elasticsearch is written in Java, so it is essential to to set JVM Heap correctly. The more heap available to Elasticsearch, the more memory it can use for its internal caches, but the less memory it leaves available for the OS to use for the file system cache. Also, larger heaps can cause longer garbage collection pauses. Set Xmx and Xms to no more than 50% of your physical RAM
Elasticsearch performs lots of network consuming operations, from transferring data during queries to reallocating shards, so networking matters. 1GB is good bandwidth, 10GB is better 🙂
In case of storage – consider only SSD disks of last generations, file system is also essential. From my practice appears, that Linux Ext4 is a good choice.

As for me, better to start from some overwhelming hardware configuration and then provide tests/experiments using your real data, and running queries that you plan to use. While providing such tests, the cluster should be under constant monitoring. In case you are going to run Elasticsearch as a service at some cloud – then you will get monitoring parameters out of the box. But, unfortunately, Elasticsearch as a service is very expensive solution, as result a lot of companies continue to run Elasticsearch at VMs inside clouds or at own private servers. How to deal with monitoring in that case? I can definitely recommend you Zabbix + Elasticsearch solution. If you are interested at how to install and configure monitoring with Zabbix, then please, write me at email or Linkedin – in case I will see high interest in current theme, then some new articles around it can appear. So, now lets return to monitoring parameters. What exactly should we pay attention at?

First of all, it is classical CPU utilization, RAM usage and disk write/read operation at VMs/servers, where Elasticsearch nodes are running.
Next parameters are related with search by itself – it is query rate, fetch rate and search latency.
Then we have things related with insert/update operations and segments – refresh, merge, flush statistics.
Essential parameters are related with rejected Java thread pools.
We also have pay attention at field data cache and node query cache.
And finally, simply check JVM health and Elasticsearch slow query logs.

Yes, you are right – there are a lot of things for monitoring. But let me show how you can verify most of above mentioned metrics using built in statistics API.

For that purpose we can use /index_name/_stats and /_nodes/stats API points. Here is how according results looks like at short video, which was cut from one of my udemy course lectures:

That’s all I wanted to tell you around Elasticsearch cluster monitoring. For sure it is only short overview. But I hope it can be a good start point for people who are starting their adventure with Elasticsearch 🙂

Below are the links to my udemy courses, where you can find a lot of useful and, first of all, practical information about Elasticsearch: