2014/02/15

ElasticSearch internals monitoring by Zabbix

NOTE: New version of this article with use of zabbix_traps is here

There is quite a lot of Zabbix monitoring agent extensions for ElasticSearch monitoring. But they are limited and provide just some predefined counters. What if you need to collect internal data?


This in-detail data is displayed in ElasticSearch management interface, and provided in JSON by backend. You can preview it using any browser, just open:
http://servername:9200/_cluster/nodes/stats?all=true
Here is JSON browser from Firefox's Firebug:

As you can see, there are indeed a lot of internal detailed counters. It's easy now to write Zabbix agent extension, to load JSON url, parse it, get needed counter and provide to Zabbix. Just a little issue here - if you need to check 70 different counters, then page will be downloaded 70 times. So, let's do some caching of downloaded data for 30sec in tmpfs then.

Save this as /etc/zabbix/elasticsearch.py
#!/usr/bin/env python

import urllib
import json
import sys, os, time

def main():
    # Usage: %s [url] path.counter.name
    # [url] ='all' by default
    urls = { 'all'    : "http://localhost:9200/_cluster/nodes/stats?all=true",
             'health' : "http://localhost:9200/_cluster/health" }
    if len(sys.argv) < 2:
      sys.exit('Usage: %s [url] path.counter.name' % sys.argv[0])

    #parse command line
    if len(sys.argv) > 2 and sys.argv[1] in urls:
      ty  = sys.argv[1]
      url = urls[ty]
      cnt = sys.argv[2]
    else:
      ty  = 'all'
      url = urls[ty]
      cnt = sys.argv[1]

    #download url with caching
    tmp = '/tmp/es_stats_'+ty
    try:
      if os.path.isfile(tmp) and (os.path.getmtime(tmp) + 30) > time.time():
        f = file(tmp,'r')
        body = json.load(f)
        f.close()
      else:
        f = urllib.urlopen(url)
        body = f.read()
        f = file(tmp,'w')
        f.write(body)
        f.close()
        body = json.loads(body)

    except:
      out = '0'

    else:
      #get results for current node from cluster results
      if ty == 'all':
        for node_id in body['nodes'].keys():
          if body['nodes'][node_id]['name'] == os.uname()[1]:
            stats = body['nodes'][node_id]
      else:
        stats = body

      #JVM counters calculations
      if cnt == 'jvm_heap_p_of_RAM':
        out = str(100*float(stats['jvm']['mem']['heap_committed_in_bytes'])/(stats['os']['mem']['actual_used_in_bytes'] + stats['os']['mem']['actual_free_in_bytes']))
      elif cnt == 'jvm_p_heap_used':
        out = str(100*float(stats['jvm']['mem']['heap_used_in_bytes'])/stats['jvm']['mem']['heap_committed_in_bytes'])

      #direct value
      else:
        c=cnt.split('.')
        while len(c):
          stats=stats[c.pop(0)]
        out = str(stats)

    print out

if __name__ == "__main__":
    main()

And create extension for agent /etc/zabbix/zabbix_agentd.d/elasticsearch.conf
# Key syntax is es.json[page, counter].
# Page may be "all" or "health". Default is "all", optional
# Counter is json path to counter value like indices.search.query_total and is mandatory
UserParameter=es.json[*],/etc/zabbix/elasticsearch.py $1 $2

Some preparations:
# chmod +x /etc/zabbix/elasticsearch.py
# /etc/init.d/zabbix-agent restart

And we are good to go, just provide JSON path to any counter needed divided by dot (.)


Also, there are some additional JVM online calculated counters provided by script:


With such result:


11 comments:

  1. Do you have a zabbix template also ?

    ReplyDelete
  2. Great script! FYI, your script assumes that your node name matches your host name, in our environment it does not and I was getting the following error:

    Traceback (most recent call last):
    File "/etc/zabbix/bin/elasticsearch.py", line 68, in
    main()
    File "/etc/zabbix/bin/elasticsearch.py", line 62, in main
    stats = stats[c.pop(0)]
    UnboundLocalError: local variable 'stats' referenced before assignment

    I updated the line:
    if body['nodes'][node_id]['name'] == os.uname()[1]:
    to
    if body['nodes'][node_id]['hostname'] == os.uname()[1]:

    so that it would work for me. Thanks again!

    ReplyDelete
  3. I was getting this error:
    Traceback (most recent call last):
    File "./elasticsearch_stats.py", line 68, in
    main()
    File "./elasticsearch_stats.py", line 46, in main
    for node_id in body['nodes'].keys():
    KeyError: 'nodes'

    It looks like elasticsearch's node status endpoinds have changed. I had to modify the following line to get the script to start working again.

    from:
    urls = { 'all' : "http://localhost:9200/_cluster/nodes/stats?all=true",

    to:
    urls = { 'all' : "http://localhost:9200/_nodes/stats?all=true",

    ReplyDelete
  4. Can you provide Zabbix template please? XML export of template that you use and have this graphs from?

    ReplyDelete
  5. great script can you alos export the xml & post please ?

    ReplyDelete
  6. Can't generate data. Erro: "Received value [] is not suitable for value type [Numeric (unsigned)] and data type [Decimal]

    ReplyDelete
  7. Thank you all for feedback,
    I've published new version of this script:
    http://blog.sepa.spb.ru/2014/12/elasticsearch-internals-monitoring-by.html
    Some bugs were fixed, basic template included.

    ReplyDelete
  8. where i need to setup this, on zabbix client or zabbix server ?

    ReplyDelete
    Replies
    1. It is client part.
      For server part template see the new version of article:
      http://blog.sepa.spb.ru/2014/12/elasticsearch-internals-monitoring-by.html

      Delete
  9. Can you export all of your code to github or another public VCS?

    ReplyDelete
    Replies
    1. Have you carefully read the article? Especially the very first line ;)

      Delete