2014/12/06

ElasticSearch internals monitoring by Zabbix (v2 traps)

Here is more resource oriented version of ElasticSearch monitoring from previous article with using zabbix-traps. Also, it comes with very basic template, which was so asked in comments:



Graphs included:
  • Shard's nodes status
  • Indices tasks speed
  • Indices tasks time spend


Installation:
  1. Save this as /etc/zabbix/zabbix_agentd.d/elasticsearch.conf
    #Key jvm.uptime_in_millis used to trigger trap sending
    UserParameter=es[*],/etc/zabbix/elasticsearch.py $1
    
    
  2. And here is data-getter
    /etc/zabbix/elasticsearch.py

  3. Then import template
    template_app_elasticsearch.xml
How to add a new counter:
  • Browse JSON output of your server's
    http://localhost:9200/_nodes/_local/stats?all=true
  • Write path to value of interest using dot as separator, for example
    indices.docs.count
  • Create new counter in zabbix with
    key name = es[path.you.found]
    and type = zabbix_trap
  • And here is difference from previous version: Please note that you also need to add this path to counter to traps2 section of elasticsearch.py file. Then execute elasticsearch.py without any parameters and debug of zabbix_sender should be written to console. In top section you should find your new counter key (if it isn't - key is not found or empty in JSON output), and in bottom section number of failed items should be zero (if it isn't - there is no such key for this server configured in zabbix web)

49 comments:

  1. Hi, Can you please add the steps to monitor another host instead of localhost .
    if i want to monitor like 10.0.1.10 host from my zabbix server 10.0.1.5, so what steps i need to do on remote host and zabbix server as well.

    ReplyDelete
    Replies
    1. Just a usual monitoring of remote host by zabbix-agent. Something like this:
      - install zabbix-agent on remote host
      - add this host to zabbix-server, add it to some "generic linux template" and check that basic data started to collect (zabbix-agent must return the same "hostname" as it is configured for this host on zabbix-server side)
      - then proceed with expanding data with elasticsearch checks - import template from this article on zabbix-server side, and add it to your host
      - on host side configure
      /etc/zabbix/zabbix_agentd.d/elasticsearch.conf and
      /etc/zabbix/elasticsearch.py

      So, the template should be installed only once on server side, and then assigned to hosts needed. But config for agent-check and script to collect the data should be installed on each elasticsearch host (and they are sending data about localhost back to zabbix-server)

      Delete
  2. when I run script I get this error:

    12539:20150107:102815.007 item "nw-esclient-201.prod.pcln.com:es[jvm.uptime_in_millis]" became not supported: Received value [File "/usr/local/zabbix/bin/elasticsearch.py", line 22 "status", ^SyntaxError: invalid syntax] is not suitable for value type [Numeric (float)]

    ReplyDelete
  3. when I run script I get this error:

    12539:20150107:102815.007 item "nw-esclient-201.prod.pcln.com:es[jvm.uptime_in_millis]" became not supported: Received value [File "/usr/local/zabbix/bin/elasticsearch.py", line 22 "status", ^SyntaxError: invalid syntax] is not suitable for value type [Numeric (float)]

    ReplyDelete
    Replies
    1. Maybe there is some EOL issue when you'd downloaded the script?
      What is shown when you try:
      # python elasticsearch.py
      ?

      Delete
    2. bash-4.1$ python elasticsearch.py
      File "elasticsearch.py", line 22
      "status",
      ^
      SyntaxError: invalid syntax

      Delete
    3. Dunno, syntax looks valid. At least for python2 i have installed at my home box (there is no elasticsearch)
      # wget https://github.com/sepich/zabbix/raw/master/elasticsearch.py &>/dev/null
      # python elasticsearch.py
      Unable to load JSON data!

      So, no any syntax errors. What's your OS and python?

      Delete
    4. Thanks for helping out, OS=Red Hat Enterprise Linux Server release 6.5
      with Python 2.6.6

      Delete
    5. Hi Tom,
      add None value to all traps1 and traps2 keys

      ie (line 22) "status": None,

      Worked for me on same environment

      Delete
    6. Hi Tom,
      add None value to all keys in traps1 and traps2 dictionaries

      ie. (line 22) "status": None,

      Worked for me on same environment

      Delete
  4. Thanks Federico, that helped:

    bash-4.1$ ./elasticsearch.py jvm.uptime_in_millis
    10116541216

    But this errors:

    bash-4.1$ ./elasticsearch.py jvm_heap_p_used
    zabbix_sender [18783]: Warning: [line 1] '-' encountered as 'Hostname', but no default hostname was specified
    Sending failed.

    ReplyDelete
    Replies
    1. I hardcoded my hostname
      line 71
      out += "YOURHOSTNAMEHERE es.{0} {1}\n".format(t,s)

      Delete
  5. Hi,
    Can you help me:

    Traceback (most recent call last):
    File "./elasticsearch.py", line 70, in
    main()
    File "./elasticsearch.py", line 64, in main
    stats=stats[c.pop(0)]
    UnboundLocalError: local variable 'stats' referenced before assignment

    ReplyDelete
    Replies
    1. There is no such code (stats=stats[c.pop(0)]) anywhere in file:
      https://github.com/sepich/zabbix/blob/master/elasticsearch.py
      Maybe you are trying with some older version?
      Could you try with the latest?

      Delete
  6. Hi There,

    This is a great solution, thank you for sharing. I have a small problem with it. elasticsearch.py script runs OK on my other nodes but on my master node, it returns "Unable to load JSON data!" error. I don't know python therefore cannot really figure out what the code does but I assume something fails here:

    for node_id in all['nodes']:
    if all['nodes'][node_id]['host'].startswith(os.uname()[1]):
    node = all['nodes'][node_id]
    if len(sys.argv) == 1:
    print "node found"
    except:
    print "Unable to load JSON data!"
    sys.exit(1)

    Any ideas?

    ReplyDelete
  7. Looks like cluster node name differs from hostname on that server. Tell me what are they and I'll think how script should be modified to handle this

    ReplyDelete
    Replies
    1. I don't know if you got the reply, if not, let me know and I'll send it again.

      Delete
  8. Hmm, that's strange :) I have the same naming convention on all nodes.
    hostnames: eslog001.abc.local to eslog005.abc.local (first one is the master and the problem is there)
    Node names: ES_ONE, ES_TWO, ES_THREE, ES_FOUR, ES_FIVE

    ReplyDelete
    Replies
    1. In this case it should not work on other nodes too ;)
      if all['nodes'][node_id]['host'].startswith(os.uname()[1]):
      Basically what this code does - it search all cluster nodes looking for that which starts with current hostname. So, for eslog001.abc.local hostname would be eslog001.
      If your cluster nodes names are not equal to hostnames - you can just hardcode it in script like this:
      if all['nodes'][node_id]['host'].startswith('ES_ONE'):

      Delete
  9. This is really strange. I have created a script as you advised with the hard coded node address and hard coded node name. It still does not work on the problematic server (unable to load JSON) but if I run the exact same script on others with the problematic server's address and name hardcoded, it works :)

    ReplyDelete
    Replies
    1. Actually python scripts are not so hard in debugging
      Just open python prompt:
      # python
      And then copy-paste everything from script which is above
      def main():
      line (actually only imports and getKeys needed here). Then copy-paste beginning of main():
      f = requests.get("http://localhost:9200/_cluster/health")
      health = f.json()
      f = requests.get("http://localhost:9200/_nodes/_local/stats?all=true")
      all = f.json()
      Now we are coming to problem place. Let's check what nodes are exist in health report:
      >>> for node_id in all['nodes']: print all['nodes'][node_id]['host']
      there should be all nodes names printed out. And we looking here for current server, with this name:
      >>> print os.uname()[1]
      Do you see some node in previous output with name starting with exact same characters?

      Oh, I've remember that error was "Unable to load JSON data!", so it is exception during one of those commands execution. When you did copy-paste lines from script - where do you saw errors?

      Delete
  10. Hi

    I have tried using this script but keep getting this error when I run it:

    File "/etc/zabbix/elasticsearch.py", line 22
    "status",
    ^
    SyntaxError: invalid syntax

    ReplyDelete
  11. Hi all,

    If you receive the error "Unable to load JSON data!" try running the command:

    curl -XGET 'http://localhost:9200/_cluster/health'

    In my case, the elasticsearch process was not listening on localhost (127.0.0.1) and I got the error:

    curl: (7) couldn't connect to host

    Then making a "sudo netstat -ntlp" revealed the IP address and port elasticsearch is listening to and replacing "localhost" with this IP address in the script solved the problem.

    ReplyDelete
  12. zabbix_sender [6743]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 33; total: 33; seconds spent: 0.000243"}]
    info from server: "processed: 0; failed: 33; total: 33; seconds spent: 0.000243"
    sent: 33; skipped: 0; total: 33

    ReplyDelete
  13. Hello.
    I performed the process but do not have the /etc/zabbix/zabbix_agendtd.d directory, just /etc/zabbix/agentd_conf.d.
    Use version 2.4 of zabbix, but monitoring is not bringing any results.

    Can you help me?

    ReplyDelete
    Replies
    1. Sure, you need to use '/etc/zabbix/agentd_conf.d' folder if it is in your /etc/zabbix/zabbix_agent.conf 'Include' statement

      Delete
    2. Tks for your fast reply.

      Look that:
      /etc/zabbix/zabbix_agentd.conf, is all right.

      ### Option: Include
      # You may include individual files or all files in a directory in the configuration file.
      # Installing Zabbix will create include directory in /etc/zabbix, unless modified during the compile time.
      #
      # Mandatory: no
      # Default:
      # Include=
      # Include=/etc/zabbix/zabbix_agentd.userparams.conf
      # Include=/etc/zabbix/zabbix_agentd.conf.d/
      Include=/etc/zabbix/zabbix_agentd.conf.d/




      I also changed the /etc/zabbix/elasticsearch.py

      cfg = '/etc/zabbix/zabbix_agentd.conf.d' # path to zabbix-agent config


      Nothing ocurrs

      Delete
    3. Did you tried to run:
      # /etc/zabbix/elasticsearch.py
      Did you tried to run:
      # zabbix_agentd -k 'es[indices.docs.count]'
      What are the results?

      Delete
    4. Runnig /etc/zabbix/elasticsearch.py, nothing occurs.

      The other command return these result:

      aryel@backend-elasticsearch-dev-data-igm-pri-instance-l7d5:/usr/sbin$ ./zabbix_agentd -t 'es[indices.docs.count]'
      es[indices.docs.count] [m|ZBX_NOTSUPPORTED]

      Delete
    5. Retrying the command:

      - es.status 503
      - es.indices.search.fetch_total 0
      - es.indices.merges.total 11950
      - es.indices.refresh.total_time_in_millis 1394850
      - es.indices.indexing.index_time_in_millis 372322
      - es.os.mem.actual_used_in_bytes 3064516608
      - es.indices.merges.total_time_in_millis 3231557
      - es.indices.flush.total 253
      - es.indices.indexing.index_total 328279
      - es.indices.indexing.delete_total 0
      - es.indices.docs.deleted 0
      - es.jvm.mem.heap_committed_in_bytes 3203792896
      - es.indices.indexing.delete_time_in_millis 0
      - es.indices.get.missing_total 0
      - es.indices.search.query_time_in_millis 0
      - es.indices.get.exists_time_in_millis 0
      - es.indices.search.fetch_time_in_millis 0
      - es.indices.search.query_total 0
      - es.indices.docs.count 573692
      - es.jvm.mem.heap_used_in_bytes 756227784
      - es.indices.refresh.total 108393
      - es.indices.store.throttle_time_in_millis 25133
      - es.indices.warmer.total_time_in_millis 1250
      - es.indices.get.exists_total 0
      - es.indices.get.missing_time_in_millis 0
      - es.indices.flush.total_time_in_millis 15132
      - es.os.mem.actual_free_in_bytes 4799635456
      - es.indices.warmer.total 5491

      Delete
    6. ZBX_NOTSUPPORTED - means zabbix agent does not know what 'es[]' item key mean. Agent should be restarted after you put config file in place.
      - es.indices.merges.total 11950 ... - means data is collecting ok.

      Delete
    7. I reinstall the agent, but anything still wrong
      Could you help me?
      aryel@backend-elasticsearch-dev-data-igm-pri-instance-l7d5:/etc/zabbix$ ./elasticsearch.py
      node found
      - es.status yellow
      - es.unassigned_shards 0
      - es.active_primary_shards 10
      - es.initializing_shards 1
      - es.active_shards 19
      - es.relocating_shards 0
      - es.indices.search.fetch_total 0
      - es.indices.merges.total 13889
      - es.indices.refresh.total_time_in_millis 1618108
      - es.indices.indexing.index_time_in_millis 441191
      - es.os.mem.actual_used_in_bytes 3084935168
      - es.indices.merges.total_time_in_millis 3596862
      - es.indices.flush.total 311
      - es.indices.indexing.index_total 395003
      - es.indices.indexing.delete_total 0
      - es.indices.docs.deleted 0
      - es.jvm.mem.heap_committed_in_bytes 3203792896
      - es.indices.indexing.delete_time_in_millis 0
      - es.indices.get.missing_total 0
      - es.indices.search.query_time_in_millis 0
      - es.indices.get.exists_time_in_millis 0
      - es.indices.search.fetch_time_in_millis 0
      - es.indices.search.query_total 0
      - es.indices.docs.count 401643
      - es.jvm.mem.heap_used_in_bytes 1063793304
      - es.indices.refresh.total 124231
      - es.indices.store.throttle_time_in_millis 94
      - es.indices.warmer.total_time_in_millis 2
      - es.indices.get.exists_total 0
      - es.indices.get.missing_time_in_millis 0
      - es.indices.flush.total_time_in_millis 17301
      - es.os.mem.actual_free_in_bytes 4779216896
      - es.indices.warmer.total 19
      zabbix_sender [2106]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 33; total: 33; seconds
      spent: 0.000331"}]
      info from server: "processed: 0; failed: 33; total: 33; seconds spent: 0.000331"
      sent: 33; skipped: 0; total: 33

      Delete
  14. Hello.
    I performed the process but do not have the /etc/zabbix/zabbix_agendtd.d directory, just /etc/zabbix/agentd_conf.d.
    Use version 2.4 of zabbix, but monitoring is not bringing any results.

    Can you help me?

    ReplyDelete
  15. one bug, it doesn't find the node if the case is difference between the hostname and the node name, inserting a couple of .lower() into the comparison fixes that

    I also think it's a good idea to modify line 72 to be:

    out += "{0} es.{1} {2}\n".format(os.uname()[1],t,s)

    ReplyDelete
  16. Hi there, I had elasticsearch 1.5 and this script was working like a charm but after update to elasticsearch 2.0 it stopped working and giving this error:

    Traceback (most recent call last):
    File "/etc/zabbix/elasticsearch.py", line 117, in
    main()
    File "/etc/zabbix/elasticsearch.py", line 94, in main
    out += getKeys(node,traps2) #getting stats values
    UnboundLocalError: local variable 'node' referenced before assignment

    I don't see any difference in the output of http://localhost:9200/_nodes/_local/stats?all=true
    Any idea?

    ReplyDelete
    Replies
    1. I've found the issue. Previously I was using hostnames now I am using IP addresses. I forgot to change this in the script. :)

      Delete
  17. Having a problem.

    If i run the script directly as root or zabbix (with shell)

    ./elasticsearch.py `hostname`
    : No such file or directory

    or ./elasticsearch.py
    : No such file or directory

    If I run it as python ./elasticsearch.py it procudes the output fine.

    suggestions

    ReplyDelete
    Replies
    1. if I run it manually just once the graphs build without issue, so added a cronjob to give it a kick

      I still dont understand exactly why this is.

      Delete
  18. Please consider Elasticsearch 2.0 compatibility patch:
    https://github.com/islepnev/zabbix/commit/f413717ce5c3a4b9ead0c8f417a2ba2a53
    6b78d9

    ReplyDelete
  19. Please consider Elasticsearch 2.x compatibility patch:

    https://github.com/islepnev/zabbix/commit/f413717ce5c3a4b9ead0c8f417a2ba2a536b78d9

    ReplyDelete
  20. Dear Sepa, please explain how do you make your script send traps periodically? It works fine once, but then nothing happens. Do you use cron to schedule it?

    ReplyDelete
    Replies
    1. There is one 'active check' which is sending traps when it is requested by agent. You see there is even note about it in elasticsearch.conf:
      #Key jvm.uptime_in_millis used to trigger trap sending
      So you can tune time period for this active check to also change period when traps are send.

      Delete
  21. I didn't get sorry,
    I put
    sudo zabbix_agentd -t 'es[jvm.uptime_in_millis]'
    and get
    elasticsearch% sudo zabbix_agentd -t 'es[jvm.uptime_in_millis]'
    es[jvm.uptime_in_millis] [t|8242898]

    and that's all, zabbix_agentd doesn't start with this parametr

    ReplyDelete
  22. Well, finally I get the thing. I added jvm.uptime_in_millis param to trap list that is sended to server in .py file and set time period for this trap, which is actually Zabbix agent (active)

    ReplyDelete
    Replies
    1. So i have tried, lets call it option 1

      "jvm.mem.heap_used_in_bytes",
      "os.mem.actual_free_in_bytes",
      "os.mem.actual_used_in_bytes",
      "jvm.uptime_in_millis",


      and this option 2

      "jvm.mem.heap_used_in_bytes",
      "os.mem.actual_free_in_bytes",
      "os.mem.actual_used_in_bytes",
      "es[jvm.uptime_in_millis]",

      Changed the key on the Zabbix server to match both the above.

      Get this if i use the first option i try
      zabbix_sender [21571]: DEBUG: answer [{"response":"success","info":"processed: 33; failed: 1; total: 34; seconds spent: 0.000328"}]
      info from server: "processed: 33; failed: 1; total: 34; seconds spent: 0.000328"
      sent: 34; skipped: 0; total: 34

      I get this result on option 2

      zabbix_sender [19248]: DEBUG: answer [{"response":"success","info":"processed: 33; failed: 0; total: 33; seconds spent: 0.000305"}]
      info from server: "processed: 33; failed: 0; total: 33; seconds spent: 0.000305"
      sent: 33; skipped: 0; total: 33

      for the life of me i cant get it to update the trap stats automatically, does a great job if I run the script manually.

      is there a bug somewhere?


      Delete
  23. Hi, can you give me some more hints about this, please? I added to traps:

    out += "- {0} {1}\n".format("es[jvm.uptime_in_millis]","5000")


    but I still don't get how this is supposed to work.

    ReplyDelete
  24. Hi,
    i am struggeling with the elasticsearch.py.
    Everything works well since month with RHEL6.4, zabbix 2.4 client; Java 1.7 and Elasticsearch 1.2.4
    Now we upgrade and have RHEL6.7, zabbix Client 3.0.4, Java 1.8 and ES 2.1.2
    In this new enviroment elasticsearch.py does not return any info when startet from the commandline. It does in the old enviroment. Do you have any idear what i can do to get this working?

    ReplyDelete