2014/11/29

Network socket state statistics monitoring by Zabbix

It's strange that zabbix-agent lacks for information about network socket states. At least it would be nice to monitor number of ESTAB, TIME_WAIT and CLOSE_WAIT connections.
Good that we can extend zabbix-agent - so I made this:




Installation:
  1. Save this as /etc/zabbix/zabbix_agentd.conf.d/ss.conf
    # Only UDP count returned on any sock[*] key query
    # All other items send to trap keys at the same time
    UserParameter=sock[*], /bin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i - >/dev/null; /bin/ss -anu | wc -l
    
  2. Import template with counters and graph
    template_app_sockets.xml

How it works:
  • all counters have type 'zabbix trapper', but the 'UDP' counter (which is active)
  • when zabbix-agent asks for UDP connections count - all other counters will be sent to zabbix-server as traps at the same time
  • you can adjust query interval which is by default 5 min

Quick theory:




LISTEN - the local end-point is waiting for a connection request from a remote end-point i.e. a passive open was performed.
SYN-SENT - the first step of the three-way connection handshake was performed. A connection request has been sent to a remote end-point i.e. an active open was performed.
SYN-RECEIVED - the second step of the three-way connection handshake was performed. An acknowledgement for the received connection request as well as a connection request has been sent to the remote end-point.
ESTABLISHED - the third step of the three-way connection handshake was performed. The connection is open.
FIN-WAIT-1 - the first step of an active close (four-way handshake) was performed. The local end-point has sent a connection termination request to the remote end-point.
CLOSE-WAIT - the local end-point has received a connection termination request and acknowledged it e.g. OS knows that the remote application has closed the connection and waits for the local application to also do so.
FIN-WAIT-2 - the remote end-point has sent an acknowledgement for the previously sent connection termination request. The local end-point waits for second FIN packet from the remote end-point.
LAST-ACK - represents waiting for an acknowledgment of the connection termination request previously sent to the remote end-point
CLOSING - the local end-point is waiting for an acknowledgement for a connection termination request before going to the TIME-WAIT state.
TIME-WAIT - from the local end-point point of view, the connection is closed but we’re still waiting before accepting a new connection in order to prevent delayed duplicate packets from the previous connection from being accepted by the new connection. [According to RFC 793 a connection can stay in TIME-WAIT for a maximum of four minutes known as a 2xMSL (maximum segment lifetime).]

Due to the way TCP/IP works, connections can not be closed immediately. When connection in one direction is terminated, the other party can continue sending data in the other direction. Packets may arrive out of order or be retransmitted after the connection has been closed. CLOSE_WAIT indicates that the remote endpoint (other side of the connection) has closed the connection. TIME_WAIT indicates that local endpoint (this side) has closed the connection. The connection is being kept around so that any delayed packets can be matched to the connection and handled appropriately. The connections will be removed when they time out within some time.
So, basically, we can tune sysctl timeouts to influence upon TIME_WAIT, but CLOSE_WAIT is an application decision (until socket.close() is not executed)

22 comments:

  1. i've got an error by importing the template in Zabbix 2.4
    " XML file contains fatal error 41:Specification mandate value for attribute data-pjax-transient [ Line: 32 | Column: 40 ]"

    ReplyDelete
    Replies
    1. Unfortunately link for xml in post was pointing to conf file.
      Fixed it, and checked that it still importing successfully to zabbix 2.4.3

      Delete
  2. Where do you change the interval? When item is set as trapper it has no interval - meaning this has to be set on the monitored server -- where exactly? Thx

    ReplyDelete
    Replies
    1. You set interval on the only one item of type 'monitoring agent' which is uptime.
      This check returns uptime and triggers sending of all the traps.

      Delete
  3. sorry but what is /bin/ss? on my server there isn't any such command. do i have to install any support packages?

    ReplyDelete
    Replies
    1. $ dpkg -S /bin/ss
      iproute: /bin/ss
      It is 'iproute' package, i believed it was included in default debian

      Delete
    2. hello, it's me again. im on centos so the ss command is located in /usr/sbin/ss. therefore i've modify the path in your script and it works well

      but another issue is that i couldn't make it work on my debian server, there's no data on the graph and i don't know y

      im sending u the log from zabbix_agent

      12360:20150601:234115.676 active checks #1 [processing active checks]
      12360:20150601:234115.676 In process_active_checks() server:'172.16.8.130' port:10051)
      12360:20150601:234115.676 In zbx_popen() command:' /bin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i - >/dev/null; /bin/ss -anu | wc -l'
      12360:20150601:234115.676 End of zbx_popen():6
      12362:20150601:234115.678 zbx_popen(): executing script
      12360:20150601:234115.693 In zbx_waitpid()
      12360:20150601:234115.693 zbx_waitpid() exited, status:0
      12360:20150601:234115.693 End of zbx_waitpid():12362
      12360:20150601:234115.693 EXECUTE_STR() command:' /bin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i - >/dev/null; /bin/ss -anu | wc -l' len:2 cmd_result:'17'
      12360:20150601:234115.693 for key [sock[UDP]] received value [17]
      12360:20150601:234115.694 In process_value() key:'Zabbix server:sock[UDP]' value:'17'
      12360:20150601:234115.694 In send_buffer() host:'172.16.8.130' port:10051 values:0/100
      12360:20150601:234115.694 End of send_buffer():SUCCEED
      12360:20150601:234115.694 buffer: new element 0
      12360:20150601:234115.694 End of process_value():SUCCEED
      12360:20150601:234115.694 End of process_active_checks()
      12360:20150601:234115.694 In get_min_nextcheck()
      12360:20150601:234115.694 active checks #1 [idle 1 sec]
      12356:20150601:234116.675 collector [processing data]

      Delete
    3. So, it is successfully collecting active checks (UDP).
      Everything else is sent as Traps.
      What is the output of:
      /bin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}'

      And what is the result of putting the previous output to:
      /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i -

      when you execute it on the server to be monitored? Maybe you don't have zabbix_sender installed or your zabbix_agent.conf is in another place.

      Delete
  4. data appear to be sent to zabbix server, but not processed on server side.

    output of collecting command when run manually looks like this:
    - sock[TIME-WAIT 0 0 ::ffff:XX.XX.XX.XX:80 ::ffff:176.106.193.50:52534 ] 1
    - sock[ESTAB 0 0 ::ffff:XX.XX.XX.XX:80 ::ffff:77.120.168.85:59480 ] 1
    - sock[TIME-WAIT 0 0 ::ffff:XX.XX.XX.XX:80 ::ffff:80.77.34.239:50050 ] 1
    etc etc.

    output from sending collected data looks like this:
    /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -vv -z ZZ.ZZ.ZZ.ZZ -s XX.XX.XX.XX -i -
    zabbix_sender [21811]: DEBUG: answer [{
    "response":"success",
    "info":"processed: 0; failed: 250; total: 250; seconds spent: 0.002858"}]
    info from server: "processed: 0; failed: 250; total: 250; seconds spent: 0.002858"
    zabbix_sender [21814]: DEBUG: answer [{
    "response":"success",
    "info":"processed: 0; failed: 250; total: 250; seconds spent: 0.002877"}]
    info from server: "processed: 0; failed: 250; total: 250; seconds spent: 0.002877"
    etc etc

    any ideas?

    ReplyDelete
    Replies
    1. Please note that $ char is doubled in zabbix conf and if you want to run this command in shell you need to use single $. Check out my previous comment (s[$1]+). So correct output should be something like:
      - sock[FIN-WAIT-2] 1
      - sock[SYN-RECV] 0
      - sock[CLOSE-WAIT] 1
      - sock[LAST-ACK] 0
      - sock[ESTAB] 2
      - sock[TIME-WAIT] 3
      - sock[LISTEN] 18
      - sock[SYN-SENT] 0
      - sock[FIN-WAIT-1] 0

      Delete
    2. thanks. so, i fixed the 1st issue with double $, and now the output of collecting command run in shell looks exactly as it should:

      - sock[SYN-SENT] 0
      - sock[LAST-ACK] 17
      - sock[SYN-RECV] 8
      - sock[ESTAB] 314
      - sock[FIN-WAIT-1] 35
      - sock[CLOSING] 6
      - sock[FIN-WAIT-2] 162
      - sock[TIME-WAIT] 2660
      - sock[CLOSE-WAIT] 14
      - sock[LISTEN] 21

      but it's still not processed on the server side:
      ${collecting_string_snipped} | /usr/bin/zabbix_sender -vv -z XX.XX.XX.XX -s YY.YY.YY.YY -i -
      zabbix_sender [11169]: DEBUG: answer [{
      "response":"success",
      "info":"processed: 0; failed: 10; total: 10; seconds spent: 0.034376"}]
      info from server: "processed: 0; failed: 10; total: 10; seconds spent: 0.034376"
      sent: 10; skipped: 0; total: 10

      Delete
    3. can i check things from the zabbix server side somehow?
      i'm new to zabbix and still don't know how to check receiving multiple parameters.

      Delete
    4. You've already checked that by using zabbix_sender debug output. This line "processed: 0; failed: 10; total: 10;" means that on zabbix server side nobody listens for these values.
      It is because you used "-z XX.XX.XX.XX -s YY.YY.YY.YY" and there is no server with explicit name 'YY.YY.YY.YY' configured on zabbix side.
      Usually you installing zabbix agent and configuring it using config file. You setting servername there, not IP. And then configuring the same servername (or autodiscovery) on zabbix server side, and adding some items to check for this servername.

      Delete
    5. thanks for explanation.
      now i've fixed that, and getting the following:

      # /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -vvv -z XX.XX.XX.XX -s ServerName -i -
      zabbix_sender [19025]: DEBUG: answer [{
      "response":"success",
      "info":"processed: 9; failed: 1; total: 10; seconds spent: 0.000271"}]
      info from server: "processed: 9; failed: 1; total: 10; seconds spent: 0.000271"
      sent: 10; skipped: 0; total: 10

      how could i check what's failed?

      Delete
    6. sorry, the previous output is wrong, since it did not send any data from the second command (/usr/sbin/ss -anu | wc -l).
      now, when i send output of the whole string, i get:

      /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' ; /usr/sbin/ss -anu | wc -l | /usr/bin/zabbix_sender -vvv -z XX.XX.XX.XX -s ServerName -i -
      - sock[SYN-SENT] 3
      - sock[LAST-ACK] 19
      - sock[SYN-RECV] 131
      - sock[ESTAB] 299
      - sock[FIN-WAIT-1] 16
      - sock[CLOSING] 2
      - sock[FIN-WAIT-2] 141
      - sock[TIME-WAIT] 2145
      - sock[CLOSE-WAIT] 5
      - sock[LISTEN] 25
      zabbix_sender [12145]: Warning: [line 1] 'Key' required
      Sending failed.

      Delete
  5. 1. To check this kind of issue:
    "processed: 9; failed: 1; total: 10;"
    Just send output to zabbix_sender line by line

    2. You don't need to include second command output for passing to zabbix_sender.
    Idea is to use one active check to send multiple trap counters in bunch. So zabbix-agent asks script only for UDP connections count ant this value is returned in plain form. But in the background all other counters are send in bunch to zabbix-server as traps using zabbix_sender. This way load on server and agent is less than asking for each counter as active checks.

    ReplyDelete
  6. "Please note that $ char is doubled in zabbix conf and if you want to run this command in shell you need to use single $."

    I spent about 3 hours for debugging this issue, and then i read this comment=) Now i have only one word - FUUU (aka Блин, да как так-то? все уже написано до нас..) =)

    ReplyDelete
  7. Hmmm no data on server side and any related entry on zabbix_agent.log

    ReplyDelete
  8. I don't have such thing on my host /usr/bin/zabbix_sender :/usr/bin/zabbix_sender: No such file or directory

    What should I use on the ss.conf file?
    Thanks

    ReplyDelete
  9. On the Zabbix UI I can see under latest data /graph the UDP ports populating the graph min max average , the others are at 0.

    /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -vvv -z 10.10.20.190 -s ServerName -i -
    zabbix_sender [17561]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 9; total: 9; seconds spent: 0.000259"}]
    info from server: "processed: 0; failed: 9; total: 9; seconds spent: 0.000259"
    sent: 9; skipped: 0; total: 9


    local the info it is checked and correct

    /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[$1]++} END {for (i in s) {print "- sock["i"]", s[i]}}'
    - sock[SYN-SENT] 0
    - sock[LAST-ACK] 0
    - sock[SYN-RECV] 0
    - sock[ESTAB] 7
    - sock[FIN-WAIT-1] 0
    - sock[FIN-WAIT-2] 1
    - sock[TIME-WAIT] 29
    - sock[CLOSE-WAIT] 1
    - sock[LISTEN] 9

    What will be the next step for troubleshooting

    ReplyDelete
  10. Got this in the zabbix server logs -elevated the log level to debug-

    17077:20161214:161100.799 In zbx_popen() command:' /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[21161]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i - >/dev/null; /usr/sbin/ss -anu | wc -l'
    17077:20161214:161100.800 End of zbx_popen():5
    18704:20161214:161100.800 zbx_popen(): executing script
    17077:20161214:161100.833 In zbx_waitpid()
    17077:20161214:161100.833 zbx_waitpid() exited, status:0
    17077:20161214:161100.833 End of zbx_waitpid():18704
    17077:20161214:161100.833 EXECUTE_STR() command:' /usr/sbin/ss -ant | grep -v State | awk 'BEGIN {s["CLOSE-WAIT"]=0;s["ESTAB"]=0;s["FIN-WAIT-1"]=0;s["FIN-WAIT-2"]=0;s["LAST-ACK"]=0;s["SYN-RECV"]=0;s["SYN-SENT"]=0;s["TIME-WAIT"]=0} {s[21161]++} END {for (i in s) {print "- sock["i"]", s[i]}}' | /usr/bin/zabbix_sender -c /etc/zabbix/zabbix_agentd.conf -i - >/dev/null; /usr/sbin/ss -anu | wc -l' len:2 cmd_result:'22'
    17077:20161214:161100.833 for key [sock[UDP]] received value [22]
    17077:20161214:161100.834 In process_value() key:'rtpdl-ph-cmhs01:sock[UDP]' value:'22'
    17077:20161214:161100.834 In send_buffer() host:'10.10.20.190' port:10051 values:0/100
    17077:20161214:161100.834 End of send_buffer():SUCCEED
    17077:20161214:161100.834 buffer: new element 0
    17077:20161214:161100.834 End of process_value():SUCCEED
    17077:20161214:161100.834 End of process_active_checks()
    17077:20161214:161100.834 In get_min_nextcheck()
    17077:20161214:161100.834 active checks #1 [idle 1 sec]
    17073:20161214:161101.792 collector [processing data]
    17073:20161214:161101.793 In update_cpustats()
    17073:20161214:161101.804 End of update_cpustats()
    17073:20161214:161101.804 collector [idle 1 sec]
    17077:20161214:161101.834 In send_buffer() host:'10.10.20.190' port:10051 values:1/100
    17077:20161214:161101.835 JSON before sending [{"request":"agent data","data":[{"host":"rtpdl-ph-cmhs01","key":"sock[UDP]","value":"22","clock":1481749860,"ns":834310726}],"clock":1481749861,"ns":834980670}]
    17077:20161214:161101.836 JSON back [{"response":"success","info":"processed: 1; failed: 0; total: 1; seconds spent: 0.000170"}]
    17077:20161214:161101.836 In check_response() response:'{"response":"success","info":"processed: 1; failed: 0; total: 1; seconds spent: 0.000170"}'
    17077:20161214:161101.836 info from server: 'processed: 1; failed: 0; total: 1; seconds spent: 0.000170'
    17077:20161214:161101.837 End of check_response():SUCCEED
    17077:20161214:161101.837 OK
    17077:20161214:161101.837 End of send_buffer():SUCCEED




    Still why the close_wait or FIN_wait is not populated in graph

    ReplyDelete