Friday, March 23, 2018

Elasticsearch: setting up watcher for FS monitoring

In this post, we'll go through components used for this case scenario and the process of setting up a watcher within elasticsearch.

Our case scenario: we have three machines running different OS (all are linux based) and we'd like to be notified when we're running low on disk space. For this we've taken advantage of running elasticsearch instance with Gold license and enabled watcher feature.




Components

Our structure is rather basic:

Metricbeat -> Logstash -> Elasticsearch -> Kibana

  • metricbeat is a lightweight shipper used to collect system metrics like cpu, processes, memory and among others also filesystem metrics, which we'll use here
  • logstash serves as an intermediate data flow link between shipper and elasticsearch instance
  • elasticsearch holds all the data in default metricbeat* indices
  • kibana is in this case used as an tool to create the watcher job

Metricbeat data structure

In the picture, you can see sample entry from filesystem metricset. We are interested in monitoring system.filesystem.used.pct metric.


Preparing the query for watcher


First thing to consider was the query, which we should execute to get the right data. Basically we wanted to be notified about how much % of FS is used and monitor this value, this means that the system.filesystem.used.pct field is the right one for us, but...
While querying this field will give us feedback that we're running out of space, we won't know which filesystem, nor the machine is triggering this alert. Luckily for us, all the data are in the metricbeat entry, so we'll need to add fields system.filesystem.mount_point and host.

Now the real question comes in mind... how do we get the query? Well, there are few ways and perhaps the most simple one is with help of kibana.
Because kibana does all the querying, we just need to create new table visualization, visualize MAX value of system.filesystem.used.pct and split rows on system.filesystem.mount_point and host.

That's it! We have the table, with all the data we're interested in, so lets just copy the request from kibana.
It will look something like this:
    {   "size": 0,  
       "_source": {  
         "excludes": []  
       },  
       "aggs": {  
         "3": {  
           "terms": {  
             "field": "host",  
             "size": 20,  
             "order": {  
               "1": "desc"  
             }   
          },   
          "aggs": {    
           "1": {   
              "max": {   
                "field": "system.filesystem.used.pct"   
              }   
            },   
            "2": {    
             "terms": {    
               "field": "system.filesystem.mount_point",   
                "size": 5,   
                "order": {   
                  "1": "desc"   
                }   
              },    
             "aggs": {    
               "1": {    
                 "max": {   
                    "field": "system.filesystem.used.pct"   
                  }   
                }   
              }   
            }   
          }   
        }   
      },   
      "query": {   
        "bool": {   
          "must": [   
            {   
             "exists": {    
               "field": "system.filesystem.used.pct"   
              }  
             },  
             {   
              "range": {  
                 "@timestamp": {  
                   "gte": 1521427888000,  
                   "lte": 1521727888000,  
                   "format": "epoch_millis"   
                }  
               }  
             }  
           ]  
         }  
       }  
     }  
Let's not worry about the details about this json, just keep it for later. We'll use parts of it for the watcher.


Constructing the Watcher


Documentation about the watcher is well written on elastic.co, so just briefly...
Watcher consists of four main parts:
  • trigger: specifies when the watch should run
  • input: is the query and filters to get our data
  • condition: it's the condition whether action should be run
  • actions: if the condition is met, action will be triggered
    {  
       "trigger": { ... },  
       "input": { ... },  
       "condition": { ... },  
       "actions": { ... }  
    }  
Lets take a look at parts of the watcher...

Trigger

This one is simple. We can use different schedulers, in this case we use interval 5m, meaning it'll run every 5 minutes after it's created.

  "trigger": {  
     "schedule": {  
       "interval": "5m"  
     }  
   },

Input

In this part we'll specify:
  1. indices on which we'll search data
  2. the query
  3. aggregations

NOTE: see comments in the json sample below

    "input": {  
         "search": {  
           "request": {  
             "search_type": "query_then_fetch",  
             "indices": [  
    # as our data is in the metricbeat indices, we want to query only them  
               "<metricbeat*>"  
             ],  
             "types": [],  
             "body": {  
               "size": 0,  
               "query": {  
    # this is the query we extracted earlier from the visualization  
                  "bool": {  
                   "must": [  
                     {  
                       "exists": {  
                         "field": "system.filesystem.used.pct"  
                       }  
                     },  
                     {  
                       "range": {  
                         "@timestamp": {  
    # we've changed the timestamp range from static values to dynamic  
    # in this case search data for last 2 minutes  
                           "gte": "now-2m",  
                           "lte": "now",  
                           "format": "epoch_millis"  
                         }  
                       }  
                     }  
                   ]  
                 }  
               },  
               "aggs": {  
    # here are the aggs from the query from visualization  
    # we've renamed aggs number to something meaningful (was 1, now it's host)  
                 "host": {  
                   "terms": {  
                     "field": "host",  
                     "size": 20,  
                     "order": {  
                       "pct": "desc"  
                     }  
                   },  
                   "aggs": {  
                     "pct": {  
                       "max": {  
                         "field": "system.filesystem.used.pct",  
                         "script": {  
    # as the original value is in percentage we multiply with 100  
                           "source": "doc['system.filesystem.used.pct'].value *100",  
                           "lang": "painless"  
                         }  
                       }  
                     },  
                     "mpoint": {  
                       "terms": {  
                         "field": "system.filesystem.mount_point",  
                         "size": 5,  
                         "order": {  
                           "pct": "desc"  
                         }  
                       },  
                       "aggs": {  
                         "pct": {  
                           "max": {  
                             "field": "system.filesystem.used.pct"  
                           }  
                         }  
                       }  
                     }  
                   }  
                 }  
               }  
             }  
           }  
         }  
       },  

 

Condition

Condition is simple: test if the used FS is greater than 75%

      "condition": {  
        "compare": {  
    # our query is more complex, so it'll return buckets with data  
    # the MAX value is in the first bucket, we can refer to it like this  
          "ctx.payload.aggregations.host.buckets.0.pct.value": {  
            "gt": 75  
          }  
        }  
      },  

 

Actions

In this case, we want to be notified by email:
      "actions": {  
        "mail@krisko": {  
    # the watcher will run every 5 minutes, but notification will be sent every 20 minutes  
          "throttle_period": "20m",  
          "email": {  
            "profile": "standard",  
            "to": [  
              "email@t-systems.com"  
            ],  
            "subject": "High FS Usage on {{ctx.payload.aggregations.host.buckets.0.key}}",  
            "body": {  
              "html": "<b>{{ctx.payload.aggregations.host.buckets.0.mpoint.buckets.0.key}}</b> reached <b>{{ctx.payload.aggregations.host.buckets.0.pct.value}}%</b>"  
            }  
          }  
        }  
      } 

The last step is to PUT the trigger to elasticsearch and test if it's running as designed. This can be handled from Dev Tools in Kibana:
    PUT _xpack/watcher/watch/CheckFS  
    {  
    # place all the watcher parts here  
       "trigger": { ... },  
       "input": { ... },  
       "condition": { ... },  
       "actions": { ... }  
    }  
      
    POST _xpack/watcher/watch/CheckFS/_execute  

Sample email:
    High FS Usage on webNode1  
      
    /home reached 77.0%  

2 comments:

  1. hey thanks for this post but this "ctx.payload.aggregations.host.buckets.0.mpoint.buckets.0.key" giving me the first key result of the bukcet (I know you mentioned as buckets.0.key) but how can I list all remaining disk's

    here I have like this

    "buckets": [
    {
    "pct": {
    "value": 0.903
    },
    "doc_count": 11,
    "key": "/opt/data2"
    },
    {
    "pct": {
    "value": 0.891
    },
    "doc_count": 11,
    "key": "/opt/data"
    },
    {
    "pct": {
    "value": 0.709
    },
    "doc_count": 11,
    "key": "/opt/data3"
    },
    {
    "pct": {
    "value": 0.618
    },
    "doc_count": 11,
    "key": "/"
    },
    {
    "pct": {
    "value": 0.225
    },
    "doc_count": 11,
    "key": "/opt/data4"
    }

    ReplyDelete
    Replies
    1. Hello,
      you can maybe add the next bucket key number to return more results, but with kibana, you cannot monitor more than one metric.
      In my case this is sufficient, because the FS usage metrics are sorted from high to low, so if one FS reaches the threshold, you'll be notified.

      Delete