An essential part of running an infrastructure is monitoring. Making sure that everything is running as they should and getting alerts when they are not.
This usually means setting alerts based on triggers: if the level of free disk space goes below 20% send me an email.
But what if you don't know what the "normal" range looks like? Take the example of our monitoring infrastructure: we fire up hundreds of servers per day and keep track of thousands per seconds. Those servers belong to our customers and live in 30+ different data centres around 5 different continents. How do we know if one data centre is having issues?
With a growing number of server per each data centre, there is no fixed number of servers that need to show a fault for us to realise there is something wrong going on there. Moreover, each one of the servers under our management sends monitoring signals and heartbeats back to Cloud 66 central. How do we differentiate between missing heartbeats caused by data centre issues or connectivity issues on our end?
Also, our customers might shutdown servers, disable heartbeats or reroute the network traffic on their servers. These are legitimate cases which we shouldn't get notified about.
Take the example of heartbeats: each one of the deployed servers sends a heartbeat back home once every minute. We monitor those individual heartbeats and if a server misses a heartbeat for more than 10 minutes, we alert our customers. Every heartbeat is also sent to our Graphite servers. Graphite server charts the count of the heartbeats received per second from across the network, breaking it down by data centre.
This is great for seeing any issues across the network or specific to a single data centre. But we can't set any fixed thresholds for monitoring to those counts as they are constantly growing. Here we use a rolling standard deviation of the heartbeat count and monitor that in Seyren. This way we only get alerts if there is a sudden change in the counts (up or down, both of the unusual) within a rolling window of 5 minutes of above say 2%.
Rolling Standard Deviation on pulses
This combination has worked very well for us. This also can be used to monitor abnormal rise in activities like hits on certain URLs or failed login attempts without fixing the threshold to an arbitrary number.
Of course we send a lot more than heartbeats to Graphite. These range from the number of new signups, to deployment success and failures, IP address changes, disk, CPU and memory metrics, IO and OS metrics and much more. By combining Graphite rolling average, standard deviation and alerting capabilities of Seyren we can ensure a healthy infrastrucutre running for our customer without needing to know what each individual customer is doing.