ww2-siegfried-line-of-defense

System Alerting & Monitoring in 13 Steps

Om Vikram Thapa

--

My father used to tell me :

To beef up is always better than to repent on later

In software industry if you are working as product engineer then there are 2 terms which you hear most of the time — Reliability & Scalability

While Reliability makes your system stable and base strong, on the other hand Scalability opens up your system aspiration to move ahead to larger audience.

To maintain both the above keywords we need to beef up our system by the help of Alerting & Monitoring which means keeping eyes/ears to what system wants to convey the message to you about their health.

Alerting & Monitoring both can be Proactive as well as Reactive in nature.

“This blog is not about how to master Alerting & Monitoring but about how to establish a full proof plan to set up Alerting & Monitoring which will help us to attain a Reliable & Scalable system”

Below are 13 steps I have jolted down, based on my experience, to set up Alerting & Monitoring in place so that your system will have regular checkups. Here we go :

1) Know your System first

If you don’t know your system then you won’t be able to set up healths and checks in place. Please understand the architecture, dependent systems and the hand shake between them before we even start with alerts.

Your application may have various micro-services or various layers which might require individual level alerts & monitors while you should also think about system/application level alerts & monitors to cover the system.

Make sure your application is sending the right data points to the centralised system (your APM or Logging system) where you can build the health checks.

know-your-system

2) Understand Your Infrastructure

Just like application you should be aware of your infrastructure also be in AWS, Google, Azure Cloud or on-premises infrastructure.

There are many a times these infra gives you signal about the health of the system which in turn impacts the application and your product.

Thankfully all of these cloud services provides you instrument to capture the infra logs and monitoring for eg. AWS Logs, AWS CloudWatch. The signals can be anything like :

  • CPU Utilisation
  • DB IOPS (Input Output per sec)
  • Network Packets In/Out
  • Physical disk space etc.
know-your-infrastructure

3) Select your Notification Types

There are various notification types available but you need to choose which one you want to select for different priority type alerts for eg.

  • Email
  • SMS
  • Call
  • Slack
  • Push Notifications etc.

I remember in Payment Platform for certain cases we used to send SMS if there is no payment transaction in last 5 mins (except graveyard shift)

select-your-notification-type

4) Set your Priorities “Right”

Each of the Alerts and Notification Types should have a clear cut segregation of their priority or attention i.e. Notice, Warning & Fatal Error.

For New Relic and Sentry alerts we generally add a prefix in the email

  • [WARNING]
  • [CRITICAL]
set-your-priorities-right

5) Target the Right Audience

This is one of the most easy and most ignored step where we tend to send the notification to the mass group audience and it generally doesn’t work. WHY?

Because NOBODY takes the ownership of the alert then, it gets filtered out and later treated as SPAM.

know-your-audience

At Goibibo Tech we generally have pods structure (mostly consists of 5 members) so we have different pods email ids and group email ids too. This eases out the life for everyone.

For eg. to : pod1@yourcompany.com, cc : team@yourcompany.com

The same way you can set up teams in Victor Ops also for Call Notifications.

segregate-the-relevant-audience

6) Set up Escalation Matrix

For all FATAL or CRITICAL alerts you need to setup a process where you know that who will be responsible if the alerts gets missed unattended. This is called chain or responsibility and in other terms — Escalation Matrix

All the notification aggregators/tools provide you easy way to setup the Escalation matrix for e.g we use Victor Ops and Pager Duty for Calls & App push notification. It works like this -

  • Add your team member to the portal
  • Set up the rotation cycle (who will go first, second..)
  • Set up rotation period (for how long a member will be the owner)
  • Find who will be the next level owner to the top level mgmt.
set-escalation-matrix

7) Test before you Rest

Always make sure whatever Alert Policies you have written and integrated is tested well. Be it slack, SMS, Email, Push Notifications or even calls.

Most of the tools provide the same New Relic Life and AWS User Groups provides the same. Double sure what you are going to face in future…

test-alert-policies-with-violation

8) PUSH vs PULL Notifications

I always believe in hybrid modal of PUSH and PULL Notifications.

What does that mean is, it is humanly impossible for us to keep watching the monitoring tools and capture the signals. Big organisations have NOC team dedicated to do so.

Alerting can be manual or automated but monitoring is tedious job if done manually. Thus we require health check notifications. For eg. Daily Txns. Mail

dashboard-monitoring

9) Everybody loves Dashboards

I believe so! Dashboards are effective way to manage the monitoring of the system. There are lot of such tools available in market and some of them have build their in house tools for the same.

Learn how to read the data, read the existing dashboards, find the inferences. Copy and contribute to the dashboards. You might need to learn some Query Languages but nowadays these are more or less similar to SQL. For eg. NRQL

we-love-dashboards

10) Approach to Resolution

Once the dashboards, monitoring, escalation policy and alerts are in place you need to set up guidelines to perform RCA (Root Cause Analysis) i.e.

  • Define the problem
  • Collect more info regarding the problem
  • Identify the usual suspects
  • Narrow down the issue to reach RCA
  • Resolution
root-cause-analysis-process

11) Always Document it (Please)

I personally believe that don’t just follow your instincts and setup a process but document it so that it will be helpful for all the existing and new members in your team. Its a practice, it comes with patience :)

We at Goibibo Tech have almost document all the RCAs performed over the past few years. We have documented al the dashboards URLs to look into along with all the Alert Policies, Alert Notification Channels and Escalation Matrix in Confluence. What?! Its an honest thing to do.

documentation-is-must

12) Sharing is Caring :)

We believe in “Each One Teach One” so if you know the process, setup, rules and the quickest way possible to perform RCA then you don’t win this race, you are not the master yet. If you are able to pass on the same message to at least 2 new members in next few weeks then you are moving in the right direction. Pass on the message to more and you will have whole team educated and contributing to Alerting & Monitoring.

Once you have more members with same mindset and tools and expertise your system will have more eyes and ears to look after the so called Reliability & Scalability of the system.

sharing-is-caring

13) There is no step here..it’s YOU

This is YOU who else? You need to be self motivated, learning and optimistic enough that the system will be safe by few hands like yours. You need to understand the application & system knowledge and the alerting and monitoring tools to be able to pass on the knowledge to others.

Its You who is at the center of Alerting & Monitoring. It’s people like you who is reading this blog to get some ideas to safeguard your system…

its-you-yes-you

It’s you who still think that a small drop can still cause Tsunami

Peace Out. Cheers.

Happy Reliability

--

--