General Monitoring and Alerting Requirements

Introduction

Monitoring and Alerting requirements are often bolted onto an application as afterthoughts, which can extend the life-cycle of a project weeks or even months.

Object Partners, Inc. aims to make Monitoring and Alerting into first class citizens. With a predefined set of expectations around these subjects we are able to rapidly design and deploy new applications and platforms with a pre-included framework for Monitoring and Alerting.

These requirements are intended to be rapidly transformed according to the needs of our clients. Specific implementations are not outlined, only the requirements of monitoring, alerting, and responses to each.

Requirements will be broken down based on hardware and software monitoring.

Monitoring

Implementing monitoring without a well defined plan can quickly result in an overload of largely useless information. Picking a specific set of useful information to monitor will improve the usability of the platform.

Four Golden Signals

The SRE book (see the Bibliography) champions use of the Four Golden Signals. These metrics, taken together, should give an accurate representation of the state of a platform, infrastructure, and all application components. They are as follows:

  • Latency
    • The time it takes to serve a request, whether web or some other application.
    • Latency should always be measured both in terms of both successful and failed requests, with both cases separated.
  • Traffic
    • The measure of demand on a system.
    • This value can vary in definition. For a web site, it may be the requests per second. For a database, this may be ‘queries performed per second’.
    • Individual services should all have meaningful measurements of Traffic defined.
  • Errors
    • How often does the system or application fail?
    • What are the causes/nature of the errors?
    • Failures can be either:
      • Explicit: such as a Core Dump or a 500 error
      • Implicit: a query returns with  the wrong data or a website returns a 200 OK but with an error result.
  • Saturation
    • Measures both:
      • Current utilization: such as CPU load or Memory Usage
      • Predicted utilization: your battery will die in 20 minutes

Service Level Agreements, Objectives, and Indicators

It’s important to take a moment to define these Service Level targets. In the industry, we have a tendency to inaccurately throw around the term “SLA”. We often use it as follows:

  • The site went down and we missed our SLA.
  • DB queries are failing, we’re missing our SLA.

These examples may sound accurate, but in fact rely on a fallacious definition of the term “SLA”. To be fair, these items may both cause a breach of the Service Level Agreements, but both problems can be better solved by looking more specifically at the issues involved than at the overall SLA.

Confusing? Understandably. Let’s look at specific definitions, then revise these statements.

  • Service Level Indicators
    • A SLI is a specific value to be monitored. For example:
      • “CPU Load” is a SLI of a Saturation metric
      • “Percent of Successful HTTP Requests” is a SLI for a web page.
  • Service Level Objectives
    • A SLO is an objective that we expect our SLIs to meet. For the two examples above, we might set SLOs as such:
      • “CPU Load” is under 70% for 95% of all running time.
      • “Percent of Successful HTTP Requests” is greater than 95% on a weekly basis.
  • Service Level Agreements
    • A SLA is usually arrived at as a collection of SLOs to meet. In the case of our two example SLIs and SLOs, we might roll them into one agreement with the customer:
      • The SLA would say that SLOs for “CPU Load” and “Percent of Successful HTTP Requests” are both met for 99% of the year.

Stringent use of this nomenclature facilitates a more rapid response from technical personnel, each in charge of their own SLOs. It also helps in writing more meaningful monitoring solutions and SLAs.

We can now revise the SLA statements made at the beginning of this section:

  • The site went down and we missed the SLO for 95% daily uptime.
  • The DB went down and our daily SLO of 99% Successful Queries was missed.

The distinction is subtle, but the extra definition around SLI, SLO, and SLA will allow us to better serve our clients and create monitoring solutions as a pre-arranged facet of the project.

Indicators as Percentile Values

Whenever possible, transform SLIs into percentile values. As an indicator, a percentile value can provide very meaningful data about a system. It also allows a finer grained grading of the system once translated to Service Level Objectives.

This method requires the transformation of raw metrics followed by the aggregation of that data. For example:

  • With every HTTP Request, we measure the latency.
  • Latency, per request, is then transformed to tell us what percentile the single request belongs in.
  • This will provide us a table, as in the following example:
Request number Latency Percentile
1 20 ms 30
2 1 s 60
3 4 s 95

Given a similar dataset, a team may set multiple Service Level Objectives, as follows:

  • SLO: 95% of requests should see latency under 3 seconds
  • SLO: 60% of requests should see latency under 1.5 seconds
  • SLO: 30% of requests should see latency under 50 milliseconds

Comparing the above Objectives to the table of requests, we can see that requests 1 and 2 meet the Objectives (20 ms when we want <50 ms, 1 second when we want <1.5), but request number 3 fails (4 seconds, rather than <3 seconds).

Knowing this, we can focus on the group of failing requests to ensure that we can meet SLO 1 in the future.

Aggregating Indicator data into percentiles whenever possible allows the stratification of Objectives, in turn allowing tighter direction for troubleshooting and future engineering.

How Many 9’s Would You Like?

Think hard before promising 5 9s of reliability (That would be 99.999%…. 5 9s, see?) on a service. It’s an unrealistic value and a couple of things need to be taken into account:

  • There are 31,436,000 seconds in a year.

Failures allowed for 5 9s of reliability means 314 seconds in a year. A little over 5 minutes. This may sound great to an end user, but it can be extremely difficult to achieve. Error Budgets (a subject for another paper) should allow for more failures than this if a rapid response to business needs is desired of developers.

  • What kind of service is promised by your:
    • Cloud provider?
    • Internet provider?
    • Other service providers?

These questions should be carefully considered. It’s unreasonable to promise better reliability than the platform, especially after Error Budgets are accounted for.

Consider what a user experiences on error and how disruptive it may be. How quickly can support personnel respond? Does the user merely need to resubmit their web-form in order to resolve the issue or see a different response?

Take these questions into account and try to resist promising too much uptime. Google, for example, tends towards 99.5% or 99.95% ( 2 and a half or 3 and a half 9s) for most of their services (exceptions exist).

Alerting

With the collection of indicator data, it becomes possible to find values for automated responses and alerting. Alerting services are legion, with the simplest being “Let’s email an on-call group.” In fact, alerting should remain simple and to the point.

Two Types of Alerts

There are two common events that may be sent as alerts: Warnings and Breaches:

  • Warning
        • A warning may be issued when values for a SLI begin to encroach on the limits of a SLO. Sending a Warning allows support personnel to proactively respond.
  • Breach
      • Once an Objective is missed, a Breach Alert should be sent to the supporting personnel. Breach Alerts are indicative of something having gone very wrong and also indicative of the need to immediately rectify a situation.

Of the two types of Alerts, it’s much preferable to receive a Warning. Warnings allow proactive solutions to be implemented, while a Breach is strictly reactive.

Contents of an Alert

In order to be useful, an alert needs to have the correct amount and type of information in it. A short guide follows, but also consider rolling a Toil Report (see: another paper) into the alert. Proper monitoring should allow us to immediately identify toilsome tasks that drain the energy and hours from support teams.

Be sure to include the following information, but be careful about including Too Much Information (see Over-Alerting, below. The same principals apply):

  • Warning or Error
    • It should be immediately evident what kind of event is being alerted on.
  • A unique identifier for the alert event
    • A UUID will suffice.
  • Objective that is failing
    • Tell the supporting personnel which SLO is failing or in danger of failing.
  • Indicator metrics in support of the event.
    • Include metrics that show the failure, broken down as finely as possible.
  • Log snippets
    • If log snippets are available, they can be included or attached.
  • The ID’s of past alerting events affecting the same SLO
    • If the SLO has a history of prior failures, provide the IDs of each event.
  • Past solutions to similar issues
    • Objectives, Indicators, and Log Snippets can be compared, via automation, to past alerts. If a common solution has been used, include it here.

If an alert is determined to have past solutions or a history of similar alerts, a toil report should also be filed by the alerting system.

Over-Alerting

At all costs, avoid over-alerting. An inbox or pager that fills up with useless noise will make it extremely difficult for diligent on-call personnel to distinguish between actionable signals and pointless noise.

In fact, it’s not uncommon for on-call personnel, in the face of over-alerting, to miss important signals. This is a sure-fire way to introduce instability into a product.

Common Services, Suggested Indicators and Objectives

What follows is a list of common services. With each, suggested indicators and objectives allows the rapid definition of monitoring requirements at the inception of a project.

Note that not all the golden signals are available in all cases. In particular, Saturation metrics tend to indicate the health of the hardware or vm over that of the software.

Furthermore, transforming SLIs into SLOs often requires manipulation of the raw SLI data. Whenever possible, try to store data as a Time-Series to allow a deep analysis of these records.

A Note About Ping Times and Latency

These values are based on requests passing through a Cable Modem and travelling 1500 physical miles, end to end (see Bibliography [3]). All Pings and Latency SLOs should be reconsidered based on the actual circumstances of the service.

HTTP Based Web Service

This could be a web site or an API service, etc. Anything that provides HTTP requests and receives an HTTP response. Capture the logs of the http servers and process them through log aggregators and analyzers.

Although it’s possible for Content Distribution Networks to serve non HTTP traffic, the CDN still gets lumped under this heading.

Indicators

  • Latency
    • Individual OK Request Latency
      • This value represents the latency of a single request.
    • Aggregate OK Request Latency Percentiles
      • Establish one of these per SLO. For example, if a SLO states ‘Latency of successful requests should be less 3 seconds 70% of the time’, then an Aggregate OK Request Latency tables should have an entry for the 70th percentile on it for easy visibility:
Latency Percentile (200 OK) Time value
70th 2.7 seconds
  • Individual Failed Request Latency
  • Aggregate Failed Request Latency Percentiles
  • Traffic
    • Total number of requests served
      • Per Minute
      • Per Hour
      • Per Day
    • Total number of requests ok
      • Per minute
      • Per Hour
      • Per Day
    • Percentage of requests ok
    • Total number of requests failed
    • Percentage of requests failed
  • Errors
    • Individual Errors, by HTTP Code (500, 404, etc)
      • These should be linkable to specific log snippets where possible
    • Individual Errors, other than HTTP error codes
      • This may happen when a user reports misbehavior of a web page or service that otherwise reports itself to be okay.
      • Logged information about that request should also be available, when possible.
    • Downtime, in seconds
    • Downtime, percentage by hour
    • Downtime, percentage by day
    • Downtime, percentage by week
    • Downtime, percentage by month
    • Downtime, percentage by year

Objectives

  • Aggregate OK Request Latency Percentiles
    • 95% < 450 ms
    • 90% < 300 ms
    • 50% < 100 ms
  • Aggregate failed Request Latency Percentiles
    • 95% < 475 ms
    • 90% < 325 ms
    • 50% < 125 ms
  • Total Percentage of Requests Failed < .1
    • This assumes 3 9’s of uptime and service
  • 100 – Percentage downtime: (all values assume 3 9’s of uptime)
    • Hourly: < .1
    • Daily: < .1
    • Weekly: < .1
    • Monthly: < .1
    • Yearly: < .1

Bare Metal Servers (General)

Bare-metal servers possess a set of common Saturation SLIs to watch. In fact, the bulk of these SLIs are Saturation metrics.

To measure Latency, a central monitoring machine can be made to ping and record the response time to each server.

Errors at the vm level are usually thrown into a local system log file, which can be aggregated and scraped to present errors.

In many cases, specifically targetable statistics are available from the equipment manufacturer (such as target CPU temperature).

Indicators

  • Latency
    • Ping time, 1/minute/server
      • This value can also provide Downtime by the minute
  • Errors: typically these are fixed as identified.
    • Identified System Error Frequency
      • Per minute
      • Per hour
      • Per day
      • Per week
      • Per month
      • Per year
  • Saturation
    • CPU Load
      • Reported by minute
    • Memory Usage
      • Reported by  minute
    • CPU Temp
      • Reported by minute
    • Disk I/O
      • Reported by minute
    • Disk Availability Percentage
      • Reported by minute
    • Network Utilization (bandwidth) (see Bibliography: [5])
      • Reported by minute

Objectives

  • Ping time (See: Bibliography, [3].)
    • 50th percentile < 50 ms
    • 90th percentile < 100 ms
    • 99th percentile < 300 ms
  • Total Server Downtime
    • Less than .1% of a hour
    • Less than .01% of a day
    • Less than .01% of a week
    • Less than .01% of a month
    • Less than .01% of a year
  • CPU Load
    • 95th Percentile < 90% Load
    • 75th Percentile < 70% Load
    • 50th Percentile < 60% Load
  • Memory Usage
    • 95th Percentile < 90% Usage
    • 75th Percentile < 80% Usage
    • 50th Percentile < 70% Usage
  • CPU Temp (check hardware specs and recommendations from the manufacturer)
    • 95th Percentile < z degrees C
    • 75th Percentile < y degrees C
    • 50th Percentile < x degrees C
  • Disk I/O
    • 95th Percentile < 30 ms
    • 75th Percentile < 15 ms
    • 50th Percentile <  8 ms
  • Disk Availability Percentage > 20% at all times
  • Network Utilization < 75% at all times
  • System Errors should be fixed with a response time according to frequency of occurence (once identified)
    • Frequency: daily or less – respond within 1 day
    • Frequency: weekly – respond within 1 week
    • Frequency: monthly or more – respond within 1 month

VM Instances (General)

Monitoring a VM is very  similar to monitoring bare metal, with the exception of CPU Temperature. See the notes on Bare-Metal Servers, particularly regarding ping time.

Indicators

  • Latency
    • Ping time, 1/minute/server
      • This value can also provide Downtime by the minute
  • Errors: typically these are fixed as identified.
    • Identified System Error Frequency
      • Per minute
      • Per hour
      • Per day
      • Per week
      • Per month
      • Per year
  • Saturation
    • CPU Load
      • Reported by minute
    • Memory Usage
      • Reported by  minute
    • Disk I/O
      • Reported by minute
    • Disk Availability Percentage
      • Reported by minute
    • Network I/O
      • Reported by minute
    • Network bandwidth usage (See Bibliography: [5])
      • Reported by minute

Objectives

  • Ping time
    • 50th percentile < 50 ms
    • 90th percentile < 200 ms
    • 99th percentile < 500 ms
  • Total Server Downtime
    • Less than .1% of a hour
    • Less than .01% of a day
    • Less than .01% of a week
    • Less than .01% of a month
    • Less than .01% of a year
  • CPU Load
    • 95th Percentile < 90% Load
    • 75th Percentile < 70% Load
    • 50th Percentile < 60% Load
  • Memory Usage
    • 95th Percentile < 90% Usage
    • 75th Percentile < 80% Usage
    • 50th Percentile < 70% Usage
  • Disk I/O
    • 95th Percentile < 30 ms
    • 75th Percentile < 15 ms
    • 50th Percentile <  8 ms
  • Disk Availability Percentage > 20% at all times
  • Network bandwidth usage < 75% at all times
  • System Errors should be fixed with a response time according to frequency of occurence (once identified)
    • Frequency: daily or less – respond within 1 day
    • Frequency: weekly – respond within 1 week
    • Frequency: monthly or more – respond within 1 month

Database Services

These recommendations are generic for a broad array of Databases, both NoSQL and SQL. Saturation Indicators and Objectives will come from the underlying VM or Bare Metal server.

Indicators

  • Latency
    • Individual OK Query Latency
    • Aggregate OK Query Latency Percentiles
      • Establish one of these per Objective. For example, if an Objective states ‘Latency of successful Queries should be less 3 seconds 70% of the time’, then an Aggregate OK Query Latency tables should have an entry for the 70th percentile on it for easy visibility:
Latency Percentile (200 OK) Time value
70th 2.7 seconds
  • Individual Error Query Latency
  • Aggregate Error Query Latency Percentiles
  • Individual OK CRUD Latency
  • Aggregate OK CRUD Latency Percentiles
    • Establish one of these per Objective. For example, if an Objective states ‘Latency of successful CRUD statements should be less 3 seconds 70% of the time’, then an Aggregate OK CRUD Latency table should have an entry for the 70th percentile on it for easy visibility:
Latency Percentile (200 OK) Time value
70th 2.7 seconds
  • Individual Error CRUD Latency
  • Aggregate Error CRUD Latency Percentiles
  • Traffic
    • Total number of Queries served
      • Per Minute
      • Per Hour
      • Per Day
    • Total number of Queries ok
      • Per minute
      • Per Hour
      • Per Day
    • Percentage of Queries ok
    • Total number of Queries failed
    • Percentage of Queries failed
    • Total number of CRUD served
      • Per Minute
      • Per Hour
      • Per Day
    • Total number of CRUD ok
      • Per minute
      • Per Hour
      • Per Day
    • Percentage of CRUD ok
    • Total number of CRUD failed
    • Percentage of CRUD failed
    • Downtime, in seconds
    • Downtime, percentage by hour
    • Downtime, percentage by day
    • Downtime, percentage by week
    • Downtime, percentage by month
    • Downtime, percentage by year
  • Errors
    • Individual Error Query Statement
      • For example: select * from foo ;
      • Group these by the Error Code from the db where possible.
    • Individual Error CRUD Statement
      • For example: insert values( a,b,c) into foo ;
      • Group these by the Error Code from the db where possible

Objectives

  • Aggregate OK Query Latency Percentiles
    • 95% < 450 ms
    • 90% < 300 ms
    • 50% < 100 ms
  • Aggregate failed Query Latency Percentiles
    • 95% < 475 ms
    • 90% < 325 ms
    • 50% < 125 ms
  • Total Percentage of Queries Failed < .1
    • This assumes 3 9’s of uptime and service
  • Aggregate OK CRUD Latency Percentiles
    • 95% < 450 ms
    • 90% < 300 ms
    • 50% < 100 ms
  • Aggregate failed CRUD Latency Percentiles
    • 95% < 475 ms
    • 90% < 325 ms
    • 50% < 125 ms
  • Total Percentage of CRUD Failed < .1
    • This assumes 3 9’s of uptime and service
  • 100 – Percentage downtime: (all values assume 3 9’s of uptime)
    • Hourly: < .1
    • Daily: < .1
    • Weekly: < .1
    • Monthly: < .1
    • Yearly: < .1

Messaging Queues

Messaging Queues provide the backbone for Service Oriented Architecture and asynchronous software suites (such as OpenStack). One of the largest Gotchas witnessed with messaging queues is the tendency for subscribing programs to pick up new queues while publishing queues continue publishing to the old one (or vice-versa).

Saturation metrics will usually come from the underlying OS (whether hardware or VM).

Indicators

  • Latency
    • Individual OK Request Latency
    • Aggregate OK Request Latency Percentiles
      • Establish one of these per Objective. For example, if an Objective states ‘Latency of successful Queries should be less 3 seconds 70% of the time’, then an Aggregate OK Query Latency tables should have an entry for the 70th percentile on it for easy visibility:
Latency Percentile (200 OK) Time value
70th 2.7 seconds
  • Individual Error Request Latency
  • Aggregate Error Request Latency Percentiles
  • Traffic
    • Requests served per second
    • Failed Requests per second
    • Length of individual queues, taken once per minute
    • Individual Queues with active publishers (when possible)
    • Individual Queues with active subscribers (when possible)
  • Errors
    • Individual Failed Request details
    • Percentage of Errors by total number of requests
      • Turn these into percentile tables

Objectives

  • Aggregate OK Request Latency Percentiles
    • 95% < 450 ms
    • 90% < 300 ms
    • 50% < 100 ms
  • Aggregate failed Request Latency Percentiles
    • 95% < 475 ms
    • 90% < 325 ms
    • 50% < 125 ms
  • Percentage of successful requests
    • 99.9% succeed
  • Growth rate of queues (percentile)
    • 99% < 50% growth
    • 70% < 25% growth
    • 50% <= 0% growth
  • Any queue that grows for X hours+ needs to throw an alert, based on the run times of supporting programs:
    • Daily: X = 24
    • Hourly: X = 1
    • Continually: X = 1

Networking

This category includes SDNs, Load Balancers, Switches, and Routers, Networking Hardware, and more.

Simple Network Management Protocol (SNMP) (see Bibliography: [6]) is important in network monitoring and can assist in providing details about throughput, latency, errors, and saturation.

Hardware and SDN capacity can vary, so OEM specifications should always be consulted when fleshing out values for objectives.

Strive to include a method of visualization. Having the ability to create a visual map of your network is, itself, a viable SLO.

Indicators

  • Latency – can be measured across the system with a small amount of infrastructure designed and implemented to span network resources with a ping. Use a traceroute to verify, if necessary.
    • Equipment Spanning Ping Times, recorded every minute
  • Traffic
    • Network Tomography: traffic across key links, in bandwidth (bits per second)
  • Errors
    • Dropped packets per day
    • Dropped packets per minute
    • Dropped packets per hour
    • Timeouts, by location, per day
    • Timeouts, by location, per hour
    • Timeouts, by location, per minute
    • Link Downtime, in seconds
  • Saturation
    • Link %Used (by percent of OEM stated capacity)

Objectives

  • Latency: (essentially just ping times)
    • 95th Percentile < 250 ms
    • 75th Percentile < 100 ms
    • 50th Percentile < 50 ms
  • Uptime:
    • Hourly >= 99.9%
    • Daily >= 99.9%
    • Monthly >= 99.9%
    • Yearly >= 99.9%
  • Link Capacity:
    • 95th Percentile < 95%
    • 75th Percentile < 80%
    • 50th Percentile < 60%
  • Dropped Packets
    • 95th Percentile < 4%
    • 60th Percentile < 2%
  • Have a network map

Storage Solutions (Hardware and Virtual)

Monitoring storage devices requires knowledge of the environment and the stored data.

There are storage solutions that provide pretty lofty levels of IOPs, but in order to utilize all those IOPs, the local hardware needs fast access to the data. In other words, if we have an HTTP service that stores large amounts of data, the IOPs of the disk-write is likely to be dictated by the latency of the HTTP request to the storage device, rather than simply hardware bound.

How we mitigate this problem is a topic beyond the scope of this paper. However, monitoring may be configured to catch this problem, whether we witness it or not. Be sure to measure latency of all network traffic to the storage endpoint. In the case of an HTTP data writing service, our standard HTTP monitoring should also be considered an important part of storage solution monitoring.

Advertised OEM specifications should state what the expected IOPs, as well as minimums and maximums.
(See also: Bibliography [4])

Indicators

  • Latency
    • Disk IOPs, minimum measurable value or 20 sec, whichever is longer
    • Average IOPs
      • Hourly
      • Daily
      • Weekly
      • Montly
      • Yearly
  • Traffic
    • Amount of data written (people will ask)
      • By minute
      • By hour
      • By day
      • By week
      • By month
      • By year
    • Amount of data read
      • By minute
      • By hour
      • By day
      • By week
      • By month
      • By year
  • Errors
    • All data write errors should be reported
    • Frequency of error-types should be reported
    • High volumes of errors, whether individual error types or multiple, must be alerted
  • Saturation
    • Negative or Positive Sigma (𝜎) = number of standard deviations from expected IOPS to actual IOPS.
    • Network device percentage of capacity

Objectives

  • 𝜎 =~ 0
    • High positive or negative values of σ tell us the actual IOPS are not tracking to the target IOPS.

Cloud, Container Farm, or PaaS

There should be no components of a Cloud, Container Farm, or PaaS that are not monitored under other services, but a network map and visualization of the overall Platform should be available at all times.

Objectives

  • Always have data related to:
    • VM or Container inventory
    • Service inventory
    • Network equipment inventory.
    • Network map

Nested Services (General)

Nested services get their own segment solely to stress the importance of Traceability. When a nested service is monitored, it will almost always be monitored according to what type of service it is. However, if an API function makes calls to other API functions, which in turn call databases or devices or whatever, it’s important to be able to trace each  call from beginning to end.

The advice below is a minimum set of requirements. There are a number of products that will assist in gathering useful information about nested calls which go well above and beyond these simple requirements.

Indicators

  • Traffic
    • Every call receives a GUID to enable tracing from beginning to end.

Objectives

  • Every nested call must be Traceable.

Everything

This section breaks the pattern but provides some non-functional requirements which should be considered immediately upon the inception of each project.

  • All metrics must be archived according to the retention policies of the company and project being supported.
  • Metrics should always be gathered at a central location before final processing. All metrics should be decoupled from the system they are monitoring. In other words, changes to the monitoring system should not interfere with the supported project any more than absolutely necessary. Metrics collection, aggregation, and processing should always occur in separate locations.
  • Error reports should be sent out daily. Many of the captured Error metrics have no clear SLO to accompany them, but all errors should be prioritized and dealt with.

 

Bibliography

While OPI consultants represent an enormous body of knowledge, other sources of authority should be recognized.


Also published on Medium.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

*