Implementing Observability Part I: SLOs

That which you cannot measure, you cannot improve.

This is a simplified starter guide to get you started with observability. I begin by explaining what SLOs are in this blog post. This post is essentially a cherry pick and outline of various SLO chapters on the SRE books by google.

Service level objectives (SLOs) specify a target level for the reliability of your service. SLOs are dependent on well implemented service level indicators (SLIs). SLIs essentially tell what’s the current state of your service.

A good example of an SLI would be the number of good requests divided by the total requests. For example the number of successful HTTP requests / total HTTP requests.

Once you have implemented SLIs for the relevant services and metrics, you need to ascertain what a good SLO would be for that service.

The SLO determines how many failed or degraded requests your users are willing to tolerate, this tolerance threshold is what’s defined as the error budget.

Having an error budget that’s agreed by all stakeholders is important because it helps you determine the priorities of your engineering team. For example, should they spend their time improving the reliability of the system, or on new features?

The closer you get to running out of your error budget, the more you should aim to fulfil your SLOs to keep your users happy.

SLOs and SLIs are also instrumental to be able to measure all meaningful parts of your system, anything that cannot be measured, cannot be improved. SLOs are also essential to define meaningful alerts, hence why I start by defining them here.

Feasible Reliability Target and Error Budget

Anything above your SLO targets should keep your customers happy with the responsiveness of your service.

That sounds great, but happiness is an elusive concept to measure, especially at the beginning. So why not aim for a 100% SLO?:

  • Given all the components between you and the user and your lack of control over all of them, reaching 100% availability is impossible.
  • The cost of maintaining a near 100% availability is often extremely high and the added value of an extra 0.0001% will go unnoticed.
  • 100% availability means that you do not have an error budget to make improvements and release new features comfortably, and as a consequence your creative output will stagnate. Ironically, systems that try hard to aim for 100% end up being less reliable due to lack of innovation.

Implementing SLIs

Before you can think of implementing SLOs, you need to ascertain what’s best to measure by implementing SLIs.

SLI Sources

  • Application server logs
  • Load balancer monitoring
  • Black-box monitoring
  • Client-side instrumentation

Defining SLIs

  • Chose one application you want to define SLOs for, you can add more later.
  • For your first SLIs chose something that requires minimum engineering work
  • Follow a consistent style when implementing SLIs, whether ratio or percentages, it’s better if all your metrics share this approach.
  • Define clearly who the users are in this situation, “users” can be the next step in the value stream — the dependent microservice or team.
  • Consider the common ways your users interact with your application and measure those.
  • Draw a high-level architecture diagram of your system; show the key components, the request flow, the data flow, and the critical dependencies

Your SLO policy

Once your SLIs are implemented and you have some data to define some provisional SLOs you want to begin formalising this procedure with all stakeholders.

Getting Stakeholder Agreement

If you do not have the business on your side when you implement SLOs it will be very difficult to maintain and define them accurately.

  • The product managers have to agree that this threshold is good enough for users — performance below this value is unacceptably low and worth spending engineering time to fix.
  • The product developers need to agree that if the error budget has been exhausted, they will take some steps to reduce risk to users until the service is back in budget .
  • The team responsible for the production environment who are tasked with defending this SLO have agreed that it is defensible without Herculean effort, excessive toil, and burnout — all of which are damaging to the long-term health of the team and service.

SLO Prerequisities

  • If the SREs feel that the SLO is not defensible without undue amounts of toil, they can make a case for relaxing some of the objectives.
  • If the development team and product manager feel that the increased resources they’ll have to devote to fixing reliability will cause feature release velocity to fall below acceptable levels, then they can also argue for relaxing objectives. Remember that lowering the SLOs also lowers the number of situations to which the SREs will respond; the product manager needs to understand this tradeoff.
  • If the product manager feels that the SLO will result in a bad experience for a significant number of users before the error budget policy prompts anyone to address an issue, the SLOs are likely not tight enough.

Prerequisites for the SLO Policy

  • There are SLOs that all stakeholders in the company have approved as being fit for the product
  • The people responsible for achieving this SLO have agreed that it is possible to meet this SLO under normal circumstances
  • The organisation has committed to using the error budget for decision making and prioritising, this is formalised with an error budget policy.
  • There is a process in place for refining the SLO

The SLO Policy Document

  • The authors of the SLO, the reviewers (who checked it for technical accuracy), and the approvers (who made the business decision about whether it is the right SLO).
  • The date on which it was approved, and the date when it should next be reviewed.
  • A brief description of the service to give the reader context.
  • The details of the SLO: the objectives and the SLI implementations.
  • The details of how the error budget is calculated and consumed.
  • The rationale behind the numbers, and whether they were derived from experimental or observational data. Even if the SLOs are totally ad hoc, this fact should be documented so that future engineers reading the document don’t make bad decisions based upon ad hoc data.

When drafting this document it should be reviewed every month to make adjustments and improve until the product becomes more mature. Then reviews can be quarterly or so.

Here is an example of SLO policy document.

The Error Budget Policy Document

  • The policy authors, reviewers, and approvers
  • The date on which it was approved, and the date when it should next be reviewed
  • A brief description of the service to give the reader context
  • The actions to be taken in response to budget exhaustion. For example stopping feature launches until your service is again within your SLOs while devoting the majority of the teams time to reliability related tasks.
  • A clear escalation path to follow if there is disagreement on the calculation or whether the agreed-upon actions are appropriate in the circumstances
  • Depending upon the audience’s level of error budget experience and expertise, it may be beneficial to include an overview of error budgets.

Here is an example of error budget policy document.

Continuous Improvement of SLO targets

Before improving SLO targets you need to learn about user satisfaction:

  • You can count outages that were discovered manually, posts on public forums, support tickets, and calls to customer service.
  • You can attempt to measure user sentiment on social media.
  • You can add code to your system to periodically sample user happiness.
  • You can conduct face-to-face user surveys and samples.

Improving the Quality of you SLOs

Count manually detected and support tickets of outages. These can be used to correlate current breaches of SLOs with historical data, to facilitate finding the root cause.

If some of your outages and ticket spikes are not captured in any SLI or SLO, or if you have SLI dips and SLO misses that don’t map to user-facing issues, this is a strong sign that your SLO lacks coverage, if this is the case there are some things you can do:

  • Change your SLO The SLO decision matrix can help with this.
  • Change your SLI implementation — either move the measurement closer to the user to improve the quality of the metric, or improve coverage so you capture a higher percentage of user interactions
  • Institute an aspirational SLO — This is a parallel SLO that does not trigger alerts but you can use as a guide to reach your ideal state.
  • Iterate — remember this about starting small and continuously improving.

Going Forward

As your SLOs mature you may want to think of implementing the following as well:

Dashboards and Reports

It’s very useful to have a dashboard that shows your current state of achieving SLOs and being able to create reports for time periods. These can be instrumental in spotting problematic areas and discussing solutions with the various teams.

Choosing Appropriate Time Window

Your SLOs may change depending on the time of the day or the period of the week or month. You may need to adjust them accordingly and set time windows where they vary.

Establishing windows is also helpful to know where it’s best to spend your error budget.

It’s better to have a rolling window than a calendar window.

Modelling User Journeys

SLOs should centre around improving user experience.

You must identify user centric events, that is functions that may contain more than one SLIs or log sources and joining them together to measure them jointly and establish SLIs/SLOs in aggregate, this will more accurate map your SLIs to real case scenarios.

Grading Interaction Importance

You may want to bucket your requests. For example you may want to give more priority to your premium users and better service, hence you may have different SLOs for different cases.

Modeling Dependencies

You should map and understand all of your dependencies to determine where the bottlenecks are. If a service cannot absolutely run any faster due to a limitation you should engineer around. Having good SLIs and SLOs can help giving you an overview of where you can sort these problems out.

If one of your dependencies is a constant cause of an outage, engineering time should be devoted to resolve this permanently.

Experimenting with Relaxing SLOs

This is quite similar to chaos engineering in a way. Here you deliberately slow down the service to observe how customers react and see how slow you can go before you start losing business.

These experiments need to be performed with extreme care and with a small control group.

Sources

I have heavily based this post on the following chapters from the SRE books, I recommend you read the below if you want to extend your knowledge about SLOs: