A Guide to Understanding your SLOs

Getting Started

So you built your User Journeys, SLOs, Error Budget Policies, and SLI(s) and you are now looking at a window of charts.

Results window with graphs

As you mouse over the chart, you will see data point dots, representing data points from the 28 day data digestion. So what is the Blameless SLO Manager telling you?

Understanding your Data

If you click on a point, Blameless opens a comment modal on the right side of the screen where you can add comments regarding that point in time within the data digestion.

SLO Results window with data point

Comments like, “There was an outage”; “There was a maintenance window” helps users understand why there was a violation with the SLO.

Header info

The Header information is where you start to figure out your query.

SLO Results data point comment window

The next table provides definition to the components located in the Header.

TitleValueDescription
Reliability Target90%The minimum percentage of requests (e.g. 95.90%) over a specific time window(*) that teams have decided that they have to meet a service level objective (SLO) on their SLI. The entered value will typically be set somewhere between 100% and 0%.
Service Level98.489%Percentage of the current, measured, Service Level Indicator, sampled out over a specific time window(*), as time series data from core metrics that are injected continuously.
Comparison Operator<Logical operator
Objective Value Latency400Threshold defining what a good vs a valid event is.
Latency Unit (of measure)rpsRequest per Second
SLISLI nameSLI against which this SLO applies.
Error Budget PolicyError Budget Policy nameBreak out of thresholds and notifications.

So basically, the parameters you have chosen are 90% (or better) of the target out of all requests we want to successfully pass as valid requests for this service. We are currently running above that at 98.489%, which means we are currently not depleting our Error Budget.

If you configured your Error Budget policy accordingly and you depleted 25% of your Error Budget (only 75% remaining), you can be notified by e-mail and also start a new Blameless Incident.

note

The SLI and EBP information is gleaned from opening the specific SLI and EBP to look at the dat and descriptions provided.

Error Budget

The Error Budget graph shows the vertical axis as the history of the available Error Budget Consumption over a 28-day sliding window (x% of 100%) while the horizontal axis defines measured date and time.

SLO Results data point comment window

You will note that the remaining error budget is actually declining, reflecting that more budget is being consumed, and reducing the wiggle room you have over that 28-day sliding time window to address potential growing reliability issues or push more changes to the corresponding service.

note

You will also see two speech-bubble icons appear at different days and times during the monitoring. You can click on the boxes to see comments that were added to the chart. This could describe an external event or activity that occurred at that time and manually entered by a user in the graph (see how to “Insert a comment on a data point"), for example.

note

If the error budget depletes completely to 0% in this graph, and your Service Level continues to be below your Reliability Target, the calculated error budget will become negative and continue to deplete further. In that situation, the portion of the graph below the horizontal axis will become red.

SLO Events

The SLO Events chart tracks the total of daily Good vs Valid events over a 24 hour period that starts arbitrarily at 5 am (local time zone). When you hover your mouse over a specific day, a tooltip shows up with detailed information such as the date, the count of good events, the count of valid events and the actual service level calculated on that day.

SLO Results data point comment window

Vertical for the total number of events logged vs. horizontal and the date and time again.

SLO Objective

The SLO Objective chart tracks the history of the daily service level (100 x good/valid events) . It also displays an horizontal line representing the reliability target set for this SLO.

SLO Results data point comment window

The vertical axis is the percentage of the Service Level achieved for the horizontal value of day (and time).

Enhancing the Data points

Now that you understand more about your graphs, you can enhance the information but adding custom comments and details highlighting events at critical points in the process.

Insert a comment on a data point

  1. Click on a data point in the chart. The Comment modal opens.
  2. Enter your comment in the text field and click on POST.
SLO Results data point comment window

Action options regarding your comment

If you select the ellipse (three dots), the following options are displayed:

  • Resolve (the comment)
  • Reply (to the comment)
  • Delete (the comment)
note

The ellipse (...) will appear to the right of the comment(s) after it has been posted.

When you open a comment, the ellipse (three dots) remain, but the options are reduced to only deleting the post.

SLO Results data point comment posted

Recalculate the Error Budget

If you change your Reliability Target, Blameless recalculates automatically those values to match the new target. You can use the “Recalculate Error Budget” option in the drop-down to manually refresh the error budget graph data if you like.

To do so complete the following steps:

  1. Go back to the User Journey SLO Card window.
  2. Click on the desired SLO Card.
  3. Locate the ellipse (three dots) in the upper right hand corner and click on them. The dropdown options are:
    • Edit (pencil icon)
    • Recalculate Error Budget (recycle icon)
    • Delete (trash can icon)
SLO Results refresh

Select the Recalculate option. Blameless will proceed to refresh. Upon completion, the recalculated error budget graph will appear.

For More Information

For instructions regarding the creation, configuration, and use of User Journeys, Error Budgets, SLOs, and SLIs, refer to the following SLO references:

Blameless SLO Definitions

An Introductory Guide to Blameless SLOs

A Guide to Getting started with Blameless SLOs

A Guide to Building a New SLO

A Guide to Error Budget Policies

A Guide to Managing Blameless SLOs

A Guide to Understanding your SLOs (this document)

Refer to the Google SRE Handbook for more information regarding Site Reliability Engineering.