An Introductory Guide to Blameless SLOs

Getting Started

Service Levels have become the common language for cross-functional teams to set guardrails and incentives to drive high levels of service reliability. Service Level Indicators (SLI) are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency (e.g. response time) or availability (e.g. good versus all requests) and are points on a digital user journey that contribute to customer experience and satisfaction.

Service Levels

To track the reliability performance of those SLIs, users set Service levels, such as a Service Level Objective (SLO) target, against each SLI. A reliability target is expressed as the minimum percentage of requests (e.g. 95.90%) over a specific time window (e.g. 28-day sliding window) teams have decided that they have to meet a performance objective (SLO) on their SLI, to keep their customers or users happy. It automatically translates into a maximum amount of time over that time window, which is the wiggle room (or buffer) a team has to accelerate feature velocity or concentrate on improving reliability. This amount of time is defined as the Error Budget, or simply “budget” that teams can spend or consume before your corresponding service no longer meets its reliability objectives, within that time window.

As a new User to SLOs

To help you, as a new user, Blameless provides you with the SLO Wizard to help guide you through the process. You start with the User Journey.

Start by launching the SLO Manager. Blameless opens to the User Journey Landing page. Next, click on “+New Journey”. The SLO Wizard will walk you through the process, and you can follow that process via the guide icon at the top of the page or by clicking the on “Next” button.

SLO Feature Nav Bar
note

You can create a User Journey and leave it blank as a placeholder for future population.

You can continue to the section “Working with the SLO Wizard” for a high level description of the feature.

For detailed instructions regarding the New User Journey and the SLO Wizard, refer to the Building a New SLO

As an Experienced User of SLOs

As an experienced user, you are probably familiar enough with the process to not need the SLO Wizard to create more SLIs, but it is certainly there for you to use to create new user journeys and add new SLOs to user journeys. You can continue on via the section, “Launching the SLO Manager”.

Working via the SLO Wizard

An SLO requires the following:

  • Create the User Journey
  • Create the SLI
  • Create the Error Budget Policy
  • Create the SLO
  • Set the Thresholds
note

The best practice for User Journey analysis is collaboration across teams and groups to collect the journey information.

Creating a New User Journey

Creating a New User Journey occurs via the SLO Wizard to simplify the process. The SLO Wizard guides you through, showing your progress at the top of the window.

note

You can also create a User Journey and leave it blank as a placeholder for future population.

For detailed instructions regarding New User Journey, refer to the Building a New SLO

Manage your User Journeys

  1. Click on the SLO Manager icon.
SLO Feature Nav Bar

The User Journey window will open, display three options in the left side of the SLO window:

  • User Journey
  • Error Budget Policies
  • Service Level Indicators

When it opens, the default landing page is the existing User Journey list (if any).

Existing User Journey List

User Journey

The User Journey is composed of SLOs, SLIs, and Error Budgets which allow the user to examine the state of their reliability conformance.

Displaying the SLOs

If you want to examine the contents of an existing User Journey, there are two ways to do this:

  1. Click on the down arrow to the left of the User Journey name, which opens a "Quick View" of the associated SLOs.
  2. Click on the User Journey name itself, which opens the User Journey window and displays a list of associated SLOs in a either Card or Table view with more details (see below the definition of key terms used in both the card and table views)
  3. Click on the down arrow to the LEFT of the User Journey name. A drop-down appears, providing a Quick View of the associated SLOs names.
User Journey associated SLOs

Key SLO Terms

Here are some common, but key, SLO terms you will see throughout our documentation.

TermDescriptionSource
SLOArbitrary name given to the SLO (alphanumeric string)User defined
Reliability TargetThe minimum percentage of requests (e.g. 95.90%) over a specific time window(*) that teams have decided that they have to meet a service level objective (SLO) on their SLI. The entered value will typically be set somewhere between 100% and 0%.User defined
SLI TypeType of SLI against which the user-defined reliability target is measured (e.g. Availability, Latency, etc.)User defined
Service LevelPercentage of the current, measured, Service Level Indicator, sampled out over a specific time window(*), as time series data from core metrics that are injected continuously.Calculated
RemainingError Budget remaining, expressed as a percentage of the total amount of budget given to the SLO. It represents the maximum amount of time (“reliability”) over a specific time window(*), providing “wiggle room” to a team who has to accelerate feature velocity or concentrate on improving reliability.Calculated
Burn RateError Budget Burn Rate is a number relative to the reliability target set for an SLO. This describes how fast you are burning your error budget.Calculated
Depleted InNumber of days left before the remaining Error Budget goes to 0%. It is automatically updated, based on the Error Budget Burn Rate.Calculated
note

The burn rate reflects recent changes more rapidly than the remaining error budget value.

(*) Supported time windows: 28-days sliding window Customizable sliding window length (future) Calendar window (future)

note

Calculated values are updated every 7 minutes.

  1. Click on an SLO row title which opens a new Details window. The new window will contain the definition of the SLO (reliability target, servel level, etc.), the associated SLI and the associated Error Budget Policy, The User Journey Summary remains visible on the far right side of the window.
Existing User Journey ListExisting SLO Details window

You will note within the Details window you have several icons identifying actions you can apply to the elements in the window. These are identified and the action defined in the following table.

IconTypeAction
“...”Drop-downEdit SLO
Refresh Error Budget
Delete SLO
“+”ActionAdd SLO / SLI
Pencil iconActionEdit the associated field
“X”ActionClose Details Window
Existing SLO components
  1. If there are existing SLOs: Click on the desired SLO.
Existing SLO components
  1. Click on an SLO Card. It launches the SLO Details window just like above in the drop-down option.
note

Cards with a green or red sliver on the left of the card indicates the status of the SLO: Green when the remaining Error Budget is higher than 0%, red if below 0%.

Modifying an Existing User Journey

  1. If you wish to modify an existing User Journey:
Existing SLO components

Within the SLO, you can see the associated SLI, the attached Error Budget alert policy, as well as a User Journey Summary.

For detailed instructions regarding Existing SLOs, refer to the A Guide to Managing Blameless SLOs

Actionable SLOs

The Blameless advantage is that our SLOs are "Actionable". Teams responsible for the reliability of one or more services can be automatically notified via email, Slack when Error Budgets are depleting below certain thresholds (e.g. 25%). Additionally, an Error Budget alert policy can automatically start a Blameless incident based on those same thresholds.

For example: Using an error budget policy as a rule to trigger one or more notifications to one or more teams via email and Slack channels.

Existing SLO components

Ingesting metrics for SLIs

After the SLI has been created, SLO Manager immediately starts injecting time series data from one or more metrics that compose the SLI (e.g. good, valid) from the selected data source (e.g. DataDog, New Relic, Prometheus, etc.)

The SLI status is reported depending on the status. For example:

SLI StatusIconIcon type
In ProgressSpinning wheelThis SLI is currently fetching the latest data from your APM.
Backfill completedGreen circle checkmarkSuccessfully fetched latest data from your APM.
ErrorRed circleExclamation message “Error while fetching…”.
No incoming dataTBDFuture Feature

Examples regarding these status icons appears in the following:

Existing SLO components
note

The Error message will be similar to the sample image, based on the type of error and explanation available.

Blameless SLO API Endpoints

Blameless offers a set of SLO APIs to support the client’s needs.

For detailed information on our SLO API endpoints, refer to the SLO API docs and SLO Timeseries API docs.

For more Information

For instructions regarding the creation, configuration, and use of User Journeys, Error Budgets, SLOs, and SLIs, refer to the following SLO references:

Blameless SLO Definitions

An Introductory Guide to Blameless SLOs (this document)

A Guide to Getting started with Blameless SLOs

A Guide to Building a New SLO

A Guide to Error Budget Policies

A Guide to Managing Blameless SLOs

A Guide to Understanding your SLOs

Refer to the Google SRE Handbook for more information regarding Site Reliability Engineering.