A Guide to Blameless SLOs

Definitions

Service level components become the common language for cross-functional teams to set guardrails and incentives to drive high levels of service reliability. Within the SLO, SLIs are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency or availability and are points on a digital user journey that contribute to customer experience and satisfaction.

Service Level Indicators

A Service Level Indicator (SLI) is “a carefully defined quantitative measure of some aspect of the level of service that is provided.”

SLIs are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency or availability, which are defined as response times, including queue/wait time, in milliseconds.

A collection of SLIs, or composite SLIs, are a group of SLIs attributed to a larger SLO. These indicators are points on a digital user journey that contribute to customer experience and satisfaction. Once you have SLIs set up, you connect them to your SLOs, which are targets against your SLI.

Service Level Objectives

A Service Level Objective (SLO) is “a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.

Service level objectives become the common language for cross-functional teams to set guardrails and incentives to drive high levels of service reliability. SLOs give you the objective language and measure of how to prioritize reliability work for proactive service health.

Error Budget Policy

An error budget is the percentage of remaining "wiggle room" you have in terms of your SLO. Generally, you’ll institute a rolling window versus historical purview into your data. This keeps that SLO fresh and constantly moving forward as something that you can monitor. It’s not enough to know what your error budget is; you also need to know what you’ll do in the event of error budget violations. You can do this through an error budget policy, which determines alerting thresholds and actions to take to ensure that error budget depletion is being addressed accordingly.

Service Registry

The service registry is centralized catalog of services that ties together context across the service lifecycle such as change events, SLOs, observability insights, incidents, and more.

Service Level Agreements

A Service Level Agreement is “an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.”

Building an SLO

Building your SLO requires the following:

  • Create the User Journey

  • Create the SLI

  • Create the SLO

  • Set the Thresholds

note

The best practice for User Journey analysis is collaboration across teams and groups to collect the journey information.

Create the User Journey

The user journey is composed of Services, SLIs, and SLOs which allow the user to examine the state of their profile conformance.

  1. Click on the SLO Manager icon.
SLO Feature Nav Bar

The User Journey window will open, display three options:

  • User Journey

  • Error Budget Policy

  • Service Registry

note

You can create a User Journey and leave it blank as a placeholder for future population.

Creating a New Journey

  1. Click on the “New Journey” Button.
New User Journey opening window
  1. Enter a User Journey Name.

  2. Enter a User Journey Definition.

  3. Select a Status from the drop-down. Your options are:

    • Production
    • Development
  4. Click on the “Unassigned” button to assign an owner.

    When all of the required fields have been defined, the “Save” button becomes active.

  5. Click the “Save” button.

    When you return to the Service Registry, on the right hand side of the window you will see an ellipse (three dots: “…”) The icon contains the following actions:

    • Edit (the current SLI)

    • Delete (the current SLI)

Creating an SLI

  1. Select the Service Registry option.

    When the Registry window opens, a list of services, if created, will appear.

note

If there is no SLI associated with a service under teh Service Registry Title, the SLI title in the field will be blank.

  1. Open the desired or create a new Service window.
note

You must create a service or services and at least one SLI prior to creating the SLO.

Using an existing SLI (service)

  1. Click on the desired SLI. A new window.
RESTO Service Registry SLI List
note

This will include any existing SLIs, Notes (regarding the service), and a Summary of the Service metadata (Name, Description, Creation and Modification dates).

Creating a new SLI (service)

  1. Click on the “New Service” button.

    A new modal opens containing the following required (*) fields:

    • Service Name

    • Description

  2. Enter the name for the new service and a description

  3. Click the “Save” button.

    The new service will appear on the “Service Registry” landing screen the next time you open it.

note

The SLI list will remain blank until you create an SLI and save it.

New Service Registry opening window
  1. Define a new or select an existing SLI.

    When you open the desired Service Registry window, you will find the following elements:

    • SLI

      A list of SLIs (if any exist) under an SLI tab

    • Notes

      A Notes tab containing any added information regarding the service

    • Service Summary

      • A Service Description

      • Creation date

      • Last updated

      • Team

note

Both the Description and the Team (members) sections have a pencil icon, signifying these fields can be edited.

Using an existing SLI (service)

  1. Select an existing Serivce Registry.

  2. Select the desired Service. A details window will open.

Creating a new SLI (service)

  1. Click on the “New Service" Button.

  2. Enter a Service Name.

  3. Enter a Service Definition.

    When all of the required fields have been defined, the “Save” button becomes active.

  4. Click the “Save” button.

    When you return to the Service Registry, on the right hand side of the window you will see an ellipse (three dots: “…”) The icon contains the following actions:

    • Edit (the current SLI)

    • Delete (the current SLI)

Define an SLI

  1. Select and open an existing SLI (service).

  2. Click on the “Define SLI” button.

  3. Assign an SLI Name and enter a description.

    For example: "This SLI measures the latency of the login request for the 95th percentile of login requests hitting the API and Login service".

SLI Latency Configuration window
  1. Select the SLI Type. Currently supported options are:

    • Availability measures good metrics vs. valid metrics.

    • Latency measures how long it takes to complete the task.

  2. Select the Data source, based on the integration(s) you activated.

  3. Copy and paste the metric shown in the example field, based on the Data source selected.

  4. Click the “Save” button.

Return to the User Journey top level. You can now start to set up SLOs.

Actionable SLOs

The Blameless advantage is that our SLOs are "Actionable". That means you can use an SLO event to activate a threshold point.

For example:

Using an error budget policy as a rule to make something occur when the SLO threshold value is reached.

note

While the SLO and Error Budget Policy Thresholds are NOT a real-time event tracking capability, they can be used to observe activity and events to help build a breadcrumb of how things are functioning and before the events become customer-affecting issues.

Create an Error Budget Policy

  1. Select the Error Budget Policy option on the left side of the window.

  2. Click on the "New Policy" button on the right side of the window. A new modal opens.

  3. Enter a Policy name and description.

  4. Click the “Save” button.

New Error Budget Policy Creation window
note

As with the other windows, the ellipse (three dots) at the end of each line gives you the following action options regarding that item:

  • Edit

  • Delete

If you click on Delete, you will receive a warning that you are about to remove the item.

Setting Policy Thresholds

  1. Open the new Policy. Within the new policy window, you will see a list of thresholds that can be set.

  2. Set an automated Threshold. The following list of thresholds are available

note

This is the currently available automated thresholds. More may be added as features are added.

  • Notify via E-mail

  • Notify via Slack

  • Create a Blameless ticket

  • Create a ServiceNow or JIRA ticket

note

As stated earlier, you can Integrate EITHER JIRA or ServiceNow, but not both. efer to these “Integrations” guides for more information.

Error Budget Sample window

Create an SLO

note

For this example, we are using a Latency SLI to create the following SLO.

  1. Click on the User Journey Option in the upper left corner of the window.
Existing User Journey List window
  1. Click on the desired User Journey. A new window opens. The User Journey window will contain a number of options:

    • A status for the User Journey

    • A list of any existing SLOs associated with the User Journey

    • A summary of the User Journey

    • Buttons allowing you to edit the existing User Journey and Add a new SLO if desired.

  2. Click on the “Add SLO” button. A new "Create New SLO" window opens.

Create a New SLO window
  1. Enter a name for the new SLO. Best practices suggest something that reflects the User Journey it is associated with.

    For example: Login Latency for 95% percentile.

  2. Associate it with an Error Budget Policy within the drop down field beneath the SLO Name field.

  3. Click the “Save” button.

  4. Choose an SLI to associate with the SLO from the list.

  5. Click on the SLI option on the left side of the window.

  6. Select the SLI you wish to associate with this SLO.

  7. Click on the “Next” button. Blameless now opens the “Specify your SLO threshold target” window.

Specify SLO Target window
  1. Select the Percentage of Time threshold for the SLI.

  2. Select the Latency value (the threshold value that will determine if there is a triggering violation in milliseconds or second).

  3. Select a Violation comparison operator (i.e., less than or less than or equal to).

  4. Click the “Next” button. The SLO Status window opens.

Select SLO Environment status window
  1. Select the desired environment (i.e., Testing) regarding the threshold monitoring.
note

Development / Testing / Active correspond to a few things. “Development” does not fire error budget policies, while “Testing” and “Active” do. The difference between “Testing” and “Active” is more of a matter of organizing once we add some filtering.

  1. Click the “Create” button. Blameless returns you to the User Journey landing page. The new SLO should now appear in the User Journey window as an option.

When the SLO kicks off, it will (currently) connect to the selected Data source and start digesting data for the previous 28 day time window to measure against the SLO(s) that are activated.

note

This may take some time to "crunch the numbers".

When it is done, Blameless will generate a chart of the data it has digested, based on the parameters set and the Error Budget policy values, and display it below the SLO list on the landing page.

SLO Results window with charts

Operationalize the SLO Service

As you mouse over the chart, you will see dots, representing data points from the 28 day data digestion.

SLO Results window with data point

If you click on a point, Blameless opens a comment modal on the right side of the screen where you can add comments regarding that point in time within the data digestion.

Comments like, “There was an outage”; “There was a maintenance window” helps users understand why there was a violation with the SLO.

Insert a comment on a data point

  1. Click on a data point in the chart. The Comment modal opens.

  2. Enter your comment in the text field and click on POST.

SLO Results data point comment window
note

The ellipse (...) will appear to the right of the comment(s) after it has been posted.

SLO Results data point comment posted

Action options regarding your comment

If you select the ellipse, the following options are displayed:

  • Resolve (the comment)

  • Reply (to the comment)

  • Delete (the comment)

Restore Error Budget

Blameless also allows the user to restore error budget. To do so complete the following steps:

  1. Locate the chart icon (squiggly arrow) in the right corner of the SLO Title window and click on it. A new modal opens on the right side of the SLO window.

  2. Enter a description of the restore event (i.e., Maintenance window).

  3. Enter the amount of time involved in the event. That entered value will appear in the Restored Error Budget Minutes chart below the Error Budget chart.

SLO Results data point comment posted

Blameless SLO API Endpoints

Blameless offers a set of SLO APIs to support the client’s needs.

For detailed information on our SLO API endpoints, please refer to the SLO API docs and SLO Timeseries API docs.