Service Levels have become the common language for cross-functional teams to set guardrails and incentives to drive high levels of service reliability. Service Level Indicators (SLI) are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency (e.g. response time) or availability (e.g. good versus all requests) and are points on a digital user journey that contribute to customer experience and satisfaction.
To track the reliability performance of those SLIs, users set Service levels, such as a Service Level Objective (SLO) target, against each SLI. A reliability target is expressed as the minimum percentage of requests (e.g. 95.90%) over a specific time window (e.g. 28-day sliding window) teams have decided that they have to meet a performance objective (SLO) on their SLI, to keep their customers or users happy. It automatically translates into a maximum amount of time over that time window, which is the wiggle room (or buffer) a team has to accelerate feature velocity or concentrate on improving reliability. This amount of time is defined as the Error Budget, or simply “budget” that teams can spend or consume before your corresponding service no longer meets its reliability objectives, within that time window.
As a new User to SLOs
To help you, as a new user, Blameless provides you with the SLO Wizard to help guide you through the process. You start with the User Journey.
Start by launching the SLO Manager. Blameless opens to the User Journey Landing page. Next, click on “+New Journey”. The SLO Wizard will walk you through the process, and you can follow that process via the guide icon at the top of the page or by clicking the on “Next” button.
You can create a User Journey and leave it blank as a placeholder for future population.
You can continue to the section “Working with the SLO Wizard” for a high level description of the feature.
For detailed instructions regarding the New User Journey and the SLO Wizard, refer to the Building a New SLO
As an Experienced User of SLOs
As an experienced user, you are probably familiar enough with the process to not need the SLO Wizard to create more SLIs, but it is certainly there for you to use to create new user journeys and add new SLOs to user journeys. You can continue on via the section, “Launching the SLO Manager”.
Working via the SLO Wizard
An SLO requires the following:
- Create the User Journey
- Create the SLI
- Create the Error Budget Policy
- Create the SLO
- Set the Thresholds
The best practice for User Journey analysis is collaboration across teams and groups to collect the journey information.
Creating a New User Journey
Creating a New User Journey occurs via the SLO Wizard to simplify the process. The SLO Wizard guides you through, showing your progress at the top of the window.
You can also create a User Journey and leave it blank as a placeholder for future population.
For detailed instructions regarding New User Journey, refer to the Building a New SLO
Manage your User Journeys
- Click on the SLO Manager icon.
The User Journey window will open, display three options in the left side of the SLO window:
- User Journey
- Error Budget Policies
- Service Level Indicators
When it opens, the default landing page is the existing User Journey list (if any).
The User Journey is composed of SLOs, SLIs, and Error Budgets which allow the user to examine the state of their reliability conformance.
Displaying the SLOs
If you want to examine the contents of an existing User Journey, there are two ways to do this:
- Click on the down arrow to the left of the User Journey name, which opens a "Quick View" of the associated SLOs.
- Click on the User Journey name itself, which opens the User Journey window and displays a list of associated SLOs in a either Card or Table view with more details (see below the definition of key terms used in both the card and table views)
- Click on the down arrow to the LEFT of the User Journey name. A drop-down appears, providing a Quick View of the associated SLOs names.
Key SLO Terms
Here are some common, but key, SLO terms you will see throughout our documentation.
|SLO||Arbitrary name given to the SLO (alphanumeric string)||User defined|
|Reliability Target||The minimum percentage of requests (e.g. 95.90%) over a specific time window(*) that teams have decided that they have to meet a service level objective (SLO) on their SLI. The entered value will typically be set somewhere between 100% and 0%.||User defined|
|SLI Type||Type of SLI against which the user-defined reliability target is measured (e.g. Availability, Latency, etc.)||User defined|
|Service Level||Percentage of the current, measured, Service Level Indicator, sampled out over a specific time window(*), as time series data from core metrics that are injected continuously.||Calculated|
|Remaining||Error Budget remaining, expressed as a percentage of the total amount of budget given to the SLO. It represents the maximum amount of time (“reliability”) over a specific time window(*), providing “wiggle room” to a team who has to accelerate feature velocity or concentrate on improving reliability.||Calculated|
|Burn Rate||Error Budget Burn Rate is a number relative to the reliability target set for an SLO. This describes how fast you are burning your error budget.||Calculated|
|Depleted In||Number of days left before the remaining Error Budget goes to 0%. It is automatically updated, based on the Error Budget Burn Rate.||Calculated|
The burn rate reflects recent changes more rapidly than the remaining error budget value.
(*) Supported time windows: 28-days sliding window Customizable sliding window length (future) Calendar window (future)
Calculated values are updated every 7 minutes.
- Click on an SLO row title which opens a new Details window. The new window will contain the definition of the SLO (reliability target, servel level, etc.), the associated SLI and the associated Error Budget Policy, The User Journey Summary remains visible on the far right side of the window.
You will note within the Details window you have several icons identifying actions you can apply to the elements in the window. These are identified and the action defined in the following table.
|Refresh Error Budget|
|“+”||Action||Add SLO / SLI|
|Pencil icon||Action||Edit the associated field|
|“X”||Action||Close Details Window|
- If there are existing SLOs: Click on the desired SLO.
- Click on an SLO Card. It launches the SLO Details window just like above in the drop-down option.
Cards with a green or red sliver on the left of the card indicates the status of the SLO: Green when the remaining Error Budget is higher than 0%, red if below 0%.
Modifying an Existing User Journey
- If you wish to modify an existing User Journey:
Within the SLO, you can see the associated SLI, the attached Error Budget alert policy, as well as a User Journey Summary.
For detailed instructions regarding Existing SLOs, refer to the A Guide to Managing Blameless SLOs
The Blameless advantage is that our SLOs are "Actionable". Teams responsible for the reliability of one or more services can be automatically notified via email, Slack when Error Budgets are depleting below certain thresholds (e.g. 25%). Additionally, an Error Budget alert policy can automatically start a Blameless incident based on those same thresholds.
For example: Using an error budget policy as a rule to trigger one or more notifications to one or more teams via email and Slack channels.
Ingesting metrics for SLIs
After the SLI has been created, SLO Manager immediately starts injecting time series data from one or more metrics that compose the SLI (e.g. good, valid) from the selected data source (e.g. DataDog, New Relic, Prometheus, etc.)
The SLI status is reported depending on the status. For example:
|SLI Status||Icon||Icon type|
|In Progress||Spinning wheel||This SLI is currently fetching the latest data from your APM.|
|Backfill completed||Green circle checkmark||Successfully fetched latest data from your APM.|
|Error||Red circle||Exclamation message “Error while fetching…”.|
|No incoming data||TBD||Future Feature|
Examples regarding these status icons appears in the following:
The Error message will be similar to the sample image, based on the type of error and explanation available.
Blameless SLO API Endpoints
Blameless offers a set of SLO APIs to support the client’s needs.
For more Information
For instructions regarding the creation, configuration, and use of User Journeys, Error Budgets, SLOs, and SLIs, refer to the following SLO references:
An Introductory Guide to Blameless SLOs (this document)
Refer to the Google SRE Handbook for more information regarding Site Reliability Engineering.