So you built your User Journeys, SLOs, Error Budget Policies, and SLI(s) and you are now looking at a window of charts.
As you mouse over the chart, you will see data point dots, representing data points from the 28 day data digestion. So what is the Blameless SLO Manager telling you?
Understanding your Data
If you click on a point, Blameless opens a comment modal on the right side of the screen where you can add comments regarding that point in time within the data digestion.
Comments like, “There was an outage”; “There was a maintenance window” helps users understand why there was a violation with the SLO.
The Header information is where you start to figure out your query.
The next table provides definition to the components located in the Header.
|Reliability Target||90%||The minimum percentage of requests (e.g. 95.90%) over a specific time window(*) that teams have decided that they have to meet a service level objective (SLO) on their SLI. The entered value will typically be set somewhere between 100% and 0%.|
|Service Level||98.489%||Percentage of the current, measured, Service Level Indicator, sampled out over a specific time window(*), as time series data from core metrics that are injected continuously.|
|Comparison Operator||<||Logical operator|
|Objective Value Latency||400||Threshold defining what a good vs a valid event is.|
|Latency Unit (of measure)||rps||Request per Second|
|SLI||SLI name||SLI against which this SLO applies.|
|Error Budget Policy||Error Budget Policy name||Break out of thresholds and notifications.|
So basically, the parameters you have chosen are 90% (or better) of the target out of all requests we want to successfully pass as valid requests for this service. We are currently running above that at 98.489%, which means we are currently not depleting our Error Budget.
If you configured your Error Budget policy accordingly and you depleted 25% of your Error Budget (only 75% remaining), you can be notified by e-mail and also start a new Blameless Incident.
The SLI and EBP information is gleaned from opening the specific SLI and EBP to look at the dat and descriptions provided.
The Error Budget graph shows the vertical axis as the history of the available Error Budget Consumption over a 28-day sliding window (x% of 100%) while the horizontal axis defines measured date and time.
You will note that the remaining error budget is actually declining, reflecting that more budget is being consumed, and reducing the wiggle room you have over that 28-day sliding time window to address potential growing reliability issues or push more changes to the corresponding service.
You will also see two speech-bubble icons appear at different days and times during the monitoring. You can click on the boxes to see comments that were added to the chart. This could describe an external event or activity that occurred at that time and manually entered by a user in the graph (see how to “Insert a comment on a data point"), for example.
If the error budget depletes completely to 0% in this graph, and your Service Level continues to be below your Reliability Target, the calculated error budget will become negative and continue to deplete further. In that situation, the portion of the graph below the horizontal axis will become red.
The SLO Events chart tracks the total of daily Good vs Valid events over a 24 hour period that starts arbitrarily at 5 am (local time zone). When you hover your mouse over a specific day, a tooltip shows up with detailed information such as the date, the count of good events, the count of valid events and the actual service level calculated on that day.
Vertical for the total number of events logged vs. horizontal and the date and time again.
The SLO Objective chart tracks the history of the daily service level (100 x good/valid events) . It also displays an horizontal line representing the reliability target set for this SLO.
The vertical axis is the percentage of the Service Level achieved for the horizontal value of day (and time).
Enhancing the Data points
Now that you understand more about your graphs, you can enhance the information but adding custom comments and details highlighting events at critical points in the process.
Insert a comment on a data point
- Click on a data point in the chart. The Comment modal opens.
- Enter your comment in the text field and click on POST.
Action options regarding your comment
If you select the ellipse (three dots), the following options are displayed:
- Resolve (the comment)
- Reply (to the comment)
- Delete (the comment)
The ellipse (...) will appear to the right of the comment(s) after it has been posted.
When you open a comment, the ellipse (three dots) remain, but the options are reduced to only deleting the post.
Recalculate the Error Budget
If you change your Reliability Target, Blameless recalculates automatically those values to match the new target. You can use the “Recalculate Error Budget” option in the drop-down to manually refresh the error budget graph data if you like.
To do so complete the following steps:
- Go back to the User Journey SLO Card window.
- Click on the desired SLO Card.
- Locate the ellipse (three dots) in the upper right hand corner and click on them. The dropdown options are:
- Edit (pencil icon)
- Recalculate Error Budget (recycle icon)
- Delete (trash can icon)
Select the Recalculate option. Blameless will proceed to refresh. Upon completion, the recalculated error budget graph will appear.
For More Information
For instructions regarding the creation, configuration, and use of User Journeys, Error Budgets, SLOs, and SLIs, refer to the following SLO references:
A Guide to Understanding your SLOs (this document)
Refer to the Google SRE Handbook for more information regarding Site Reliability Engineering.