Ideally, the SRE team and its partner developer teams should detect new bugs before they even make it into production. In reality, automated testing misses many bugs, which are then launched to production. Software testing is a large topic well covered elsewhere e. However, software testing techniques are particularly useful in reducing the number of bugs that reach production, and the amount of time they remain in production:. This kind of rollback strategy requires predictable and frequent releases so that the cost of rolling back any one release is small.
Minimizing the number of bugs in production not only reduces pager load, it also makes identifying and classifying new bugs easier. Therefore, it is critical to remove production bugs from your systems as quickly as possible. Prioritize fixing existing bugs above delivering new features; if this requires cross-team collaboration, see SRE Engagement Model. Architectural or procedural problems, such as automated health checking, self-healing, and load shedding, may need significant engineering work to resolve.
Chapter 3 of Site Reliability Engineering describes how error budgets are a useful way to manage the rate at which new bugs are released to production. The Connection team from our example adopted a strict policy requiring every outage to have a tracking bug. This data revealed that human error was the second most common cause of new bugs in production. Before you make a change to production, automation can perform additional testing that humans cannot.
The Connection team was making complex changes to production semimanually. Automated systems making the same changes would have determined that the changes were not safe before they entered production and became paging events. The technical program manager took this data to the team and convinced them to prioritize automation projects.
For example, given a page that manifests only under high load, say at daily peak, if the problematic code or configuration is not identified before the next daily peak, it is likely that the problem will happen again. There are several techniques you might use to reduce identification delays:. Consider these techniques for reducing mitigation delays:.
- What is the Matter?.
- Echoes and Embers (Rebel Angels Book 2)?
- Mutant Origins: Collection (Teenage Mutant Ninja Turtles).
- The Warhol Incident (Alexis Parker Book 2).
- Solve for Happy: Engineer Your Path to Joy!
- They Will Drag You Inside The Wall;
At this point, systems may need self-healing properties, which is out of scope for this chapter. Strictly observing these guidelines is critical to maintaining a healthy on-call rotation. Receiving a page creates a negative psychological impact. To minimize that impact, only introduce new paging alerts when you really need them. Anyone on the team can write a new alert, but the whole team reviews proposed alert additions and can suggest alternatives.
Thoroughly test new alerts in production to vet false positives before they are upgraded to paging alerts. After you address these production bugs, alerting will only page on new bugs, effectively functioning like regression tests. Be sure to run the new alerts in test mode long enough to experience typical periodic production conditions, such as regular software rollouts, maintenance events by your Cloud provider, weekly load peaks, and so on.
A week of testing is probably about right. However, this appropriate window depends on the alert and the system. Explicitly approve or disallow the new alert as a team. If introducing a new paging alert causes your service to exceed its paging budget, the stability of the system needs additional attention. Aim to identify the root cause of every page. Was an outage caused by a bug that would have been caught by a unit test? If you know the root cause, you can fix and prevent it from ever bothering you or your colleagues again.
Use the paging alert as an chance to surface engineering work that improves the system and obviates an entire class of possible future bugs. For example, imagine a situation where there are too many servers on the same failure domain, such as a switch in a datacenter, causing regular multiple simultaneous failures:. Make sure resource planning for both the SRE and developer teams consider the effort required to respond to bugs.
We recommend reserving a fraction of SRE and developer team time for responding to production bugs as they arise. Instead, they work on bugs that improve the health of the system. Make sure that your team routinely prioritizes production bugs above other project work. SRE managers and tech leads should make sure that production bugs are promptly dealt with, and escalate to the developer team decision makers when necessary.
See Postmortem Culture: Learning from Failure for more details. Once you identify bugs in your system that caused pages, a number of questions naturally arise:. You will end up with a list of as-yet-not-understood bugs in one column, and a list of all of the pages that each bug is believed to have caused in the next. Once you have structured data about the causes of the pages, you can begin to analyze that data and produce reports.
Those reports can answer questions such as:. The quality of the data you collect will determine the quality of the decisions either humans or automata can make. To ensure high-quality data, consider the following techniques:. All too often, teams fall into operational overload by a thousand cuts.
To avoid boiling the frog , it is important to pay attention to the health of on-call engineers over time, and ensure that production health is consistently and continuously prioritized by both SRE and developer teams.
- Dear Mister President;
- Unity In The Home (The Waterfalls Experience Book 3);
- Geordie: SAS Fighting Hero;
- Solve for Happy: Engineer Your Path to Joy by Mo Gawdat;
- Magnolias That Cared About Me, featuring THUNDER;
- The Chill Factor, New Thinking for Parents of Addicts.
The arrangement works well for you. All is well…until one day you realize that the on-call schedule and the demands of your personal life are starting to clash. There are many potential reasons why—for example, becoming a parent, needing to travel on short notice and take a leave from work, or illness.
Many teams and organizations face this challenge as they mature. The key to keeping a healthy, fair, and equitable balance of on-call work and personal life is flexibility. There are a number of ways that you can apply flexibility to on-call rotations to meet the needs of team members while still ensuring coverage for your services or products.
It is impossible to write down a comprehensive, one-size-fits-all set of guidelines.
How to Pick a Career (That Actually Fits You) — Wait But Why
We encourage embracing flexibility as a principle rather than simply adopting the examples listed here. As teams grow, accounting for scheduling constraints—vacation plans, distribution of on-call weekdays versus weekends, individual preferences, religious requirements, and so on—becomes increasingly difficult. Different people have different needs and different preferences. Team composition and preferences dictate whether your team prefers a uniform distribution, or a more customized way of meeting scheduling preferences.
Using an automated tool to schedule on-call shifts makes this task much easier. This tool should have a few basic characteristics:. Schedule generation can be either fully automated or scheduled by a human. Likewise, some teams prefer to have members explicitly sign off on the schedule, while others are comfortable with a fully automated process.
You might opt to develop your own tool in-house if your needs are complex, but there are a number of commercial and open source software packages that can aid in automating on-call scheduling. Requests for short-term changes in the on-call schedule happen frequently. Or you might need to run an unforeseen urgent errand in the middle of your on-call shift. You may also want to facilitate on-call swaps for nonurgent reasons—for example, to allow on-callers to attend sports training sessions. In this situation, team members can swap a subset of the on-call day for example, half of Sunday.
Nonurgent swaps are typically best-effort. Teams with a strict pager response SLO need to take commute coverage into account.
Become the leader your engineers need
If your pager response SLO is 5 minutes, and your commute is 30 minutes, you need to make sure that someone else can respond to emergencies while you get to work. To achieve these goals in flexibility, we recommend giving team members the ability to update the on-call rotation. Also, have a documented policy in place describing how swaps should work.
Decentralization options range from a fully centralized policy, where only the manager can change the schedule, to a fully decentralized one, where any team member can change the policy independently. In our experience, instituting peer review of changes provides a good tradeoff between safety and flexibility. Sometimes team members need to stop serving in the on-call rotation because of changes in personal circumstances or burnout. Ideally, team size should allow for a temporary staff reduction without causing the rest of the team to suffer too much operational load.
Therefore, it is safe to assume each site will need one extra engineer as protection against staff reduction, bringing the minimum staffing to six engineers per site multisite or nine per site single-site. Both models are compatible with on-call work, but require different adjustments to on-call scheduling. The first model easily coexists with on-call work, especially if the nonworking day s are constant over time. As mentioned in Chapter 11 of Site Reliability Engineering , Google SRE compensates support outside of regular hours with a reduced hourly rate of pay or time off, according to local labor law and regulations.
In order to maintain a proper balance between project time and on-call time, engineers working reduced hours should receive a proportionately smaller amount of on-call work. Larger teams absorb this additional on-call load more easily than smaller teams. Working from this discussion of team psychology, how do you go about building a team with positive dynamics? Consider an on-call team with the following set of hypothetical problems. A company begins with a couple of founders and a handful of employees, all feature developers. Everyone knows everyone else, and everyone takes pagers. The company grows bigger.
On-call duty is limited to a smaller set of more experienced feature developers who know the system better. The company grows even bigger. They add an ops role to tackle reliability. This team is responsible for production health, and the job role is focused on operations, not coding.
The on-call becomes a joint rotation between feature developers and ops. Feature developers have the final say in maintaining the service, and ops input is limited to operational tasks. By this time, there are 30 engineers in the on-call rotation: 25 feature developers and 5 ops, all located at the same site. The team is plagued by high pager volume. Despite following the recommendations described earlier in this chapter to minimize pager load, the team is suffering from low morale. Because the feature developers prioritize developing new features, on-call follow-up takes a long time to implement.
One feature developer insists on paging by error rate rather than error ratio for their mission-critical module, despite complaints from others on the team. These alerts are noisy, and return many false positives or unactionable pages. So I just ignore them. Some Google teams experienced similar problems during their earlier days of maturity. If not handled carefully, these problems have the potential to tear the feature developer and ops teams apart and hinder on-call operation.
While your methodology may differ, your overall goal should be the same: build positive team dynamics, and carefully avoid tailspin. You can remodel the operations organization according to the guidelines outlined in this book and Site Reliability Engineering , perhaps even including a change of name SRE, or similar to indicate the change of role. Simply retitling your ops organization is not a panacea, but it can be helpful in communicating an actual change in responsibilities away from the old ops-centric model. Make it clear to the team and the entire company that SREs own the site operation.
This includes defining a shared roadmap for reliability, driving the full resolution of issues, maintaining monitoring rules, and so on. To return to our hypothetical team, this announcement ushered in the following operational changes:. With this arrangement, feature developers are explicit collaborators on reliability features, and SREs are given the responsibility to own and improve the site.
Another possible solution is to build stronger team bonds between team members. As a result, engineers are more likely to fix bugs, finish action items, and help out their colleagues. For example, say you turned off a nightly pipeline job, but forgot to turn off the monitoring that checked if the pipeline ran successfully. As a result, you accidentally page a colleague at 3 a. Encourage teams to eat lunch with each other.
It plays directly into team dynamics. SRE on-call is different than traditional ops roles. Rather than focusing solely on day-to-day operations, SRE fully owns the production environment, and seeks to better it through defining appropriate reliability thresholds, developing automation, and undertaking strategic engineering projects. On-call is a source of much tension, both individually and collectively. This chapter illustrates some of the lessons about on-call that we learned the hard way; we hope that our experience can help others avoid or tackle similar issues.
If your on-call team is drowning in endless alerts, we recommend taking a step back to observe the situation from a higher level. Compare notes with other SRE and partner teams. Thoughtfully structuring on-call is time well spent for on-call engineers, on-call teams, and the whole company.
Published by O'Reilly Media, Inc. Chapter 8 - On-Call.
Chapter 8 - On-Call
What you described in your first book is irrelevant to me. Split the responsibilities? A lot of pages get ignored, while the real problems are buried under the pile. Where should we start? How do you address the knowledge gap within the team? Please be specific, because the DevOps team is very concerned about this. Example On-Call Setups Within Google and Outside Google This section describes real-world examples of on-call setups at Google and Evernote, a California company that develops a cross-platform app that helps individuals and teams create, assemble, and share information.
For example: At the start of each shift, the on-call engineer reads the handoff from the previous shift. The on-call engineer minimizes user impact first, then makes sure the issues are fully addressed. At the end of the shift, the on-call engineer sends a handoff email to the next engineer on-call.
Adjusting our on-call policies and processes The move to the cloud unleashed the potential for our infrastructure to grow rapidly, but our on-call policies and processes were not yet set up to handle such growth. Restructuring our monitoring and metrics Our primary on-call rotation is staffed by a small but scrappy team of engineers who are responsible for our production infrastructure and a handful of other business systems for example, staging and build pipeline infrastructure.
We classify any event generated by our metrics or monitoring infrastructure into three categories: P1: Deal with immediately Should be immediately actionable Pages the on-call Leads to event triage Is SLO-impacting P2: Deal with the next business day Generally is not customer-facing, or is very limited in scope Sends an email to team and notifies event stream channel P3: Event is informational only Information is gathered in dashboards, passive email, and the like Includes capacity planning—related information Any P1 or P2 event has an incident ticket attached to it.
Tracking our performance over time With the introduction of SLOs, we wanted to track performance over time, and share that information with stakeholders within the company. Table Examples of realistic response times Incident description Response time SRE impact Revenue-impacting network outage 5 minutes SRE needs to be within arm's reach of a charged and authenticated laptop with network access at all times; cannot travel; must heavily coordinate with secondary at all times Customer order batch processing system stuck 30 minutes SRE can leave their home for a quick errand or short commute; secondary does not need to provide coverage during this time Backups of a database for a pre-launch service are failing Ticket response during work hours None Scenario: A team in overload The hypothetical Connection SRE Team, responsible for frontend load balancing and terminating end-user connections, found itself in a position of high pager load.
Pager load inputs The first step in tackling high pager load is to determine what is causing it. Regularly update the software or libraries that your system is built upon to take advantage of bug fixes however, see the next section about new bugs. Perform regular load testing in addition to integration and unit testing. New bugs Ideally, the SRE team and its partner developer teams should detect new bugs before they even make it into production.
However, software testing techniques are particularly useful in reducing the number of bugs that reach production, and the amount of time they remain in production: Improve testing over time. Many bugs manifest only under particular load conditions or with a particular mix of requests. Run staging testing with production-like but synthetic traffic in a production-like environment.
We briefly discuss generating synthetic traffic in Alerting on SLOs of this book. Perform canarying Canarying Releases in a production environment. Have a low tolerance to new bugs.
See Mitigation delay for more details. Some bugs may manifest only as the result of changing client behavior. For example: Bugs that manifest only under specific levels of load—for example, September back-to-school traffic, Black Friday, Cyber Monday, or that week of the year when Daylight Saving Time means Europe and North America are one hour closer, meaning more of your users are awake and online simultaneously. Bugs that manifest only with a particular mix of requests—for example, servers closer to Asia experiencing a more expensive traffic mix due to language encodings for Asian character sets.
Bugs that manifest only when users exercise the system in unexpected ways—for example, Calendar being used by an airline reservation system!
Testing for Reliability
Therefore, it is important to expand your testing regimen to test behaviors that do not occur every day. There are several techniques you might use to reduce identification delays: Use good alerts and consoles Ensure pages link to relevant monitoring consoles, and that consoles highlight where the system is operating out of specification. In the console, correlate black-box and white-box paging alerts together, and do the same with their associated graphs. Make sure playbooks are up to date with advice on responding to each type of alert. On-call engineers should update the playbook with fresh information when the corresponding page fires.
Perform small releases If you perform frequent, smaller releases instead of infrequent monolithic changes, correlating bugs with the corresponding change that introduced them is easier. Canarying releases, described in Canarying Releases gives a strong signal about whether a new bug is due to a new release. Log changes Aggregating change information into a searchable timeline makes it simpler and hopefully quicker to correlate new bugs with the corresponding change that introduced them.
Tools like the Slack plug-in for Jenkins can be helpful. The on-call engineer is never alone; encourage your team to feel safe when asking for help. Consider these techniques for reducing mitigation delays: Roll back changes If the bug was introduced in a recent code or configuration rollout, promptly remove the bug from production with a rollback, if safe and appropriate a rollback alone may be necessary but is not sufficient if the bug caused data corruption, for example.
The build step of rolling forward may take much longer than 15 minutes, so rolling back impacts your users much less. Use feature isolation Design your system so that if feature X goes wrong, you can disable it via, for example, a feature flag without affecting feature Y. Drain requests away Drain requests i. For example, if the bug is the result of a code or config rollout, and you roll out to production gradually, you may have the opportunity to drain the elements of your infrastructure that have received the update. This allows you to mitigate the customer impact in seconds, rather than rolling back, which may take minutes or longer.
It is worth highlighting some key elements discussed in that chapter: All alerts should be immediately actionable. There should be an action we expect a human to take immediately after they receive the page that the system is unable to take itself. The signal-to-noise ratio should be high to ensure few false positives; a low signal-to-noise ratio raises the risk for on-call engineers to develop alert fatigue.
If a team fully subscribes to SLO-based and symptom-based alerting, relaxing alert thresholds is rarely an appropriate response to being paged. Just like new code, new alerts should be thoroughly and thoughtfully reviewed. Each alert should have a corresponding playbook entry. Rigor of follow-up Aim to identify the root cause of every page. For example, imagine a situation where there are too many servers on the same failure domain, such as a switch in a datacenter, causing regular multiple simultaneous failures: Point fix Rebalance your current footprint across more failure domains and stop there.
Systemic fix Use automation to ensure that this type of server, and all other similar servers, are always spread across sufficient failure domains, and that they rebalance automatically when necessary. Monitoring or prevention fix Alert preemptively when the failure domain diversity is below the expected level, but not yet service-impacting.
Having been in engineering is super powerful for that. Everything they do leaves a digital trail so there's no getting around the fact that if something goes wrong, it's a particular coder's fault. Whether it's learning the latest programming language or testing out the latest gadget, engineers are obsessed with technology that improves on the way things are currently done. They're constantly being told something won't work , the market is too small, or an idea has been tried before. To succeed within this environment of constant criticism, engineers by nature must be stubborn and perversely enjoy solving impossible problems.
Similarly, getting a company off the ground and making traction takes an incredible amount of perseverance. So it makes it fun. The opinions expressed here by Inc.