Failure Is Not an Option: Workshop Notes

Image credit: Google Gemini
Failure Is Not an Option; It’s Required

To be presented at the Cascadia IT Conference, Seattle, WA, March 7-8, 2014 (55 minutes)

Abstract

Google SREs (system reliability engineers) spend almost 90% of their time handling or anticipating failure1. Managing failure is at the core of keeping Google services as reliable as they are. In this workshop, we will explore some of the principles Google employs to make services reliable and how you might use them in your work.

Everything fails. With a little planning, you can fail well.

Shorter Summary

A workshop to explore some ways that Google SRE makes Google as reliable as it is, and how you can make your services more reliable, too, with a focus on failing well.

More About Me

Brian Haney has been a system administrator and SRE systems engineer for seven years. When not writing, speaking or teaching, he helps maintain some of Google’s internal storage infrastructure.

Notes:

1: I posit this without comprehensive data, but will make the case for it in the workshop.

[The above proposal was emailed to the Cascadia IT Conference Program Committee in December 2013]

Outline and Notes:

Failure is not an option; it’s required.

A few cautionary tales from disaster recovery tests at Google and elsewhere

Objective:

My goal is to send you out of this room with way more questions than you brought in. Take these question home and ask them of your team, your boss, your internal customers, your CEO. Get everyone thinking about failure.

Rough Notes:

Failure is normal.

We live a life of failure.

Systems should be built with failures as a part of normal operations.

Q: What was the last failure (past or anticipated) that you have dealt with?
What was the most memorable one?

Most disasters are caused by people mis-handling simple failures

Comms fail --> backup comms fail --> chaos ensues

Q: Do any of you have examples of poor emergency communications, outdated emergency procedures, stale documentation, or siloed expertise?

Test the plan.

SSF intern backup story:

When I was a new systems admin back in the ‘90s, I heard a story. Some company in South San Francisco hired a summer intern to implement their tape backup system. Every time it would run, it reported a list of the files it had backed up, so the boss was happy, and the intern went back to school in the fall. The backup reports kept coming in. Everything looked great.

Then there was a building fire. The server room was destroyed. They quickly got insurance money, opened the company in a new office just down the street, pulled the backup tapes from offsite storage, and restored them. What they found was that all the backup tapes held were lists of files, not the files themselves. All of the customer data was gone. The company went out of business in six months.

Moral of the story: Do a COMPLETE disaster recovery test. Test the plan.

DiRT Objectives

Readiness, Response, Recovery

DiRT Participation

All SRE teams. Non-SRE (and non-tech) teams encouraged.

What is SRE?

Engineers focused on delivering services

Roles in DiRT

Proctor

Designs theoretical or practical (prefered) tests

Target Service Team

Responds as able and directed by service oncall and team tech lead.

DiRT Team

Coordinates Proctors and the various tests. Collects test reports and curates the results, following up with the teams to fix identified problems.

Success metric: FAILURE. Test for “Readiness, Response, and Recovery.” A good test identifies a weakness that needs to get fixed before the disaster is not a drill. If everything went hunky-dory, you probably need to revisit your test design and assumptions.

Rules of engagement

Real emergencies trump simulated ones
Label comms clearly
NO impact to external users
Treat simulated emergencies as you would real ones

DiRT Week

War room to coordinate a benign environment to simulate emergencies.

Systems are people, too. “Cafe DoS” case study.

A couple of years ago, a DiRT exercise to test Google’s reliance on the source code version control system took it offline at 6:00 PM. Now, Google engineers like to work late so they were trying to end their day on a high note, but couldn’t. Thousands of engineers with nothing better to do descended upon the Google cafes promptly when they opened for dinner at 6:30 PM. The result: one of the worst DoS attacks Google has ever suffered, and it was inadvertent and self-inflicted.

Moral: Under-capacity is a failure, too.

The element of surprise.

Recently, a major server room (think: mini data center) at Google HQ was scheduled to go offline for the weekend of network maintenance. This was a planned outage. My team, Corp Storage, had over a month to prepare to keep Google engineers productive over the weekend (yes, many Googlers like to work on weekends). We spent weeks preparing to failover all file servers to our DR site in another state. Each volume of storage hiccuped for a few minutes while we switched to the DR server. It took five guys nine hours to execute the failovers and test each one. (But few people even noticed the failover, except for higher latency than normal.) And we had weeks to get ready! What would this look like if we had less than an hour to prepare?

Moral of the story: plan ahead.

Sometimes it is best to just wait it out -- a case study.

Your HQ file servers are accessed from all over the company. As a DR strategy, you mirror the file servers to a data center in another state. They synchronize every hour.

The network egress ports leaving HQ go down. File servers are fine and accessible from HQ campus, but not from the remote offices. A Euro office has a critical legal compliance deadline and MUST have access to that HQ data NOW. What do you do?

Do you do a “dirty” failover to the DR copies? (“Dirty” meaning the stale data on the masters, up to an hour old, will not be synced to the DR servers.)
Do you simply wait out the disaster?
Do you point out the read-only access to the DR copies?
This is a controlled test. Do you immediately whitelist access to the HQ file server in question?

Moral of the story: Sometimes the best action is to do nothing.

Lose No Data

Apparently, Gmail had a significant outage in 2011. I missed it. But you can see the recently-posted Google NYC Tech Talk about it on YouTube (1h 14m). The gist: they were able to recover ALL of the data that have been damaged by a bad software deployment. It took three days to recover, but users LOST NO DATA.

Moral of the story: Don’t focus on just backing up and not on restoring.

DiRT Revisited:

Events

Design and review the test
Conduct test
Review the results and follow-up (DiRT bugs and post mortems)

Post Mortems

Not just a DiRT thing; any engineer can call for a post mortem review of any outage
Expected complete within 72 hours of a request
All about prevention, not about blame. How can we improve our operations so this doesn’t happen again?
Focus

What happened? (summary of events)
What worked?
What can be improved? (lessons learned, bugs, action items)
Appendices: details (logs and supporting data)

Not Just Hardware and Software -- Revisited

For non-technical teams, what would happen to their business operations if a site or service “went away”? For example, what would your accounting team do if the accounting server went down? Or, would your sales team still be able to execute sales while you were bringing up a spare server and restoring data? Have you tested that?
Who are your key vendors? What is the fallback plan if they suffer a disaster and cease operations? Yes, this can quickly leave the realm of IT, but the question has to be asked and the procedure tested. “Systems” go WAY beyond IT.
For all teams, who can approve emergency expenditures, such as to buy more fuel for the power generators or a roomful of new servers to replace those damaged by a fire? What if they are on vacation during the disaster?
PeopleOps and PR teams get tested, too. They draft internal announcements, create response forms, and have to ask themselves, “How would we communicate this publicly?” Even English grammar gets tested.

Classic problems

Who ya gonna call?
Who is your PANIC Team? Do they know how to check in to the disaster management communication channel? Who will be the lead coordinator?
What is your primary disaster communications channels? Secondary? Tertiary?

In a real disaster, phones might be offline, cell networks swamped, etc.
Phone bridge? IRC? Google hangouts? Skype? Smoke signals?
Who is the central incident manager? Does everyone know how to reach him/her?

Questions as Exercises

Think of the three most likely disasters that might befall your server room. Consider that most disasters are caused by people: bad software updates, bad config changes, restore data to the wrong place, poor planning, ...
Prizes for the most interesting and imaginative scenarios.
What is your most valuable IT resource? It’s not services. Did you say “Data”?
How do you partition your failure domains?

Physical location
Application software
System software
Storage infrastructure
Storage Media
Why do you use those failure domains?
What other failure domains might be relevant?

people (physical security)
governments run amok
vendors

What “attack” vectors might befall you?

How do you verify your DR plan?
How much should you spend preparing for disaster? How much will it cost your organization if you are NOT prepared for disaster?

Consider how much it would cost your organization if one of your most likely disasters befell it. Consider the likelihood that one of those scenarios, or one like it, would occur. Multiply those together. That is your maximum disaster readiness budget.

When can you say, “Test passed. Bring on disasters.”? NEVER.
Systems are never static.
The Law of Entropy implies that, by default, your systems are more vulnerable today than they were last year.

Scratch Notes (here to bottom are not part of the talk)

Anecdotes from DiRT ‘14 (and DiRTs past)

non-eng teams get tested, too (BizOps team overseas)
how is your corporate communications structured?
is your food service staff ready? (food-borne illness or Cafe DoS attacks)
executive backup communications (lost phones, dead batteries, etc.)
What does your DiRTy laundry look like?

80-90% of SRE planning is about anticipating, preventing, or avoiding failure.

Causes of failures:

moving parts
hot parts
dependencies/infrastructure
people

Economics of failure

(Cost of a given failure) X (Likelihood of that failure) =
Expected cost of that failure (which maps into your DR budget)
Note the the budget can reduce risks, but not eliminate them.

Plan for failure

resilience by design

capacity
redundancy

N+1? N+2? 2(1+1)?
sharding
replication
hot/cold/warm spares
failover

overhead
risk

Procedure
People
Practice

---------------

What does it mean "to fail"?

When is failure a Good Thing? Failure in the abstract. Teachable moments, opportunities for growth, "building character".

What does it mean to "fail well"? Are you flying or are you "falling with style."

Failure domains. Isolating failure. N+2. Hidden dependencies.

Do you understand the modes of failure of your systems? All modes? All systems? How do you test for interaction among systems in sympathetic modes of failure?

What is a "disaster"? Is sub-par performance at revenue of $2000/minute a disaster?

What is "recovery"?

DiRT. Testing: walk-through, talk-through, kick-through. Simulating failure. Testing recovery systems, not primary systems.

Mini-DiRTs. Wheel of misfortune.

Who are your stakeholders? Are they testing also? Are they prepared for your systems to fail?

Simulating disasters. Test designers. Proctors. War room. Whitelist.

Post mortems & lessons learned.

Team organization. The role of oncall in a disaster.

----------

bbeck & mharo:

DiRTy stories to illustrate lessons learned and best practices

DiRT overview

test design, conduct, evaluation, DiRT bug hotlist

Search This Blog

From Grumpy's Garage