Failure Is Not an Option: Workshop Notes

Image credit: Google Gemini Failure Is Not an Option; It’s Required To be presented at the Cascadia IT Conference , Seattle, WA, March 7-8, 2014 (55 minutes) Abstract Google SREs (system reliability engineers) spend almost 90% of their time handling or anticipating failure 1 . Managing failure is at the core of keeping Google services as reliable as they are. In this workshop, we will explore some of the principles Google employs to make services reliable and how you might use them in your work. Everything fails. With a little planning, you can fail well. Shorter Summary A workshop to explore some ways that Google SRE makes Google as reliable as it is, and how you can make your services more reliable, too, with a focus on failing well. More About Me Brian Haney has been a system administrator and SRE systems engineer for seven years. When not writing, speaking or teaching, he helps maintain some of Google’s internal storage infrastructure. Notes: 1: I posit this without comprehens...