Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PWL NY: Simple Testing Can Prevent Most Critica...
Search
Caitie McCaffrey
June 14, 2016
Technology
8
460
PWL NY: Simple Testing Can Prevent Most Critical Failures
Caitie McCaffrey
June 14, 2016
Tweet
Share
More Decks by Caitie McCaffrey
See All by Caitie McCaffrey
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
331
21k
The Path Towards Simplifying Consistency in Distributed Systems
caitiem20
1
330
Argus Papers We Love
caitiem20
13
1.2k
The Verification of a Distributed System
caitiem20
22
2.2k
We Hear You Like Papers: Eventual Consistency
caitiem20
14
820
The Verification of a Distributed System
caitiem20
12
780
The Verification of a Distributed System
caitiem20
6
780
A Brief History of Distributed Programming: RPC
caitiem20
31
6.6k
Building Scalable Stateful Services
caitiem20
12
1.6k
Other Decks in Technology
See All in Technology
データベースの引越しを Ora2Pg でスマートにやろう
jri_narita
0
160
TypeScript をより型安全に扱うプラクティス #TSKaigi #TSKaigi2025_kataritai
bengo4com
0
2.1k
Autocon3 - Building Trustworthy Network Automation, From Principles to Practice
dgarros
2
110
CSS polyfill とその未来
ken7253
0
150
Java 30周年記念! Javaの30年をふりかえる
skrb
4
2.5k
[zh-TW] DevOpsDays Taipei 2025 -- Creating Awesome Change in SmartNews!(machine translation)
martin_lover
1
680
FigmaのMCPを活用した Next.js with TypeScriptの爆速実装ガイド デザインから実装までの効率化ワークフロー
suguruooki
0
110
うちの会社の評判は?SNSの投稿分析にAIを使ってみた
doumae
0
570
Digitization部 紹介資料
sansan33
PRO
1
3.9k
NW運用の工夫と発明
recuraki
2
850
組織とセキュリティ文化と、自分の一歩
maimyyym
3
1.3k
Information Architecture Recommoning: How Standardization Enables Differentiation
angioia
0
120
Featured
See All Featured
Intergalactic Javascript Robots from Outer Space
tanoku
271
27k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
160
15k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
137
34k
Documentation Writing (for coders)
carmenintech
71
4.8k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Sharpening the Axe: The Primacy of Toolmaking
bcantrill
42
2.3k
[RailsConf 2023] Rails as a piece of cake
palkan
55
5.6k
YesSQL, Process and Tooling at Scale
rocio
172
14k
Chrome DevTools: State of the Union 2024 - Debugging React & Beyond
addyosmani
6
670
RailsConf 2023
tenderlove
30
1.1k
A Tale of Four Properties
chriscoyier
159
23k
VelocityConf: Rendering Performance Case Studies
addyosmani
329
24k
Transcript
Simple Testing Can Prevent Most Critical Failures: An Analysis of
Production Failures in Distributed Data-Intensive Systems Papers We Love New York - June 2016
Caitie McCaffrey @caitie Distributed Systems Engineer CaitieM.com
None
None
Analyzed Failures in Real World Systems
“A majority (77%) of failures require more than one input
event to manifest, but most of the failures (90%) require no more than 3” Complexity of Failures
“The specific order of events is important in 88% of
the failures that require multiple events Complexity of Failures
“3 Nodes or less can reproduce 98% of Failures” Complexity
of Failures
Unit Tests “A majority of production failures (77%) can be
reproduced by a unit test”
Top Down Fault Injection & State Space Exploration is Expensive
Logging • 76% of the failures print explicit failure- related
error messages • For 84% of the failures, all of the triggering events are logged • Logs are noisy: each failure prints 824 log messages (median)
Catastrophic Failures
Error Handling • 92% of failures were the result of
incorrect handling of non-fatal errors • 58% of faults could have been detected via simple testing • 35% of failures caused by bad practices in error handling code
• Error Handling Code is simply empty or only contains
a Log statement • Error Handler aborts cluster on an overly general exception • Error Handler contains comments like FIXME or TODO Bad Practices
Aspirator Performs static analysis of Java bytecode to detect: •
error handler is empty • error handler over-catches exceptions and aborts • error handler contains phrases like “TODO” or “FIXME”
• 500 New Bugs & Bad Practices • 115 Fasle
Positives • 171 bugs reported • 143 bugs confirmed or fixed Aspirator Results
-developer “I fail to see the reason to handle every
exception” Developer Reactions
“It is often much harder to reason about the correctness
of a system’s abnormal path than its normal execution path ”
Moving Forward • Use a tool like Aspirator that is
capable of identifying trivial bugs • Enforce code reviews of error handling code • High code coverage on error handling code
Questions @caitie