Abstract

Today's cloud systems heavily rely on redundancy for reliability. Nevertheless, as cloud systems become ever more structurally complex, independent infrastructure components may unwittingly share deep dependencies. These unexpected common dependencies may result in causal failures that undermine redundancy efforts. The state-of-the-art efforts, e.g., post-failure forensics, not only lead to prolonged failure recovery time in the face of structurally complex systems, but also fail to avoid expensive service downtime. We describe RepAudit, an auditing system which builds on the so-called fault graph, to identify causal failures in the cloud. To ensure the practicality, efficiency, and accuracy of our approach, we further equip RepAudit with a domain-specific auditing language framework, a set of high-performance auditing primitives. We leverage SAT solvers to construct efficient fault graph analysis algorithms. To empirically evaluate this claim, we run RepAudit on a real-world cloud storage dataset and we observed that RepAudit is 300 times more efficient than other state-of-the-art auditing tools. 

RepAudit technique has been used in Alibaba Group---the system is named CloudCanary---to perform real-time audits on service update scenarios to identify the root causes of correlated failure risks, and generate improvement plans with increased reliability.  CloudCanary has successfully detected various correlated failure risks such as single-point microservice, common power source, and shared network components, preventing service outages ahead of time. 

The talk will conclude with an overview of the state of the arts on network verification community which widely relies on using SAT/SMT solver to check the correctness of network behaviors.

References:
[1] Ennan Zhai, Ruichuan Chen, David Isaac Wolinsky, Bryan Ford:
Heading Off Correlated Failures through Independence-as-a-Service. OSDI 2014: 317-334

[2] Ennan Zhai, Ruzica Piskac, Ronghui Gu, Xun Lao, Xi Wang:
An auditing language for preventing correlated failures in the cloud. PACMPL 1(OOPSLA): 97:1-97:28 (2017)

This talk will be given in collaboration with Ennan Zhai, a researcher at Alibaba Group.

Video Recording