Reliable Distributed Systems Foundations… in Practice

Abstract

Distributed systems are difficult to build, algorithms are subtle and bug-prone, and performance is often at odds with consistency. It is therefore crucial to have solid algorithmic foundations that are succinct and simple to follow. This talk examines how some of the most well known foundations in distributed computing evolved through the drive of practical impact. For example, the past decade exposed interesting anomalies in fundamental replication schemes such as Paxos. It saw the emergence of a new algorithm called Vertical Paxos for managing dynamic systems, that bridges between the Paxos theory and engineering practices. Yet, despite increasing insight into distributed systems, there is still a strong deficiency in the set of tools available for building scale-out data platforms. State-of-the-art tools for scaling systems, for example NOSQL systems, typically partition state to achieve scalability. With such systems, obtaining a consistent view of global state often requires complicated and expensive distributed protocols. This led a team of researchers at Microsoft to design a new method, called Corfu, that provides high-throughput scale-out replication with strong consistency guarantees. Corfu spreads updates over a large cluster, each replicated over a small group of nodes. All groups are “stitched” into a global log using a soft-state sequencer, which is neither an IO bottleneck, nor a single point of failures. Fast forward to today, we describe how Corfu is being used by a team of researchers at VMware Research to build a distributed control plane for VMware’s software defined networking (SDN) technology.