Building robust distributed systems
Reducing connections and building buffers
I have written before on this blog about what distributed systems are and how they can give us tremendous scalability at the cost of having to deal with a more complicated system design. Let’s discuss how we can make a distributed system resilient to random failures which get more common as the system gets larger.
Systems theory tells us that the more interconnected parts of a system are, the more the likelihood of large failures. So to build a resilient system, we need to reduce the number of connections. Where this cannot be done, we need to implement ways to “temporarily” sever connections to failing parts so that errors do not cascade to other parts.
Every component has to assume that every other component will fail at some point and decide what it will do when such failures happen.
Lastly, we need to build some buffers in the system — some ways to relax, if not remove the demands placed on it so that there is slack to handle unexpected conditions.
Minimize inter-component dependencies
Components of a distributed system communicate with each other for data or functionality. In both cases, we can reduce the requirement of connectivity by pushing the…