[CSE231 Paper Reading] Why do Computers stop and What Can Be Done About it?

文章目录[隐藏]

Summary

  1. Motivation: When we are designing a system with high availability, we normally encounter the fast fail issue that we must dive into to find the expected semantics of the code in accordance with the process in our mind. As studied in the paper, 20K copies of von Neuman's system were needed for it to have a 100-year MTBF. A system that practices the principle of rapid error should be well-modularized, with each module choosing only between providing normal functionality and ceasing to function. So generally speaking, the error that the caller should capture should be stored as the information and the other should simply spawn the exception.
  2. The solution for this is to design either redundancy modularized software or a hardware solution. redundancy and modularity. A module's failure only impacts that module due to modularity. Decompose the system into modules in a hierarchical manner. 1. Create each module with an MTBF greater than one year. 2. Each module should quickly fail. 3. Each module should include a heartbeat message so you know when it fails. 4. possess backup modules that can replace a faulty module. In the system, there are two primary methods: 1. Before the code is ever executed, static checking is performed. But cautiously checking may provide numerous false positives. 2. Dynamic checking verifies running code. But this may result in a low false positive rate and may not catch every bug, particularly in rarely used code paths. Another solution is Process pairings in which the other process takes over after the first fails including 1. Lockstep: both carry out each directive 2. Checkpointing: The primary periodically saves its state, which is transferred to the backup. 3. Alternatives: Kernel Checkpointing and Delta Checkpointing 4. Backup relies entirely on permanent storage for information. It is necessary to prevent inconsistent persistent store.
  3. The evaluation in the paper is basically 2 examples of availability 1. Transactions atomicity, consistency, isolation, and durability (ACID) properties. Jim Gray promotes the use of transactions and persistent process pairs. 2. Encompass system Fault-Tolerant Communication implemented Sessions and sequence numbers are important. TCP Sequence numbers are used to identify duplicate and missing communications using the same principle.

Critique

  1. The evaluation and the solution are highly correlated. 1. Maintain hardware on a regular basis.2. Delay software updates as long as you can to give them time to develop 3. If a bug is creating outages, only fix it. But the analysis is not fungible and prevailing as it is to the latest system development. In terms of current development, for example, if we are developing a project that uses the fat client model, the client caches the user balance and the price of the item. We have a business interface that consumes the user's balance to purchase items. In this model, a normal client will not initiate a business interface request when the balance is low. So for the developer of the server-side business interface, there is no need to handle and return the insufficient balance error to the client and just let the client disconnect when the balance is insufficient. The opposite of the same situation: if we develop a project using the thin client mode, the client does not know the balance and the price of the product before initiating the request, then the server side needs to return the insufficient balance business error to the client so that the client can further prompt the user. Another example: Suppose we develop a project that maintains a file for each user to keep the user's balance, which is created at the time of user registration. The business interface described in the previous example should not handle various errors such as file read/write failure or file not existing without permissions under this design and should let the service refuse to serve the user with the problem, throw an exception and record the log. Because at this point it is no longer possible to provide the correct service. If the service is provided reluctantly, it will probably lead to more problems and even contaminated data.
  2. I think the idea brought by the paper is art, they proposed how we can make the system more reliable using redundancy and modularized. They also suggest how we should throw the panic to debug the system better.
  3. It's definitely the start of the system research that guides the nowadays system research. For Linux 6.0, they add Runtime Verification, which is lightweight but equally rigorous and is more tractable for complex systems than classical exhaustive verification techniques such as model checking and proof of the theory. Besides, eADR for the persistent memory is a hardware-software codesign problem to address the persistence storing problem.

Reference

  1. https://dl.acm.org/doi/pdf/10.1145/265924.265927
  2. https://personal.utdallas.edu/~hamlen/Papers/necula96safe.pdf
  3. https://docs.kernel.org/trace/rv/index.html