[CSE231 Paper Reading] Eraser: A Dynamic Data Race Detector for Multithreaded Programs

Summary

  1. Motivation for the paper the current distributed system complies the Lamport's happened-before logic and uses mutex lock to lock the variable that prevents data race from other threads accessing the variable.

  2. They use the lockset algorithm that maintains an observation thread for seeing if the same locks are held whenever the shared data is accessed and infer which locks protect which data item. Since Eraser has no way to know the purpose of the lock, it must obtain the lock protection relationship based on the history. For each shared variable v, Eraser maintains a candidate lock set C(v), which holds all locks that have protected v. When the variable is initialized, C(v) is the set of all possible locks. When the variable is accessed, C(v) <- C(v) $\cap$ set of locks for the current thread. Obviously, this set will only get smaller or remain stable. Terminal: All threads protect the same resource with consistent locks (immobilization point) => Refined; C(v) is empty => Inconsistent. To weaken the strong rule, we can assume the initialization has no lock and the read-only data do not require a lock and the read-write lock is single-writer multiple-readers. If no other thread can refer to it, then there is no need to lock other threads. This is true for newly allocated data. To avoid false alerts, Eraser delays the refinement of candidate sets until after initialization is complete. However, confirming that initialization is complete is difficult, so this is delayed again until the shared variable is first accessed by a second thread. If the shared variable is always read or written by only one thread, then the refinement is always not executed. Furthermore, for read-only objects, again, protection is not necessary. Such support can be provided by warning only those shared variables that are written by multiple threads.

    If updated because of the writer, we should only save locks that are in reading mode.

    Eraser is implemented in Digital Unix OS on the Alpha processor using the ATOM [Srivastava and Eustace 1994] binary modification system. Eraser takes the unmodified binary code as input and produces new binary code that is functionally consistent but with Eraser runtime inserted into it. Monitor each global/heap load/store to maintain C(v) (via address and stack pointer relationships) Monitor each lock acquire/release call and thread creation/termination to maintain the locks held by the threads. Monitor each memory allocation to initialize C(v), and each word has the potential to become a shared variable. (So much trouble, mostly because changes are being made to the binary, not the source code) Maintain the set of stack addresses accessible through registers rather than stack pointers. When a race occurs, Eraser logs to the file and reports the line number, traceback of the stack frame, thread ID, memory address, access method, PC/RSP value, etc. to locate the problem code. If that doesn't find it, then let Eraser log each access.

  3. The author found that the general overhead comes from the fact that a function call is required for each monitoring in ATOM, but upgrading to version 1996 allows inlining. Therefore he directly concluded that performance was not greatly affected and that it was not the main purpose of the program. The slowdown by a factor of 10 to 30x is unavoidable.

Critique

  1. They monitor C/C++ allocators and locks that require binary rewriting and runtime function calls. It still couldn't observe the kernel level and microarchitecture level races. Like there's a soft IRQ that changes the memory state, the simple model of observing aside is not valid. For modeling, the memory safety/ race condition inside the kernel should introduce modeling for all possible state changes that will make the tool much more soundiness.
  2. This is a combination of art and engineering. It's art is because it utilizes some weaker formalization and uses observation thread to watch the thread state change. It's engineering because binary rewriting is dirty work.
  3. It's the start of the whole debugging and software engineering world in the kernel. I think tracing the bugs inside the kernel is a long-discussed topic. New eBPF tools can be used to livepatch kernel, reducing the overhead of instrumenting kernel and migratable without writing a module or hacking LSM to do so.

Reference

  1. https://kexiang.me/2021/11/30/1759-NOV-25/
  2. https://blog.cloudflare.com/live-patch-security-vulnerabilities-with-ebpf-lsm/

[CSE231 Paper Reading] Safe Kernel Extensions Without Run-Time Checking

Summary

  1. Motivation: The runtime checking for the kernel is too time-consuming. Naive formal verification for runtime systems is the code consumer verifies the safety of the code. Because the verification takes a great amount of time. The verification normally takes the pre/post condition of assembly code and defines the transmission rules for the different operator of instructions and eventually use the SMT solver to check the running kernel corresponds with the condition.

  2. The technique allows the kernel to check the extension safety. For code receiver defines a set of safety rules that guarantee the safe behavior of programs. Then the code producer creates a formal safety proof that its code acts corresponding to the code that complies with the safety rules. The code receiver uses a simple and fast validator to check that the code is safe. The proof-carrying code is the message to pass the producer attaches the proof to the code that offloads the consumer's work. Most of the validation effort is done by the code producer. The construction of the proofs is irrelevant to the code consumer. PCC programs are unchangeable. Changes to the code render the evidence invalid. There are no reliable outsiders Before any code is executed, errors are found. It contains several basic components. 1. The use of formal specifications to convey the safety policy 2. Formal semantics of the unreliable code's language 3. The words utilized to convey the proofs 4. Validation algorithm for the proofs 5. The process for producing the safety proofs. First, they have a simple LF type-checker and a theorem prover that does the safety check.

  3. They evaluated the BPFs by implementing four packet filter logics in assembly language and certified their safety with respect to the packet filter safety policy. Normally just prove the input and output packets are within the filter definition and do not violate the memory check. The running time is average per packet runtime of 4 PCC packet filters, and they compared with the BSD packet filter interpreter which is super slow. And using software fault isolation and a safe subset of Modula 3 plus the VIEW extension for safe pointer casting.

Critique

  1. The experiment result shows PCC can make server kernel interact safely with untrusted code and PCC has no runtime overhead for the receiver, lastly, the safety policies are defined by the receiver which makes it a lot easier and flexier.
  2. This work is an engineering effort because they actually implemented a lot of the prover. it seems that the verification of the system is requiring neither zero-knowledge proof nor cryptology, just pure proof offloads. However, the verifier framework can be applied to any runtime system with only the protocol of PCC being defined, that's great in a system that requires high robustness. I think the real thing is rewriting all the code, especially the driver code in rust so that at least it will reduce 99% of memory and type issues.
  3. Type system checker like system w that OCaml defines is a better type static and dynamic checker but not applicable in runtime system because it's too time-consuming. The current eBPF verifier is an offline verifier with online checks. So that this work is the guidance for the current system that requires memory safety and semantic safety. But it definitely has a weaker condition for the proof.

Reference

  1. https://www.slideserve.com/keaton/ensuring-code-safety-without-run-time-checks-for-real-time-control-systems/?utm_source=slideserve&utm_medium=website&utm_campaign=auto+related+load
  2. https://www.slideserve.com/pamelia/safe-kernel-extensions-without-run-time-checking-powerpoint-ppt-presentation
  3. https://stackoverflow.com/questions/29806121/how-to-type-check-recursive-definitions-using-algorithm-w

[CSE231 Paper Reading] Why do Computers stop and What Can Be Done About it?

Summary

  1. Motivation: When we are designing a system with high availability, we normally encounter the fast fail issue that we must dive into to find the expected semantics of the code in accordance with the process in our mind. As studied in the paper, 20K copies of von Neuman's system were needed for it to have a 100-year MTBF. A system that practices the principle of rapid error should be well-modularized, with each module choosing only between providing normal functionality and ceasing to function. So generally speaking, the error that the caller should capture should be stored as the information and the other should simply spawn the exception.
  2. The solution for this is to design either redundancy modularized software or a hardware solution. redundancy and modularity. A module's failure only impacts that module due to modularity. Decompose the system into modules in a hierarchical manner. 1. Create each module with an MTBF greater than one year. 2. Each module should quickly fail. 3. Each module should include a heartbeat message so you know when it fails. 4. possess backup modules that can replace a faulty module. In the system, there are two primary methods: 1. Before the code is ever executed, static checking is performed. But cautiously checking may provide numerous false positives. 2. Dynamic checking verifies running code. But this may result in a low false positive rate and may not catch every bug, particularly in rarely used code paths. Another solution is Process pairings in which the other process takes over after the first fails including 1. Lockstep: both carry out each directive 2. Checkpointing: The primary periodically saves its state, which is transferred to the backup. 3. Alternatives: Kernel Checkpointing and Delta Checkpointing 4. Backup relies entirely on permanent storage for information. It is necessary to prevent inconsistent persistent store.
  3. The evaluation in the paper is basically 2 examples of availability 1. Transactions atomicity, consistency, isolation, and durability (ACID) properties. Jim Gray promotes the use of transactions and persistent process pairs. 2. Encompass system Fault-Tolerant Communication implemented Sessions and sequence numbers are important. TCP Sequence numbers are used to identify duplicate and missing communications using the same principle.

Critique

  1. The evaluation and the solution are highly correlated. 1. Maintain hardware on a regular basis.2. Delay software updates as long as you can to give them time to develop 3. If a bug is creating outages, only fix it. But the analysis is not fungible and prevailing as it is to the latest system development. In terms of current development, for example, if we are developing a project that uses the fat client model, the client caches the user balance and the price of the item. We have a business interface that consumes the user's balance to purchase items. In this model, a normal client will not initiate a business interface request when the balance is low. So for the developer of the server-side business interface, there is no need to handle and return the insufficient balance error to the client and just let the client disconnect when the balance is insufficient. The opposite of the same situation: if we develop a project using the thin client mode, the client does not know the balance and the price of the product before initiating the request, then the server side needs to return the insufficient balance business error to the client so that the client can further prompt the user. Another example: Suppose we develop a project that maintains a file for each user to keep the user's balance, which is created at the time of user registration. The business interface described in the previous example should not handle various errors such as file read/write failure or file not existing without permissions under this design and should let the service refuse to serve the user with the problem, throw an exception and record the log. Because at this point it is no longer possible to provide the correct service. If the service is provided reluctantly, it will probably lead to more problems and even contaminated data.
  2. I think the idea brought by the paper is art, they proposed how we can make the system more reliable using redundancy and modularized. They also suggest how we should throw the panic to debug the system better.
  3. It's definitely the start of the system research that guides the nowadays system research. For Linux 6.0, they add Runtime Verification, which is lightweight but equally rigorous and is more tractable for complex systems than classical exhaustive verification techniques such as model checking and proof of the theory. Besides, eADR for the persistent memory is a hardware-software codesign problem to address the persistence storing problem.

Reference

  1. https://dl.acm.org/doi/pdf/10.1145/265924.265927
  2. https://personal.utdallas.edu/~hamlen/Papers/necula96safe.pdf
  3. https://docs.kernel.org/trace/rv/index.html

[CSE231 Paper Reading] EXSi

Summary

  1. Motivation: To reduce the overhead of guest overcommitting memory and exploiting content-based page sharing and hot I/O page remapping, VMWare proposed EXSi. When a client accesses a large block of I/O through DMA (Direct Memory Access), the QEMU emulator will not put the result into the shared page, but write the result directly to the client's memory through memory mapping. The advantage of this approach is that it can simulate a wide variety of hardware devices; the disadvantage is that the path of each I/O operation is long, requiring multiple context switches and multiple data copies, so the performance is poor. The ESXi comes with various mechanisms and policies that manage proportional sharing for memory and upcoming research for GPU devices.

    1. Ballooning is a method for ESXi server to reclaim memory from the guest OS. To do this, pinned pages are created in the physical memory by using the guest OS's memory management algorithms. The ESX server then takes back these pinned pages. This concept is based on the idea that the guest operating system has the best knowledge of the page that has to be removed.
    2. Content-Based Transparent Page Sharing – Contrary to disco, which required particular interfaces for shared page creation The ESX server efficiently compares comparable pages between VMs by scanning them and using a hash map. An inactive background process actively scans and generates hash values, which are then utilized to locate related pages using a hash map. If a match is discovered, the object is tagged as COW, and the duplicate memory is freed.
    3. The unresolved issue in share-based resource management is resolved by the idle memory penalty. The basic concept is to charge a VM extra for idle pages compared to those that are being actively used. To recapture the idle pages, utilize the ballooning technique described above.
    4. I/O Page remapping is also used in large-memory systems to lower I/O copying overheads.
  2. Solution: a thin software layer that effectively multiplexes different hardware resources among several virtual machines. What makes this unique? Traditional virtual machine platforms used a hypervisor, which runs on top of a standard operating system, to intercept I/O device requests coming from VMs and handle them as host OS system calls. Because ESX Server operates directly on top of the system hardware, it offers faster I/O performance and greater resource management flexibility. With any necessary OS modifications, can run several operating systems. When allocating a virtual machine hypervisor creates a contiguous addressable memory space for the virtual machine. They designed a hypervisor called pmap that mapped the shadow page from guest virtual to guest physical fast, and it also manages the oversubscription. To inflate the memory balloon in the VM, the virtual machine balloon driver will pin the page to prevent the page out to disk, and unmap the two pages from the guest OS to the host OS.

  3. The numerous strategies discussed in the study are precisely evaluated in this paper. The performance boost over Disco is contrasted at numerous places. The overall throughput is improved by idle memory stressing, among other significant findings. Memory is made available that increases linearly with the number of VMs thanks to content-based memory sharing. While the performance of inflated VMs was nearly identical to that of unballooned VMs with the same amount of memory, ballooning adds a very tiny overhead (1.4 - 4.4%). This indicates the performance reduced compared to the Disco.

Critique

  1. At least one example from each unique technique is used to clearly evaluate them all. In terms of ballooning, a Linux virtual machine running Dbench with 40 clients displayed nearly identical performance when compared to a server without ballooning. Three real-world guest Oss instances were launched to evaluate content-based sharing, and the page-sharing method was able to recoup more than one-third of the virtual machine. By adjusting the tax rate, the percentage share on 2 VMs was tracked to evaluate the effectiveness of the idle memory tax.
  2. This is definitely the start of scientific research but EXSi is pure engineering work and took a great amount of time. The idea of utilizing the idle page is widely used today. The disaggregation stuff happens to start from the boom of Memory Resource Management. The ballooning technique is all architecture that wants to gain memory virtualization performance, they will implement one, like GPU virt like "XenGT" and "gScale" or "CXL" based virtualization.
  3. This approach can be applied to the latest MemTrade. But I think the latest work for an idle page or cold page identification is not that efficient. For virtualization, we can simply be done in the page level, but we didn't make everything work smoothly before the bottleneck of the memory. like if the memory pool of the host device is 80% fulfilled, we need to design a mechanism to swap/zswap/disaggregate instead of waiting till the end of the memory then to these.

[CSE231 Paper Reading] ErOS

Summary

  1. The motivation for the paper is the author wants an abstraction for checkpointing and object-based storage or memory system for a microkernel so that you can achieve its efficiency by fusing carefully selected abstract objects with methods of object caching. All programs with direct hardware access or the capacity to directly change the security state of the system is gathered in the kernel to make security assurance simpler. Thus, the single-levelstore and bottom-half device drivers are implemented in the kernel.

  2. Via capability-protected abstractions, the kernel offers a very direct virtualization of the underlying hardware. The size of the pages, which include both data and capabilities, is determined by the underlying hardware. Additionally stored in nodes are capabilities. Nodes have 32 capabilities and perform a similar function to metadata in traditional systems. Pages and nodes include all the information that applications can see. A capability that names the root of the process' address space tree is present in every EROS process. This tree's mapping is converted into the representation needed by the underlying hardware on demand. The EROS kernel makes any necessary object faults to load pages and nodes from the disk after trying to traverse the address space tree to construct a valid mapping when a page fault occurs. In the event that a valid mapping is discovered, it is added to the hardware mapping database and the process is repeated. If not, the problem is reported using an upcall to a user-level fault handler that is provided by the process or the address space, if any (otherwise). They have a virtual copy mechanism for page migration and address translation for faster page walk.

  3. The author proposed a fantastic result for the EROS. It is evident that EROS performs admirably on 6 of the 7 benchmarks. This, in our opinion, offers strong, if clearly not conclusive, evidence that the presence of protected subsystems (some of which were employed in the EROS benchmarks) need not adversely affect system performance. Furthermore, it seems that capability systems can operate effectively without the need for specialist hardware assistance.

Critique

  1. I generally believe in the experiment since it basically represent the pros of the experiment.
  2. It's art and science. It has abstractions for inter-process communication (IPC), virtual address spaces, threads, and, unlike most L4 kernels, authorization capabilities. Virtual address spaces do not have a kernel-defined structure; instead, page faults are conveyed via IPC to the pager threads in charge of creating the virtual address space by converting frames into it. To assist virtualization, exceptions and non-native system calls are also transmitted via IPC. IPC facilitates RPC through reply capabilities and uses synchronous and asynchronous endpoints (port-like destinations without in-kernel buffering) for inter-thread communication. Capability address spaces made up of capability container objects known as cnodes are used to separate and store capabilities.
  3. It's the beginning of the research, because it has the abstraction for IPC and circle out the machine state, which is the best for verification using LTL verfication technique.

[CSE231 Paper Reading] Log Structure Filesystem

Motivation

Computer hardware performance began to explode: CPUs were getting faster and RAM was getting bigger; however, while sequential read and write speeds of hard disks were increasing, random read and write speeds, which were limited by physical seek times, were hardly shorter than 10 ms. On the other hand, file systems at the time, whether Unix File System or FFS, had a large number of random reads and writes (at least 5 random writes are required to create a new file in FFS), thus becoming a performance bottleneck for the whole system. At the same time, because of Page cache, the authors argue that random reads are not the main problem: with more and more memory, most of the reads can be cached, so the main problem of LFS is to reduce random writes to the hard disk.

Implementation

  1. File System as a Log
  2. Segment-based bulk writing solves the random write problem, but how does LFS implement read operations? Similar to UFS/FFS, LFS stores the contents of a file within a segment, and also stores the index of the file. Specifically: in Segment, the file contents are stored in a fixed-size data block. Segment0 stores the two data blocks of file2, and the subsequent inode2 stores the indexes of these two data blocks. However, unlike UFS/FFS, LFS inodes are dynamically allocated, so LFS stores an index to the inode at the end of each Segment, called the inode map. in LFS, all inode map contents are cached in the contents, which speeds up reads.
  3. Garbage collection: As mentioned earlier, LFS requires a garbage collection mechanism designed to remove old data. In LFS, multiple segments containing obsolete data blocks are compacted into new data segments, and the old data in them is deleted.
  4. Failure recovery is obviously important for any file system to be able to recover data from a failure. Unlike UFS, which uses the fsck command to recover from a failure, LFS stores a checkpoint for the entire drive: because each Segment of LFS stores the address of the next Segment, the entire file system is organized like a chain. In Checkpoint, LFS stores the address of the first Segment and the last Segment of this chain, so the entire file system can be recovered by reading Checkpoint. lfs updates the data in Checkpoint every 30 seconds.

Experiments

Figure 3/4 shows the disk capacity utilization and fraction alive the segment cleaned with write cost are better than FFS today for uniform data and 10% hot data and 90% cold data. Figure 5 shows that in terms of segment utilization, the hot-and-cold data are better than uniform. Figure 6 shows that the segment distribution occurs when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written.

Evaluation effectiveness

This evaluation is designed sophistically.

Novelty

Rosenblum is a co-founder of Vmware and Ousterhout is one of the authors of Raft. I don't have any comments on their novelty, just they are tooooo ahead of time.

SoTA Connection

The Segment-based design of LFS coincides with the physical characteristics of SSDs and is therefore widely used in SSD firmware; the memory table/compaction of LSM is in line with the memory buffer and GC of LFS, and new file systems such as btrfs are based on the append-only feature of LSM to implement copy-on-write or multiversion. newer file systems such as btrfs also implement copy-on-write or multi-version features based on the LSM append-only feature. The EXT3/4 also has redo logging with journal/ordered/writeback mode.

Also, the Parallel Log-Structure FS follows the LFS to make parallel FS tailored to N-N write which speeds up the MPI write.

Reference

  1. http://www.eecs.harvard.edu/~cs161/notes/lfs.pdf
  2. https://lwn.net/Articles/353411/
  3. https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf
  4. https://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
  5. https://www.usenix.org/legacy/publications/library/proceedings/sd93/seltzer.pdf