[CSE220 Project] Signature Path Prefetcher

All the modified code and raw result is https://github.com/masc-ucsc/esesc/commit/2763f4c52351e9c9233b67beb245dd4c8d39c5ed and modified code reside the last 3 commits.

Basic Idea of Improving the IPC

Since the workload I'm using is a RISC-V simulator and an in-memory database, it highly requires the prefetcher that has awareness of page prefetcher.

Comparison between different prefetcher

Stride Prefetcher and TAGE Prefetcher have already been implemented in esesc and achieved great performance, but they still failed to fetch if the semantic of cacheline is within page level. Say the application malloc and do a pattern within the page level, and the SPP Prefetcher will win.

Base Offset Prefetcher(contains Stride/NextLine)

1-degree prefetcher (only prefetch 1 address)
- Prefetch 2 offset results in many useless prefetches
- Normally fetch the next line or stride semantic prefetcher.
Turn off the prefetcher if the performance is too low.(stride or next line will not do so)
MSHR threshold varied depending on BO score and L2 access rate

SPP Prefetcher

Basically maintained 3 tables, a Signature Table that records the page, a Pattern Table that records the specific page with what kind of pattern, and a Global Register History to enable possible cross-page ref.

Design Doc

the esesc implemented Prefetcher only takes place in a single hierarchy. So I inserted SPP into the member variable of GMmemorySystem. If the CCache is named L2(0) or DL1(0), I passed the SPP in. So that I could record all the down req for DL1, calculate the SPP result, and once entered L2(0), I consume all the addresses stored in SPP and schedule the prefetch.

Associative size for replacing policy cast on the vector<Entry*> and unordered_map<Entry*, int>. Use AssocSet with vector and unordered_map of the address to implement the 1 assoc, 1 entry size replacement policy.

Table update
- The offset that corresponds to a page's signature will be updated when DL1(0) accesses it.
  - The previous signature is utilized to alter the pattern table while the offset difference (delta) is used to construct the new signature.
The signature will match the design.
- Shorten training sessions and PT store visits
Prefetching
- Search the signature of the currently accessed page when L2 is accessing cacheline.
- Choose the delta with highest probability Pi ($C_{delta}/C_{sig}$) of $i_{th}$ prefetch depth
- If multiply all P larger than the threshold
  - Prefetch current address + delta
  - Use delta to update the signature and access the pattern table again
- If P < threshold=0.95, the procedure ends

Result

CKB-VM

For much of the page semantic awareness workloads, the SPPPrefetcher reached a great performance, but some of the workloads are too deterministic since it's simulating the naive and straight forward test cases.

RisingWave

We see a great jump >1.5x speed up for in-memory database tests. It's because much of the hash merge join and sort-merge join requires page-level fetching to reduce the memory wall!

Conclusion

For a short-term running time
- next line prefetcher or without prefetcher has better performance in benchmarks which has a more regular access pattern and higher overall miss rate
- Performance gain in random access pattern is ignorable

November 23, 2022November 25, 2022

Dagger: Efficient and Fast RPCs in Cloud Microservices with Near-Memory Reconfigurable NICs @ASPLOS21

The current RPC layer between nodes can be scaled by utilizing NUMA variants like NEBULA [1] or soNUMA, which provides on-chip buffer locality awareness of the data movement. A lot of RDMA RPC Accelerator/ SmartNIC implements these kinds of locality information when sending on LLC or on RDMA Shared Receive Queue(to enable the inter-endpoint buffer sharing).

For tail latency that newly arriving RPCs will inevitably violate the SLO and are eagerly NACKed, informing the client early about increased load conditions on the server, NEBULA will reject or retry(fail-fast approach) the requests predicted. The NIC to core policy is implemented naturally using DDIO+fast packet reassembly, but this dispatch is not cached into L1 directly.

Dagger implemented FPGA reconfigurable RPC stack-integrated microservices. Their eventual goal is to enable NUMA(future support CXL) over FPGA RPC. Their initiative is still if the microservice like NGINX/MongoDB is not latency bound and memory bound, just disaggregate using their NIC to the same memory space. The RPC is the prevailing IDL gRPC. The CPU/NIC communication is very similar to the approach in the HFT industry DDIO+cache hash function+avx stream load cl, WQE-by-MMIO.

Their load balancer+ flow scheduler on FPGA is very similar to a CXL.cache device bias CHA. That CHA normally distributes Req on-chip or cross Socket through UPI.

When CXL FPGA NIC's in, I think it can cache cl onto L1 and use hw to replace the cc. I still think there's more OS-level sync that can leverage the cc of CXL.cache.

Reference

November 22, 2022November 25, 2022

Rethinking Communication in Multiple-kernel OSes and for New Shared Memory Interconnects @PLOS'19

The paper is basically talking about when memory pooling & sharing is out, can we design a typed memory allocation that can be shared through different kernel ABI and ISA. The programming model is referring to PGAS/Shmem.

Communication is shared memory. Because messages have to be placed in a buffer, a notification sent (e.g.,an IPI that may cost tens of us), the receiver interrupt processing needs to handle the message, and it may have to send a message back in the same way. Thus, for performance, shared memory is preferable to message passing.

Data Format is a fixed typed executable code segment. Updating code rather than data holds the promise of significant runtime overheads reduction, especially in scenarios where the amount of data to process is significant.

The code above is an accesser for the same snippet of typed code, it will read the offset which is the type hint for one data object.

Implementation has

extended slab allocator, for typed memory modification to normal memory.
compiler static support, sizeof/__typed
JIT support for inter-kernel data sharing for adding a data conversion+ type morphing.

November 10, 2022January 30, 2023

SC2022 Atendency

人生第一个in person的会.

先放一张Intel的吹牛PPT,苹果之后大家都喜欢这么做CPU广告

November 6, 2022November 8, 2022

Optimizing Data-intensive Systems in Disaggregated DataCenters with TELEPORT @Sigmod22

The paper claimed that as brought in [2], MESI virtual cache is a better way of managing disaggregating memory. So they design an RDMA Daemon to do the MESI virtual cache things. For [2] is to leverage the DRAM TLB, that giving tag that can be VA or PA, so that better for communicating between cores. For this paper, they implemented a syscall called pushdown to emulate the memory sharing with different memory spaces with memory order assured, coherence protocol inside, and fast PA/VA translation between 2 memory spaces.
The system is built upon LegoOS, which has no innovation if they implemented on Linux since there are so many daemon implementations out there.
The daemon RPC messages are received and handled in a FIFO order, no transaction, no crash consistency, no atomicity.
Concurrent page fault handling

Reference

November 6, 2022November 12, 2022

[CSE231 Paper Reading] Filesystem Reflection

SSDs and HDDs design difference.

We described two approaches to building file systems, clustering, and logging, for Hard Disk Drives (HDDs). How should we design a file system for a Solid State Drive (SSD)? Your answer should identify two hardware properties of SSDs that contrast with the properties of HDDs and describe how these properties inform your design decisions.

In the first FS prototype of UNIX, FFS, we used inode and dirent for interpreting the abstraction for file, those designs are aware of the physical media of the HDDs of how addressing a space will be faster and the page size for balancing swapping the physical memory onto the disks. This type of filesystem is not designed for SSDs. The SSDs have NAND QLC/SLC/MLC as physical media to store files and the on-chip controller is equipped with the ability of LBA to page map translation, drive channels, write-behind, allocation, and wear leveling (considered harmful in [1]). Nowadays, the firmware FS is designed similarly to Log-structured FS because it can fast recover from physical media failure with low latency cost.

EXT4 is always evolving with new media and smarter design. Ramzi group from the University of Wisconsin still publishes papers on EXT4 core design like leveraging device mapper deduplication [2], and how data path can be more critical [3].

Starting from PMFS, people are getting close to how FS can be designed if the persistent memory is attached as memory or more recently as a CXL.mem. 1. with the hardware speeding up, we require software to be low latency as well. e.g. tracer, hypervisor, and multi-tenancy isolation management 2. the ordering and durability of a single cacheline or a single page, or a single pointer should be dealt in the firmware level, either resolved by locks, lock-free hardware mechanism, or other smart API support. 3. workloads aware optimization by providing performance hints and advise interface like madvise() in Linux. The storage of ML model serialization is completely different from those configured to be plain memory(no ptr or hierarchical logic) dump. 4. optimize consistency using a combination of atomic in-place updates, logging at cacheline granularity (fine-grained journaling), and copy-on-write (CoW)

Replace POSIX?

In our discussion of single-level stores, we identified a few problems with using this design for all interactions with persistent storage. In this question, describe one potential issue that arises when using a single-level store for persistent storage. How does using a POSIX/file system interface simplify/eliminate the issue that you described?

The VFS is long discussed overhead inside the kernel that causes higher latency by random scheduler overhead or lower bandwidth by blk or other mechanisms of kernel occupying the queue. But we don't currently have a proposal to replace it since the kernel must rely on the blk pread/pwrite & bdi write back all this stuff. [4] provides

The semantics for POSIX is too strong for distributed FS. If you could put something on the fly in the CXL cache layer, that is bounded for data appearance for already ready. that will be

/ Software-defined SmartSSD/ CXLSSD

Reference

November 4, 2022November 25, 2022

Pond: CXL-Based Memory Pooling Systems for Cloud Platforms @ASPLOS22

NUMA模拟器(Bandwidth is an attack argument), XGBOOST无脑allocator,什么垃圾. 一股湾区摸鱼怪的AB Testing 味做实验.

allocator 是hyperv的allocator所以没开源.这东西微软8年前就开始做在azure上了.后来还有smartNIC中间翻译地址啥的做offloading,微软搞pond顺理成章.

对memory预测错,他们在hypervisor层有一个allocation bit 来记录现在的memory 在哪,他们会有另外一套logic来减少panelty.(所谓的QoS monitoring)

他们的future work

微软肯定搞些CPU co-processor把上述过程硬件化,但是他们能控制hypervisor层软件解决方案也行,但是ML model会co exist within L3 cache.Parameter 越多其实是非常不利的.他们并未开源ML部分的实现,其实仔细看看history 和 untouched memory还是不一样的.
geust OS层page migration 也是要学的.所以不如直接交给TPP(double prediction),但是怎么解偶? FaaS, geust OS自己控制就没这个问题.
memory virtualization why not use Balloon Driver?
No idea to get to the optimal.

November 1, 2022November 6, 2022

[CSE231 Paper Reading] Eraser: A Dynamic Data Race Detector for Multithreaded Programs

Summary

Motivation for the paper the current distributed system complies the Lamport's happened-before logic and uses mutex lock to lock the variable that prevents data race from other threads accessing the variable.
They use the lockset algorithm that maintains an observation thread for seeing if the same locks are held whenever the shared data is accessed and infer which locks protect which data item. Since Eraser has no way to know the purpose of the lock, it must obtain the lock protection relationship based on the history. For each shared variable v, Eraser maintains a candidate lock set C(v), which holds all locks that have protected v. When the variable is initialized, C(v) is the set of all possible locks. When the variable is accessed, C(v) <- C(v) $\cap$ set of locks for the current thread. Obviously, this set will only get smaller or remain stable. Terminal: All threads protect the same resource with consistent locks (immobilization point) => Refined; C(v) is empty => Inconsistent. To weaken the strong rule, we can assume the initialization has no lock and the read-only data do not require a lock and the read-write lock is single-writer multiple-readers. If no other thread can refer to it, then there is no need to lock other threads. This is true for newly allocated data. To avoid false alerts, Eraser delays the refinement of candidate sets until after initialization is complete. However, confirming that initialization is complete is difficult, so this is delayed again until the shared variable is first accessed by a second thread. If the shared variable is always read or written by only one thread, then the refinement is always not executed. Furthermore, for read-only objects, again, protection is not necessary. Such support can be provided by warning only those shared variables that are written by multiple threads.

If updated because of the writer, we should only save locks that are in reading mode.

Eraser is implemented in Digital Unix OS on the Alpha processor using the ATOM [Srivastava and Eustace 1994] binary modification system. Eraser takes the unmodified binary code as input and produces new binary code that is functionally consistent but with Eraser runtime inserted into it. Monitor each global/heap load/store to maintain C(v) (via address and stack pointer relationships) Monitor each lock acquire/release call and thread creation/termination to maintain the locks held by the threads. Monitor each memory allocation to initialize C(v), and each word has the potential to become a shared variable. (So much trouble, mostly because changes are being made to the binary, not the source code) Maintain the set of stack addresses accessible through registers rather than stack pointers. When a race occurs, Eraser logs to the file and reports the line number, traceback of the stack frame, thread ID, memory address, access method, PC/RSP value, etc. to locate the problem code. If that doesn't find it, then let Eraser log each access.
The author found that the general overhead comes from the fact that a function call is required for each monitoring in ATOM, but upgrading to version 1996 allows inlining. Therefore he directly concluded that performance was not greatly affected and that it was not the main purpose of the program. The slowdown by a factor of 10 to 30x is unavoidable.

Critique

They monitor C/C++ allocators and locks that require binary rewriting and runtime function calls. It still couldn't observe the kernel level and microarchitecture level races. Like there's a soft IRQ that changes the memory state, the simple model of observing aside is not valid. For modeling, the memory safety/ race condition inside the kernel should introduce modeling for all possible state changes that will make the tool much more soundiness.
This is a combination of art and engineering. It's art is because it utilizes some weaker formalization and uses observation thread to watch the thread state change. It's engineering because binary rewriting is dirty work.
It's the start of the whole debugging and software engineering world in the kernel. I think tracing the bugs inside the kernel is a long-discussed topic. New eBPF tools can be used to livepatch kernel, reducing the overhead of instrumenting kernel and migratable without writing a module or hacking LSM to do so.

Reference

November 1, 2022November 6, 2022

[CSE231 Paper Reading] Safe Kernel Extensions Without Run-Time Checking

Summary

Motivation: The runtime checking for the kernel is too time-consuming. Naive formal verification for runtime systems is the code consumer verifies the safety of the code. Because the verification takes a great amount of time. The verification normally takes the pre/post condition of assembly code and defines the transmission rules for the different operator of instructions and eventually use the SMT solver to check the running kernel corresponds with the condition.
The technique allows the kernel to check the extension safety. For code receiver defines a set of safety rules that guarantee the safe behavior of programs. Then the code producer creates a formal safety proof that its code acts corresponding to the code that complies with the safety rules. The code receiver uses a simple and fast validator to check that the code is safe. The proof-carrying code is the message to pass the producer attaches the proof to the code that offloads the consumer's work. Most of the validation effort is done by the code producer. The construction of the proofs is irrelevant to the code consumer. PCC programs are unchangeable. Changes to the code render the evidence invalid. There are no reliable outsiders Before any code is executed, errors are found. It contains several basic components. 1. The use of formal specifications to convey the safety policy 2. Formal semantics of the unreliable code's language 3. The words utilized to convey the proofs 4. Validation algorithm for the proofs 5. The process for producing the safety proofs. First, they have a simple LF type-checker and a theorem prover that does the safety check.
They evaluated the BPFs by implementing four packet filter logics in assembly language and certified their safety with respect to the packet filter safety policy. Normally just prove the input and output packets are within the filter definition and do not violate the memory check. The running time is average per packet runtime of 4 PCC packet filters, and they compared with the BSD packet filter interpreter which is super slow. And using software fault isolation and a safe subset of Modula 3 plus the VIEW extension for safe pointer casting.

Critique

The experiment result shows PCC can make server kernel interact safely with untrusted code and PCC has no runtime overhead for the receiver, lastly, the safety policies are defined by the receiver which makes it a lot easier and flexier.
This work is an engineering effort because they actually implemented a lot of the prover. it seems that the verification of the system is requiring neither zero-knowledge proof nor cryptology, just pure proof offloads. However, the verifier framework can be applied to any runtime system with only the protocol of PCC being defined, that's great in a system that requires high robustness. I think the real thing is rewriting all the code, especially the driver code in rust so that at least it will reduce 99% of memory and type issues.
Type system checker like system w that OCaml defines is a better type static and dynamic checker but not applicable in runtime system because it's too time-consuming. The current eBPF verifier is an offline verifier with online checks. So that this work is the guidance for the current system that requires memory safety and semantic safety. But it definitely has a weaker condition for the proof.

Reference

November 1, 2022November 1, 2022

[CSE231 Paper Reading] Why do Computers stop and What Can Be Done About it?

Summary

Motivation: When we are designing a system with high availability, we normally encounter the fast fail issue that we must dive into to find the expected semantics of the code in accordance with the process in our mind. As studied in the paper, 20K copies of von Neuman's system were needed for it to have a 100-year MTBF. A system that practices the principle of rapid error should be well-modularized, with each module choosing only between providing normal functionality and ceasing to function. So generally speaking, the error that the caller should capture should be stored as the information and the other should simply spawn the exception.
The solution for this is to design either redundancy modularized software or a hardware solution. redundancy and modularity. A module's failure only impacts that module due to modularity. Decompose the system into modules in a hierarchical manner. 1. Create each module with an MTBF greater than one year. 2. Each module should quickly fail. 3. Each module should include a heartbeat message so you know when it fails. 4. possess backup modules that can replace a faulty module. In the system, there are two primary methods: 1. Before the code is ever executed, static checking is performed. But cautiously checking may provide numerous false positives. 2. Dynamic checking verifies running code. But this may result in a low false positive rate and may not catch every bug, particularly in rarely used code paths. Another solution is Process pairings in which the other process takes over after the first fails including 1. Lockstep: both carry out each directive 2. Checkpointing: The primary periodically saves its state, which is transferred to the backup. 3. Alternatives: Kernel Checkpointing and Delta Checkpointing 4. Backup relies entirely on permanent storage for information. It is necessary to prevent inconsistent persistent store.
The evaluation in the paper is basically 2 examples of availability 1. Transactions atomicity, consistency, isolation, and durability (ACID) properties. Jim Gray promotes the use of transactions and persistent process pairs. 2. Encompass system Fault-Tolerant Communication implemented Sessions and sequence numbers are important. TCP Sequence numbers are used to identify duplicate and missing communications using the same principle.

Critique

The evaluation and the solution are highly correlated. 1. Maintain hardware on a regular basis.2. Delay software updates as long as you can to give them time to develop 3. If a bug is creating outages, only fix it. But the analysis is not fungible and prevailing as it is to the latest system development. In terms of current development, for example, if we are developing a project that uses the fat client model, the client caches the user balance and the price of the item. We have a business interface that consumes the user's balance to purchase items. In this model, a normal client will not initiate a business interface request when the balance is low. So for the developer of the server-side business interface, there is no need to handle and return the insufficient balance error to the client and just let the client disconnect when the balance is insufficient. The opposite of the same situation: if we develop a project using the thin client mode, the client does not know the balance and the price of the product before initiating the request, then the server side needs to return the insufficient balance business error to the client so that the client can further prompt the user. Another example: Suppose we develop a project that maintains a file for each user to keep the user's balance, which is created at the time of user registration. The business interface described in the previous example should not handle various errors such as file read/write failure or file not existing without permissions under this design and should let the service refuse to serve the user with the problem, throw an exception and record the log. Because at this point it is no longer possible to provide the correct service. If the service is provided reluctantly, it will probably lead to more problems and even contaminated data.
I think the idea brought by the paper is art, they proposed how we can make the system more reliable using redundancy and modularized. They also suggest how we should throw the panic to debug the system better.
It's definitely the start of the system research that guides the nowadays system research. For Linux 6.0, they add Runtime Verification, which is lightweight but equally rigorous and is more tractable for complex systems than classical exhaustive verification techniques such as model checking and proof of the theory. Besides, eADR for the persistent memory is a hardware-software codesign problem to address the persistence storing problem.