[CSE231 Paper Reading] EXSi

Summary

  1. Motivation: To reduce the overhead of guest overcommitting memory and exploiting content-based page sharing and hot I/O page remapping, VMWare proposed EXSi. When a client accesses a large block of I/O through DMA (Direct Memory Access), the QEMU emulator will not put the result into the shared page, but write the result directly to the client's memory through memory mapping. The advantage of this approach is that it can simulate a wide variety of hardware devices; the disadvantage is that the path of each I/O operation is long, requiring multiple context switches and multiple data copies, so the performance is poor. The ESXi comes with various mechanisms and policies that manage proportional sharing for memory and upcoming research for GPU devices.

    1. Ballooning is a method for ESXi server to reclaim memory from the guest OS. To do this, pinned pages are created in the physical memory by using the guest OS's memory management algorithms. The ESX server then takes back these pinned pages. This concept is based on the idea that the guest operating system has the best knowledge of the page that has to be removed.
    2. Content-Based Transparent Page Sharing – Contrary to disco, which required particular interfaces for shared page creation The ESX server efficiently compares comparable pages between VMs by scanning them and using a hash map. An inactive background process actively scans and generates hash values, which are then utilized to locate related pages using a hash map. If a match is discovered, the object is tagged as COW, and the duplicate memory is freed.
    3. The unresolved issue in share-based resource management is resolved by the idle memory penalty. The basic concept is to charge a VM extra for idle pages compared to those that are being actively used. To recapture the idle pages, utilize the ballooning technique described above.
    4. I/O Page remapping is also used in large-memory systems to lower I/O copying overheads.
  2. Solution: a thin software layer that effectively multiplexes different hardware resources among several virtual machines. What makes this unique? Traditional virtual machine platforms used a hypervisor, which runs on top of a standard operating system, to intercept I/O device requests coming from VMs and handle them as host OS system calls. Because ESX Server operates directly on top of the system hardware, it offers faster I/O performance and greater resource management flexibility. With any necessary OS modifications, can run several operating systems. When allocating a virtual machine hypervisor creates a contiguous addressable memory space for the virtual machine. They designed a hypervisor called pmap that mapped the shadow page from guest virtual to guest physical fast, and it also manages the oversubscription. To inflate the memory balloon in the VM, the virtual machine balloon driver will pin the page to prevent the page out to disk, and unmap the two pages from the guest OS to the host OS.

  3. The numerous strategies discussed in the study are precisely evaluated in this paper. The performance boost over Disco is contrasted at numerous places. The overall throughput is improved by idle memory stressing, among other significant findings. Memory is made available that increases linearly with the number of VMs thanks to content-based memory sharing. While the performance of inflated VMs was nearly identical to that of unballooned VMs with the same amount of memory, ballooning adds a very tiny overhead (1.4 - 4.4%). This indicates the performance reduced compared to the Disco.

Critique

  1. At least one example from each unique technique is used to clearly evaluate them all. In terms of ballooning, a Linux virtual machine running Dbench with 40 clients displayed nearly identical performance when compared to a server without ballooning. Three real-world guest Oss instances were launched to evaluate content-based sharing, and the page-sharing method was able to recoup more than one-third of the virtual machine. By adjusting the tax rate, the percentage share on 2 VMs was tracked to evaluate the effectiveness of the idle memory tax.
  2. This is definitely the start of scientific research but EXSi is pure engineering work and took a great amount of time. The idea of utilizing the idle page is widely used today. The disaggregation stuff happens to start from the boom of Memory Resource Management. The ballooning technique is all architecture that wants to gain memory virtualization performance, they will implement one, like GPU virt like "XenGT" and "gScale" or "CXL" based virtualization.
  3. This approach can be applied to the latest MemTrade. But I think the latest work for an idle page or cold page identification is not that efficient. For virtualization, we can simply be done in the page level, but we didn't make everything work smoothly before the bottleneck of the memory. like if the memory pool of the host device is 80% fulfilled, we need to design a mechanism to swap/zswap/disaggregate instead of waiting till the end of the memory then to these.

Stronger ownership check in nightly-1.66.0

Happens to get compiled for the risingwave using the latest rust nightly compiler because of the riscv random segfault. Thankfully Alex Chi has approved his solution for Promise trait bug for not calculating correctly the trait bound.

Then I came into 2 lifetime issues for nightly-1.66.0

error: lifetime may not live long enough
   --> src/storage/src/hummock/file_cache/store.rs:112:17
    |
64  | impl<'a, K, V> StoreBatchWriter<'a, K, V>
    |      -- lifetime `'a` defined here
...
93  |     pub fn append(&mut self, key: K, value: &V) {
    |                   - let's call the lifetime of this reference `'1`
...
112 |                 rotate_last_mut(&mut self.buffers)
    |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'1` must outlive `'a`

The not live long enough can always be cheated by manually replace the lifetime dataflow with renaming.

error: lifetime `'_` is part of concrete type but not used in parameter list of the `impl Trait` type alias
impl ChanSender for FifoSender {
    type SendFuture<'a> = impl Future<Output = BatchResult<()>>;

    fn send(&mut self, chunk: Option<DataChunk>) -> Self::SendFuture<'_> {
        async {
            self.sender
                .send(chunk.map(DataChunkInChannel::new))
                .await
                .map_err(|_| SenderError)
        }
    }
}

You will get the '_ for the ones inside the async trait, you just need to specify type SendFuture<'a> = impl Future<Output = BatchResult<()>> + 'a;

Azure Mapper vs. Graph DB vs. RDBMS

The motivation dataset

CompuCache Design of VM

The comparison

Azure Mapper Graph DBs RDBMS
Pros Parallel Computing Recursive Ptr Deref Chase edges across servers Various neighborhood queries Extended buffer pool Semantic cache Temporary-data store
Benefit by Avoiding strictly serializing the roud tip RPC Avoiding strictly serialized full round trip RPC overheads Index lookups Predicates on MVs

Sproc (Stored Procedure) and eRPC abstractions

Sproc CompuCache
An application specifies a sproc by calling the Register function and executes it by calling the Execute function. CompuCache uses eRPC, a user-space RPC library that runs on DPDK or RDMA.
Sproc code is a parameter to Register and is broadcast to all CompuCache servers. DPDK avoids OS kernel overhead by executing the networking in user space.
On each server, the code is compiled locally as a dynamic library and loaded into the runtime of the server. RDMA offers the further benefit of offloading the CPU by running most of the networking protocol on the network interface card (NIC).
A CompuCache server might not have all the data needed to execute a sproc, which requires coordination with other CompuCache servers To leverage the full bandwidth of high-speed networks, CompuCache batches small operations in a single network transfer. This includes batching small I/O requests and responses and all sproc execution requests.

CompuCache Server

We have server-side Pointer Chasing by LocalTranslator Data Structure provides Translate.

An sproc invokes this function to map virtual cache addresses into physical server locations.

Cross-Server Pointer Chasing Handling OOB exceptions.

Result

Possible Migration to CXL

  1. Highly correlated with far memory setup. and type 3 can be used for better memory access, without the need of using ptr chasing to leverage the info of far memory.
  2. Can also leverage the Near data processing for memory node in a memory polling setup.

Reference

  1. https://www.youtube.com/watch?v=XXJj1nJuLbo

[CSE231 Paper Reading] Formal Requirements for Virtualizable Third-Generation Architectures

Summary

  1. The motivation for this paper is that x86 devices don't automatically get virtualization because of sensitive instructions like popf, and should satisfy the author to compose a formal definition for the x86 computer to support virtualization
    1. Equivalence, i.e., the VMM needs to simulate an environment for the virtual machine on the host that is essentially the same as the physical machine. The virtual machine runs on this environment in the same way as it does on the physical machine, except that it may perform slightly differently in the virtual environment due to resource competition or VMM intervention, e.g., the I/O, network, etc. of the virtual machine may be slower than when it is exclusively on the physical machine due to the host's speed limit or multiple virtual machines sharing resources.
    2. Efficient, i.e., the performance of virtual machine instruction execution is not significantly degraded compared to running on a physical machine. The standard requires that the vast majority of instructions in the virtual machine run directly on the physical CPU without VMM intervention, such as the ARM system we run on the x86 architecture through Qemu is not virtualized, but emulated.
    3. Resource control, that is, VMM can completely control system resources. The VMM controls the coordination of host resources to individual virtual machines, and cannot be controlled by the virtual machine having the resources of the host.
  2. The solution for that is Trap and Emulate model. Privileged instructions can only be run in system mode and will trigger a processor exception if run in user mode. The operating system allows the kernel to run in system mode because the kernel needs to manage system resources and needs to run privileged instructions, while normal user programs run in user mode. In the fall-in and simulation models, the user programs of the virtual machine still run in user mode, but the kernel of the virtual machine will also run in user mode, in a way called privileged compression. In this approach, the unprivileged instructions in the virtual machine run directly on the processor, satisfying the virtualization standard's requirement for efficiency, i.e., most instructions run directly on the processor without VMM intervention. However, when a VM executes a privileged instruction because it is running in user mode, it will trigger a processor exception, which will fall into the VMM, which will act as a proxy for the VM to complete access to system resources, the so-called emulate. In this way, the requirements of the virtualization standard for VMM-controlled system resources are again met, and the virtual machine will not modify the host's resources because it can run privileged instructions directly, thus destroying the host's environment.
  3. They prove using the definition of the machine S = (E, M, P, R) and using A virtual machine map (VM map) f: Cr -> Co is a one-one homomorphism with respect to all the operators $e_i$ in the instruction sequence set L to bound the instruction change or trap does not leak message or information by two states. If this proof is correct, I think it's great.

Critique

  1. I generally agree with the author using the state machine to formally prove how sensitive instruction affects message leakage. But now if we want to prove something, we may have a better framework like Framework K.
  2. The novelty comes in that it classifies the sensitive instruction that
    1. Instructions that attempt to access or modify the virtual machine mode or machine state.
    2. Instructions that attempt to access or modify sensitive registers or storage units, such as clock registers, interrupt registers, etc.
    3. Instructions that attempt to access the memory protection system or memory or address allocation system.
    4. All I/O instructions.
      The essence of the "fall-in-simulation" is to ensure that instructions that may affect the correct operation of the VMM are simulated by the VMM and that most of the non-sensitive instructions run as usual. The CPU needs to support a protection mechanism, such as MMU, to isolate the physical system and other VMs from the currently active VMs. 4. Since the execution of control-sensitive instructions may change the state of the system (processor and device), in order to ensure the absolute control of resources by the VMM and maintain the normal operation of the VM, the execution of such instructions needs to fall into the VMM and transfer the control to the VMM, which will handle the simulation. The execution result of behavior-sensitive instructions depends on the highest privilege level of the CPU, and the Guest OS runs at a non-highest privilege level, so to ensure the correct result, it also needs to fall into the VMM and be emulated by it.
  3. It's obviously the start of the whole virtualization industry or cloud industry. The Full virtualization party, represented by VMware, is paranoid about the idea of running without modifications. Xen's excellent performance with appropriate modifications to the Guest OS and Xen's excellent performance with appropriate modifications to the Guest OS made Para virtualization a big hit. Later, Intel and AMD entered the fray and expanded the hardware to address the difficulties of traditional x86 virtualization and to improve performance. In the end, hardware extensions were adopted by both the Full and Para schools. Since then, the performance advantage of the Para school has been less obvious, and the friendliness of the Full school, which runs directly without modification, has gradually gained the upper hand The user-friendliness of running without modification is gaining ground.
  4. Although various solutions have been used from the software level to solve the problems encountered when virtualizing the x86 architecture, these solutions have introduced additional overhead in addition to significant complexity to the VMM implementation. Intel did not modify the unprivileged sensitive instructions to privileged ones, because not all privileged instructions need to be intercepted. As a typical example, whenever the OS kernel switches processes, it switches the cr3 register so that it points to the page table of the currently running process. However, when using the shadow page table for GVA to HPA mapping, the VMM module needs to capture every operation of Guest to set the cr3 register to point to the shadow page table. When hardware-level EPT support is enabled, the cr3 register no longer needs to point to the shadow page table; it still points to the page table of Guest's process. Intel developed VT technology to support virtualization by adding Virtual-Machine Extensions, or VMX, to CPUs. VMM runs in VMX Root Mode, which is not fundamentally different from normal mode except that it supports VMX.

[CSE231 Paper Reading] ErOS

Summary

  1. The motivation for the paper is the author wants an abstraction for checkpointing and object-based storage or memory system for a microkernel so that you can achieve its efficiency by fusing carefully selected abstract objects with methods of object caching. All programs with direct hardware access or the capacity to directly change the security state of the system is gathered in the kernel to make security assurance simpler. Thus, the single-levelstore and bottom-half device drivers are implemented in the kernel.

  2. Via capability-protected abstractions, the kernel offers a very direct virtualization of the underlying hardware. The size of the pages, which include both data and capabilities, is determined by the underlying hardware. Additionally stored in nodes are capabilities. Nodes have 32 capabilities and perform a similar function to metadata in traditional systems. Pages and nodes include all the information that applications can see. A capability that names the root of the process' address space tree is present in every EROS process. This tree's mapping is converted into the representation needed by the underlying hardware on demand. The EROS kernel makes any necessary object faults to load pages and nodes from the disk after trying to traverse the address space tree to construct a valid mapping when a page fault occurs. In the event that a valid mapping is discovered, it is added to the hardware mapping database and the process is repeated. If not, the problem is reported using an upcall to a user-level fault handler that is provided by the process or the address space, if any (otherwise). They have a virtual copy mechanism for page migration and address translation for faster page walk.

  3. The author proposed a fantastic result for the EROS. It is evident that EROS performs admirably on 6 of the 7 benchmarks. This, in our opinion, offers strong, if clearly not conclusive, evidence that the presence of protected subsystems (some of which were employed in the EROS benchmarks) need not adversely affect system performance. Furthermore, it seems that capability systems can operate effectively without the need for specialist hardware assistance.

Critique

  1. I generally believe in the experiment since it basically represent the pros of the experiment.
  2. It's art and science. It has abstractions for inter-process communication (IPC), virtual address spaces, threads, and, unlike most L4 kernels, authorization capabilities. Virtual address spaces do not have a kernel-defined structure; instead, page faults are conveyed via IPC to the pager threads in charge of creating the virtual address space by converting frames into it. To assist virtualization, exceptions and non-native system calls are also transmitted via IPC. IPC facilitates RPC through reply capabilities and uses synchronous and asynchronous endpoints (port-like destinations without in-kernel buffering) for inter-thread communication. Capability address spaces made up of capability container objects known as cnodes are used to separate and store capabilities.
  3. It's the beginning of the research, because it has the abstraction for IPC and circle out the machine state, which is the best for verification using LTL verfication technique.

FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs @ ATC'22

之前听过这个talk,但是由于作者本人的英语口语不太行,所以当时么认真看。这篇同样,在CXL.cache出来以后实现方法完全是另外一套,但是我们完全可以在其他device上实现相同逻辑,用CXL.cache做offload。但是TLB等硬件的hack加速还是必要的。

GPU side 的smartNIC是为了

  1. 使用GPU虚拟地址与本地GPU进行直接的PCIe P2P通信,并为远程GPU提供可靠的100Gb网络访问。(用DPA或者IOMMU都是相对比较慢的操作
  2. 有编程能力用C++ HLS做datapath优化。

    对比NV出的DPU确实有很大的优势,也许那些人做硬件的时候也没考虑为什么放一个Arm core 而不是放一个FPGA更好的software define。(年初打ISC的时候用过DPU,真的只能调用他们的OPENMPI优化少数几个primitive,如alltoallv.

做法是hook GPU的send/recv请求,现在变成GPU到NIC的p2p通讯。这个操作需要offload control plane onto GPUs/offload data plane onto FPGA。主要让GPU能poll FPGA状态寄存器,FPGANIC可以直接访问GPU所有数据。

GTLB优化VA->PA转换,参数如下。



Evaluation 很成功。offloading和DPU所谓的overlap率有关,但是这个只适合高频公司特定需求,DPU已经很好的利用bandwidth了。

Record OS Buffer Cache and shared buffer in PostgreSQL to prevent double caching

When running the experiment limiting the 6th query of TPC-H on PostgreSQL, we found that the local memory will be 50% application occupation and 50% OS Buffer Cache. I desperately need a tool that can record why the PG needs to double caching either using sysv read/write or mmap scheme.

Two plugins to install

  1. pg_buffercache: already in the pg contrib extension folder, if you didn't install it, go to the pg_buffercache folder and make sure the pg_config is in the right environment.
  2. pgfincore: make sure you check out the right branch which 1.2.2 is compatible with PG 9.6

cat the metrics to csv

I need to record pg_buffered, pgbuffer_percent, percent_of_relation, os_cache_mb.

in case of double caching

You'll know how the os_cache coaleased with the data in buffer cache, you can follow the [2], to flush it.

Reference

  1. https://sites.google.com/site/itmyshare/database-tips-and-examples/postgres/postgresql-buffering
  2. https://unix.stackexchange.com/questions/253816/restrict-size-of-buffer-cache-in-linux/255000#255000

Achieving 100Gbps Intrusion Prevention a Single Server @OSDI' 20

FPGA offload 入侵检测, performance/power efficiency balance。这篇的insight就是要做above 100Gbps TCP的on chip计算,比如负载均衡、能量负载、安全等如果放在CPU或者PIM上算都太慢了,所以搞了这个near NIC的computation。在交易所网络包发送的过程中也有类似的需要更改简单逻辑的场景,运用smartNIC在保证volatility的条件下可以大大减少延时。这篇的第二个insight是用到了intel hyperscan尽可能software 提速匹配IDS/IPSA。

第三个insight是硬件调度优化三个操作Regex matching rules/TCP reassemble/other task。

其中full matcher轮询一个由DMA engine填充的环形缓冲器。每个数据包都携带元数据,包括MSPM确定为部分匹配的规则ID(hyper scan的后半部分)。对于每个规则ID,完全匹配器检索完整的规则(包括正则表达式)并检查是否完全匹配。

TCP resembler 是一个ooo的设计。packets会先渠道fast path, 再到一个bram的cuckoo hashing table(flow table), insertion table 会弥补不同执行时间的ooo engine。

为了减少在FPSM的hash table lookup,其还写进去了一个SIMD shift or matching(显然应该不会比商用的intel的fpu写的快。(不过在FPGA上塞这么多逻辑.

täko: A Polymorphic Cache Hierarchy forGeneral-Purpose Optimization of Data Movement

(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.


They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss() onEviction() writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.

In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.

However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.

Reference

  1. https://mir.cs.illinois.edu/marinov/publications/ZhaoETAL20InvarSpec.pdf

[CSE231 Paper Reading] Log Structure Filesystem

Motivation

Computer hardware performance began to explode: CPUs were getting faster and RAM was getting bigger; however, while sequential read and write speeds of hard disks were increasing, random read and write speeds, which were limited by physical seek times, were hardly shorter than 10 ms. On the other hand, file systems at the time, whether Unix File System or FFS, had a large number of random reads and writes (at least 5 random writes are required to create a new file in FFS), thus becoming a performance bottleneck for the whole system. At the same time, because of Page cache, the authors argue that random reads are not the main problem: with more and more memory, most of the reads can be cached, so the main problem of LFS is to reduce random writes to the hard disk.

Implementation

  1. File System as a Log
  2. Segment-based bulk writing solves the random write problem, but how does LFS implement read operations? Similar to UFS/FFS, LFS stores the contents of a file within a segment, and also stores the index of the file. Specifically: in Segment, the file contents are stored in a fixed-size data block. Segment0 stores the two data blocks of file2, and the subsequent inode2 stores the indexes of these two data blocks. However, unlike UFS/FFS, LFS inodes are dynamically allocated, so LFS stores an index to the inode at the end of each Segment, called the inode map. in LFS, all inode map contents are cached in the contents, which speeds up reads.
  3. Garbage collection: As mentioned earlier, LFS requires a garbage collection mechanism designed to remove old data. In LFS, multiple segments containing obsolete data blocks are compacted into new data segments, and the old data in them is deleted.
  4. Failure recovery is obviously important for any file system to be able to recover data from a failure. Unlike UFS, which uses the fsck command to recover from a failure, LFS stores a checkpoint for the entire drive: because each Segment of LFS stores the address of the next Segment, the entire file system is organized like a chain. In Checkpoint, LFS stores the address of the first Segment and the last Segment of this chain, so the entire file system can be recovered by reading Checkpoint. lfs updates the data in Checkpoint every 30 seconds.

Experiments

Figure 3/4 shows the disk capacity utilization and fraction alive the segment cleaned with write cost are better than FFS today for uniform data and 10% hot data and 90% cold data. Figure 5 shows that in terms of segment utilization, the hot-and-cold data are better than uniform. Figure 6 shows that the segment distribution occurs when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written.

Evaluation effectiveness

This evaluation is designed sophistically.

Novelty

Rosenblum is a co-founder of Vmware and Ousterhout is one of the authors of Raft. I don't have any comments on their novelty, just they are tooooo ahead of time.

SoTA Connection

The Segment-based design of LFS coincides with the physical characteristics of SSDs and is therefore widely used in SSD firmware; the memory table/compaction of LSM is in line with the memory buffer and GC of LFS, and new file systems such as btrfs are based on the append-only feature of LSM to implement copy-on-write or multiversion. newer file systems such as btrfs also implement copy-on-write or multi-version features based on the LSM append-only feature. The EXT3/4 also has redo logging with journal/ordered/writeback mode.

Also, the Parallel Log-Structure FS follows the LFS to make parallel FS tailored to N-N write which speeds up the MPI write.

Reference

  1. http://www.eecs.harvard.edu/~cs161/notes/lfs.pdf
  2. https://lwn.net/Articles/353411/
  3. https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf
  4. https://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
  5. https://www.usenix.org/legacy/publications/library/proceedings/sd93/seltzer.pdf