最近的multiverse debug历程

我同事这人过去六个月的进步不如上一周的进步。因为yyw在旁边点醒了他。

首先static instrument把启动位置和offset弄了,把instrument code的basic block都加个offset。把jmp map metadata放最后,第一个问题是在6.2.0-36 kernel和5.4.0行为不一样,rtdl.c:1306会segfault原因是p_header在两个kernel里的physical和virtual address的mapping行为不一样。

其次spec17 gcc有selfmodifying code我们不支持。

其次perl在destructor会炸,因为对dtor的libc处理很傻逼,也是一个selfmodifying code实现。

推荐阅读

How to create rootfs for kernel debugging

Compulsory kernel options

CONFIG_E1000=y

create rootfs

readonly BASEDIR=$(readlink -f $(dirname $0))/

QEMU_PATH=${BASEDIR}/
QEMU_BUILD_PATH=${QEMU_PATH}/build/
IMAGE_NAME=qemu-image.img


qemu-img create ${QEMU_PATH}/${IMAGE_NAME} 100g
mkfs.ext4 ${QEMU_PATH}/${IMAGE_NAME}
mkdir -p ${QEMU_PATH}/mount-point.dir
sudo mount -o loop ${QEMU_PATH}/${IMAGE_NAME} ${QEMU_PATH}/mount-point.dir/
sudo apt install debootstrap debian-keyring debian-archive-keyring
# When debootstrap install if wget error occurs,
# add proxy configuration to /etc/wgetrc (http_proxy, https_proxy, ftp_proxy)
sudo debootstrap --no-check-certificate --arch amd64 lunar ${QEMU_PATH}/mount-point.dir/
# mkdir - ${QEMU_PATH}/mnt
sudo mount ${IMAGE_NAME} ${QEMU_PATH}/mnt
cd ${QEMU_PATH}/mnt
sudo chroot .

# After creating rootfs, 
# 1) Change root password
# $ passwd

# 2) (Optional) Add proxy configuration
# Open /etc/profile, and add proxy configuration
# export HTTP_PROXY="http://:/"
# export HTTPS_PROXY="http://:/"
# $ sh /etc/profile

# 3) (Optional) Add local Ubuntu repository mirrors
# Open /etc/apt/sources.list, and add repository mirrors

# 4) Package update
# $ apt update
# $ apt upgrade
# $ apt install nano pciutils debian-keyring debian-archive-keyring openssh-server net-tools ifupdown

# 5) Modify sshd config
# $ vi /etc/ssh/sshd_config
# ...
# PermitRootLogin yes
# ...

# 6) Modify network config
# $ vi /etc/network/interfaces
# # add below lines
# auto ens3
# iface ens3 inet dhcp
# # for q35 machine
# auto enp0s2
# iface enp0s2 inet dhcp
# alternative network config
# $ vi /etc/netplan/00-installer-config.yaml
# # add below lines
# network:
#  ethernets:
#    ens3:
#      dhcp4: true
#    enp0s2:
#      dhcp4: true
#  version: 2
# 7) Quit
# $ exit

After create the rootfs and compiler the kernel

SMDK_KERNEL_PATH=${BASEDIR}/Bede-linux/
ROOTFS_PATH=${BASEDIR}
MONITOR_PORT=45454

QEMU_SYSTEM_BINARY=`which qemu-system-x86_64`
BZIMAGE_PATH=${SMDK_KERNEL_PATH}/arch/x86_64/boot/bzImage
INITRD_PATH=/boot/initrd.img-6.4.0+
IMAGE_PATH=${ROOTFS_PATH}/qemu-image.img

function print_usage(){
	echo ""
	echo "Usage:"
	echo " $0 [-x vm_index(0-9)]"
	echo ""
}

while getopts "x:" opt; do
	case "$opt" in
		x)
			if [ $OPTARG -lt 0 ] || [ $OPTARG -gt 9 ]; then
				echo "Error: VM count should be 0-9"
				exit 2
			fi
			VMIDX=$OPTARG
			;;
		*)
			print_usage
			exit 2
			;;
	esac
done

if [ -z ${VMIDX} ]; then
	NET_OPTION="-net user,hostfwd=tcp::2242-:22,hostfwd=tcp::6379-:6379,hostfwd=tcp::11211-:11211, -net nic"
else
	echo "Info: Running VM #${VMIDX}..."
	MONITOR_PORT="4545${VMIDX}"
	IMAGE_PATH=$(echo ${IMAGE_PATH} | sed 's/.img/-'"${VMIDX}"'.img/')
	MACADDR="52:54:00:12:34:${VMIDX}${VMIDX}"
	TAPNAME="tap${VMIDX}"
	NET_OPTION="-net nic,macaddr=${MACADDR} -net tap,ifname=${TAPNAME},script=no"

	IFCONFIG_TAPINFO=`ifconfig | grep ${TAPNAME}`
	if [ -z "${IFCONFIG_TAPINFO}" ]; then
		log_error "${TAPNAME} SHOULD be up for using network in VM. Run 'setup_bridge.sh' in /path/to/SMDK/lib/qemu/"
		exit 2
	fi
fi

if [ ! -f "${QEMU_SYSTEM_BINARY}" ]; then
	log_error "qemu-system-x86_64 binary does not exist. Run 'build_lib.sh qemu' in /path/to/SMDK/lib/"
	exit 2
fi

if [ ! -f "${BZIMAGE_PATH}" ]; then
	log_error "SMDK kernel image does not exist. Run 'build_lib.sh kernel' in /path/to/SMDK/lib/"
	exit 2
fi

if [ ! -f "${IMAGE_PATH}" ]; then
	log_error "QEMU rootfs ${IMAGE_PATH} does not exist. Run 'create_rootfs.sh' in /path/to/SMDK/lib/qemu/"
	exit 2
fi
#  echo sudo ${QEMU_SYSTEM_BINARY} \
#     -smp 3 \
#     -numa node,cpus=0-2,memdev=mem0,nodeid=0 \
#     -object memory-backend-ram,id=mem0,size=8G \
#     -kernel ${BZIMAGE_PATH} \
# 	-initrd ${INITRD_PATH} \
#     -drive file=${IMAGE_PATH},index=0,media=disk,format=raw \
#     -drive file=${IMAGE1_PATH},index=1,media=disk,format=raw \
#     -enable-kvm \
#     -monitor telnet::${MONITOR_PORT},server,nowait \
#     -serial mon:stdio \
#     -nographic \
#     -append "root=/dev/sda rw console=ttyS0 nokaslr memblock=debug loglevel=7" \
#     -m 8G,slots=4,maxmem=32G \
# 	-device virtio-crypto-pci,id=crypto0,cryptodev=cryptodev0 \
#     -object cryptodev-backend-builtin,id=cryptodev0 \
# 	-object secret,id=sec0,file=./Drywall/passwd.txt \
#     ${NET_OPTION}

 ${QEMU_SYSTEM_BINARY} \
    -S -s -smp 4  \
    -numa node,cpus=0,memdev=mem0,nodeid=0 \
    -object memory-backend-ram,id=mem0,size=8G \
	-numa node,cpus=1,memdev=mem1,nodeid=1 \
    -object memory-backend-ram,id=mem1,size=8G \
	-numa node,cpus=2,memdev=mem2,nodeid=2 \
    -object memory-backend-ram,id=mem2,size=8G \
	-numa node,cpus=3,memdev=mem3,nodeid=3 \
    -object memory-backend-ram,id=mem3,size=8G \
    -kernel ${BZIMAGE_PATH} \
	-initrd ${INITRD_PATH} \
    -drive file=${IMAGE_PATH},index=0,media=disk,format=raw \
    -serial mon:stdio \
    -nographic \
    -append "root=/dev/sda rw console=ttyS0 memblock=debug loglevel=7 cgroup_no_v1=1" \
    -m 32G,slots=4,maxmem=36G \
	-nic bridge,br=virbr0,model=virtio-net-pci,mac=02:76:7d:d7:1e:3f

Design of per cgroup memory disaggregation

This post will be integrate with yyw's knowledge base

For an orchestration system, resource management needs to consider at least the following aspects:

  1. An abstraction of the resource model; including,
  • What kinds of resources are there, for example, CPU, memory (local vs remote that can be transparent to the user), etc.;
  • How to represent these resources with data structures;

1. resource scheduling

  • How to describe a resource application (spec) of a workload, for example, "This container requires 4 cores and 12GB~16GB(4GB local/ 8GB-12GB remote) of memory";
  • How to describe the current resource allocation status of a node, such as the amount of allocated/unallocated resources, whether it supports over-segmentation, etc.;
  • Scheduling algorithm: how to select the most suitable node for it according to the workload spec;

2.Resource quota

  • How to ensure that the amount of resources used by the workload does not exceed the preset range (so as not to affect other workloads);
  • How to ensure the quota of workload and system/basic service so that the two do not affect each other.

k8s is currently the most popular container orchestration system, so how does it solve these problems?

k8s resource model

Compared with the above questions, let's see how k8s is designed:

  1. Resource model :
    • Abstract resource types such as cpu/memory/device/hugepage;
    • Abstract the concept of node;
  2. Resource Scheduling :
    • requestThe two concepts of and are abstracted limit, respectively representing the minimum (request) and maximum (limit) resources required by a container;
    • AllocatableThe scheduling algorithm selects the appropriate node for the container according to the amount of resources currently available for allocation ( ) of each node ; Note that k8s scheduling only looks at requests, not limits .
  3. Resource enforcement :
    • Use cgroups to ensure that the maximum amount of resources used by a workload does not exceed the specified limits at multiple levels.

An example of a resource application (container):

apiVersion: v2
kind: Pod
spec:
  containers:
  - name: busybox
    image: busybox
    resources:
      limits:
        cpu: 500m
        memory: "400Mi"
      requests:
        cpu: 250m
        memory: "300Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Here requestsand limits represent the minimum and maximum values ​​of required resources, respectively.

  • The unit of CPU resources m is millicores the abbreviation, which means one-thousandth of a core, so cpu: 500m means that 0.5 a core is required;
  • The unit of memory is well understood, that is, common units such as MB and GB.

Node resource abstraction

$ k describe node <node>
...
Capacity:
  cpu:                          48
  mem-hard-eviction-threshold:  500Mi
  mem-soft-eviction-threshold:  1536Mi
  memory:                       263192560Ki
  pods:                         256
Allocatable:
  cpu:                 46
  memory:              258486256Ki
  pods:                256
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource            Requests     Limits
  --------            --------     ------
  cpu                 800m (1%)    7200m (15%)
  memory              1000Mi (0%)  7324Mi (2%)
  hugepages-1Gi       0 (0%)       0 (0%)
...

Let's look at these parts separately.

Capacity

The total resources of this node (which can be simply understood as physical configuration ), for example, the above output shows that this node has 48CPU, 256GB memory, and so on.

Allocatable

The total amount of resources that can be allocated by k8s , obviously, Allocatable will not exceed Capacity, for example, there are 2 less CPUs as seen above, and only 46 are left.

Allocated

The amount of resources that this node has allocated so far, note that the message also said that the node may be oversubscribed , so the sum may exceed Allocatable, but it will not exceed Capacity.

Allocatable does not exceed Capacity, and this concept is also well understood; but which resources are allocated specifically , causing Allocatable < Capacityit?

Node resource segmentation (reserved)

Because k8s-related basic services such as kubelet/docker/containerd and other operating system processes such as systemd/journald run on each node, not all resources of a node can be used to create pods for k8s. Therefore, when k8s manages and schedules resources, it needs to separate out the resource usage and enforcement of these basic services.

To this end, k8s proposed the Node Allocatable Resources[1] proposal, from which the above terms such as Capacity and Allocatable come from. A few notes:

  • If Allocatable is available, the scheduler will use Allocatable, otherwise it will use Capacity;
  • Using Allocatable is not overcommit, using Capacity is overcommit;

Calculation formula: [Allocatable] = [NodeCapacity] - [KubeReserved] - [SystemReserved] - [HardEvictionThreshold]

Let’s look at these types separately.

System Reserved

Basic services of the operating system, such as systemd, journald, etc., are outside k8s management . k8s cannot manage the allocation of these resources, but it can manage the enforcement of these resources, as we will see later.

Kube Reserved

k8s infrastructure services, including kubelet/docker/containerd, etc. Similar to the system services above, k8s cannot manage the allocation of these resources, but it can manage the enforcement of these resources, as we will see later.

EvictionThreshold (eviction threshold)

When resources such as node memory/disk are about to be exhausted, kubelet starts to expel pods according to the QoS priority (best effort/burstable/guaranteed) , and eviction resources are reserved for this purpose.

Allocatable

Resources available for k8s to create pods.

The above is the basic resource model of k8s. Let's look at a few related configuration parameters.

Kubelet related configuration parameters

kubelet command parameters related to resource reservation (segmentation):

  • --system-reserved=""
  • --kube-reserved=""
  • --qos-reserved=""
  • --reserved-cpus=""

It can also be configured via the kubelet, for example,

$ cat /etc/kubernetes/kubelet/config
...
systemReserved:
  cpu: "2"  
  memory: "4Gi"

Do you need to use a dedicated cgroup for resource quotas for these reserved resources to ensure that they do not affect each other:

  • --kube-reserved-cgroup=""
  • --system-reserved-cgroup=""

The default is not enabled. In fact, it is difficult to achieve complete isolation. The consequence is that the system process and the pod process may affect each other. For example, as of v1.26, k8s does not support IO isolation, so the IO of the host process (such as log rotate) soars, or when a pod process executes java dump, It will affect all pods on this node.

The k8s resource model will be introduced here first, and then enter the focus of this article, how k8s uses cgroups to limit the resource usage of workloads such as containers, pods, and basic services (enforcement).

k8s cgroup design

cgroup base

groups are Linux kernel infrastructures that can limit, record and isolate the amount of resources (CPU, memory, IO, etc.) used by process groups.

There are two versions of cgroup, v1 and v2. For the difference between the two, please refer to Control Group v2. Since it's already 2023, we focus on v2. The cgroup v1 exposes more memory stats like swapiness, and all the control is flat control, v2 exposes only cpuset and memory and exposes the hierarchy view.

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

$ root@banana:~/CXLMemSim/microbench# ls /sys/fs/cgroup
cgroup.controllers      cpuset.mems.effective  memory.reclaim
cgroup.max.depth        dev-hugepages.mount    memory.stat
cgroup.max.descendants  dev-mqueue.mount       misc.capacity
cgroup.pressure         init.scope             misc.current
cgroup.procs            io.cost.model          sys-fs-fuse-connections.mount
cgroup.stat             io.cost.qos            sys-kernel-config.mount
cgroup.subtree_control  io.pressure            sys-kernel-debug.mount
cgroup.threads          io.prio.class          sys-kernel-tracing.mount
cpu.pressure            io.stat                system.slice
cpu.stat                memory.numa_stat       user.slice
cpuset.cpus.effective   memory.pressure        yyw

$ root@banana:~/CXLMemSim/microbench# ls /sys/fs/cgroup/yyw
cgroup.controllers      cpu.uclamp.max       memory.oom.group
cgroup.events           cpu.uclamp.min       memory.peak
cgroup.freeze           cpu.weight           memory.pressure
cgroup.kill             cpu.weight.nice      memory.reclaim
cgroup.max.depth        io.pressure          memory.stat
cgroup.max.descendants  memory.current       memory.swap.current
cgroup.pressure         memory.events        memory.swap.events
cgroup.procs            memory.events.local  memory.swap.high
cgroup.stat             memory.high          memory.swap.max
cgroup.subtree_control  memory.low           memory.swap.peak
cgroup.threads          memory.max           memory.zswap.current
cgroup.type             memory.min           memory.zswap.max
cpu.idle                memory.node_limit1   pids.current
cpu.max                 memory.node_limit2   pids.events
cpu.max.burst           memory.node_limit3   pids.max
cpu.pressure            memory.node_limit4   pids.peak
cpu.stat                memory.numa_stat

The procfs is registered in

Multi-Generation LRU

HeMem has a critique that access bit based sampling is slow, so they use pebs, while TPP leverages the autoNUMA to rely on the kernel's LRU-list approach to denote. Then I found the MGLRU approach that can additionally select the aging pages(A rmap walk targets a single page and does not try to profit from discovering a young PTE.) with the better spatial locality of scanning access bit approach.

Focus on both memory-backed files, which give detailed results and more general cases like anon page in page table access which they have assumptions of w & w/o temporal locality.

Overhead Evaluation through eBPF

Does it matches the LRU performance?

According to the DynamoRIO results, 5% of the perfect LRU in local get get to 95% of the performance.

Carbink

The comparison between RMA based memory disaggregation and CXL.mem based memory disaggregation.

The span+coherency state in Carbink is just like cacheline coherency in CXL.mem but except that if two threads contention on one span it will go back and forth, that's the charm of cachable that CXL don't need the cacheline be transmitted but they are registered in the window of local LLC.

A lot of the software optimization is based on the panelty of small chunks transmission of RDMA is too huge that if we replace with CXL, we don't need to care ptr serialization and relinking because they are in the same memory space. maintaining a metadata of pages is still a huge overhead. The local page map is a two-level radix tree. The lookup process is similar to a page table walk: the first 20 bits of the object's virtual address are indexed to the first level radix tree table, and the next 15 bits are indexed to the second level table. The same mapping method allows Carbink to map the virtual address of a locally-resident span to its metadata. Thus this paper in era of CXL is useless, nothing to refer.

The difference of EC-Split(their implementation of Hydra) and EC-Batch is the critical path of the memory transaction. To reconstruct a single span, a compute node must contact multiple memory nodes to pull in all the required fragments. This requirement to contact multiple memory nodes makes the swap operation vulnerable to deviators, thus increasing the tail latency. And their compaction and de-fragmentation approach is to save the remote data usage but has no upgain for performance actually for their local vs remote upper than 50%. They only gain 10% for more on local side by the hiding of the span swap operations.

Reference

  1. https://www.usenix.org/conference/osdi22/presentation/zhou-yang
  2. https://www.google.com/search?q=hydra+fast+21&oq=hydra+fast+21&aqs=chrome..69i57j33i299l3j33i22i29i30i625l6.4597j1j4&sourceid=chrome&ie=UTF-8

Dagger: Efficient and Fast RPCs in Cloud Microservices with Near-Memory Reconfigurable NICs @ASPLOS21

The current RPC layer between nodes can be scaled by utilizing NUMA variants like NEBULA [1] or soNUMA, which provides on-chip buffer locality awareness of the data movement. A lot of RDMA RPC Accelerator/ SmartNIC implements these kinds of locality information when sending on LLC or on RDMA Shared Receive Queue(to enable the inter-endpoint buffer sharing).

For tail latency that newly arriving RPCs will inevitably violate the SLO and are eagerly NACKed, informing the client early about increased load conditions on the server, NEBULA will reject or retry(fail-fast approach) the requests predicted. The NIC to core policy is implemented naturally using DDIO+fast packet reassembly, but this dispatch is not cached into L1 directly.

Dagger implemented FPGA reconfigurable RPC stack-integrated microservices. Their eventual goal is to enable NUMA(future support CXL) over FPGA RPC. Their initiative is still if the microservice like NGINX/MongoDB is not latency bound and memory bound, just disaggregate using their NIC to the same memory space. The RPC is the prevailing IDL gRPC. The CPU/NIC communication is very similar to the approach in the HFT industry DDIO+cache hash function+avx stream load cl, WQE-by-MMIO.

Their load balancer+ flow scheduler on FPGA is very similar to a CXL.cache device bias CHA. That CHA normally distributes Req on-chip or cross Socket through UPI.

When CXL FPGA NIC's in, I think it can cache cl onto L1 and use hw to replace the cc. I still think there's more OS-level sync that can leverage the cc of CXL.cache.

Reference

  1. https://ieeexplore.ieee.org/abstract/document/9138953
  2. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9180035

[CSE231 Paper Reading] Filesystem Reflection

SSDs and HDDs design difference.

We described two approaches to building file systems, clustering, and logging, for Hard Disk Drives (HDDs). How should we design a file system for a Solid State Drive (SSD)? Your answer should identify two hardware properties of SSDs that contrast with the properties of HDDs and describe how these properties inform your design decisions.

In the first FS prototype of UNIX, FFS, we used inode and dirent for interpreting the abstraction for file, those designs are aware of the physical media of the HDDs of how addressing a space will be faster and the page size for balancing swapping the physical memory onto the disks. This type of filesystem is not designed for SSDs. The SSDs have NAND QLC/SLC/MLC as physical media to store files and the on-chip controller is equipped with the ability of LBA to page map translation, drive channels, write-behind, allocation, and wear leveling (considered harmful in [1]). Nowadays, the firmware FS is designed similarly to Log-structured FS because it can fast recover from physical media failure with low latency cost.

EXT4 is always evolving with new media and smarter design. Ramzi group from the University of Wisconsin still publishes papers on EXT4 core design like leveraging device mapper deduplication [2], and how data path can be more critical [3].

Starting from PMFS, people are getting close to how FS can be designed if the persistent memory is attached as memory or more recently as a CXL.mem. 1. with the hardware speeding up, we require software to be low latency as well. e.g. tracer, hypervisor, and multi-tenancy isolation management 2. the ordering and durability of a single cacheline or a single page, or a single pointer should be dealt in the firmware level, either resolved by locks, lock-free hardware mechanism, or other smart API support. 3. workloads aware optimization by providing performance hints and advise interface like madvise() in Linux. The storage of ML model serialization is completely different from those configured to be plain memory(no ptr or hierarchical logic) dump. 4. optimize consistency using a combination of atomic in-place updates, logging at cacheline granularity (fine-grained journaling), and copy-on-write (CoW)

Replace POSIX?

In our discussion of single-level stores, we identified a few problems with using this design for all interactions with persistent storage. In this question, describe one potential issue that arises when using a single-level store for persistent storage. How does using a POSIX/file system interface simplify/eliminate the issue that you described?

The VFS is long discussed overhead inside the kernel that causes higher latency by random scheduler overhead or lower bandwidth by blk or other mechanisms of kernel occupying the queue. But we don't currently have a proposal to replace it since the kernel must rely on the blk pread/pwrite & bdi write back all this stuff. [4] provides

The semantics for POSIX is too strong for distributed FS. If you could put something on the fly in the CXL cache layer, that is bounded for data appearance for already ready. that will be

/ Software-defined SmartSSD/ CXLSSD

Reference

  1. https://dl.acm.org/doi/pdf/10.1145/3538643.3539750#:~:text=We%20argue%20that%20wear%20leveling,produce%20a%20high%20write%20amplification.
  2. https://elinux.org/images/0/02/Filesystem_Considerations_for_Embedded_Devices.pdf
  3. https://github.com/Harry-Chen/HERMES
  4. https://dl.acm.org/doi/10.1145/3538643.3539751

Achieving 100Gbps Intrusion Prevention a Single Server @OSDI' 20

FPGA offload 入侵检测, performance/power efficiency balance。这篇的insight就是要做above 100Gbps TCP的on chip计算,比如负载均衡、能量负载、安全等如果放在CPU或者PIM上算都太慢了,所以搞了这个near NIC的computation。在交易所网络包发送的过程中也有类似的需要更改简单逻辑的场景,运用smartNIC在保证volatility的条件下可以大大减少延时。这篇的第二个insight是用到了intel hyperscan尽可能software 提速匹配IDS/IPSA。

第三个insight是硬件调度优化三个操作Regex matching rules/TCP reassemble/other task。

其中full matcher轮询一个由DMA engine填充的环形缓冲器。每个数据包都携带元数据,包括MSPM确定为部分匹配的规则ID(hyper scan的后半部分)。对于每个规则ID,完全匹配器检索完整的规则(包括正则表达式)并检查是否完全匹配。

TCP resembler 是一个ooo的设计。packets会先渠道fast path, 再到一个bram的cuckoo hashing table(flow table), insertion table 会弥补不同执行时间的ooo engine。

为了减少在FPSM的hash table lookup,其还写进去了一个SIMD shift or matching(显然应该不会比商用的intel的fpu写的快。(不过在FPGA上塞这么多逻辑.

täko: A Polymorphic Cache Hierarchy forGeneral-Purpose Optimization of Data Movement

(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.


They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss() onEviction() writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.

In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.

However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.

Reference

  1. https://mir.cs.illinois.edu/marinov/publications/ZhaoETAL20InvarSpec.pdf

[CSE231 Paper Reading] Log Structure Filesystem

Motivation

Computer hardware performance began to explode: CPUs were getting faster and RAM was getting bigger; however, while sequential read and write speeds of hard disks were increasing, random read and write speeds, which were limited by physical seek times, were hardly shorter than 10 ms. On the other hand, file systems at the time, whether Unix File System or FFS, had a large number of random reads and writes (at least 5 random writes are required to create a new file in FFS), thus becoming a performance bottleneck for the whole system. At the same time, because of Page cache, the authors argue that random reads are not the main problem: with more and more memory, most of the reads can be cached, so the main problem of LFS is to reduce random writes to the hard disk.

Implementation

  1. File System as a Log
  2. Segment-based bulk writing solves the random write problem, but how does LFS implement read operations? Similar to UFS/FFS, LFS stores the contents of a file within a segment, and also stores the index of the file. Specifically: in Segment, the file contents are stored in a fixed-size data block. Segment0 stores the two data blocks of file2, and the subsequent inode2 stores the indexes of these two data blocks. However, unlike UFS/FFS, LFS inodes are dynamically allocated, so LFS stores an index to the inode at the end of each Segment, called the inode map. in LFS, all inode map contents are cached in the contents, which speeds up reads.
  3. Garbage collection: As mentioned earlier, LFS requires a garbage collection mechanism designed to remove old data. In LFS, multiple segments containing obsolete data blocks are compacted into new data segments, and the old data in them is deleted.
  4. Failure recovery is obviously important for any file system to be able to recover data from a failure. Unlike UFS, which uses the fsck command to recover from a failure, LFS stores a checkpoint for the entire drive: because each Segment of LFS stores the address of the next Segment, the entire file system is organized like a chain. In Checkpoint, LFS stores the address of the first Segment and the last Segment of this chain, so the entire file system can be recovered by reading Checkpoint. lfs updates the data in Checkpoint every 30 seconds.

Experiments

Figure 3/4 shows the disk capacity utilization and fraction alive the segment cleaned with write cost are better than FFS today for uniform data and 10% hot data and 90% cold data. Figure 5 shows that in terms of segment utilization, the hot-and-cold data are better than uniform. Figure 6 shows that the segment distribution occurs when the cost-benefit policy is used to select segments to clean and live blocks grouped by age before being re-written.

Evaluation effectiveness

This evaluation is designed sophistically.

Novelty

Rosenblum is a co-founder of Vmware and Ousterhout is one of the authors of Raft. I don't have any comments on their novelty, just they are tooooo ahead of time.

SoTA Connection

The Segment-based design of LFS coincides with the physical characteristics of SSDs and is therefore widely used in SSD firmware; the memory table/compaction of LSM is in line with the memory buffer and GC of LFS, and new file systems such as btrfs are based on the append-only feature of LSM to implement copy-on-write or multiversion. newer file systems such as btrfs also implement copy-on-write or multi-version features based on the LSM append-only feature. The EXT3/4 also has redo logging with journal/ordered/writeback mode.

Also, the Parallel Log-Structure FS follows the LFS to make parallel FS tailored to N-N write which speeds up the MPI write.

Reference

  1. http://www.eecs.harvard.edu/~cs161/notes/lfs.pdf
  2. https://lwn.net/Articles/353411/
  3. https://web.stanford.edu/~ouster/cgi-bin/papers/lfs.pdf
  4. https://pages.cs.wisc.edu/~remzi/OSTEP/file-lfs.pdf
  5. https://www.usenix.org/legacy/publications/library/proceedings/sd93/seltzer.pdf