Design of per cgroup memory disaggregation

This post will be integrate with yyw's knowledge base

For an orchestration system, resource management needs to consider at least the following aspects:

  1. An abstraction of the resource model; including,
  • What kinds of resources are there, for example, CPU, memory (local vs remote that can be transparent to the user), etc.;
  • How to represent these resources with data structures;

1. resource scheduling

  • How to describe a resource application (spec) of a workload, for example, "This container requires 4 cores and 12GB~16GB(4GB local/ 8GB-12GB remote) of memory";
  • How to describe the current resource allocation status of a node, such as the amount of allocated/unallocated resources, whether it supports over-segmentation, etc.;
  • Scheduling algorithm: how to select the most suitable node for it according to the workload spec;

2.Resource quota

  • How to ensure that the amount of resources used by the workload does not exceed the preset range (so as not to affect other workloads);
  • How to ensure the quota of workload and system/basic service so that the two do not affect each other.

k8s is currently the most popular container orchestration system, so how does it solve these problems?

k8s resource model

Compared with the above questions, let's see how k8s is designed:

  1. Resource model :
    • Abstract resource types such as cpu/memory/device/hugepage;
    • Abstract the concept of node;
  2. Resource Scheduling :
    • requestThe two concepts of and are abstracted limit, respectively representing the minimum (request) and maximum (limit) resources required by a container;
    • AllocatableThe scheduling algorithm selects the appropriate node for the container according to the amount of resources currently available for allocation ( ) of each node ; Note that k8s scheduling only looks at requests, not limits .
  3. Resource enforcement :
    • Use cgroups to ensure that the maximum amount of resources used by a workload does not exceed the specified limits at multiple levels.

An example of a resource application (container):

apiVersion: v2
kind: Pod
spec:
  containers:
  - name: busybox
    image: busybox
    resources:
      limits:
        cpu: 500m
        memory: "400Mi"
      requests:
        cpu: 250m
        memory: "300Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Here requestsand limits represent the minimum and maximum values ​​of required resources, respectively.

  • The unit of CPU resources m is millicores the abbreviation, which means one-thousandth of a core, so cpu: 500m means that 0.5 a core is required;
  • The unit of memory is well understood, that is, common units such as MB and GB.

Node resource abstraction

$ k describe node <node>
...
Capacity:
  cpu:                          48
  mem-hard-eviction-threshold:  500Mi
  mem-soft-eviction-threshold:  1536Mi
  memory:                       263192560Ki
  pods:                         256
Allocatable:
  cpu:                 46
  memory:              258486256Ki
  pods:                256
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource            Requests     Limits
  --------            --------     ------
  cpu                 800m (1%)    7200m (15%)
  memory              1000Mi (0%)  7324Mi (2%)
  hugepages-1Gi       0 (0%)       0 (0%)
...

Let's look at these parts separately.

Capacity

The total resources of this node (which can be simply understood as physical configuration ), for example, the above output shows that this node has 48CPU, 256GB memory, and so on.

Allocatable

The total amount of resources that can be allocated by k8s , obviously, Allocatable will not exceed Capacity, for example, there are 2 less CPUs as seen above, and only 46 are left.

Allocated

The amount of resources that this node has allocated so far, note that the message also said that the node may be oversubscribed , so the sum may exceed Allocatable, but it will not exceed Capacity.

Allocatable does not exceed Capacity, and this concept is also well understood; but which resources are allocated specifically , causing Allocatable < Capacityit?

Node resource segmentation (reserved)

Because k8s-related basic services such as kubelet/docker/containerd and other operating system processes such as systemd/journald run on each node, not all resources of a node can be used to create pods for k8s. Therefore, when k8s manages and schedules resources, it needs to separate out the resource usage and enforcement of these basic services.

To this end, k8s proposed the Node Allocatable Resources[1] proposal, from which the above terms such as Capacity and Allocatable come from. A few notes:

  • If Allocatable is available, the scheduler will use Allocatable, otherwise it will use Capacity;
  • Using Allocatable is not overcommit, using Capacity is overcommit;

Calculation formula: [Allocatable] = [NodeCapacity] - [KubeReserved] - [SystemReserved] - [HardEvictionThreshold]

Let’s look at these types separately.

System Reserved

Basic services of the operating system, such as systemd, journald, etc., are outside k8s management . k8s cannot manage the allocation of these resources, but it can manage the enforcement of these resources, as we will see later.

Kube Reserved

k8s infrastructure services, including kubelet/docker/containerd, etc. Similar to the system services above, k8s cannot manage the allocation of these resources, but it can manage the enforcement of these resources, as we will see later.

EvictionThreshold (eviction threshold)

When resources such as node memory/disk are about to be exhausted, kubelet starts to expel pods according to the QoS priority (best effort/burstable/guaranteed) , and eviction resources are reserved for this purpose.

Allocatable

Resources available for k8s to create pods.

The above is the basic resource model of k8s. Let's look at a few related configuration parameters.

Kubelet related configuration parameters

kubelet command parameters related to resource reservation (segmentation):

  • --system-reserved=""
  • --kube-reserved=""
  • --qos-reserved=""
  • --reserved-cpus=""

It can also be configured via the kubelet, for example,

$ cat /etc/kubernetes/kubelet/config
...
systemReserved:
  cpu: "2"  
  memory: "4Gi"

Do you need to use a dedicated cgroup for resource quotas for these reserved resources to ensure that they do not affect each other:

  • --kube-reserved-cgroup=""
  • --system-reserved-cgroup=""

The default is not enabled. In fact, it is difficult to achieve complete isolation. The consequence is that the system process and the pod process may affect each other. For example, as of v1.26, k8s does not support IO isolation, so the IO of the host process (such as log rotate) soars, or when a pod process executes java dump, It will affect all pods on this node.

The k8s resource model will be introduced here first, and then enter the focus of this article, how k8s uses cgroups to limit the resource usage of workloads such as containers, pods, and basic services (enforcement).

k8s cgroup design

cgroup base

groups are Linux kernel infrastructures that can limit, record and isolate the amount of resources (CPU, memory, IO, etc.) used by process groups.

There are two versions of cgroup, v1 and v2. For the difference between the two, please refer to Control Group v2. Since it's already 2023, we focus on v2. The cgroup v1 exposes more memory stats like swapiness, and all the control is flat control, v2 exposes only cpuset and memory and exposes the hierarchy view.

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

$ root@banana:~/CXLMemSim/microbench# ls /sys/fs/cgroup
cgroup.controllers      cpuset.mems.effective  memory.reclaim
cgroup.max.depth        dev-hugepages.mount    memory.stat
cgroup.max.descendants  dev-mqueue.mount       misc.capacity
cgroup.pressure         init.scope             misc.current
cgroup.procs            io.cost.model          sys-fs-fuse-connections.mount
cgroup.stat             io.cost.qos            sys-kernel-config.mount
cgroup.subtree_control  io.pressure            sys-kernel-debug.mount
cgroup.threads          io.prio.class          sys-kernel-tracing.mount
cpu.pressure            io.stat                system.slice
cpu.stat                memory.numa_stat       user.slice
cpuset.cpus.effective   memory.pressure        yyw

$ root@banana:~/CXLMemSim/microbench# ls /sys/fs/cgroup/yyw
cgroup.controllers      cpu.uclamp.max       memory.oom.group
cgroup.events           cpu.uclamp.min       memory.peak
cgroup.freeze           cpu.weight           memory.pressure
cgroup.kill             cpu.weight.nice      memory.reclaim
cgroup.max.depth        io.pressure          memory.stat
cgroup.max.descendants  memory.current       memory.swap.current
cgroup.pressure         memory.events        memory.swap.events
cgroup.procs            memory.events.local  memory.swap.high
cgroup.stat             memory.high          memory.swap.max
cgroup.subtree_control  memory.low           memory.swap.peak
cgroup.threads          memory.max           memory.zswap.current
cgroup.type             memory.min           memory.zswap.max
cpu.idle                memory.node_limit1   pids.current
cpu.max                 memory.node_limit2   pids.events
cpu.max.burst           memory.node_limit3   pids.max
cpu.pressure            memory.node_limit4   pids.peak
cpu.stat                memory.numa_stat

The procfs is registered in

Reliable and Fast DWARF-Based Stack Unwinding @OOPSLA19

Dwarf is a bytecode format for leaving runtime debugging info based on the symbolic register and memory location, which gives a recoverable last instruction and call frame info. Given the current unwind is slow and Google traces will use frame pointer to accelerate the production fast unwind, the author provides the fix point control flow analysis based validation and synthesis.

On running every line of code, the symbolic value will be eval to locate the stack frame, it will recursively walk stack to unwind for every call frame.

By architectural advantage, we can leverage offset based on un-updated varaibles during computation like %rip or %

Continue reading "Reliable and Fast DWARF-Based Stack Unwinding @OOPSLA19"

Address Generation Unit operation offloading.

CXL.mem does not have ATS required since the coherency may be too crowded maintain, the type 3 devices will be only within the DCOH of endpoint.

ATS info is recorded in the firmware level as PMU. Sounds need other logic to get these metrics.

Reference

  1. https://en.wikipedia.org/wiki/Address_generation_unit
  2. https://indico.cern.ch/event/1106990/contributions/5041334/attachments/2533446/4359546/20221024_Suarez_ACAT_fin.pdf

WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection @Eurosys23

  1. WAFFLE is about cheap ways to detect expensive bugs thus it's concerned with the design tradeoffs around concurrency bug detection tools (active delay injection in particular) compared to TSVD
  2. In breaking down the design space for active delay injection, distills the essence of delay injection for the reader, which is useful
  3. I'm interested in systems that exploit the physical time to avoid more expensive analysis when tackling hard concurrency problems (e.g., google's Spanner)

Comments

  1. The oracle of injecting time does not find ABA data structure bugs. need to record a timestamp for not necessarily the happen before logic but other oracles to hunt that.

Multi-Generation LRU

HeMem has a critique that access bit based sampling is slow, so they use pebs, while TPP leverages the autoNUMA to rely on the kernel's LRU-list approach to denote. Then I found the MGLRU approach that can additionally select the aging pages(A rmap walk targets a single page and does not try to profit from discovering a young PTE.) with the better spatial locality of scanning access bit approach.

Focus on both memory-backed files, which give detailed results and more general cases like anon page in page table access which they have assumptions of w & w/o temporal locality.

Overhead Evaluation through eBPF

Does it matches the LRU performance?

According to the DynamoRIO results, 5% of the perfect LRU in local get get to 95% of the performance.

[CSE290S] Secured filesystem

Efficient reconstruction techniques for disaster recovery in secret-split

Use secret split. The assumption is shards with different authentication ways will be safe without encryption by only matching. And the guard will identify the attacks when the attacker's operation is great enough.

Approximate pointer, use whole sharding for recovery to prevent the adversarial

64 - 64 - 64
   \ 64 \ 64
   ...    ...

Different approaches for the secret split. one get diverged from the list of key, the other uses a 128-bit field that will not leak information of the key.

Plutus: Scalable, secure file sharing on untrusted storage

论文里实验才几KB每秒。泪觉时代已经过了20年了。这种鉴权系统对现在来说软件的开销已经够大了。

Lethe: Secure Deletion by Addition

Designed secured copy-on-write over ZFS using the keyed hash forest.

TMTS Talk @CXL SIG

Scheduling asynchronous page migration based on the access pattern.

Hardware support is crucial

  • Page access scans alone have high latency
  • PMU address sampling drastically reduces promotion latency (access to promotion time)
  • Earlier promotion improves performance

Per application policy is crucial

  • The ufard they run in the userspace is per processs control flow
  • Each application's policy of migration page should be separated and have conflict using PGO

Thoughts

  1. PGO rather than online PEBS? because PEBS's overhead is huge, even if you start in a seperate threads, or lower the sample period to 10k or 2m.
  2. The TLB should be hidden by CXL.cache atomic exchange cacheline and no need to update the page table. The page table reuse distance should be also considered, since either way of updating page table 1. mark page ro and migrate or atomic exchange requires timing next time use this page.
  3. will eBPF to control all the policies be a better choice? offloading policy to rc/ep

Reference

  1. contiguitas_isca23.pdf
  2. https://dl.acm.org/doi/pdf/10.1145/3503222.3507745
  3. https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

LLVM JIT/AOT checkpoint and restore in a new achitecture.

We focus on the Classic Interpreter for function PoC and AOT for performance PoC. In the above picture, We think the LLVM view before machine-related optimization together with the wasm view is cross-platform. For the latter, we need to find a stable point like function calls, branch operations, and jump for not architecturally reordering the instruction or semantically hazardous. For turning back the view to the wasm, we originally thought DWARF would help, but the WAMR team did not implement the mapping of the wasm and native stack mapping. But they implemented AOT GC that on those stable points periodically commits the native stack to the wasm view.

Record and replay files, sockets, IPC, and locks. In the VM, there are 2 implementations of wasi, one is POSIX-based, but only uses the subset of POSIX since the definition, and the other is uvwasi which is a message-passing library that has an implementation on the Windows platform. Because we don’t really know which implementation is the target, we only record the operation log for files, sockets, IPC, and locks.

Specifically for open syscall, since it's not calling into WAMR's libwasi, while it's merely a bunch of function calls fopen->__wasilibc_open_nomode->find_relpath->__wasilibc_nocwd_openat_nomode->__fdopen. So we simply instrument the fopen and get the fdopen input to get the {fd, path, option} three-element tuple. Need instrument in AOT mode.

Snapshotting for WebAssmebly view of Memory and Frame. For the Interpreter, we defined a C++ struct for better snapshotting the memory and frame and put them in the C++ struct snapshots. The interpreter frame is just linearly set up for every function call. For JIT/AOT we need to rely on the call convention on the source machine and symbolically execute the call frame from the wasm stack on recovery. For big/little endian, you just transform if they are different, the JIT/AOT phase should take care of the memory.

Re-architecture of AOT for snapshot The current implementation of the native stack frame is incremental. which is not necessarily good for recovery, should do something like FastJIT in all the function calls and basic blocks are just jmp with auxiliary operations of stacks. (convenient for committing regs?) Then we need to on every function call commit the CPU state to the wasm stack that relies on LLVM infra for generating 1. labels that will not be optimized out by both sides. Research problem(the frequently accessed points.) 2. Register and native stack mapping to the wasm stack on stable point. (Or we need dwarf and stronger information if we only get one time on checkpoint). On recovery, we can just jmp to the label and just resume.

ReJIT or ReAOT on the target machine. We need to recover, first, ReJIT or ReAOT the wasm binary, and do a translation that only does the function call specific operation in the target machine of generated native code. Then the native call frame will be set up, and we just set the native PC to the last called function’s startup.

File Descriptor Recovery We will call the target machine’s implementation of the wasi for recovering the file descriptor and we need to make sure the order is the same.

Socket recovery For the same IP, we can just refer to the CRIU's implementation that utilizes the kernel implementation of getsockopt(TCP_recover), but the problem is it will be platform-specific, so we set up a gateway for updating the NAT after migration and implemented a socket recovery our selves referring to this. In the MVP implementation, on migration, we should first notify the gateway in the below graph which is mac running docker with virtual IP 192.168.1.1, then, do socket migration, in the meantime the gateway sends keepalive ACK to the server VM2, after migration, VM3 first starts and reinits socket/bind/accept and notify the gateway to bypass all the request from VM2 to VM3.

Locks recovery The order of the recovery to lock is very important since some of the states in the kernel will be canceled out if we only record and replay in the above graph. We need to track the order of setting the lock and who is blocking it because of what to semantically correct make the order right.

How to do c++ reflection with pointer.

The serialization struct can be a lot, but like CRIU ways of interpreting the c struct is tedious; you need to dump and restore the Linux state, which seems too tedious; rather, I need to write a flat struct that can be reconstructed and use concept trait to require the dump and restore things.

In CRIU, they performs like this:

// SPDX-License-Identifier: MIT

syntax = "proto2";

message bpfmap_data_entry {
	required uint32	map_id			= 1;
	required uint32	keys_bytes		= 2;	/* Bytes required to store keys */
	required uint32	values_bytes		= 3;	/* Bytes required to store values */
	required uint32 count			= 4;	/* Number of key-value pairs stored */
}

In dump and restore process:

int do_collect_bpfmap_data(struct bpfmap_data_rst *r, ProtobufCMessage *msg, struct cr_img *img,
			   struct bpfmap_data_rst **bpf_hash_table)
{
	int ret;
	int table_index;

	r->bde = pb_msg(msg, BpfmapDataEntry);
	ret = bpfmap_data_read(img, r);
	if (ret < 0)
		return ret;

	table_index = r->bde->map_id & BPFMAP_DATA_HASH_MASK;
	r->next = bpf_hash_table[table_index];
	bpf_hash_table[table_index] = r;

	pr_info("Collected bpfmap data for %#x\n", r->bde->map_id);
	return 0;
}
int restore_bpfmap_data(int map_fd, uint32_t map_id, struct bpfmap_data_rst **bpf_hash_table)
{
	struct bpfmap_data_rst *map_data;
	BpfmapDataEntry *bde;
	void *keys = NULL;
	void *values = NULL;
	unsigned int count;
	LIBBPF_OPTS(bpf_map_batch_opts, opts);

	for (map_data = bpf_hash_table[map_id & BPFMAP_DATA_HASH_MASK]; map_data != NULL; map_data = map_data->next) {
		if (map_data->bde->map_id == map_id)
			break;
	}

	if (!map_data || map_data->bde->count == 0) {
		pr_info("No data for BPF map %#x\n", map_id);
		return 0;
	}

	bde = map_data->bde;
	count = bde->count;

	keys = mmap(NULL, bde->keys_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (keys == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map keys");
		goto err;
	}
	memcpy(keys, map_data->data, bde->keys_bytes);

	values = mmap(NULL, bde->values_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (values == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map values");
		goto err;
	}
	memcpy(values, map_data->data + bde->keys_bytes, bde->values_bytes);

	if (bpf_map_update_batch(map_fd, keys, values, &count, &opts)) {
		pr_perror("Can't load key-value pairs to BPF map");
		goto err;
	}
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return 0;

err:
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return -1;
}

The problem is the above process does not leverage the cpp feature that uses the static compilation capability to generate visitor member function for every struct that passes to protobuf automatically. However, this is not perfect because in C, stack will have some type like this

union {
    uint64 _make_it_8_byte_aligned_;
    WASMMemoryInstance memory_instances[1];
    uint8 bytes[1];
} global_table_data;

BlockAddr block_addr_cache[BLOCK_ADDR_CACHE_SIZE][BLOCK_ADDR_CONFLICT_SIZE];

/* Heap data base address */
DefPointer(uint8 *, heap_data);
/* Heap data end address */
DefPointer(uint8 *, heap_data_end);
/* The heap created */
DefPointer(void *, heap_handle);

It sounds like can not automatically pass to struct_pack. So still need a stub cpp struct for protobuf, but this time, no need to manually write protobuf, every metadata generation is compile time, and runtime only do the serialization. The above struct can be defined below in c++.

struct WAMRMemoryInstance {
    /* Module type */
    uint32 module_type;
    /* Shared memory flag */
    bool is_shared;
    /* Number bytes per page */
    uint32 num_bytes_per_page;
    /* Current page count */
    uint32 cur_page_count;
    /* Maximum page count */
    uint32 max_page_count;
    /*
     * Memory data begin address, Note:
     *   the app-heap might be inserted in to the linear memory,
     *   when memory is re-allocated, the heap data and memory data
     *   must be copied to new memory also
     */
    std::vector<uint8> memory_data;

    /* Heap data base address */
    std::vector<uint8> heap_data;
}
WAMRMemoryInstance memory_instances;

std::array<std::array<WAMRBlockAddr, BLOCK_ADDR_CACHE_SIZE>, BLOCK_ADDR_CONFLICT_SIZE> block_addr_cache;

/* Heap data base address */
std::vector<uint8> heap_data;

Then define the concept to illustrate the trait

template <typename T, typename K>
concept SerializerTrait = requires(T &t, K k) {
    { t->dump(k) } -> std::same_as<void>;
    { t->restore(k) } -> std::same_as<void>;
};

impl for every stub struct.

void dump(WASMMemoryInstance *env) {
    module_type = env->module_type;
    is_shared = env->is_shared;
    num_bytes_per_page = env->num_bytes_per_page;
    cur_page_count = env->cur_page_count;
    max_page_count = env->max_page_count;
    memory_data.resize(env->memory_data_size);
    memcpy(memory_data.data(), env->memory_data, env->memory_data_size);
    heap_data = std::vector<uint8>(env->heap_data, env->heap_data_end);
};
void restore(WASMMemoryInstance *env) {
    env->module_type = module_type;
    env->is_shared = is_shared;
    env->num_bytes_per_page = num_bytes_per_page;
    env->cur_page_count = cur_page_count;
    env->max_page_count = max_page_count;
    env->memory_data_size = memory_data.size();
    env->memory_data = (uint8 *)malloc(env->memory_data_size);
    memcpy(env->memory_data, memory_data.data(), env->memory_data_size);
    env->heap_data = (uint8 *)malloc(heap_data.size());
    memcpy(env->heap_data, heap_data.data(), heap_data.size());
    env->heap_data_end = env->heap_data + heap_data.size();
};

Reference

  1. https://github.com/alibaba/yalantinglibs/pull/122
  2. https://www.youtube.com/watch?v=myhB8ZlwOlE

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters @ISCA23

TLB in larger memory capacity era is not big enough and if the programmer want to resolve page fragmentation need to invalidate TLB for sure. The paper designed a transparent migration layer to accelerate contiguous page access.

The Unmovable Region like NIC/Storage/GPU is pinned to pages, this is not exactly true when we design new CXL devices. Other pages will be marked movable with IOMMU support. The transparent page movement contiguous will relax the TLB shootdown. The mapping will be migrate(ppn src,ppn dest). The difference of Non-cachable

Region resize will go by the momentum of data movement to shrink or enlarge the unmovable region.

For a cycle simulator, as long as the warm-up will regard as resonable.

Reference

  1. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9882041