How to do c++ reflection with pointer.

The serialization struct can be a lot, but like CRIU ways of interpreting the c struct is tedious; you need to dump and restore the Linux state, which seems too tedious; rather, I need to write a flat struct that can be reconstructed and use concept trait to require the dump and restore things.

In CRIU, they performs like this:

// SPDX-License-Identifier: MIT

syntax = "proto2";

message bpfmap_data_entry {
	required uint32	map_id			= 1;
	required uint32	keys_bytes		= 2;	/* Bytes required to store keys */
	required uint32	values_bytes		= 3;	/* Bytes required to store values */
	required uint32 count			= 4;	/* Number of key-value pairs stored */
}

In dump and restore process:

int do_collect_bpfmap_data(struct bpfmap_data_rst *r, ProtobufCMessage *msg, struct cr_img *img,
			   struct bpfmap_data_rst **bpf_hash_table)
{
	int ret;
	int table_index;

	r->bde = pb_msg(msg, BpfmapDataEntry);
	ret = bpfmap_data_read(img, r);
	if (ret < 0)
		return ret;

	table_index = r->bde->map_id & BPFMAP_DATA_HASH_MASK;
	r->next = bpf_hash_table[table_index];
	bpf_hash_table[table_index] = r;

	pr_info("Collected bpfmap data for %#x\n", r->bde->map_id);
	return 0;
}
int restore_bpfmap_data(int map_fd, uint32_t map_id, struct bpfmap_data_rst **bpf_hash_table)
{
	struct bpfmap_data_rst *map_data;
	BpfmapDataEntry *bde;
	void *keys = NULL;
	void *values = NULL;
	unsigned int count;
	LIBBPF_OPTS(bpf_map_batch_opts, opts);

	for (map_data = bpf_hash_table[map_id & BPFMAP_DATA_HASH_MASK]; map_data != NULL; map_data = map_data->next) {
		if (map_data->bde->map_id == map_id)
			break;
	}

	if (!map_data || map_data->bde->count == 0) {
		pr_info("No data for BPF map %#x\n", map_id);
		return 0;
	}

	bde = map_data->bde;
	count = bde->count;

	keys = mmap(NULL, bde->keys_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (keys == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map keys");
		goto err;
	}
	memcpy(keys, map_data->data, bde->keys_bytes);

	values = mmap(NULL, bde->values_bytes, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
	if (values == MAP_FAILED) {
		pr_perror("Can't map memory for BPF map values");
		goto err;
	}
	memcpy(values, map_data->data + bde->keys_bytes, bde->values_bytes);

	if (bpf_map_update_batch(map_fd, keys, values, &count, &opts)) {
		pr_perror("Can't load key-value pairs to BPF map");
		goto err;
	}
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return 0;

err:
	munmap(keys, bde->keys_bytes);
	munmap(values, bde->values_bytes);
	return -1;
}

The problem is the above process does not leverage the cpp feature that uses the static compilation capability to generate visitor member function for every struct that passes to protobuf automatically. However, this is not perfect because in C, stack will have some type like this

union {
    uint64 _make_it_8_byte_aligned_;
    WASMMemoryInstance memory_instances[1];
    uint8 bytes[1];
} global_table_data;

BlockAddr block_addr_cache[BLOCK_ADDR_CACHE_SIZE][BLOCK_ADDR_CONFLICT_SIZE];

/* Heap data base address */
DefPointer(uint8 *, heap_data);
/* Heap data end address */
DefPointer(uint8 *, heap_data_end);
/* The heap created */
DefPointer(void *, heap_handle);

It sounds like can not automatically pass to struct_pack. So still need a stub cpp struct for protobuf, but this time, no need to manually write protobuf, every metadata generation is compile time, and runtime only do the serialization. The above struct can be defined below in c++.

struct WAMRMemoryInstance {
    /* Module type */
    uint32 module_type;
    /* Shared memory flag */
    bool is_shared;
    /* Number bytes per page */
    uint32 num_bytes_per_page;
    /* Current page count */
    uint32 cur_page_count;
    /* Maximum page count */
    uint32 max_page_count;
    /*
     * Memory data begin address, Note:
     *   the app-heap might be inserted in to the linear memory,
     *   when memory is re-allocated, the heap data and memory data
     *   must be copied to new memory also
     */
    std::vector<uint8> memory_data;

    /* Heap data base address */
    std::vector<uint8> heap_data;
}
WAMRMemoryInstance memory_instances;

std::array<std::array<WAMRBlockAddr, BLOCK_ADDR_CACHE_SIZE>, BLOCK_ADDR_CONFLICT_SIZE> block_addr_cache;

/* Heap data base address */
std::vector<uint8> heap_data;

Then define the concept to illustrate the trait

template <typename T, typename K>
concept SerializerTrait = requires(T &t, K k) {
    { t->dump(k) } -> std::same_as<void>;
    { t->restore(k) } -> std::same_as<void>;
};

impl for every stub struct.

void dump(WASMMemoryInstance *env) {
    module_type = env->module_type;
    is_shared = env->is_shared;
    num_bytes_per_page = env->num_bytes_per_page;
    cur_page_count = env->cur_page_count;
    max_page_count = env->max_page_count;
    memory_data.resize(env->memory_data_size);
    memcpy(memory_data.data(), env->memory_data, env->memory_data_size);
    heap_data = std::vector<uint8>(env->heap_data, env->heap_data_end);
};
void restore(WASMMemoryInstance *env) {
    env->module_type = module_type;
    env->is_shared = is_shared;
    env->num_bytes_per_page = num_bytes_per_page;
    env->cur_page_count = cur_page_count;
    env->max_page_count = max_page_count;
    env->memory_data_size = memory_data.size();
    env->memory_data = (uint8 *)malloc(env->memory_data_size);
    memcpy(env->memory_data, memory_data.data(), env->memory_data_size);
    env->heap_data = (uint8 *)malloc(heap_data.size());
    memcpy(env->heap_data, heap_data.data(), heap_data.size());
    env->heap_data_end = env->heap_data + heap_data.size();
};

Reference

  1. https://github.com/alibaba/yalantinglibs/pull/122
  2. https://www.youtube.com/watch?v=myhB8ZlwOlE

Contiguitas: The Pursuit of Physical Memory Contiguity in Datacenters @ISCA23

TLB in larger memory capacity era is not big enough and if the programmer want to resolve page fragmentation need to invalidate TLB for sure. The paper designed a transparent migration layer to accelerate contiguous page access.

The Unmovable Region like NIC/Storage/GPU is pinned to pages, this is not exactly true when we design new CXL devices. Other pages will be marked movable with IOMMU support. The transparent page movement contiguous will relax the TLB shootdown. The mapping will be migrate(ppn src,ppn dest). The difference of Non-cachable

Region resize will go by the momentum of data movement to shrink or enlarge the unmovable region.

For a cycle simulator, as long as the warm-up will regard as resonable.

Reference

  1. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9882041

Rethinking the design of CXL Fault Tolerant Distributed System

  1. Ray-like software replication will introduce 20% overhead compared with MPI for distributed PyTorch training. Hardware-assisted replicas and erasure code should be implemented in the remote memory. The distributed kernel should be aware of the data's presence, how much time it takes for reconstruction, and the reliability rate for deciding where to put the data.
  2. Page table way of memory mapping seems tedious for local hardware resources of MSHR/TLB/ROB for hiding latency. The bound check could happen in the remote CXL pool ACL part. At the same time, the language runtime support should also comply with the check.

[CSE290S] Erasure Code

This part is what Ethan researched on starting his RAID thesis in UCB. I would say VAST Data and deployed systems like carbink in Google somehow have been widely researched and deployed. Could be Core Competitiveness.

ECC vs erasure code. detection of errors and locating the errors.

LPC or Reed Solomon or RAID6 or better erasure code that mathematically performs better, for different locations of the identification of the error and still what Azure and google are spending time researching on.

Non-crypto/ crypto is what you want to fight against malicious manipulation of your bit without leaking. The former is just anti bit-flips.

uManyCore: ISCA23


Writing a microkernel for the village that can offload operators is definitely correct. The problem of how to define access to the memory node has long been discussed. Either a so-called semi-disaggregated MN like Yiying Zhang's Lego OS/ Cilo or purely one-sided RDMA based on prior distributed thoughts like FUSEE.

This work is leveraging hardware metrics in both the NIC and Mem to accelerate the request queue of the village accelerator access remote memory. This can either be memory dependent or a dry bulk load from the memory pool.

However, this work does not provide a distributed kernel paradigm that orchestrates all the CXL accelerator can get from remote memory and how hardware hint that can guide your distributed kernel and operator bytecode has been offloaded.

Reference

  1. https://github.com/dmemsys/FUSEE