C++ coroutine segfault for assigning into a shared object

struct Task {
    struct promise_type;
    using handle_type = std::coroutine_handle<promise_type>;

    struct promise_type {
        auto get_return_object() { 
            return Task{handle_type::from_promise(*this)}; 
        }
        std::suspend_never initial_suspend() { return {}; }
        std::suspend_always final_suspend() noexcept { return {}; }
        void return_void() { }
        void unhandled_exception() {}
    };

    handle_type handle;
    
    Task(handle_type h) : handle(h) {}
    ~Task() {
        if (handle) handle.destroy();
    }
    Task(const Task&) = delete;
    Task& operator=(const Task&) = delete;
    Task(Task&& other) : handle(other.handle) { other.handle = nullptr; }
    Task& operator=(Task&& other) {
        if (this != &other) {
            if (handle) handle.destroy();
            handle = other.handle;
            other.handle = nullptr;
        }
        return *this;
    }

    bool done() const { return handle.done(); }
    void resume() { handle.resume(); }
};
Task process_queue_item(int i) {
    if (!atomicQueue[i].valid) {
        co_await std::suspend_always{};
    }
    atomicQueue[i].res = remote1(atomicQueue[i].i, atomicQueue[i].a, atomicQueue[i].b);
}

why line atomicQueue[i].res = ... cause segfault?

Coroutine lifetime issues: If the coroutine is resumed after the atomicQueue or its elements have been destroyed, this would lead to accessing invalid memory.

solusion

Task process_queue_item(int i) {
    if (i < 0 || i >= atomicQueue.size()) {
        // Handle index out of bounds
        co_return;
    }
    
    if (!atomicQueue[i].valid) {
        co_await std::suspend_always{};
    }
    
    // Additional check after resuming
    if (!atomicQueue[i].valid) {
        // Handle unexpected invalid state
        co_return;
    }
    
    try {
        atomicQueue[i].res = remote1(atomicQueue[i].i, atomicQueue[i].a, atomicQueue[i].b);
    } catch (const std::exception& e) {
        // Handle any exceptions from remote1
        // Log error, set error state, etc.
    }
}

MOAT: Towards Safe BPF Kernel Extension

MPK only supports up to 16 domains, the # BPF could be way over this number. We use a 2-layer isolation scheme to support unlimited BPF programs. The first layer deploys MPK to set up a lightweight isolation between the kernel and BPF programs. Also, BPF helper function calls is not protected, and can be attacked.

  1. They use the 2 layer isolation with PCID. In the first layer, BPF Domain has protection key permission lifted by kernel to do corresponding work, only exception is GDT and IDT they are always write-disabled. The second layer, when a malicious BPF program tries to access the memory regions of another BPF program, a page fault occurs, and the malicious BPF program is immediately terminated. To avoid TLB flush, each BPF program has PCID and rarely overflow 4096 entries.

  1. helper: 1. protect sensitive objects It has critical object finer granularity protection to secure. 2. ensure the validity of the parameters. It(Dynamic Parameter Auditing (DPA)) leverages the information obtained from the BPF verifier to dynamically check if the parameters are within their legitimate ranges.

LibPreemptible

uintr come with Sapphire Rapids,(RISCV introduce N extension at 2019) meaning no context switches compared with signal, providing lowest IPC Latency. Using APIC will incur safety concern.

uintr usage

  1. general purpose IPC
  2. userspace scheduler(This paper)
  3. userspace network
  4. libevent & liburing

syscall addication(eventfd like) sender initiate and notify the event and receiver get the fd call into kernel and senduipi back to sender.

WPS Office 2024-07-27 20.44.35
They wrote a lightweight runtime for libpreemptible.

  1. enable lightweight and fine grained preemption
  2. Separation of mechanism and policy
  3. Scability
  4. Compatibility

They maintained a fine-grained(3us) and dynamic timers for scheduling rather than kernel timers. It can greatly improve the 99% tail latency. Normal design of SPR's hw feature.

Reference

  1. https://github.com/OS-F-4/qemu-tutorial/blob/master/qemu-tutorial.md

OMB-CXL: A Micro-Benchmark Suite for Evaluating MPI Communication Utilizing Compute Express Link Memory Devices

This paper talks about Message Passing Interface (MPI) libraries utilize CXL for inter-node communication.

In the HH case, CXL has lower latency than Ethernet for the small message range with 9.5x speedup. As the message size increases, the trend reverses with Ethernet performing better in latency than CXL due to the CXL channel having lower bandwidth than Ethernet in the emulated system 2 compute node with memory expander for each node.

Zero-NIC @OSDI24

Zero-NIC proactively seperate the control flow and datapath, it wplit and merge the headers despite reordering, retransmission and drops of the package.

It will send the payload to arbitrary devices with zero-copy data transfer.

It maps the memory into object list called Memroy Segment and manage the package table using Memory Region table. it will use IOMMU for address translation to host application buffer. Since the control stack is
co-located with the transport protocol, it directly invokes it
without system calls. The speed up is more like iouring to avoid syscalls. For scalability, MR can resides any endpoint.

It has the slightly worse performance than RoCE, while bringing TCP and higher MTU.