Address Generation Unit operation offloading.

CXL.mem does not have ATS required since the coherency may be too crowded maintain, the type 3 devices will be only within the DCOH of endpoint.

ATS info is recorded in the firmware level as PMU. Sounds need other logic to get these metrics.

Reference

  1. https://en.wikipedia.org/wiki/Address_generation_unit
  2. https://indico.cern.ch/event/1106990/contributions/5041334/attachments/2533446/4359546/20221024_Suarez_ACAT_fin.pdf

WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection @Eurosys23

  1. WAFFLE is about cheap ways to detect expensive bugs thus it's concerned with the design tradeoffs around concurrency bug detection tools (active delay injection in particular) compared to TSVD
  2. In breaking down the design space for active delay injection, distills the essence of delay injection for the reader, which is useful
  3. I'm interested in systems that exploit the physical time to avoid more expensive analysis when tackling hard concurrency problems (e.g., google's Spanner)

Comments

  1. The oracle of injecting time does not find ABA data structure bugs. need to record a timestamp for not necessarily the happen before logic but other oracles to hunt that.

Multi-Generation LRU

HeMem has a critique that access bit based sampling is slow, so they use pebs, while TPP leverages the autoNUMA to rely on the kernel's LRU-list approach to denote. Then I found the MGLRU approach that can additionally select the aging pages(A rmap walk targets a single page and does not try to profit from discovering a young PTE.) with the better spatial locality of scanning access bit approach.

Focus on both memory-backed files, which give detailed results and more general cases like anon page in page table access which they have assumptions of w & w/o temporal locality.

Overhead Evaluation through eBPF

Does it matches the LRU performance?

According to the DynamoRIO results, 5% of the perfect LRU in local get get to 95% of the performance.

[CSE290S] Secured filesystem

Efficient reconstruction techniques for disaster recovery in secret-split

Use secret split. The assumption is shards with different authentication ways will be safe without encryption by only matching. And the guard will identify the attacks when the attacker's operation is great enough.

Approximate pointer, use whole sharding for recovery to prevent the adversarial

64 - 64 - 64
   \ 64 \ 64
   ...    ...

Different approaches for the secret split. one get diverged from the list of key, the other uses a 128-bit field that will not leak information of the key.

Plutus: Scalable, secure file sharing on untrusted storage

论文里实验才几KB每秒。泪觉时代已经过了20年了。这种鉴权系统对现在来说软件的开销已经够大了。

Lethe: Secure Deletion by Addition

Designed secured copy-on-write over ZFS using the keyed hash forest.

TMTS Talk @CXL SIG

Scheduling asynchronous page migration based on the access pattern.

Hardware support is crucial

  • Page access scans alone have high latency
  • PMU address sampling drastically reduces promotion latency (access to promotion time)
  • Earlier promotion improves performance

Per application policy is crucial

  • The ufard they run in the userspace is per processs control flow
  • Each application's policy of migration page should be separated and have conflict using PGO

Thoughts

  1. PGO rather than online PEBS? because PEBS's overhead is huge, even if you start in a seperate threads, or lower the sample period to 10k or 2m.
  2. The TLB should be hidden by CXL.cache atomic exchange cacheline and no need to update the page table. The page table reuse distance should be also considered, since either way of updating page table 1. mark page ro and migrate or atomic exchange requires timing next time use this page.
  3. will eBPF to control all the policies be a better choice? offloading policy to rc/ep

Reference

  1. contiguitas_isca23.pdf
  2. https://dl.acm.org/doi/pdf/10.1145/3503222.3507745
  3. https://web.eecs.umich.edu/~takh/papers/jamilan-apt-get-eurosys-2022.pdf

LLVM JIT/AOT checkpoint and restore in a new achitecture.

We focus on the Classic Interpreter for function PoC and AOT for performance PoC. In the above picture, We think the LLVM view before machine-related optimization together with the wasm view is cross-platform. For the latter, we need to find a stable point like function calls, branch operations, and jump for not architecturally reordering the instruction or semantically hazardous. For turning back the view to the wasm, we originally thought DWARF would help, but the WAMR team did not implement the mapping of the wasm and native stack mapping. But they implemented AOT GC that on those stable points periodically commits the native stack to the wasm view.

Record and replay files, sockets, IPC, and locks. In the VM, there are 2 implementations of wasi, one is POSIX-based, but only uses the subset of POSIX since the definition, and the other is uvwasi which is a message-passing library that has an implementation on the Windows platform. Because we don’t really know which implementation is the target, we only record the operation log for files, sockets, IPC, and locks.

Specifically for open syscall, since it's not calling into WAMR's libwasi, while it's merely a bunch of function calls fopen->__wasilibc_open_nomode->find_relpath->__wasilibc_nocwd_openat_nomode->__fdopen. So we simply instrument the fopen and get the fdopen input to get the {fd, path, option} three-element tuple. Need instrument in AOT mode.

Snapshotting for WebAssmebly view of Memory and Frame. For the Interpreter, we defined a C++ struct for better snapshotting the memory and frame and put them in the C++ struct snapshots. The interpreter frame is just linearly set up for every function call. For JIT/AOT we need to rely on the call convention on the source machine and symbolically execute the call frame from the wasm stack on recovery. For big/little endian, you just transform if they are different, the JIT/AOT phase should take care of the memory.

Re-architecture of AOT for snapshot The current implementation of the native stack frame is incremental. which is not necessarily good for recovery, should do something like FastJIT in all the function calls and basic blocks are just jmp with auxiliary operations of stacks. (convenient for committing regs?) Then we need to on every function call commit the CPU state to the wasm stack that relies on LLVM infra for generating 1. labels that will not be optimized out by both sides. Research problem(the frequently accessed points.) 2. Register and native stack mapping to the wasm stack on stable point. (Or we need dwarf and stronger information if we only get one time on checkpoint). On recovery, we can just jmp to the label and just resume.

ReJIT or ReAOT on the target machine. We need to recover, first, ReJIT or ReAOT the wasm binary, and do a translation that only does the function call specific operation in the target machine of generated native code. Then the native call frame will be set up, and we just set the native PC to the last called function’s startup.

File Descriptor Recovery We will call the target machine’s implementation of the wasi for recovering the file descriptor and we need to make sure the order is the same.

Socket recovery For the same IP, we can just refer to the CRIU's implementation that utilizes the kernel implementation of getsockopt(TCP_recover), but the problem is it will be platform-specific, so we set up a gateway for updating the NAT after migration and implemented a socket recovery our selves referring to this. In the MVP implementation, on migration, we should first notify the gateway in the below graph which is mac running docker with virtual IP 192.168.1.1, then, do socket migration, in the meantime the gateway sends keepalive ACK to the server VM2, after migration, VM3 first starts and reinits socket/bind/accept and notify the gateway to bypass all the request from VM2 to VM3.

Locks recovery The order of the recovery to lock is very important since some of the states in the kernel will be canceled out if we only record and replay in the above graph. We need to track the order of setting the lock and who is blocking it because of what to semantically correct make the order right.