[CSE290S] Erasure Code

This part is what Ethan researched on starting his RAID thesis in UCB. I would say VAST Data and deployed systems like carbink in Google somehow have been widely researched and deployed. Could be Core Competitiveness.

ECC vs erasure code. detection of errors and locating the errors.

LPC or Reed Solomon or RAID6 or better erasure code that mathematically performs better, for different locations of the identification of the error and still what Azure and google are spending time researching on.

Non-crypto/ crypto is what you want to fight against malicious manipulation of your bit without leaking. The former is just anti bit-flips.

uManyCore: ISCA23


Writing a microkernel for the village that can offload operators is definitely correct. The problem of how to define access to the memory node has long been discussed. Either a so-called semi-disaggregated MN like Yiying Zhang's Lego OS/ Cilo or purely one-sided RDMA based on prior distributed thoughts like FUSEE.

This work is leveraging hardware metrics in both the NIC and Mem to accelerate the request queue of the village accelerator access remote memory. This can either be memory dependent or a dry bulk load from the memory pool.

However, this work does not provide a distributed kernel paradigm that orchestrates all the CXL accelerator can get from remote memory and how hardware hint that can guide your distributed kernel and operator bytecode has been offloaded.

Reference

  1. https://github.com/dmemsys/FUSEE

CoroBase: Coroutine-Oriented Main-Memory Database Engine @VLDB23

Problem: The prefetcher is always draining the bandwidth, and if you

Compare with hardware coroutine

[1] brought up the discussion of hiding the memory latency by asynchronously loading all desired memory requests and putting the results as objects to the L2 ScratchPad Memory. They also map this semantic to the async C++ coroutine. The setting to backend operation, unlike CoroBase, is auxiliaries by

Software Prefetching via Coroutines

assigns prefetch coroutine and switches to another coroutine for other computation. The setup is still something like coroutine in the C++20.

Thoughts

Need a better hardwrare codesign, need a better VLIW like compiler that gives feed from the hardware hint and better hide the latency.

Reference

  1. https://carrv.github.io/2022/papers/CARRV2022_paper_9_Wang.pdf

ASPLOS23 attendency

这次投了一篇workshop,但是签证问题,所以这次前半段又得是一个线上会议,说实话我只关注CXL和codesign,本来可以见见刘神和jovan大师还有yan大师的,最后还是见到了,加上各种大师。机票是25号去西雅图的,也退不了,我现在改了护照hold on,在3.27号礼拜一飞LA取护照,被告知还没来,后来邮件来了,被告知第二天早上能到,然后见了大学同学,和WDY。3.28到LA领馆的时候,一开始没到,10.37分收到,10.45分拿到护照,12.35的飞机,uber去,11.28到,线上checkin过了15分钟安检,还好出境去canada比较快,安检前根本没有check合法证件,领登机牌的时候被check了入境加拿大的合法证件。晚点了半个小时。到温哥华下飞机,入境全部电子化,我被签证官问干什么,我说I‘m a Ph.D. student and attend conference. 然后就放我过了。到的时候是最后一个session,拿了badge,听完就poster session了,到第二天3点结束其实正好就听了一天。晚上的aqurium的award不错,social 认识了美本CMU博美女。(Trans 完全没有被歧视。)感觉加拿大完全是富人的天堂,能干活的人去只能做做底层工作,不划算,我觉得,这个国家,经济上政治上科技上完全被美帝压制和吸血。最前沿的东西也没有美国牛逼。由于asplos paper 太多,下面只放最重要的。

Firesim

主要是介绍他们的firesim的,就问他们什么时候更新f1 vu9p。tutorial讲很多怎么在f1上用firesim和chipyard敏捷开发riscv,ucb的252已经用chipyard当他们的体系结构作业了,仿真一个BOOM的TAGE很正常。

Integrating a high performance instruction set simulator with FireSim to cosimulate operating system boots By tesorrent

主要讲了怎么在firesim上敏捷开发

LATTE

workshop都是企业级别的对RTL/hw/sw的优化。

Exploring Performance of Cache-Aware Tiling Strategies in MLIR Infrastructure

Intel OneDNN在MLIR上approach

PyAIE: A Python-based Programming Framework for Versal ACAP AI Engines

Versal ACAP HLS

A Scalable Formal Approach for Correctness-Assured Hardware Design

Jin Yang 大师的,之前在AHA讲过了,

Yarch

Formal Characterization of Hardware Transmitters for Secure Software and Hardware Repair

和作者聊了一下,是个台湾中研院->stanford的女生,和Cristopher合作,(他要来UCB了)大概就是model hw state,用symbolic execution resolve branch 然后看有没有timing difference。在RTL上做。

Detecting Microarchitectural Vulnerabilities via Fuzz Testing of White-box CPUs

用fuzzing地手段找Store Bypass。

SMAD: Efficiently Defending Against Transient Execution Attacks

这次被分配的mentor的学生的,这个mentor在GPU side channel很著名。

Session 1B: Shared Memory/Mem Consistency

这个chair是admit,辣个VMWare最会排列组合Intel ext的男人

Cohort: Software-Oriented Acceleration for Heterogeneous SoCs

这篇是在fpga上自己定义L1/L2 cache和crypto accelerator。然后怎么弄在一起,在CXL.cache就不是一个问题。

Probabilistic Concurrency Testing for Weak Memory Programs

一个PCT Frameware,用SC的规范来assert,找bug。

![](media/16792628578417/16799399312657





hit bug 更快



Hieristic for h is good enough for data structure test. assertion tests looks great, When I was in shanghaitech, there’s people using the same tool on PM.










'

MC Mutants: Evaluating and Improving Testing for Memory Consistency Specifications






Transform disallowed memory to weak memory label.



一个binary translator



Session 2A: Compiler Techniques & Optimization

SPLENDID: Supporting Parallel LLVM-IR Enhanced Natural Decompilation for Interactive Development

让Decompilation更丝滑。

Beyond Static Parallel Loops: Supporting Dynamic Task Parallelism on Manycore Architectures with Software-Managed Scratchpad Memories

Graphene: An IR for Optimized Tensor Computations on GPUs

Coyote: A Compiler for Vectorizing Encrypted Arithmetic Circuits

这怎么喝

NNSmith: Generating Diverse and Valid Test Cases for Deep Learning Compilers

刘神的,写几个z3规则用来生成fuzzer,就是csmith in NN。

Session 3B: Accelerators A

Mapping Very Large Scale Spiking Neuron Network to Neuromorphic Hardware



1d locality is 3d locality

CRLA mapping like traditional DNN? NO.

HuffDuff: Stealing Pruned DNNs from Sparse Accelerators

观测到了HW的boundary effect可以搞。


  1. Can snoop the weights update.
  2. dense data are more easliy being observed.


    并不transferable to other model,但是可以通过观测有没有bound effect来看是不是convolution。

NV eng问:Gemm/FC也可以reverse engineering。

OCCAMY: Elastically Sharing a SIMD Cc processor across Multiple CPU Cores



SIMD有两种sharing



加两个hint length和load时间predicate,用类似rob的方法dispatch指令。





这直接上roofline就行



Motivation why arm unmodified? but with compiler inserted MSR and MRS.

Session 4B: Memory Mgmt. / Near Data Processing

Session 4C: Tensor Computation

Keynote 3: Language Models - The Most Important Computational Challenge of Our Time

NV吹逼大会IMG_9679
IMG_9678
IMG_9677
IMG_9675
IMG_9674
IMG_9680
IMG_9676

Session 7A (Deep Learning Systems)

Session 7B: Security

Dekker

The instrumentation on control flow + linker + runtime 检测CFI, CPI,indirect pointer access

Finding Unstable Code via Compiler-driven Differential Testing

Use CompDiff-AFL++ to fuzz the UB

Going Beyond the Limits of SFI: Flexible Hardware-Assisted In-Process Isolation with HFI

WebAssembly for SFI + hardware assistance

Session 7C: Virtualization

Exit-less, Isolated, and Shared Access for Virtual Machines


需要 gate &sub VM funciton




VDom: Fast and Unlimited Virtual Domains on Multiple Architectures




用PTE 隔离。

ghost descendent

想法是把schuduler从kernel 里抽象出来。

Session 8B: Accelerators C

TPP

transparent cacheline for TPP is another question.

和husan讲,一个toronto的教授问jvm怎么做更好的page placement,husan说这个在OS level最好
第二个人问用pebs和cpu pmu sampling waste cpu cycle。TPP sampling比较轻量
UBC的另一个人问,deref page的traffic怎么统计?
husan说这个page prefetch mechanism保证,也可以做multi hierarchy LRU,但是访问latency会变高
然后joseph问了个问题,will PMU in device side help investigate page warmth?
大概husan去AMD就做CXL hardware-software design for page promotion performance
hint 就是PMU,然后OS提供接口,不是madvice,而是一段内存granularity,device提供可以decide
这样最好
UBC那一些人和我做的一样。。
回去要加油了
不过他们绝对会cite我的simulator了😂,我宣传他们赶紧cite

Session 9C: Hardware Security

封笔,等21号ASPLOS ddl以后写。

About the MHSR of the LLC miss with CXL.mem devices

In [1], the author talked about the Asynchronous Memory Unit that the CPU and Memory controller needs to support of co-design.

The overhead of hardware consistency checking is one reason that limits the capacity of traditional load/store queues and MSHRs. The AMU leaves the consistency issue to the software. They argue that software and hardware cooperation is the right way to exploit the memory parallelism over large latency for AMU.

As shown in the Figure of sensitivity tests in [2], the decomposition analysis of DirectCXL shows a completely different result: no software and no data copy overhead. As the payload increases, the main component of the DirectCXL latency is the LLC (CPU Cache). This is because the Miss State Holding Register (MSHR) in the CPU LLC can handle 16 concurrent misses, so with large payload data, many memory requests (64B) are suspended on the CPU, and processing a 4KB payload takes up 67% of the total latency.

The conclusion is MHSR inside the CPU is not enough to deal with memory load in the CXL.mem world, and both the latency and the bandwidth are so diverse across the serial PCIe5 lane. Also, another possible outcome compared with RDMA SRQ approach of the controller, we think the PMU and semantics of coherency still matter and the future way of persistency according to the Huawei's approach and SRQ approaches will fall back to ld/st but with a smarter leverage in the MC that asynchronously ld/st the data.

Reference

  1. Asynchronous memory access unit for general purpose processors
  2. Direct Access, High-Performance Memory Disaggregation with DirectCXL

Qemu CXL type 1 emulation proposal

Introduction to CXL Type 1

Guided Usecase

[1] and [2] are just qemu's implementation of dm-crypto for LUKS, that every device mapper over a physical block device will require a key and a crypto accelerator/software crypto implementation to decrypt to get the data. We implement a crypto accelerator CXL type 1 semantic over a framework of virtio-crypto-pci. We want to emulate the mal state or unplug the crypto device; the kernel will get ATS bit DMAed data and will resume by CPU software crypto implementation.

Device emulation

DMA and access memory

Create a CacheMemRegion that maps a specific SPP region for one on one mapping of a bunch of CXL.cache caches on a CXL device.

Crypto operations

When calling crypto operations in the kernel, we actually offload the encrypt/decrypt operations to the type 1 accelerator through CXL.io, which tells the device cast on operation on arbitrary SPP. The accelerator will first take ownership of arbitrary SPP in the CacheMemRegion and notify the host. Eventually, the host will get the shared state of the SPP's cacheline.

Cache coherency emulation

struct D2HDataReq = {
    D2H DataHeader 24b;
    opcode 4b;
    CXL.cache Channel Crediting;
}
struct CXLCache = { 
    64Byte Data; 
    MESI 4 bit;
    ATS 64 bit;
    [D2HDataReq;n] remaining bit;
}

Metadata with Intel SPP write protection support. Mark the access to arbitrary cacheline to the SPP. We need to perform all the transaction descriptions and queuing in the 64Byte of the residue data in the SPP. The arbitrary operation according to the queue will cast on the effect of MESI bit change and update writes protection for the Sub Page and Root Complex or other side effects like switches change.

The host and device's request is not scheduled FIFO, but the host's seeing the data will have better priority. So the H2D req will be consumed first and do D2H FIFO. All the operations follow interface operation to CXLCache.

Taking exclusiveness-able

We mark the Transportation ATS bit be whether taking exclusiveness-able. and copy the cacheline in another map; once emulated unplugged, the cacheline is copied back for further operation of the kernel to resume software crypto calculation.

How to emulate the eviction

We have two proposals

  1. qemu pebs to watch the cache evict of the physical address of an SPP
  2. use sub-page pin page_get_fast to pin to a physical address within the last level cache. [7]

Reference

  1. https://www.os.ecc.u-tokyo.ac.jp/papers/2021-cloud-ozawa.pdf
  2. https://people.redhat.com/berrange/kvm-forum-2016/kvm-forum-2016-security.pdf
  3. https://yhbt.net/lore/all/[email protected]/T/
  4. https://privatewiki.opnfv.org/_media/dpacc/a_new_framework_of_cryptography_virtio_driver.pdf
  5. https://github.com/youcan64/spp_patched_qemu
  6. https://github.com/youcan64/spp_patched_linux
  7. https://people.kth.se/~farshin/documents/slice-aware-eurosys19.pdf