最近听了 Hongzheng 大师的 alpa("all-parallel") talk,最近超算又正好在打 yuan(一个基于transfomer 改的),同时又在帮忙de ray 和 hoplite 的一些 bug。所以温习这篇 Ion Stoica 大师的论文。
CODIC: A Low-Cost Substrate for EnablingCustom In-DRAM Functionalities and Optimizations
Haocong 学长有 contribution 的 work,其他作者来自ETHz,UIUC,NUDT。把一些PIM的工作low overhead的通过HW的方法让普通DRAM可以被修改、监控、优化。
CODIC design
CODIC-sig generates signature values that depend on process variation by sensing and amplifying aDRAM cell that we set to the precharge voltage (Vdd/2). Sense amplifiers detect minor voltage differences above or belowVdd/2.
CODIC-sig-opt is based on the key observation that CODIC-sig can set the voltage of the DRAMcapacitor toVdd/2very quickly.
CODIC-det generates deterministic values. The key idea is to drive the cell to a deterministic value by activating the two signals that drive the SA (sense_n and sense_p)with a delay between them. Depending on which of the two signals triggers first, the generated value is 0 or 1.
Substrate
Enable the fine-grained control of fundamental DRAM internal circuit timings that control key basic components in the DRAM array(wordline, sense amplifier, precharge logic)
Applications like cold start attack prevention, Physical Unclonable Function.
Physical Unclonable Function
The hardware primitive maps a unique input (i.e., challenge) to a unique response. It can start with Challenge-Responsepair (CR pair). The address and size of a memory segment as the only parameters that define a challenge.
- It has a fast evaluation time due to its ability to control internal DRAM timing signals.
- It does not require any filtering mechanisms because it provides highly stable out-put values.
- It has state-of-the-art resilience to temperature changes.
- The latency is good. All DRAM cells are always precharged toVdd/2for generating a PUF response, independently of their original value.
Cold start attack prevention
The attacker first disables power to the computer containing the victim DRAM and then transfers the DRAMto another system that can read its content.
Previous work 1. using Enclave to encrypt the Memory, 2.scramble the data in the memory controller, 3. Trusted Computing Group resets the DRAM content upon power-off.
The CODIC solution is self-destruction using sig or det operation on boot, it will first carry out this two logic to refresh all the data on row buffer.
Reference
Cerebros: Evading the RPC Tax in Datacenters
Main Story
RPC always plays a crucial role in distributed systems. In Bilibili, when there's a huge amount of microservices that require RPC tax to do the data transmission. The SMartNIC, PIM, or Programmable Switches solution basically offloads the data calculation to outer computing power. The current hardware optimization mainly focuses on the transportation layer, and rarely cares about the whole execution process. Moreover, the instruction supply issue is also a bad idea, all the control paths will be injected into the binary run on the main OS.
Cerebros is an accelerator that can be attached to the NIC to read incoming RPC messages and hide its sends and recvs by overlapping the operation. The affinity logic by the CPU in the OS is not fit this design.
Problems
- CAM table for setting the type with called function address can be congestion.
- More control path means more software failure possibility.
- The reserved memory region should be preallocated with the metadata of RPCs on NIC cache which is a waste to the current DMA buffer.
RDMA pitfalls
For baseline commercial implementation baseline like Mellonoax. NIC can bypass the kernel to invoke the network stack already, the OS just needs to use its thread register to wait for QPs to end.
However, RDMA is not good for cross-datacenters or iWarp for the internet compared with TCPs. The latency is considerably small compared with the Switch Protocol calculation. However, the atomicity of the data transmission primitive on RDMA can be leveraged between private domains e.g. Shanghai to Beijing datacenter data transmission.
Comparison with MiniOOO and wBPF
The recent talk by
对于相同长度的control path来说做一个通用的硬件观测模型比这个RPC Tax 故事更好听。