论学习到科研的转变与科研的探索与谋生

学习是一种广泛的获取知识的过程,如果是为了高考或者是大学的期末考试的学习,其目的还是让学生掌握老师想让学生掌握的部分,是本专业所必需的基础理论和基本技能,但是老师并不负责知识的有效性和时效性.没了外力约束的学习,即一个人不需要获得GPA而获得荣誉或者去更好的地方学习,本科毕业的最低要求是各个课程及格,从而使得大量的废物产生在世界上.高等教育的学习,在我看来是在获得基本的求生技能的基础上培养一个人独立思考的能力,需要从社会的进步浪潮中发掘自己的擅长,如何与世界的进步发展同进步.就计算机专业而言,实践是获得此种能力的最好方式,培养实践能力比学习理论知识重要,因为工科的学习一切都有例外,没有办法用一个思辨的角度来发掘别的公司或者独立开发者为什么这么写,容易陷入过于苛求完美的循环中.工科的学习是循序渐进的,实践能力会随着眼界的拓宽而日益增长,直到自己也能造火车的那一天.PhD阶段的学习是从科研中来到科研中去的,我们无需关注无关自己课题的一切知识,其只能阻碍自己的钻研时间,而拓宽视野的过程应该流于与其他研究者的讨论,实践比赛或复现别人的artifacts,参与工作的实习,而不是单纯的从书本上寻找自己课题的答案.对于未来想要做的课题,可以提前参阅书籍和工业界的best practice,在潜意识中思考工业界没有想到过的部分,体现自己的价值.从学习到科研的转变,是一个人的学习热情从外驱力到内驱力,我认为什么东西工业界没有想到,或者没有提供全工业界的价值,而流于一些头部公司的东西,怎么更好的服务工业界.

从与已经失去科研兴趣的人交流过后,他们大多都在无尽的尝试中失去寻找自己可以作为PhD可以提供的价值,而放弃了探索,将PhD作为一种逃避经济衰退的手段,而苟活于柴米油盐,骗funding谋生.即便他们来自四大,即便他们认为这个世界已经被那些有规模效应的公司所占据了.什么是学术界的探索?究竞是所谓的peer review规则带来的劣币驱逐良币,还是PhD及其导师所想到的可以变革工业界思考的良币驱逐劣币?spark是一个很好的例子,Matai调研Facebook使用Hadoop却烦于其过慢的OLAP性能,即在伯克利创造了In memory Hadoop.我惊叹于伯克利的大力出奇迹,在peer review下并没有任何novelty却能在三次拒稿后用Facebook的广泛应用拿到了最佳paper.需求和解决方案是驱动计算机科学进步的源泉,没有思考的idea,如果它work,真正的peer也会告诉你,I buy it.伯克利不是一个适合做科研的地方,sky computing、alpa、foundation model、ORAM、UCB哪一个不是已经有别人的paper的基础上,封装一个伯克利defined layer,重新实现一遍,然后鼓吹这个东西made in UC Berkeley?但是这是一种创新吗?是.因为这就是共产主义.所谓的老人,告诉你一个怎么走捷径进入四大,top20,但是他们难道就能发SOSP吗,还是他们就是劣币,一个听从老板的left over的idea,让你实现,从而发顶会 ?PhD的位置是可以guide工业界行为的工作,这种美好的探索性的阶段绝不能浪费.

什么东西guide我的日常科研进度,我觉得老板只是提供方向上和日常怎么填满生活的建议,从而达到能发出paper的目标,真正的决定权在我,我希望在武力攻台之前拿到eb1a绿卡,我希望在毕业的时候能拿800k的工作.在做到这个之前,我理解工业界的需求是什么,我期望在他们着眼于此之前就提供我能提供的建议,同时获得价值.OpenAI说这是AI的黄金时代,David Patterson说这是体系结构的黄金时代,Chris Lattner说这是编译器的黄金时代,我说这是软硬件协同设计的黄金时代!

FAST23 Attendency

今年Fast在家门口开,那当人就请个假去玩玩咯.paper现在已经可以access了.两个Best paper都是中国的,一个是zuopengfei的ROLEX,一个是Fail-Slow的时序预测模型。

Coding and Cloud Storage

Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)

  • MR-LRC,every group has a local redundancy and global redundancy.
  • Use Cauchy LRC to predict the distance
  • 96-105 code per stripe >=6 time failure reliability test/Time to recover/ Mean-time-to-data-loss.

VAST Data's approach? LDC: workloads matters.

ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage

Repair penalty of RS Code

  • Reduce bandwidth: amount of traffic transferred in the network
  • Maximum repair load

For SoTA (4,2)Clay code, the blocks size is 256MB while bandwidth and MRL are both 384MB.

Repair can be pipelining while the block can be chunk into blocks, (bw ,MRL)=>(512,256), drawback is additive associative. But we can subpackerize the block.

pECDAG to parallel GC for Clay codes.

InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication

Cloud Tiering and backup data requires Deduplication in the cloud hypervisor level. They get teh fingerprint and semantic from the cloud tier and send the request to the local tier.

The GC algortihm is basically BatchDedup.

fingerprint Indexing in SSD.

Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

用sliding window 做peer evaluation。
用时序模型预测Fail-Slow的出现时间 Latency vs Throuput+Linear regression。

Key-Value Stores

ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance

用历史数据学不同SSD的RocksDB的L0-L1 compaction,data overflow.

感觉做的没Kill Two Birds with One Stone: Auto-tuningRocksDB for High Bandwidth and Low Latency好。

RocksDB会自己increase memtable?

Does RDO control compaction? All layer compaction

RDO MMO encoder Transformer?Future work。

FUSEE: A Fully Memory-Disaggregated Key-Value Store

这篇和Ceph的想法基本一样。只不过放到了disagregated memory。

Client centric index replications
Remote Memory Allocation: RACE hashing+SNAPSHOT
Metadata Corruption: Embeded operation log

RACE Hashing: onesided RDMA hashing

Primary and write-write conflict resolution by last-writer(majority writer) wins and update the other place thereby.

Compared with MDS that has key to hash for direct access to RDMA result. only 1 RTT for read.

embedded operation log

ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems


Learned key data movement and allocation requires recomputed decoupled learned indexes. The desired data leaves are fixed size bby the retraining decouple algorithm mathematically, which makes migration easier.


Consistency Guarantee, 感觉要同步的model的RDMA操作挺多的。

bit as lock? how to recover from the lock? future work.

Always update new model every read? first check the first entry of SLT. and update the model if changed.

AI and Storage

GL-Cache: Group-level learning for efficient and high-performance caching

xgboost学了utility time的metrics。然后按utility time evict。如果两个workloads公用cache就不行。以及如果cachesize很大我觉得预测的效率不如手动ttl的segcache。以及我觉得Transformer预测ttl更优秀,因为xgboost只是捕获信号,并没有预测能力。然后似乎这个metrics也是试出来的。

SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training

这篇是SC见到的同学的。

Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach

File Systems

CJFS: Concurrent Journaling for Better Scalability

Multi version shadow paging

Compound flush: On Tx finish cache(DMA, which can be hack by CXL?) barrier send the interrupt to the storage.

Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities


name collision, 之前有个git CVE。

ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit

软工找bug,实在太没意思了,改configuration(主要是allocator的内存格式/type信息/64bit?),然后污点分析,找这三个bug。

Compared with Hydra

HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers

MadFS 在无锡超算上跑的。

Consulidate the links from userspace.

Fisc: A Large-scale Cloud-native-oriented File System

是一个南大网络组的人去了盘古。阿里巴巴双十一现在完全基于这个了。

相当于implement了一个 lightweight file system client to improve the multiplexing of resources with a two-layer resource aggregation。每个computation storage vRPC agent proxy做负载均衡。

RPC serialization? 没有指针,docker/container的metadata都是flat的(只有offset),他们RPC几乎相当于RDMA over memcpy,很lightweight。 load balancer只做file-granularity based。如果DPU硬件fail了,直接RDMA migrate docker,其他请求fallback slow path,很快,用户无感。如果用户觉得IO慢了会提供全链路trace。

Persistent Memory Systems

TENET: Memory Safe and Fault Tolerant Persistent Transactional Memory

用tag+地址Btree连在一起的方式防止buffer overflow和dangling ptr。

MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems

file+dummy(2MB granularity)重排了文件写在DAX Pmem上,有写放大。concurrent control用的是CAS。metadata concurrent write好像没有处理,后来理解了因为open的寓意保证了write不会contension。

On Stacking a Persistent Memory File System on Legacy File Systems

Stackable Memory Filesystem over normal filesystem.上面的内存文件系统比较轻量extent RB tree。Sync file system factor设计和dirty page wb有点像。

extent hashing和下面同步

Remote Memory

Citron: Distributed Range Lock Management with One-sided RDMA

线段树map ranges to tree ndoe。
Ext-CAS+Ext-TAA

Patronus: High-Performance and Protective Remote Memory

用per memory window multiple qp and remote lease to protect the file,

More Than Capacity: Performance-oriented Evolution of Pangu in Alibaba

盘古牛逼

IO Stacks

λ-IO: A Unified IO Stack for Computational Storage

陆老师和杨者的文章,想法是在kernel和SSD driver上跑统一的eBPF.

sBPF是对eBPF检查的减弱,想offload computation eBPF code没有语意上的限制。

通过page cache来观测是否在in memory database在kernel里可以被reuse,用这个metrics offload计算到设备。同时利用host CPU资源快,内存大,device计算资源少内存少但是离storage近,只用返回计算结果。

实验在SF=40 TPCH over Java上做,把syscall 分解为 read_λ(read, percentage, λ)的形式。λ本身就是sBPF的程序。

结果非常好,有2.16倍的加速

Why not Native?统一框架。 我们发现mmap的部分都是计算时间,于是,load from device mmap+page fault and trigger userspace WebAssembly computation? for mmap very fast。

CXL+eBPF,small pagetable for transaction?fast getting metrics from cacheline network,decide load or store and dma?

有一个concern是会占满PCIe Root Complex的带宽,仅仅观测page cache大小是会出问题。

Revitalizing the Forgotten On-Chip DMA to Expedite Data Movement in NVM-based Storage Systems

NVMeVirt: A Versatile Software-defined Virtual NVMe Device

更准的SSD模拟器,问了个为什么不模拟CXLSSD的问题。

SMRSTORE: A Storage Engine for Cloud Object Storage on HM-SMR Drives

按zone分chunk在OSS更优,而且SMR问题determinist分配zone是一个比较优的策略。

15%reduction of time

SSDs and Smartphones

Multi-view Feature-based SSD Failure Prediction: What, When, and Why

用时序数据predict results

但是对异常数据不是很敏感。

Fast Application Launch on Personal Computing/Communication Devices

不是很懂

Integrated Host-SSD Mapping Table Management for Improving User Experience of Smartphones

拿L2P数据prefetch UFS的。在UFS上的数据结构比较deterministic。

yyw 的2022 年终总结

今年是魔幻的一年,在年底前22Fall考后说点今年的年终总结.

开年的时候想去阿里云实习做操作系统.但是因为过了ddl就黄了.年初选的几门课刚开始还好,后来封城了以后和老婆在她公司边杨高中路度过了不太美好相濡以沫的同居生活,只是天天都是赶due的疯癫状态,不过终于水过了大学四年.到4月份的时候知道了自己会去UCSC跟Andrew Quinn,现在觉得这里是个产业界和学术结合很紧密的地方,不会闭门造车造出工业界完全不用的东西.确实符合当时一切幻想.当时做了三个video,最后一个关于读博值不值得,我觉得30年前赚不了1个亿,不如30岁再开始,加之现在经济不是很好,去独角兽也不是随便就能财富自由的.

5月25日去往广州签证,遇到了很多高中同学,我觉得高中同学对我的教育,是零和游戏,一个显然在高考能取得更多分的人就是有从你手中拿到更多资源,但是他们是一群为了抢夺资源不择手段且孤高自傲,不与你分享的人.我觉得这不是一个健康的竞争.但是他们经历了清华、密院、同济妓院的搏杀之后,也知道了人各有志.确实我是唯一一个高中CS读博的人.广州之后去的珠海、深圳、湘潭、怀化、张家界.真的是我的寻根之旅?见外公是最后一次了.还好去看了一次.

出国是一个奇妙的体验,我和现在的老板第一次见面在7月1日,刚从封控的地方走到一个可以表达自我的地方,刚从一个学术洼地到硅谷胜地,一切都是新鲜而美好的.当然现在的转向当然是好的,只是太晚了.我觉得我舍弃了上海的现代文明,舍弃了小布尔乔亚的生活,来到了我高中就觉得是农村而没有选择大学来的地方,但是同时又能自由探索这个世界上不曾存在的体系结构、操作系统、HW-SW Codesign,我是极为兴奋的.做的事情和大学没有什么不同,但是这是一个人生的冒险的决定.我仍然决定带UCSC超算和人交流、和赞助商交流.我仍然决定做课程设计当TA.我仍然选最cutting-edge的课.

这一年我线上开了OSDI/ATC/PLDI/POPL/ISSTA/EuroSys/ISCA,我觉得一些PhD做的东西很厉害,一些是探索性的,另一些是流水线式的.我觉得我能在PhD期间发三篇很厉害的东西我就很满足了.

täko: A Polymorphic Cache Hierarchy forGeneral-Purpose Optimization of Data Movement

(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.


They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss() onEviction() writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.

In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.

However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.

Reference

  1. https://mir.cs.illinois.edu/marinov/publications/ZhaoETAL20InvarSpec.pdf

The Linux Scheduler: a Decade of Wasted Cores

The main novelty is that Linux uses a CFS algorithm with a weighted fair queuing algorithm. Say the CFS time-slices the CPU using the red-black tree to deal with the running thread. Without I/O, we may have a better performance than O(1) . Each CPU's run queue cfs_rq maintains a min_vruntime field that records the minimum value of vruntime for all processes in that run queue. The initial vruntime value of a new process is set based on the min_vruntime of the run queue it is in, keeping it within a reasonable gap from the old process.

Pros

The dormant process is compensated by vruntime when it wakes up, and its ability to grab the CPU when it wakes up is a probable event, which is the intent of the CFS scheduling algorithm, i.e., to guarantee the responsiveness of interactive processes, which would be frequently dormant waiting for user input.

Imagine when you perform each interactive operation such as hitting the keyboard, moving the mouse, etc., for the system this is when a new task comes -> runtime is 0 -> vruntime is 0 -> is put into the leftmost node of the red-black tree of the scheduling task queue -> the leftmost node is pointed to via a special pointer and that pointer is cached

Cons

  1. Priority failure. For example, there are two processes, process A has a nice of 0 and an allocated time slice of 100ms, process B has a nice of +20 and an allocated time slice of 5ms. if we have a time slice of 5ms, after process A runs for 5ms, it will switch to process B, which will also run for 5ms, so within 10ms, process A and process B will run for the same amount of time And so on, within 100ms, process A and process B run for the same amount of time, which is obviously inconsistent with the time slice we've allocated. (This is obviously inconsistent with our time slice allocation. (I think I understand it this way, I read it several times and didn't understand it)
  2. Relative nice value. For example, if there are two processes, the first one assumes a nice value of 0 and the second one assumes a nice value of 1. These two processes will be allocated time slices of 100ms and 95ms respectively. But if the two processes have a nice value of 18 and 19, they will be allocated 10ms and 5ms respectively, which is less efficient for the whole cpu.
  3. The absolute time slice needs to be on a timer beat. When the time slice is reached, the process needs to be switched, at which point, an interrupt from the timer is needed to trigger it, in which case the time slice for both processes can't be split that fine, because the timer beat has to be met. (Can be calculated in using another value as separate from the timer beat)
  4. Based on the priority adjustment problem. To raise the priority of a newly woken process in order for the process to be up and running faster.

There are no crashes or out-of-memory situations, and tools like htop, sar, or perf cannot detect the missing short-term idle periods, making it difficult to detect the issues reported in this research. The authors refer to the first tool as a "sanity checker." It checks that no core is idling while waiting threads are in the running queue of a different core. It permits this condition to exist for a brief time but issues an alert if it continues. A visualizer that displayed scheduling activity over time was the second tool. As a result, the number of running queues, their overall load, and the cores that were taken into account during routine load balancing and thread wakeups may all be profiled and plotted. Scheduling, as in allocating CPU time among threads, was once considered an issue that had been resolved. We demonstrate that this is untrue. A straightforward scheduling principle produced a highly intricate, bug-prone implementation in order to accommodate the complexity of current hardware. We found that scheduling waiting threads onto idle cores breaks a fundamental work-conserving invariant. Runnable threads might get held in running queues for a few seconds if there are idle cores in the system, which would cause a significant decrease in application performance. These vulnerabilities are difficult to find using standard tools because of their nature. We repair these flaws, identify their underlying causes, and provide tools that make finding and correcting these bugs much simpler. As for the future, Of course, the Linux kernel as a code, it is universal, so it is difficult for the community to see and focus on multi-core issues alone, the community is most concerned about maintainability, not performance. new features of Linux in 128MB memory i386 machine running no problem, that is OK. As long as not more than 80% of people encounter new problems, the community is never care, at the same time, because of this, the community will introduce bugs, which is also want to sigh can not sigh. My opinion is that the community is just a community of programmers who take the code as the criterion, and the community does not pay too much attention to the development of architecture and new features, which are all vendor things. Likewise with TCP, which I have always sprayed, people focus on the TCP implementation code itself, which is what makes it more and more complex, and then more and more fragile, which you might say is evolution, but isn't there a chance to go back to the drawing board before all hell breaks loose? It hasn't evolved to the point where it must continue to evolve, right? If you stand outside to see and have mandatory measures, it is estimated that there is no garbage TCP long ago.

Reference

  1. https://www.usenix.org/system/files/conference/atc18/atc18-bouron.pdf

Introduction to Compilers and Language Design

这本书又更新了. 是个重新学习编译技术的好时机.

我设计ChocoPy的时候觉得自己挺无力的,

  1. Semantic和LLVM部分对Symbol的定义重复。
  2. 对于Semantic的设计参考Java版chocopy过多,应该多搞点enum而不是虚空基类。(但是为了给小朋友做还是需要区分一下难度,如果自己写应该搞C++优化的类内存排布的。3倒比较偏正常的C++程序一点。
  3. 我觉得我没有很好的讲清楚LLVM和后端mapping的关系,不止寄存器分配,如果类能设计的更好一点就不会仅仅+字符串了,总之就是后端抽象的还不够。

An Empirical Study of the Effect of Source-Level Loop Transformations on Compiler Stability @OOPSLA18

定义了一个loop optimization stability,用于查看在优化后两个相同语义的代码能获得稳定的相同输出。通过先loop extract出for循环按RO/WAR&WO/RAW分类。loop mutate benchmark,metric定义了行为相似度(代码静态分析和动态PMU行为),也定义了不同编译器之间的稳定性,(所谓的z一致性)

用处是可以指导编译器语义修改、可以增加loop heuristics来让编译器没有做向量化的transform去做向量化。

SMDK 三星的CXL开发套件 (SK海力士的HMSDK)

这个是干啥的?

三星和海力士都出了他们自家的prototype,暂时的大概意思是PCIe attached FPGA+DDR5模拟CXL.mem逻辑的Proof of Concept。因为CPU的CXL2.0还没出来,与之对应的cfmws type3(带pmem的读写)指令还没实现,所以一个通过 PCIe4.0 连接的CXL.mem逻辑很容易实现,也很容易先完成性能PoC。现在听人说三星用的rambus IP的延时不好于demisifying那篇文章。

与PMDK的比较

ndctl

PMDK 的 hw/sw interface 实现在 ndctl 中(之前做PM re的时候玩过),有给iMC的指令,告诉他们什么模式启动(FSdax/devdax/Memory Mode)然后pmdk底下你issue clflush等指令还是会需要FS来维护index和crash atomicity.

SMDK 在kernel启动的时候会把SRAT(memory affinity)/CEDT(CXL Early Discovery Table)/DVSEC(Designated Vendor-Specific Extended Capability) 三选一BIOS启动参数传给kernel,告诉ACPI的哪个地址有cxl device,intel为了制定标准,把cxl ctl 也集成进ndctl了。这个命令可以获得hw信息、创建label、group等。主要的逻辑都是一样的。新的机器,只要实现了cxl.mem,内核里都会提供cxl 这个command做类似的事情。现在悬而未决的是管理cxl内存hardware axiliary的hot page inspection,用kernel里的hmm还是iouring管理cxl.cache或.mem。

in-Kernel zone memory management

在启动的时候会configure一个memory channel type在(mm/kalsr.c中我们可以看到configure过程,和pmem e820设备不同区域,所以会call在 driver/cxl/exmem.c 下的启动程序,被configure成exmem),所有的PCIe/CXL logic都写在硬件的PoC上了

写PCIe设备的时候(先发出mov CPU地址, PCIe DMAed address, mov在DMA设备读到mmio的时候retire,PCIe设备会从DMA地址读到自己的BAR并拷贝到device RAM中。三星做的是在这里完整的模拟了CXL的过程,传输层,事物层。物理层还是走PCIe5.0。如果是反过来mov则是逆过程。板上实现了IOMMU与否不重要,如果没实现只要有DMA就行了。

如果我写一段地址到exmem mapped 的memory 上,三星的设备会对应接收到DMA请求并开始PoC板上的写内存请求,kernel 在page level会check这个page是不是在exmem上(只需要移位操作比较虚拟内存就可以了)。由于PCIe内存还是相对较慢的,最快走一次PCIe5.0也要300ns,这个模拟还是只能看看样子。我觉得傻逼在于1.复用mmap的flag来做内存分配器实际上是在分配时做决定,并不和numa allocator逻辑一样。2.mmap对于deferrable write的cxl内存并不友好。


他们的road map支持在不同的cxl node/单个node做expander来configure不同的zone。

libnuma integration

三星在libnuma中插入了新的zone,以暴露借口给他们的smalloc。

jemalloc integration

给的是jemalloc的接口,因为zonewise 的用户态内存分配器已经被研究的很透彻了。

YYW的尝试

我尝试在numa上测性能,没用他们的kernel,问题是lazy allocation,如monetdb mmap一块很大内存是很难搞的。

Reference

  1. https://www.youtube.com/watch?v=Uff2yvtzONc
  2. https://www.youtube.com/watch?v=dZXLDUpR6cU
  3. https://www.youtube.com/watch?v=b9APU03pJiU
  4. https://github.com/OpenMPDK/SMDK
  5. PCIe 体系结构
  6. SMT: Software Defined Memory Tiering for Heterogeneous Computing Systems With CXL Memory Expander
  7. https://www.arxiv-vanity.com/papers/1905.01135/
  8. https://arxiv.org/pdf/2303.15375v1.pdf