Scheduler Activations

For managing the concurrency using threads, we have 1. the user-level library to make management in the application's address space. + performant and flexible - less functional 2. kernel modification + high functionality - pool performance, flexibility. Thus we demand a kernel interface combined with a user-level thread package, which can gain the above 2 pros.

User-level Threads are usually lightweight, the current design of the fiber library, can predict the process well and squeeze the context switch to the full.

Cons

The upcall performance drains down the performance (a better message call interface in the kernel using eBPF(XDP for network) or io_uring to asynchronously submit the activation may resolve this problem.

Reference

  1. https://www.cs.ucr.edu/~heng/teaching/cs202-sp18/lec7.pdf

Ph.D. 入学感想

来圣克鲁斯2个月了,想聊聊自己的感触。

  1. 老板人很好,而且可以接受我的任何问题。(感觉就是天使short term marriage 对象了。
  2. 组里大概有3波做的方向,一个Distributed System的语言方向(和Lindsey合作P system,(一坨编程语言特性来保证分布式系统的可靠性。一个我现在做的。一个record and replay。
  3. 课比较花时间(但是都学过,从一个不同角度看问题吧,Scott Beamer根正苗红来自伯克利,上课get到的比原来更多一些,大概是写过代码(有过工作经验)以后听会有不同的理解),但是workload还是上科大的一半左右。
  4. TA 工作大概是6小时/周,时薪100USD。
  5. 穿小裙子可以被任何人表面上接受。GD没那么明显了。🍬都可以报销。
  6. 有Ph.D.来湾区实习能约饭,可以认识🐶家有趣的灵魂。
  7. 大概科研想一件事找不到答案,但是总能有所突破然后又兜兜转转绕圈圈。(大概就是积累不够,自己完全比不过别人的积累,但是能快速搞懂个50-80%吧。
  8. 每周写自己想写的代码时间不多。或者就是自己效率没有高到离谱的地步。
  9. 业余时间还是可以爬山/拍视频给自己看的。
  10. 自己的体力可能没有想象中的大,可能是个现充。而且可能已经达峰了。
  11. 不会有内卷感,和当时实习一样,因为这个世界上没有人确切的知道我做的东西的答案。(JM:可以在恐龙下面摸鱼。我:没有恐龙我就是恐龙。
  12. 科研人员最重要的品质依次是1.方法论。2.质疑。3.快速实现PoC。剩下的,AP再说。(WJ说的研究技能树和管理技能树。(AP是创业小能手?
  13. 老婆能去湾区工作养我就完美了。
  14. 有些事情花了时间不一定能成,但是没花太多时间一定是自己的身体应激反应,但也可能能成。
  15. 系统的工作,如流水线作业的fuzzing,可以较快且系统的获得一篇文章需要的东西,其他东西则从头到尾都需要想。如30年前David Patterson说靠一个Ph.D. 带5个本科生可以搞RISC-V的时代不可能了,已经集成的EDA早就融合了多少人花了多少钱多少时间总结出来的智慧。如Compiler/FL system/Operating System都是,我无法build from scratch一个系统的东西,但是Ph.D.训练的也许是一个从数百万行中找到需要改的200行?或者写一个10w行的PoC丢给产业界(他们告诉我,try yourself,然后自己创业?)总之都需要自己花时间去尝试。

Lottery Scheduler

Motivation

deterministically fairly to give time tick to another process. Current scheduler designs like CFS and ULE's drawback 1. At best, priority schemes are ad hoc. Top ranking always prevails. A feedback loop is used to change priorities to attain "fair share" over the (very) long run 2. P priorities on Unix are dynamic. 3. Lower-priority t due to priority inversion. 4. Schedulers are intricate, and other tasks may stop high-priority tasking to debate.

Solution

A probabilistic process scheduling algorithm is lottery scheduling. Each process is given a small number of lottery tickets, and the scheduler holds a drawing to select a key at random. The process with the winning ticket is given the CPU. The distribution of the ticket keys could not be uniform. A process has a greater chance of winning and being chosen for execution if it has more tickets than the competition. The problem of famine is resolved, and probabilistic fairness is assured because a process with at least one ticket key will ultimately win a lottery. The distribution and allocation mechanism utilized determines the system's throughput and responsiveness.

Ticket transfers: This technique is used when a client is held up while holding a resource due to some dependencies while another client is waiting for the first client to release the shared resource. The concept behind this is to shift the tickets to the blocked client to be more likely to execute, accomplish the task faster, and release the resource sooner to benefit both clients. Any client they depend on must be able to transfer their tickets to one or more other clients.

Ticket Inflation: The Cost of Tickets A client can produce new tickets for themselves in this case, which is seen as an alternative to direct ticket transfers. On the one hand, this strategy may cause issues because a client may monopolize a resource by printing several lottery tickets. However, inflation/deflation can be utilized to modify resource allocations without explicit client-to-client communication, resulting in higher throughput. Tickets for Compensation If a client only consumes a portion f of the resource unit allotted to it (for instance, CPU cycles/time), it may be given compensation that increases the number of tickets it receives by a proportion of 1/f. This guarantees equity and improved responsiveness when it comes to process scheduling. Interactive processes commonly show this behavior.

Experiment for Significance

They try to prove the effectiveness of fairness and flexible control for multiple resources to contend. They also discussed the concurrency control, but they didn't bring up the searching for ticket time which is O(log n)

Experiment Fallacy

The Monte Carlo experiment provides the dynamic control feature but not giving a good idea of the inner functionality's percentage. I think compensation tickets are unnecessary because, in the long term, the odds should be balanced between the clients. Because we have to keep track of how many compensation tickets each client has, there is a cost associated with this. On the other hand, I believe this is a good illustration of the fresh possibilities the system's dynamic presents.
Also, the graphs were unreadable, and it would have been better to use normalized charts.

Novelty

The novelty is it describes a different scheduling system while you study this document regarding time-sharing tools. Probably the most straightforward scheduling algorithm is round-robin. It just keeps the resource occupied and has a high client throughput. But it's too easy. It does not offer the system any control over how much of the resourcehelp is allocated to each client. To address this, multilevel queue scheduling is used. The user has the option to group customers and assigns each group a distinct priority. We get control in exchange for new issues like priority inversion and famine. To address this issue further, mechanisms like priority inheritance and aging have been included. The scheduling algorithms have become more complex and challenging to comprehend.

Comment

It's a good approach, but not fast in today's operating system because of the complexity. If you write a distributed system, you definitely will use Round Robin + Consistent Hash and will not consider the scheduling with the nonconstant algorithm. But in other user applications, this scheduling will take place.
Also, we need to consider other metrics like the workload and context switch time to decide whether we need to use this, like implementing CFS/REAL_TIMERT/lottery and fallback to the different schedulers by different workloads.

Reference

  1. https://www.pdl.cmu.edu/PDL-FTP/Scheduling/lottery.pdf
  2. https://people.eecs.berkeley.edu/~brewer/cs262/Lec-scheduling.pdf

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy @ASPLOS19

In terms of memory compression, we focus more or less on the semantics of the Computer Architecture level; instead, this paper proposed a compression framework built on an object-based memory hierarchy with object-aware compression optimization.




About the ziqiwang 3 question. 1. The ISA proposed a store/load pointer instruction tailored to twizzler FOT design. 2/3. I don't think the panelty of a universal forward chunking compared with extra bit stored and do pointer remap is so big for overflow objects. I don't see the implementation so that I should really dig into the metadata bits design and runtime panelty of different operations of memory and ptrs.

ESESC Project Proposal

CSE220的project是写个模拟器,暂时想写的是(类承影的 RVV GPU SM subsystem。) 没空了,不写了.这个emulated和sniper最大的区别是这个模拟器是timesampling based,精度会高很多。

project organization

    qemuStarted     = true;
    QEMUArgs *qdata = (struct QEMUArgs *)threadargs;

    int    qargc = qdata->qargc;
    char **qargv = qdata->qargv;

    MSG("Starting qemu with");
    for(int i = 0; i < qargc; i++)
      MSG("arg[%d] is: %s", i, qargv[i]);

    qemuesesc_main(qargc, qargv, NULL);

    MSG("qemu done");

    QEMUReader_finish(0);

    pthread_exit(0);

上层先走qemu先写了个bootloader,拿到Physical memory(其在qemu里插了trace记录memory。qemu 走完了(所以这里为毛不用JIT,可能是当时开发的局限吧。然后这边接一个指令parser。

模拟器进入循环后,会从qemu(libemul)中收集timing插桩信息。

  // 4 possible states:
  // rabbit  : no timing or warmup, go as fast as possible skipping stuff
  // warmup  : no timing but it has system warmup (caches/bpred)
  // detail  : detailed modeling, but without updating statistics
  // timing  : full timing modeling

另一个thread会运行下面的模型,初始化到memorysystem会从conf里面读cache的信息

-exec bt
#0  CCache::CCache (this=0x555558ae4360, gms=0x555558ae4280, section=0x555558ae42c0 "IL1_core", name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/CCache.cpp:78
#1  0x0000555555836ccd in MemorySystem::buildMemoryObj (this=0x555558ae4280, device_type=0x5555589fa820 "icache", dev_section=0x555558ae42c0 "IL1_core", dev_name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/MemorySystem.cpp:61
#2  0x000055555586c99e in GMemorySystem::finishDeclareMemoryObj (this=0x555558ae4280, vPars=..., name_suffix=0x0) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:320
#3  0x000055555586c497 in GMemorySystem::declareMemoryObj (this=0x555558ae4280, block=0x5555589ee260 "tradCORE", field=0x555555a7e573 "IL1") at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:242
#4  0x000055555586bb0e in GMemorySystem::buildMemorySystem (this=0x555558ae4280) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:148
#5  0x000055555581c764 in BootLoader::createSimuInterface (section=0x5555589ee260 "tradCORE", i=0) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:281
#6  0x000055555581c1c4 in BootLoader::plugSimuInterfaces () at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:219
#7  0x000055555581cba1 in BootLoader::plug (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:320
#8  0x000055555572274d in main (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/main/esesc.cpp:37

CPU Model

指令被qemu issue了以后会有个taskhandler接,prefetcher和address predictor 也实现在libcore里面.有LSQ和storeSet的实现,就是那个RMO的Load Store forward queue. OoO/prefetch实现也是最全的.

Memory Model

触发ld/st/mv等memory指令的时候会直接call memorysubsystem 核心内的L1/L2都会有MemObj记录现在的地址,自动机抽象成MemRequest,noblocking先查询MSHR,到了不同MemObj更新状态.到Memory侧有controller分配dim和Xbar分配load的路径.

GPU Model

这个走的是cuda模型,编译器会生成riscv host代码和gpu device代码,SM 会执行GPU代码,然后他们的SM流处理器记录的东西和CPU部分一样的metrics.

power emulation

用了一个pwr+McPAT的power model来完成libpeq的模拟,Component category可以模拟L1I等等,还有实时更新的McPT和Gstat的counter。只是现在暂时只有SRAM和cache的model,也没有row buffer的模拟和leakage的模拟,假装所有东西都是正确运行的。

Reference

  1. https://masc.soe.ucsc.edu/esesc/resources/4-memory.pdf
  2. https://masc.soe.ucsc.edu/esesc/resources/5-powermodel%20.pdf
  3. https://masc.soe.ucsc.edu/esesc/resources/6-gpu.pdf

RecSSD: Near Data processsing

This paper is an implementation of near-data processing data placement for recommendation inference. Normally, the operation by embedding table operation is ~10 GB of storage, irregular Access and Low compute intensity.

2 Fully connected layer with non-conformed-sized embedding table. The embedding table is organized such that each row is a unique embedding vector, typically including 16, 32, or 64 learned features (i.e., the number of columns in the table). For each inference, a set of embedding vectors, specified by a set of identifiers (e.g. multi-hot coded classification inputs), are aggregated together. Common operations for aggregating embedding vectors together include summation, averaging, concatenation, and matrix multiplication [30, 445]; Figure 1 shows an example of using summation. Inference requests are often processed in batches to share the control overhead and make better use of computational resources. In addition, the model usually contains many embedding tables. Currently, production-scale data centers store embedding tables in dynamic random access memory, while the central processor performs embedding table operations and optimizations, such as vectorization instructions and software prefetching.

Embedding-Domain models include DLRM-RMC1,2,3, their observation is the hot memory hit gets up drastically for several pages. This requires a real-world computation to have a software-defined allocation and prefetching technique and vectorization.

Implementation

Their solution is on an OpenSSD over micro UNVMe stack (which can simply be replaced with SmartSSD.)

caching requires both side DRAM caching, and the multithread I/O queue are fused with SSD






Performance

ctFS: Replacing File Indexing with HW Memory Translation t/ Contiguous File Allocation @FAST22

最近正好也在看CMU的745,又发现了一坨来自SNU的韩语讲解,最近又需要RIIR kernel memory和FS部分的代码,我感觉是个好好重新学习文件系统的好机会。(以及感觉PM的很多工作可以直接Migrate 到CXL上来(似乎MMU和DAX映射可以完全一模一样),看看有没有什么东西可以挖。

Background

indexing在splitFS和ext4-DAX的时间画的很多,尤其是SWE/Append。

ext4-DAX需要先得到index nodes(directory extent tree)再得到leaf nodes,在最后的extent index上才会做virtual address的MMU。这对PM来说是比较慢的过程

Insight

观察1: 我们可以linear mapping file,把file当byte addressable的inode,这样对file的读取只需要 start+offset(这样拿到MMU后的隔离似乎有点问题)
观察2:前半部分的directory extent tree(似乎是个红黑树)可以放在PM的MMU上来执行。

我们发现File 被实际映射到PM物理介质上的过程和在PM上Page Walk的过程类似。而且我们对pagetable的每4k做一次partition,可以完整的完成对文件的抽象。


在一个非flat的文件系统中,文件分为多层,但是每一层放稍微比上一层大的文件会产生极大的浪费,所以我们仍然需要维护partition头,其包含superblock, 和下一层的inode bitmap。


对于migration,我们通过一个系统调用来保证continuous memory和REDO LOG 保证的crash consistency。

可以显见在128MB aligh对齐64个PMD entries的时候会比32k PT entries快。

一个特殊情况是atomic write。其实就是要写的先写在后面再copy on write pswap回来。

Reference

  1. https://www.youtube.com/watch?v=pT46-RYOsmw&t=309s