October 2022 - Page 2 - victoryang00’ blog

October 15, 2022October 31, 2022

[CSE231 Paper Reading] The UNIX fast file system

Motivation

The original UNIX operating system was FFS's forerunner and was straightforward and user-friendly. But there were some serious issues with it:

Poor performance was the central issue: The disk bandwidth utilization in some circumstances, as measured in this paper, was only 3%. Moreover, the throughput was poor.
Low locality was the cause of this: Inodes and data blocks were dispersed around the disk because the original UNIX file system viewed the disk as a random-access memory.
Fragmentation of the file system was another issue due to careless management of the available space. The file system became fragmented as more and more data came in and went out; as a result, the free list eventually pointed to blocks scattered around the disk, making it challenging to access logically contiguous files without repeatedly traversing the disk. Tools for disk defragmentation initially provided a solution to this issue.
A more petite size was excellent for eliminating internal fragmentation (waste within the block) but poor for transfers because each block required a placement overhead. The original block size was too small (512 bytes).

Solution

FFS guarantees that accessing two files simultaneously won't necessitate extensive disk search times by grouping

The superblock's location is rotated, replicated, and stored in several cylinder groups. This lessens the possibility of data loss as a result of superblock corruption.

Layout guidelines

Simple is the basic mantra: Keep similar things close together and irrelevant things separate.

FFS looks for the cylinder group with the freest inodes and the fewest allocated directories to place directories.
Setting up files: A file's data blocks are first allocated to the same group as its inode. In the same cylinder group as other files within the same directory, file inodes are also given.
A file system block now has a 4096-byte size instead of the previous 512-byte size (4 KB). This raises performance due to
More data is accessed with each disk transfer.
It is possible to describe more files without having to visit indirect blocks (since the direct blocks now contain more data)

Block Size

A file system block now has a 4096-byte size instead of the previous 512-byte size (4 KB). This raises performance due to more data being accessed with each disk transfer. It is possible to describe more files without having to visit indirect blocks (since the direct blocks now contain more data)

Experiments

Table 2b shows the reads and writes are around 3% to 47% faster in disk bandwidth.

Besides, FFS provided specific functionality improvements to the file system that have become standard in modern operating systems and will probably help FFS expand its user base:

Long file names: File names are now almost limitless in length (previously: 8 characters. now: 255 characters)
Programs can now lock files using advised shared or exclusive locks down to the individual file level.
Symbolic links: Users now have considerably greater flexibility because they may construct "aliases" for any other file or directory on a system.
Renaming files: Operation of atomic rename()
Quotas: Limit the number of inodes and disk blocks a user may receive from the file system.
Crash Consistency: FFS relies on the clean unmount of the filesystem to avoid consistency issues. In case of crash or power shortages, file system users have to invoke and wait for the lengthy file system consistency checker, fsck, which will detect consistency issues and attempt to recover without guarantee.

Evaluation effectiveness

I generally believe the paper since it adds up to 47% of the performance compared to the previous paper. I have a question for this paper 1. read-write lock performance compared with mandatory lock in different workloads 2. block wear leveling. 3. directory inode slow traverse. (here EXT4 fast start resolves this HERMES)

Novelty

Spreading file blocks on disk can hurt performance, especially when accessing files sequentially. But this problem can be solved by carefully choosing the block size. Specifically, if the block size is large enough, the file system will spend most of its time transferring data from disk and only (relatively) little time looking for data between blocks of blocks. This process of reducing overhead by increasing the amount of work per operation is called amortization and is a common technique in computer systems.

The second clever mechanism is a disk layout that is optimized for performance. The problem occurs when files are placed in contiguous sectors. Suppose FFS first reads block 0, and when the read is complete, requests the first block, but by this time, the head has already crossed over block 1, and if it wants to read block one, it must reevaluate again. FFS solves this problem with a different data layout. By spreading the data out, FFS has enough time to request the next block before the head crosses the block. FFS can calculate for different disks how many blocks should be jumped for placement to avoid additional spins; this technique is called parametrization. Such a layout gets only 50% of the peak bandwidth since you have to bypass each track twice to read each block once. Fortunately, modern disks are smarter: the disk reads the entire way as a cache, and then, for subsequent reads of the track, only the required data is returned from the store. So the file system doesn't have to worry about these low-level details.

SoTA connection

The latest paper, HTMFS in FAST 22, still cites this paper for discussing a clean fcsk. FFS has no consistency guarantee if a crash happens before a clean unmount.

EXT4, F2FS, and so many PM FSes are obviously the descendent of FFS. However, the current FS more focus on the performance of the software with regard to the fast media of storage CXL attached persistent memory. We've seen hacks in software stacks like mapping all the operations like writing combining operations to firmware or wraps into a fast data structure like ART. We see the SoTA FS stack more referring to LFS or PLFS rather than FFS in terms of journaling and firmware FS.

Reference

HTMFS: Strong Consistency Comes for Free with Hardware TransactionalMemory in Persistent Memory File Systems
https://ext4.wiki.kernel.org/index.php/Main_Page
https://gavv.net/articles/file-locks/
https://pages.cs.wisc.edu/~remzi/OSTEP/file-ffs.pdf

October 12, 2022October 12, 2022

Scheduler Activations

For managing the concurrency using threads, we have 1. the user-level library to make management in the application's address space. + performant and flexible - less functional 2. kernel modification + high functionality - pool performance, flexibility. Thus we demand a kernel interface combined with a user-level thread package, which can gain the above 2 pros.

User-level Threads are usually lightweight, the current design of the fiber library, can predict the process well and squeeze the context switch to the full.

Cons

The upcall performance drains down the performance (a better message call interface in the kernel using eBPF(XDP for network) or io_uring to asynchronously submit the activation may resolve this problem.

Reference

https://www.cs.ucr.edu/~heng/teaching/cs202-sp18/lec7.pdf

October 12, 2022October 12, 2022

Ph.D. 入学感想

来圣克鲁斯2个月了，想聊聊自己的感触。

老板人很好，而且可以接受我的任何问题。（感觉就是天使short term marriage 对象了。
组里大概有3波做的方向，一个Distributed System的语言方向（和Lindsey合作P system，（一坨编程语言特性来保证分布式系统的可靠性。一个我现在做的。一个record and replay。
课比较花时间（但是都学过，从一个不同角度看问题吧，Scott Beamer根正苗红来自伯克利，上课get到的比原来更多一些，大概是写过代码（有过工作经验）以后听会有不同的理解），但是workload还是上科大的一半左右。
TA 工作大概是6小时/周，时薪100USD。
穿小裙子可以被任何人表面上接受。GD没那么明显了。🍬都可以报销。
有Ph.D.来湾区实习能约饭，可以认识🐶家有趣的灵魂。
大概科研想一件事找不到答案，但是总能有所突破然后又兜兜转转绕圈圈。（大概就是积累不够，自己完全比不过别人的积累，但是能快速搞懂个50-80%吧。
每周写自己想写的代码时间不多。或者就是自己效率没有高到离谱的地步。
业余时间还是可以爬山/拍视频给自己看的。
自己的体力可能没有想象中的大，可能是个现充。而且可能已经达峰了。
不会有内卷感，和当时实习一样，因为这个世界上没有人确切的知道我做的东西的答案。（JM：可以在恐龙下面摸鱼。我：没有恐龙我就是恐龙。
科研人员最重要的品质依次是1.方法论。2.质疑。3.快速实现PoC。剩下的，AP再说。（WJ说的研究技能树和管理技能树。（AP是创业小能手？
老婆能去湾区工作养我就完美了。
有些事情花了时间不一定能成，但是没花太多时间一定是自己的身体应激反应，但也可能能成。
系统的工作，如流水线作业的fuzzing，可以较快且系统的获得一篇文章需要的东西，其他东西则从头到尾都需要想。如30年前David Patterson说靠一个Ph.D. 带5个本科生可以搞RISC-V的时代不可能了，已经集成的EDA早就融合了多少人花了多少钱多少时间总结出来的智慧。如Compiler/FL system/Operating System都是，我无法build from scratch一个系统的东西，但是Ph.D.训练的也许是一个从数百万行中找到需要改的200行？或者写一个10w行的PoC丢给产业界（他们告诉我，try yourself，然后自己创业？）总之都需要自己花时间去尝试。

October 11, 2022October 12, 2022

Lottery Scheduler

Motivation

deterministically fairly to give time tick to another process. Current scheduler designs like CFS and ULE's drawback 1. At best, priority schemes are ad hoc. Top ranking always prevails. A feedback loop is used to change priorities to attain "fair share" over the (very) long run 2. P priorities on Unix are dynamic. 3. Lower-priority t due to priority inversion. 4. Schedulers are intricate, and other tasks may stop high-priority tasking to debate.

Solution

A probabilistic process scheduling algorithm is lottery scheduling. Each process is given a small number of lottery tickets, and the scheduler holds a drawing to select a key at random. The process with the winning ticket is given the CPU. The distribution of the ticket keys could not be uniform. A process has a greater chance of winning and being chosen for execution if it has more tickets than the competition. The problem of famine is resolved, and probabilistic fairness is assured because a process with at least one ticket key will ultimately win a lottery. The distribution and allocation mechanism utilized determines the system's throughput and responsiveness.

Ticket transfers: This technique is used when a client is held up while holding a resource due to some dependencies while another client is waiting for the first client to release the shared resource. The concept behind this is to shift the tickets to the blocked client to be more likely to execute, accomplish the task faster, and release the resource sooner to benefit both clients. Any client they depend on must be able to transfer their tickets to one or more other clients.

Ticket Inflation: The Cost of Tickets A client can produce new tickets for themselves in this case, which is seen as an alternative to direct ticket transfers. On the one hand, this strategy may cause issues because a client may monopolize a resource by printing several lottery tickets. However, inflation/deflation can be utilized to modify resource allocations without explicit client-to-client communication, resulting in higher throughput. Tickets for Compensation If a client only consumes a portion f of the resource unit allotted to it (for instance, CPU cycles/time), it may be given compensation that increases the number of tickets it receives by a proportion of 1/f. This guarantees equity and improved responsiveness when it comes to process scheduling. Interactive processes commonly show this behavior.

Experiment for Significance

They try to prove the effectiveness of fairness and flexible control for multiple resources to contend. They also discussed the concurrency control, but they didn't bring up the searching for ticket time which is O(log n)

Experiment Fallacy

The Monte Carlo experiment provides the dynamic control feature but not giving a good idea of the inner functionality's percentage. I think compensation tickets are unnecessary because, in the long term, the odds should be balanced between the clients. Because we have to keep track of how many compensation tickets each client has, there is a cost associated with this. On the other hand, I believe this is a good illustration of the fresh possibilities the system's dynamic presents.
Also, the graphs were unreadable, and it would have been better to use normalized charts.

Novelty

The novelty is it describes a different scheduling system while you study this document regarding time-sharing tools. Probably the most straightforward scheduling algorithm is round-robin. It just keeps the resource occupied and has a high client throughput. But it's too easy. It does not offer the system any control over how much of the resourcehelp is allocated to each client. To address this, multilevel queue scheduling is used. The user has the option to group customers and assigns each group a distinct priority. We get control in exchange for new issues like priority inversion and famine. To address this issue further, mechanisms like priority inheritance and aging have been included. The scheduling algorithms have become more complex and challenging to comprehend.

Comment

It's a good approach, but not fast in today's operating system because of the complexity. If you write a distributed system, you definitely will use Round Robin + Consistent Hash and will not consider the scheduling with the nonconstant algorithm. But in other user applications, this scheduling will take place.
Also, we need to consider other metrics like the workload and context switch time to decide whether we need to use this, like implementing CFS/REAL_TIMERT/lottery and fallback to the different schedulers by different workloads.

Reference

October 9, 2022January 8, 2023

Project Proposal for CSE231

October 9, 2022October 11, 2022

The Linux Scheduler: a Decade of Wasted Cores

The main novelty is that Linux uses a CFS algorithm with a weighted fair queuing algorithm. Say the CFS time-slices the CPU using the red-black tree to deal with the running thread. Without I/O, we may have a better performance than O(1) . Each CPU's run queue cfs_rq maintains a min_vruntime field that records the minimum value of vruntime for all processes in that run queue. The initial vruntime value of a new process is set based on the min_vruntime of the run queue it is in, keeping it within a reasonable gap from the old process.

Pros

The dormant process is compensated by vruntime when it wakes up, and its ability to grab the CPU when it wakes up is a probable event, which is the intent of the CFS scheduling algorithm, i.e., to guarantee the responsiveness of interactive processes, which would be frequently dormant waiting for user input.

Imagine when you perform each interactive operation such as hitting the keyboard, moving the mouse, etc., for the system this is when a new task comes -> runtime is 0 -> vruntime is 0 -> is put into the leftmost node of the red-black tree of the scheduling task queue -> the leftmost node is pointed to via a special pointer and that pointer is cached

Cons

Priority failure. For example, there are two processes, process A has a nice of 0 and an allocated time slice of 100ms, process B has a nice of +20 and an allocated time slice of 5ms. if we have a time slice of 5ms, after process A runs for 5ms, it will switch to process B, which will also run for 5ms, so within 10ms, process A and process B will run for the same amount of time And so on, within 100ms, process A and process B run for the same amount of time, which is obviously inconsistent with the time slice we've allocated. （This is obviously inconsistent with our time slice allocation. (I think I understand it this way, I read it several times and didn't understand it)
Relative nice value. For example, if there are two processes, the first one assumes a nice value of 0 and the second one assumes a nice value of 1. These two processes will be allocated time slices of 100ms and 95ms respectively. But if the two processes have a nice value of 18 and 19, they will be allocated 10ms and 5ms respectively, which is less efficient for the whole cpu.
The absolute time slice needs to be on a timer beat. When the time slice is reached, the process needs to be switched, at which point, an interrupt from the timer is needed to trigger it, in which case the time slice for both processes can't be split that fine, because the timer beat has to be met. (Can be calculated in using another value as separate from the timer beat)
Based on the priority adjustment problem. To raise the priority of a newly woken process in order for the process to be up and running faster.

There are no crashes or out-of-memory situations, and tools like htop, sar, or perf cannot detect the missing short-term idle periods, making it difficult to detect the issues reported in this research. The authors refer to the first tool as a "sanity checker." It checks that no core is idling while waiting threads are in the running queue of a different core. It permits this condition to exist for a brief time but issues an alert if it continues. A visualizer that displayed scheduling activity over time was the second tool. As a result, the number of running queues, their overall load, and the cores that were taken into account during routine load balancing and thread wakeups may all be profiled and plotted. Scheduling, as in allocating CPU time among threads, was once considered an issue that had been resolved. We demonstrate that this is untrue. A straightforward scheduling principle produced a highly intricate, bug-prone implementation in order to accommodate the complexity of current hardware. We found that scheduling waiting threads onto idle cores breaks a fundamental work-conserving invariant. Runnable threads might get held in running queues for a few seconds if there are idle cores in the system, which would cause a significant decrease in application performance. These vulnerabilities are difficult to find using standard tools because of their nature. We repair these flaws, identify their underlying causes, and provide tools that make finding and correcting these bugs much simpler. As for the future, Of course, the Linux kernel as a code, it is universal, so it is difficult for the community to see and focus on multi-core issues alone, the community is most concerned about maintainability, not performance. new features of Linux in 128MB memory i386 machine running no problem, that is OK. As long as not more than 80% of people encounter new problems, the community is never care, at the same time, because of this, the community will introduce bugs, which is also want to sigh can not sigh. My opinion is that the community is just a community of programmers who take the code as the criterion, and the community does not pay too much attention to the development of architecture and new features, which are all vendor things. Likewise with TCP, which I have always sprayed, people focus on the TCP implementation code itself, which is what makes it more and more complex, and then more and more fragile, which you might say is evolution, but isn't there a chance to go back to the drawing board before all hell breaks loose? It hasn't evolved to the point where it must continue to evolve, right? If you stand outside to see and have mandatory measures, it is estimated that there is no garbage TCP long ago.

Reference

https://www.usenix.org/system/files/conference/atc18/atc18-bouron.pdf

October 7, 2022November 11, 2022

RVV 编码详解

Reference

October 4, 2022October 5, 2022

Compress Objects, Not Cache Lines: An Object-Based Compressed Memory Hierarchy @ASPLOS19

In terms of memory compression, we focus more or less on the semantics of the Computer Architecture level; instead, this paper proposed a compression framework built on an object-based memory hierarchy with object-aware compression optimization.

About the ziqiwang 3 question. 1. The ISA proposed a store/load pointer instruction tailored to twizzler FOT design. 2/3. I don't think the panelty of a universal forward chunking compared with extra bit stored and do pointer remap is so big for overflow objects. I don't see the implementation so that I should really dig into the metadata bits design and runtime panelty of different operations of memory and ptrs.

October 2, 2022November 26, 2022

ESESC Project Proposal

CSE220的project是写个模拟器，暂时想写的是（类承影的 RVV GPU SM subsystem。) 没空了,不写了.这个emulated和sniper最大的区别是这个模拟器是timesampling based，精度会高很多。

project organization

    qemuStarted     = true;
    QEMUArgs *qdata = (struct QEMUArgs *)threadargs;

    int    qargc = qdata->qargc;
    char **qargv = qdata->qargv;

    MSG("Starting qemu with");
    for(int i = 0; i < qargc; i++)
      MSG("arg[%d] is: %s", i, qargv[i]);

    qemuesesc_main(qargc, qargv, NULL);

    MSG("qemu done");

    QEMUReader_finish(0);

    pthread_exit(0);

上层先走qemu先写了个bootloader，拿到Physical memory(其在qemu里插了trace记录memory。qemu 走完了(所以这里为毛不用JIT，可能是当时开发的局限吧。然后这边接一个指令parser。

模拟器进入循环后，会从qemu(libemul)中收集timing插桩信息。

  // 4 possible states:
  // rabbit  : no timing or warmup, go as fast as possible skipping stuff
  // warmup  : no timing but it has system warmup (caches/bpred)
  // detail  : detailed modeling, but without updating statistics
  // timing  : full timing modeling

另一个thread会运行下面的模型,初始化到memorysystem会从conf里面读cache的信息

-exec bt
#0  CCache::CCache (this=0x555558ae4360, gms=0x555558ae4280, section=0x555558ae42c0 "IL1_core", name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/CCache.cpp:78
#1  0x0000555555836ccd in MemorySystem::buildMemoryObj (this=0x555558ae4280, device_type=0x5555589fa820 "icache", dev_section=0x555558ae42c0 "IL1_core", dev_name=0x555558ae4340 "IL1(0)") at /home/victoryang00/Documents/esesc/simu/libmem/MemorySystem.cpp:61
#2  0x000055555586c99e in GMemorySystem::finishDeclareMemoryObj (this=0x555558ae4280, vPars=..., name_suffix=0x0) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:320
#3  0x000055555586c497 in GMemorySystem::declareMemoryObj (this=0x555558ae4280, block=0x5555589ee260 "tradCORE", field=0x555555a7e573 "IL1") at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:242
#4  0x000055555586bb0e in GMemorySystem::buildMemorySystem (this=0x555558ae4280) at /home/victoryang00/Documents/esesc/simu/libcore/GMemorySystem.cpp:148
#5  0x000055555581c764 in BootLoader::createSimuInterface (section=0x5555589ee260 "tradCORE", i=0) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:281
#6  0x000055555581c1c4 in BootLoader::plugSimuInterfaces () at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:219
#7  0x000055555581cba1 in BootLoader::plug (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/simu/libsampler/BootLoader.cpp:320
#8  0x000055555572274d in main (argc=1, argv=0x7fffffffd888) at /home/victoryang00/Documents/esesc/main/esesc.cpp:37

CPU Model

指令被qemu issue了以后会有个taskhandler接,prefetcher和address predictor 也实现在libcore里面.有LSQ和storeSet的实现,就是那个RMO的Load Store forward queue. OoO/prefetch实现也是最全的.

Memory Model

触发ld/st/mv等memory指令的时候会直接call memorysubsystem 核心内的L1/L2都会有MemObj记录现在的地址,自动机抽象成MemRequest,noblocking先查询MSHR,到了不同MemObj更新状态.到Memory侧有controller分配dim和Xbar分配load的路径.

GPU Model

这个走的是cuda模型,编译器会生成riscv host代码和gpu device代码,SM 会执行GPU代码,然后他们的SM流处理器记录的东西和CPU部分一样的metrics.

power emulation

用了一个pwr+McPAT的power model来完成libpeq的模拟，Component category可以模拟L1I等等，还有实时更新的McPT和Gstat的counter。只是现在暂时只有SRAM和cache的model，也没有row buffer的模拟和leakage的模拟，假装所有东西都是正确运行的。