Probabilistic Concurrency Testing for Weak Memory Programs
一个PCT Frameware,用SC的规范来assert,找bug。
 the World withBASTION
Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)
MR-LRC,every group has a local redundancy and global redundancy.
Use Cauchy LRC to predict the distance
96-105 code per stripe >=6 time failure reliability test/Time to recover/ Mean-time-to-data-loss.
VAST Data's approach? LDC: workloads matters.
ParaRC: Embracing Sub-Packetization for Repair Parallelization in MSR-Coded Storage
Repair penalty of RS Code
Reduce bandwidth: amount of traffic transferred in the network
Maximum repair load
For SoTA (4,2)Clay code, the blocks size is 256MB while bandwidth and MRL are both 384MB.
Repair can be pipelining while the block can be chunk into blocks, (bw ,MRL)=>(512,256), drawback is additive associative. But we can subpackerize the block.
pECDAG to parallel GC for Clay codes.
InftyDedup: Scalable and Cost-Effective Cloud Tiering with Deduplication
Cloud Tiering and backup data requires Deduplication in the cloud hypervisor level. They get teh fingerprint and semantic from the cloud tier and send the request to the local tier.
The GC algortihm is basically BatchDedup.
fingerprint Indexing in SSD.
Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
用sliding window 做peer evaluation。
用时序模型预测Fail-Slow的出现时间 Latency vs Throuput+Linear regression。
Key-Value Stores
ADOC: Automatically Harmonizing Dataflow Between Components in Log-Structured Key-Value Stores for Improved Performance
用历史数据学不同SSD的RocksDB的L0-L1 compaction,data overflow.
感觉做的没Kill Two Birds with One Stone: Auto-tuningRocksDB for High Bandwidth and Low Latency好。
RocksDB会自己increase memtable?
Does RDO control compaction? All layer compaction
RDO MMO encoder Transformer?Future work。
FUSEE: A Fully Memory-Disaggregated Key-Value Store
Primary and write-write conflict resolution by last-writer(majority writer) wins and update the other place thereby.
Compared with MDS that has key to hash for direct access to RDMA result. only 1 RTT for read.
embedded operation log
ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems
Learned key data movement and allocation requires recomputed decoupled learned indexes. The desired data leaves are fixed size bby the retraining decouple algorithm mathematically, which makes migration easier.
Consistency Guarantee, 感觉要同步的model的RDMA操作挺多的。
bit as lock? how to recover from the lock? future work.
Always update new model every read? first check the first entry of SLT. and update the model if changed.
AI and Storage
GL-Cache: Group-level learning for efficient and high-performance caching
xgboost学了utility time的metrics。然后按utility time evict。如果两个workloads公用cache就不行。以及如果cachesize很大我觉得预测的效率不如手动ttl的segcache。以及我觉得Transformer预测ttl更优秀,因为xgboost只是捕获信号,并没有预测能力。然后似乎这个metrics也是试出来的。
SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training
这篇是SC见到的同学的。
Intelligent Resource Scheduling for Co-located Latency-critical Services: A Multi-Model Collaborative Learning Approach
File Systems
CJFS: Concurrent Journaling for Better Scalability
Multi version shadow paging
Compound flush: On Tx finish cache(DMA, which can be hack by CXL?) barrier send the interrupt to the storage.
Unsafe at Any Copy: Name Collisions from Mixing Case Sensitivities
name collision, 之前有个git CVE。
ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit
软工找bug,实在太没意思了,改configuration(主要是allocator的内存格式/type信息/64bit?),然后污点分析,找这三个bug。
Compared with Hydra
HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers
MadFS 在无锡超算上跑的。
Consulidate the links from userspace.
Fisc: A Large-scale Cloud-native-oriented File System
是一个南大网络组的人去了盘古。阿里巴巴双十一现在完全基于这个了。
相当于implement了一个 lightweight file system client to improve the multiplexing of resources with a two-layer resource aggregation。每个computation storage vRPC agent proxy做负载均衡。
(Rust-like naming: phantom data(is used to label undeclared type.) Here used to get you the data at object level movement into software-defined data movement for different workloads.
They have a callback function for how it communicates with the dataflow fabric. The hardware scheduler allows the engine to invoke onMiss()onEviction()writeback(). They simply manifest each operation with SHARED and PRIVATE state changes and I don't think simply these three callbacks can make a memory order correct Morph.
In terms of power saving, my view of saving energy by using PIM or modified iMC means you don't need to communicate well between the core and the MC, while the dataflow-based analysis inside the iMC of NoC may intrinsically reduce traffic and thus can provide an energy-efficient solution.
However, this type of design fully exposes the memory to the attacker by speculation and row hammer, which will definitely give the user a black box if they want to make it commercially available.
The main novelty is that Linux uses a CFS algorithm with a weighted fair queuing algorithm. Say the CFS time-slices the CPU using the red-black tree to deal with the running thread. Without I/O, we may have a better performance than O(1) . Each CPU's run queue cfs_rq maintains a min_vruntime field that records the minimum value of vruntime for all processes in that run queue. The initial vruntime value of a new process is set based on the min_vruntime of the run queue it is in, keeping it within a reasonable gap from the old process.
Pros
The dormant process is compensated by vruntime when it wakes up, and its ability to grab the CPU when it wakes up is a probable event, which is the intent of the CFS scheduling algorithm, i.e., to guarantee the responsiveness of interactive processes, which would be frequently dormant waiting for user input.
Imagine when you perform each interactive operation such as hitting the keyboard, moving the mouse, etc., for the system this is when a new task comes -> runtime is 0 -> vruntime is 0 -> is put into the leftmost node of the red-black tree of the scheduling task queue -> the leftmost node is pointed to via a special pointer and that pointer is cached
Cons
Priority failure. For example, there are two processes, process A has a nice of 0 and an allocated time slice of 100ms, process B has a nice of +20 and an allocated time slice of 5ms. if we have a time slice of 5ms, after process A runs for 5ms, it will switch to process B, which will also run for 5ms, so within 10ms, process A and process B will run for the same amount of time And so on, within 100ms, process A and process B run for the same amount of time, which is obviously inconsistent with the time slice we've allocated. (This is obviously inconsistent with our time slice allocation. (I think I understand it this way, I read it several times and didn't understand it)
Relative nice value. For example, if there are two processes, the first one assumes a nice value of 0 and the second one assumes a nice value of 1. These two processes will be allocated time slices of 100ms and 95ms respectively. But if the two processes have a nice value of 18 and 19, they will be allocated 10ms and 5ms respectively, which is less efficient for the whole cpu.
The absolute time slice needs to be on a timer beat. When the time slice is reached, the process needs to be switched, at which point, an interrupt from the timer is needed to trigger it, in which case the time slice for both processes can't be split that fine, because the timer beat has to be met. (Can be calculated in using another value as separate from the timer beat)
Based on the priority adjustment problem. To raise the priority of a newly woken process in order for the process to be up and running faster.
There are no crashes or out-of-memory situations, and tools like htop, sar, or perf cannot detect the missing short-term idle periods, making it difficult to detect the issues reported in this research. The authors refer to the first tool as a "sanity checker." It checks that no core is idling while waiting threads are in the running queue of a different core. It permits this condition to exist for a brief time but issues an alert if it continues. A visualizer that displayed scheduling activity over time was the second tool. As a result, the number of running queues, their overall load, and the cores that were taken into account during routine load balancing and thread wakeups may all be profiled and plotted. Scheduling, as in allocating CPU time among threads, was once considered an issue that had been resolved. We demonstrate that this is untrue. A straightforward scheduling principle produced a highly intricate, bug-prone implementation in order to accommodate the complexity of current hardware. We found that scheduling waiting threads onto idle cores breaks a fundamental work-conserving invariant. Runnable threads might get held in running queues for a few seconds if there are idle cores in the system, which would cause a significant decrease in application performance. These vulnerabilities are difficult to find using standard tools because of their nature. We repair these flaws, identify their underlying causes, and provide tools that make finding and correcting these bugs much simpler. As for the future, Of course, the Linux kernel as a code, it is universal, so it is difficult for the community to see and focus on multi-core issues alone, the community is most concerned about maintainability, not performance. new features of Linux in 128MB memory i386 machine running no problem, that is OK. As long as not more than 80% of people encounter new problems, the community is never care, at the same time, because of this, the community will introduce bugs, which is also want to sigh can not sigh. My opinion is that the community is just a community of programmers who take the code as the criterion, and the community does not pay too much attention to the development of architecture and new features, which are all vendor things. Likewise with TCP, which I have always sprayed, people focus on the TCP implementation code itself, which is what makes it more and more complex, and then more and more fragile, which you might say is evolution, but isn't there a chance to go back to the drawing board before all hell breaks loose? It hasn't evolved to the point where it must continue to evolve, right? If you stand outside to see and have mandatory measures, it is estimated that there is no garbage TCP long ago.