Lustre 文件系统使用

最近帮学长跑实验,同时也是毕业论文的实验,用的 Lustre。然后又重新读了一遍古老的PLFS、PMFS论文。同时用的是AMD超算集群,最多可以到512台node。

[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public1
Disk quotas for usr scb5090 (uid 6171):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
       /public1  19.13G    450G    500G       -   50865       0       0       -
uid 6171 is using default file quota setting
[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public2
lfs quota: cannot resolve path '/public2': No such file or directory (2)
[scb5090@ln131%bscc-a6 ~]$ lfs quota -h -u scb5090 /public3
Disk quotas for usr scb5090 (uid 6171):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
       /public3      0k      0k      0k       -       0       0       0       -
uid 6171 is using default block quota setting
uid 6171 is using default file quota setting

大致想干的一件事是把一个读hdf5的程序在并行文件系统上的scalability。

Phosphor - My Pitfalls writing dependency

Currently, I'm busy writing emails for my Ph.D and taking TOEFL and taking care of the Quantum ESPRESSO library changing and MadFS Optimization, so it may waste some time. Till now, I have to apply the DTA tool of phosphor for the java order dependency project.

about surfire integration into normal tests.

  • Maven extension
    • Integration into Maven add the redirector
      • Insert phosphor plugin one class by one into.
      • Configuration to the phosphor
      • Class Visitor, Method Visitor, Adaptor Mode Visitor
    • Mutable field in the Dependency Tainter
      • Start the taint for some place attach the tainted check after the test
      • Assert the junit stuf in check=omparison.
      • Brittle assertions in check(Taint) recursively.
    • Output the tainted version into the sufire executable folder
  • Debug
    • mvn install -Dmaven.surefire.debug -f /Volumes/DataCorrupted/project/UIUC/bramble/integration-tests/pom.xml and attach the trace point.
      • Start from the maven compilation.

Brittle Assertion

This outputs only the dependency for one test introduced in Oracle Polish JPF. For dependenct between test1 and test2,

For NPE, get the pair by idflakies test first.

 JVM Asm

Reference

  1. https://www.kingkk.com/2020/08/ASM%E5%8E%86%E9%99%A9%E8%AE%B0/

NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM

NVOverlay is a fast technique to make fast snapshot from the DRAM or Cache to make them persistent. Meantime, it utilized tracking technique, which is common to the commercially available VMWare or Virtual Box on storage. Plus, it used NVM mapping to reduce the write amplification compared with the sota logged based snapshot.(by undo(write to NVM before they are updated) or redo may add the write amplification. To specify not the XPBuffer write amplification, but the log may adds more writing data)

So-called High-frequency snapshotting is to copy all the possible data in a millisecs interval when CPU load/store to DRAM. Microservice thread may require multiple random access to MVCC of data, especially for time series ones. To better debug the thread of these load/store, the copy contents process should be fast and scalable.


here OMC means overlay memory controller

The cache coherency is considered deeply. For scalability to 4U or 8U chassis, they add a tag walk to store the local LLC tag. We know that all the LLC slice is VIPT because they are shared. For the same reason, the tag can be shared but unique to one shared space.

For a distributed system-wide problem that have to sync epoch counters bettween VDs, they used a Lamport clock to maintain the dirty cache's integrity.

Continue reading "NVOverlay: Enabling Efficient and Scalable High-Frequency Snapshotting to NVM"

Proposal for *A online systematic scheduling algorithm over Distributed IO Systems.*

In the resource allocation problem in the Distributed Systems under the High Performance Computer, we don't really know which device like disk, NIC (network interface) is more likely to be worn, or not currently on duty which may trigger delaying a while to get the data ready. The current solution is random or round robin scheduling algorithm in avoidance of wearing and dynamic routing for fastest speed. We can utilize the data collected to make it automatic.

Matured system administrator may know the pattern of the parameter to tweak like stride on the distributed File Systems, network MTUs for Infiniband card and the route to fetch the data. Currently, eBPF(extended Berkeley Packets Filter) can store those information like the IO latency on the storage node, network latency over the topology into the time series data. We can use these data to predict which topology and stride and other parameter may be the best way to seek data.

The data is online, and the prediction function can be online reinforce learning. Just like k-arm bandit, the reward can be the function of latency gains and device wearing parameter. The update data can be the real time latency for disks and networks. The information that gives to the RL bots can be where the data locate on disks, which data sought more frequently (DBMS query or random small files) and what frequency the disk make fail.

Benchmarks and evaluation can be the statistical gain of our systems latency and the overall disk wearing after the stress tests.

一个概率论Bound问题

昨晚和以前实习的同学讨论一个上界的问题,如果在未来博士的过程中也能有这样的氛围就好了。

主要就是一道概率论题

已知\(\begin{array}{l}A \sim B i n o m(n, p) \ B \sim B i n o m(\frac{A(A-1)} 2, q)\end{array}\),求H(B),即B的entropy。

这里的难点是如何求二项随机分布的二项分布。直观上感觉后者的熵值是前者的 \(log(log())\) 这种。可对A的展开太过繁琐。敲在mathmetica当中可以是 \(P(B=i) = Sum[P(B=i|A=j)*P(A=j),{j,0,n}]\),暂时我只想到这种解法。

这个问题我找了找网络前两节课上的信息论推荐的书,上面有类似对于二项式分布的相关性质,可是唯一提到的也就是在 \(p\) 上做文章。fix n,H(A)的 max 在 p 取 \(\frac12\) 时取到。然而没啥卵用。

概率论与图论背后的算法

算这个事为了做一个算法去recover这个
ER random graph,given每次只能query graph的一小部分里面有没有edge的存在。

这个 random graph 很有名,很多概率图都是基于此。也是 TCS 求 lower bound 的一种方式,很多人梦寐以求的方向。

谈体系结构的进步对网络的影响

最近量子位又发了一篇体系结构的进步,TCAM,所谓的三态内容寻址储存器。可以说,从图灵机的角度来说,上层建筑下的基层还有很多没有解决,从现在那么多Startup 在真正的做业务Oriented 的数据库及网络链路优化,体系结构还有很多可以探索的部分。同时,新的架构真的是否安全,如TPU的数据通路是否有没有被侦测到的部分可以被攻击。多数的攻击来自于软硬结合,汇集了多少工程师的智慧结晶。

交换机的简化结构


这是一个去掉2个要素的冯诺伊曼体系结构图,交换机的OutBound 和Input的Throughput是显见的bottleneck,除此之外还有延时,这就需要主存储器性能或者包的传输协议的革新。

三态内容寻址储存器(TCAM)

我在当年写VB的时候记得有个slide的参数,是一个三进制数来表示不动,向上滑和向下滑的参数。而这种0、-1,1三进制在苏联当年的计算设备上有所尝试,可惜最终失败了。0.5或许是更好的一种表示中间态或者亚稳态的编码方式,可以用于模糊匹配,或者Not Set。

CAM本质上是一个数据查找硬件方法,读写数据的速度与RAM相同,查找数据能相对模糊的匹配到数据。

这时ARP 协议从报头或者CRC来验证数据正确性起到了很大的作用,就是不管怎样,数据到了,不管对不对,以最快的速度发出去,等到了再做检验的思路是一样的。(有点像高频交易架构的gateway。

Reference

  1. Constant-time Alteration Ternary CAM with Scalable In-Memory Architecture
  2. 三态内容寻址存储器(TCAM)工作原理