Micro-architectural Analysis of OLAP: Limitations and Opportunities

To understand the OLAP's runtime PMU metrics, how should it be a paper that's published on PVLDB. And also a PM evaluation also by Bytedance.

The in-memory OLTP has greatly studied a lot for Cache misses, especially hitm; my previous blog gives my experiment on TPCH over MonetDB, and Postgresql is not DRAM Bandwidth bound. It amazed me that a reference to remote NUMA memory should cause such a bound. The paper discusses only the scan-intensive queries with multiple cores that can hit the memory bound. join-intensive queries suffer from latency-bounded data cache stalls.

The OLAP differs from the transaction-based database; most of them have vectorized-based query planer and online analysis codegen. We may look at the velox and arrowdb.

The breakdowns of CPU cycles in both single-thread execution and multi-thread execution. They affine the memory only on one NUMA socket and fully disable the prefetcher. We see that the scalability of DMBS C is good enough, while other DB has deterioration for multithread.

Normalized response time breakdowns for Quickstep when it runs the large join micro-benchmark query, as single-threaded, w/wo using Filter Join

Only the Multi-thread will hit the bandwidth bound.
g)

TCUDB: Accelerating Database with Tensor Processors

Running Database on GPU tensor computing unit(TCU).

Claim

  1. The partitioned hash join algorithm in a non-matrix-friendly manner is hard to rewrite on TCU
  2. The underlying data movement requires different data organization.
  3. TCU is mostly int8 or fp16, which are not accurate enough.

The key-value hash map data storage has cuckoo hashing in GPU; the data storage can refer to such Memory management; the insight is how to accelerate every operator with optimizer and codegen to matmul that can make use of GPU.

Also, because the single GPU's VRAM is typically smaller than CPU's private DRAM, we need the wss estimation for wh11ether the CPU or GPU plan. They use MSplitGEMM to test the working set size with is the upper bound of the VRAM occupation.


Supported query planner



The query planner has UNCOMPRESSED/COMPRESSED MEM/PINNED/MMAP and some movement assessment for whether compress or do the migration to CPU.

Their compressed data means the data is stored in a cuckoo hashing manner.

The Matrix Multiplication, Entity Matching, and PageRank have better performance because they leverage the online storage of GPU VRAM.

The fault tolerance of the GPU's data cannot be guaranteed; for more functionality, I think it still requires the DPU to store or disaggregate GPU VRAM to Memory Expander.

possible solutions to 443 and 80 highjack

Once I'm updating my centos VPS to the latest ngnix and kernel, the server weirdly disables 443 and 80 connection.

Using Wireshark, I found multiple connections to my socks server on 1082 from multiple IP seem to be scanned and SYN attacked.

69	1.379581	104.149.139.86	82.102.27.93	TCP	54	1082 → 42168 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
122	2.078443	104.149.139.86	185.174.159.18	TCP	54	1082 → 57925 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
763	13.859168	104.149.139.86	185.174.159.18	TCP	54	1082 → 52535 [RST, ACK] Seq=1 Ack=1 Win=0 Len=0
1412	23.098894	82.102.27.93	104.149.139.86	TCP	74	SuperMic_42:b7:48
198	5.122016	104.149.139.86	212.109.221.254	TCP	54	JuniperN_bb:05:01
1352	29.871004	99.84.252.117	104.149.139.86	TLSv1.3	212	SuperMic_42:b7:48

I realize the open file for accept4 will have a limit by ulimit -n max opening file in parallel, which also limits the accept4 syscall. It was reset by kernel updates. Some nginx upload file limits also may be the outcome of this. After setting it to 65534, no 443 and 80 highjacks will be enforced.

yyw 的2022 年终总结

今年是魔幻的一年,在年底前22Fall考后说点今年的年终总结.

开年的时候想去阿里云实习做操作系统.但是因为过了ddl就黄了.年初选的几门课刚开始还好,后来封城了以后和老婆在她公司边杨高中路度过了不太美好相濡以沫的同居生活,只是天天都是赶due的疯癫状态,不过终于水过了大学四年.到4月份的时候知道了自己会去UCSC跟Andrew Quinn,现在觉得这里是个产业界和学术结合很紧密的地方,不会闭门造车造出工业界完全不用的东西.确实符合当时一切幻想.当时做了三个video,最后一个关于读博值不值得,我觉得30年前赚不了1个亿,不如30岁再开始,加之现在经济不是很好,去独角兽也不是随便就能财富自由的.

5月25日去往广州签证,遇到了很多高中同学,我觉得高中同学对我的教育,是零和游戏,一个显然在高考能取得更多分的人就是有从你手中拿到更多资源,但是他们是一群为了抢夺资源不择手段且孤高自傲,不与你分享的人.我觉得这不是一个健康的竞争.但是他们经历了清华、密院、同济妓院的搏杀之后,也知道了人各有志.确实我是唯一一个高中CS读博的人.广州之后去的珠海、深圳、湘潭、怀化、张家界.真的是我的寻根之旅?见外公是最后一次了.还好去看了一次.

出国是一个奇妙的体验,我和现在的老板第一次见面在7月1日,刚从封控的地方走到一个可以表达自我的地方,刚从一个学术洼地到硅谷胜地,一切都是新鲜而美好的.当然现在的转向当然是好的,只是太晚了.我觉得我舍弃了上海的现代文明,舍弃了小布尔乔亚的生活,来到了我高中就觉得是农村而没有选择大学来的地方,但是同时又能自由探索这个世界上不曾存在的体系结构、操作系统、HW-SW Codesign,我是极为兴奋的.做的事情和大学没有什么不同,但是这是一个人生的冒险的决定.我仍然决定带UCSC超算和人交流、和赞助商交流.我仍然决定做课程设计当TA.我仍然选最cutting-edge的课.

这一年我线上开了OSDI/ATC/PLDI/POPL/ISSTA/EuroSys/ISCA,我觉得一些PhD做的东西很厉害,一些是探索性的,另一些是流水线式的.我觉得我能在PhD期间发三篇很厉害的东西我就很满足了.