PMU Change in Saphire Rapids

文章目录[隐藏]

Reference

This document will be updated in wiki

CCD interconnect Metrics
1. Since the interconnect of the CCDs is EMIB, and the cache coherency is still directory routing from LLC to HBM and Memory, the PMU has operands DRd/CRd and abstraction of TOR and IA to record the metrics.
2. All CHA Uncore PMU starts with UNC_CHA_TOR_INSERTS, those can be labeled as CHA tile 0x2000-0x2180 for a 23-core machine, which maps to each tile's 2/3/4/5 are control registers and 8/9/a/b are result registers. It's easy to get cha id mapping to os id through unc memory bandwidth write.
3. Every local CCD mesh has 3 UPIs, 3 memory port and 1 PCIe &CXL port. There are three types of traffic IIO/PCIe&CXL/UPI. From 0x3000-0x30b0, Only 1 has value(because there's no PCIe device attached) on my machine for PCIe1-4 IIO 1-4, which has 4 control registers, which maps to each tile's 2/3/4/5 are control register and 8/9/a/b are result registers. From 0x3400-0x34b0, there are only 6 values one my machine, which I think should be UPI and iMC ports. All of those only have 2 control registers, 2/3 are control registers and 8/9 are result registers.
CXL Traffic Metrics
1. I think the current PMU is not fully renamed to the CXL.cache but embedded in the core's metrics, like hitm, miss snoop_none, etc.
2. OCR.READS_TO_CORE.L3_MISS_LOCAL_SOCKET. This metric will count all the code read and RFO that miss the L3 but request to local CXL Type 2 or private DRAM. Remote requests will count toward the remote counterpart metric.
Some guesses
1. HBM2E serves as a non-inclusive directory-based directly mapped L4 cache(manifested as HBM M2M(mesh to memory) blocks and EDC/HBM channels) like persistent memory using the extra tag for caching. It can cache some of the CXL.cache requests if LLC each CCD is not big enough.
2. AMD supports CXL2.0's RAS and gpf, while Intel gives it up because 1. gpf is useless, 2. it's the outer fabric manager's work, and intel puts the efforts into vendors.

Reference