Recently, I was testing a bunch of memory-bound benchmarks using Vtune and outcomes some weird results. I'm trying a bunch of micro benchmarks to see how memory bound works for bandwidth and latency.
Manual
According to the VTune include configuration files, a slot means an execution port of the pipeline, and a stall could either mean "not retired" or "not dispatched for execution" Memory Bound is calculated as follows:
Memory_Bound = Memory_Bound_Fraction * BackendBound
Memory_Bound_Fraction
is the fraction of slots mentioned in the documentation. However, according to the top-down method discussed in the optimization manual, the memory-bound metric is relative to the backend-bound metric. So this is why it is multiplied by BackendBound.
The Memory_Bound_Fraction
formula is listed below:
Memory_Bound_Fraction =
(CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB )
* NUM_OF_PORTS / Backend_Bound_Cycles * NUM_OF_PORTS
NUM_OF_PORTS
is the number of execution ports of the microarchitecture of the target CPU. This can be simplified to:
Memory_Bound_Fraction =
CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB /
Backend_Bound_Cycles
CYCLE_ACTIVITY.STALLS_MEM_ANY
and RESOURCE_STALLS.SB
are performance events. Backend_Bound_Cycles
is calculated as follows:
Backend_Bound_Cycles = CYCLE_ACTIVITY.STALLS_TOTAL +
UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC -
Few_Uops_Executed_Threshold -
Frontend_RS_Empty_Cycles + RESOURCE_STALLS.SB
Few_Uops_Executed_Threshold
is either
UOPS_EXECUTED
, CYCLES_GE_2_UOP_EXEC
or
UOPS_EXECUTED.CYCLES_GE_3_UOP_EXEC
depending on some other metric. Frontend_RS_Empty_Cycles
is either RS_EVENTS.EMPTY_CYCLES
or zero, depending on some metric.
Setup
I used cgroup to limit the memory of a memory database running tpch; all the queries outcome memory latency bound. I think it may be an outcome of the Linux software limit. However, even those with cgroup may result in the architecture level problem.
Then I run three micro benchmarks.
#include <stdlib.h>
#include <stdio.h>
int main() {
int *p;
#pragma omp parallel for
for(int i =0 ; i<100*1024*1024;i++) {
int inc=1024*sizeof(char);
p=(int*) calloc(1,inc);
if(!p) print("error");
}
}
No matter how much OMP_NUM_THREADS
I run, I always get bandwidth and no latency, which I guess the memory channel for sequential read is too good to stall for memory bandwidth. The prefetcher here is easy to predict the calloc
pattern.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <random>
int *array;
int main(int argc, char *argv[])
{
array = (int *)malloc(SIZE * sizeof(int));
memset(array, SIZE * sizeof(int), 1);
for (int i = 0; i < 100000 + 1; ++i) {
for (int j = 1; j < 100; ++j)
{
array[rand() % 100] += 1;
}
}
return 0;
}
We didn't get a memory bandwidth result for random access to the 2D matrix.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
int *array;
int main(int argc, char *argv[])
{
array = (int *)malloc(SIZE * sizeof(int));
memset(array, SIZE * sizeof(int), 1);
for (int i = 0; i < 100000 + 1; ++i) {
for (int j = 1; j < 100; ++j)
{
array[j] += 1;
}
}
return 0;
}
For sequential access to a 2D matrix, we get 49% of the memory bandwidth result and 40% of the memory latency bound.
Therefore, only 2D+ matrix sequential read and write makes the bandwidth stall. And because the memory and latency bound's measurement unit is clock ticks, the how many percentages of a cycle that triggers the determined while memory bound metrics are merely slots which means an execution port of the pipeline.