Qemu CXL type 1 emulation proposal

Introduction to CXL Type 1

Guided Usecase

[1] and [2] are just Qemu's implementation of dm-crypto for LUKS; every device mapper over a physical block device will require a key and a crypto accelerator or software crypto implementation to decrypt to get the data. We implement a crypto accelerator with CXL type 1 semantics over a framework of virtio-crypto-pci. We want to emulate the mal state or unplug the crypto device; the kernel will get ATS bit DMAed data and resume CPU software crypto implementation.

Device emulation

DMA and access memory

Create a CacheMemRegion that maps a specific SPP region for one mapping of a bunch of CXL.cache caches on a CXL device.

Crypto operations

When calling crypto operations in the kernel, we actually offload the encrypt/decrypt operations to the type 1 accelerator through CXL.io, which tells the device cast on operation on arbitrary SPP. The accelerator will first take ownership of arbitrary SPP in the CacheMemRegion and notify the host. Eventually, the host will get the shared state of the SPP's cacheline.

Cache coherency emulation

struct D2HDataReq = {
    D2H DataHeader 24b;
    opcode 4b;
    CXL.cache Channel Crediting;
}
struct CXLCache = { 
    64Byte Data; 
    MESI 4 bit;
    ATS 64 bit;
    [D2HDataReq;n] remaining bit;
}

Use metadata with Intel SPP writes protection support and mark the access to an arbitrary cacheline in the SPP. We need to perform all the transaction descriptions and queuing in the 64-byte residue data in the SPP. The arbitrary operation, according to the queue, will have the effect of MESI bit change and update writes protection for the subpage and root complex or other side effects like switches changing.

The host's and device's requests are not scheduled FIFO, but the host's seeing the data will have better priority. So, the H2D requirement will be consumed first and done in D2H FIFO. All the operations follow interface operations to CXLCache.

Taking exclusiveness-able

We mark the Transportation ATS bit as exclusiveness-able. and copy the cacheline in another map; once emulated and unplugged, the cacheline is copied back for further operation of the kernel to resume software crypto calculation.

How to emulate the eviction

We have two proposals

  1. pebs to watch the cache eviction of the physical address of an SPP
  2. Use sub-page pin page_get_fast to pin to a physical address within the last-level cache. [7]

The code is currently developed at https://github.com/SlugLab/Drywall/ and pitfalls refer to https://asplos.dev/wordpress/2023/11/27/intel-sub-page-write-protection-cai-keng/

Alternative implementation

Maybe we should limit the operation of the address from CXL.cache, write a memory santinization assisted JIT compiler and use LAM instead.

Reference

  1. https://www.os.ecc.u-tokyo.ac.jp/papers/2021-cloud-ozawa.pdf
  2. https://people.redhat.com/berrange/kvm-forum-2016/kvm-forum-2016-security.pdf
  3. https://yhbt.net/lore/all/[email protected]/T/
  4. https://privatewiki.opnfv.org/_media/dpacc/a_new_framework_of_cryptography_virtio_driver.pdf
  5. https://github.com/youcan64/spp_patched_qemu
  6. https://github.com/youcan64/spp_patched_linux
  7. https://people.kth.se/~farshin/documents/slice-aware-eurosys19.pdf

[Computer Architecture] Sniper lab2

The code is at http://victoryang00.xyz:5012/victoryang/sniper_test/tree/lab2.

The lab was to implement the Hash perceptron according to the paper "Dynamic-Branch-Prediction-with-Perceptrons".

The cache implementation is in common/performance_model/branch_predictor.cc

#include "perceptron_branch_predictor.h"
...
else if (type == "perceptron")
{
   return new PerceptronBranchPredictor("branch_predictor", core_id);
}

The modification in the cfg

[perf_model/branch_predictor]
type = perceptron    # add Perceptron
mispredict_penalty=8 # Reflects just the front-end portion (approx) of the penalty for Interval Simulation

The constructor is passed to the perceptron_branch_predictor.h, we have to maintain the member variable as below:

struct PHT {
        bool taken;     //jump or not
        IntPtr target;  //64 history target address. 
        PHT(bool bt, IntPtr it) : taken(bt), target(it) {}
    };//The pattern history table, set like a round linked list to avoid the memory copy

//To store the history result.
		SInt32 history_sum;
    UInt32 history_bias_index;
    std::vector<UInt32> history_indexes;
    std::vector<PHT> history_path;
    UInt32 history_path_index;

    std::vector<PHT> path;
    UInt32 path_index;

    UInt32 path_length;
    UInt32 num_entries;
    UInt32 num_bias_entries;//1024
    UInt32 weight_bits;
    std::vector<std::vector<SInt32>> weights;
    std::vector<SInt32> perceptron;           //Update the perceptron table to 1024 entries
    UInt32 theta;															//Threshold, to determine the train is good
    SInt64 count;														  //To store the count
    UInt32 block_size;											  //block sides initialize to perseption

    // initialize the branch history register and perceptron table to 0
PerceptronBranchPredictor::PerceptronBranchPredictor(String name, core_id_t core_id)
    : BranchPredictor(name, core_id), history_sum(0), history_bias_index(0), history_indexes(64 / 8, 0), history_path(64, PHT(false, 0)), history_path_index(0), path(64, PHT(false, 0)), path_index(0), path_length(64), num_entries(256), num_bias_entries(1024), weight_bits(7), weights(256, std::vector<SInt32>(64, 0)), perceptron(1024, 0), theta(296.64), count(0), block_size(8), coefficients(64, 0)

The prediction method

The Base neral predictor is based on the equation $y=w_{0}+\sum_{i=1}^{n} x_{i} w_{i}$, We have that

// if the prediction is wrong or the weight value is smaller than the threshold THETA, 
// then update the weight  to train the perceptron
weights = floor(1.93 * blocksize + 14)

Hashed Perceptron Multiple

  1. set of branches assigned to same weights
  2. More than one set of information can be used to hash into the weight table
  3. Diferent hashing function can be used such as : XOR, Concatenation etc.
  4. Diferent indices can be assigned to diferent weights

Algorithm

The input is ip, and first come to a computation of (ip >> 4) % num_bias_entries;

This give the seed to the weight table lines:

table of perception will get update to te coefficient*weight and

and weight table will update to weights[index[i]][i*8...(i+1)*8]*coefficients[i*8...(i+1)*8]*h

if the result >0 the prediction is right then jump, or not jump.

Then update the history and insert the round linked list.

bool PerceptronBranchPredictor::predict(IntPtr ip, IntPtr target)
{
    double sum = 0;
  //hash method
    UInt32 bias_index = (ip >> 4) % num_bias_entries;

    sum += bias_coefficient * perceptron[bias_index];
    history_bias_index = bias_index;
  //update the weight
    for (UInt32 i = 0; i < path_length; i += block_size)
    {
        IntPtr z = ip >> 4;
        for (UInt32 j = 0; j < block_size; j++)
        {
            z ^= (path[(path_index - i - j + path_length - 1) % path_length].target >> 4);
        }
        UInt32 index = z % num_entries;
        history_indexes[i / block_size] = index;
        for (UInt32 j = 0; j < block_size; j++)
        {
            SInt32 h = path[(path_index - i - j + path_length - 1) % path_length].taken ? 1 : -1;
            sum += h * weights[index][i + j] * coefficients[i + j];
        }
    }
    bool result = ((SInt32)sum) >= 0;
    history_sum = (SInt32)sum;

    history_path.assign(path.begin(), path.end());
    history_path_index = path_index;
    path[path_index] = PHT(result, target);
    path_index = (path_index + 1) % path_length;

    return result;
}

The train process

Algorithm

Input parameters: predicted, actual, ip
auxiliary parameters: history_sum(predicted value), history_bias_index, history_indexes(index). history_path(history), history_path_index(history subscript)

for each bit in parallel
	if t=xi then
  		wi=wi+1
    else
    	wi=wi-1
    end if

Realization

for (UInt32 i = 0; i < path_length; i += block_size)
        {
            index = history_indexes[i / block_size];
            for (UInt32 j = 0; j < block_size; j++)
            {
                bool taken = history_path[(history_path_index - i - j + path_length - 1) % path_length].taken;
                if (taken == actual)
                {
                    if (weights[index][i + j] < (1 << weight_bits) - 1)
                        weights[index][i + j]++;
                }
                else
                {
                    if (weights[index][i + j] > -(1 << weight_bits))
                        weights[index][i + j]--;
                }
            }
        }

The result

FFT with Blocking Transpose
   1024 Complex Doubles
   1 Processors
   65536 Cache lines
   16 Byte line size
   4096 Bytes per page
pentium_m perceptron
num correct 63112 65612
num incorrect 4342 2112
IPC 1.36 1.37
mpki 2.71 1.39
misprediction rate 6.436% 3.118%
Elapsed time 4.46s 5.23s
cycles 1.2M 1.2M
Instructions 1.6M 1.6M

Reference

  1. https://github.com/ChrsMark/BranchPredictors

[Computer Architecture] Sniper lab3

The code can be viewed on http://victoryang00.xyz:5012/victoryang/sniper_test/tree/lab3

false sharing

For the false sharing test cases. We've given the lab3 cfg file that the cache line is 64B. So that we just need to set the false sharing variable under that cache size.

In the false_sharing_bad.c, we open 2 thread to store global variable results with first part th1 visits and the second part th2 visits.

void* myfunc(void *args){
    int i;
    MY_ARGS* my_args=(MY_ARGS*)args;
    int first = my_args->first;
    int last = my_args->last;
    int id = my_args->id;
    // int s=0;
    for (i=first;i<last;i++){
        results[id]=results[id]+arr[i];
    }

    return NULL;
}

The result of this in the sniper.

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:1] -- DONE --
[TRACE:2] -- DONE --
s1 = 5003015
s2= 5005373
s1+s2= 10008388
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 627.88 seconds
[SNIPER] Simulated 445.0M instructions, 192.9M cycles, 2.31 IPC
[SNIPER] Simulation speed 708.8 KIPS (177.2 KIPS / target core - 5643.3ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 627.79 seconds

In the false_sharing.c, we open 2 thread to store different local variable s with th1 visits and th2 visits.

void* myfunc(void *args){
    int i;
    MY_ARGS* my_args=(MY_ARGS*)args;
    int first = my_args->first;
    int last = my_args->last;
    // int id = my_args->id;
    int s=0;
    for (i=first;i<last;i++){
        s=s+arr[i];
    }

    my_args->result=s;
    return NULL;
}

The result of this in the sniper.

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:2] -- DONE --
[TRACE:1] -- DONE --
s1 = 5003015
s2= 5003015
s1+s2= 10006030
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 533.95 seconds
[SNIPER] Simulated 415.1M instructions, 182.1M cycles, 2.28 IPC
[SNIPER] Simulation speed 777.3 KIPS (194.3 KIPS / target core - 5145.9ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 533.99 seconds

The reason of false sharing:

Every time the thread may let the results to get into the CPU cache.

The Cache may check whether the cache part and memory are the same or not, thus trigger the latency, which is false sharing.

The solution of the false sharing:

Just let the adjacent data's distance larger than the one cache line, say 64B. so set FALSE_ARR to 200000.

The result changed to:

[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
[TRACE:1] -- DONE --
[TRACE:2] -- DONE --
s1 = 5003015
s2= 5005373
s1+s2= 10008388
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 512.28 seconds
[SNIPER] Simulated 445.1M instructions, 158.1M cycles, 2.82 IPC
[SNIPER] Simulation speed 868.8 KIPS (217.2 KIPS / target core - 4604.2ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 512.22 seconds

multi-level cache

The non-inclusive cach is to remove the back-invalidation and fix one.

Then I found https://groups.google.com/g/snipersim/c/_NJu8DXCVVs/m/uL3Vo24OAAAJ. That the non-inclusive cache intends to directly write back to the memory. We have to fix some bugs for inclusive cache during the L1 eviction and L2 have to evict, but L2 may not be found, thus the L1 should WB to memory, in this time.

I add a new configuration in the cfg. to make the L1 non-inclusive or not and deployed different Protocol in cfg.

[caching_protocol]
type = parametric_dram_directory_msi
variant = mesif                           # msi, mesi or mesif

I didn't do a result with a small granularity but with lock_add lock_fill_bucket reader_writer and bfs, I got num of write backs and some WB test for inclusive and non inclusive caches.

Test

Lock Add: In this test all the threads try to add ‘1’ to a global counter using locks. We see lower number of memory writebacks in MOSI because of the presence of the owner state.

Lock Fill Bucket: This tests makes buckets for numbers, so that it can count how many times each number is present in the array. This is done using locks. The dragon protocol is performing much worse here compared to others. This is probably because updates to the buckets are not always needed by other processors, hence the updates on the writes do not help.

Result

Protocol
Protocol\IPC lock_add lock_fill_bucket reader_writer bfs
MSI 1.31 1.32 1.27 1.39
MESI 1.35 1.36 1.29 1.39
MESIF 1.35 1.36 1.30 1.39

The MESIF protocol enhances the performance for teh multicore systems. MESIF protocal enhances the performance for multicore system. It may be aware of the micro arch.

CPI stack

The CPI stack can quantify the cycle gone due to memory branch or sync. literally all of the protocal has similar graph as above. The MESIF shows the lowest time in mem and sync.

Protocol\L2 miss rate lock_add lock_fill_bucket reader_writer bfs
MSI 49.18 20.13 27.12 42.24
MSI (with non-inclusive cache) 47.98 20.01 29.12 42.13
MESI 46.21 21.13 31.29 41.31
MESI (with non-inclusive cache) 45.13 21.41 26.41 42.15
MESIF 45.71 20.12 25.14 41.39
MESIF (with non-inclusive cache) 46.35 23.14 24.14 41.13

The non-inclusive cache have a better score than inclusive ones in L2 miss rate and MESIF & MESI are better than the MSI.

Summary

Interesting Conclusions

Adding ‘E’ state:

To measure the performance differences caused by adding the Exclusive state to the protocols, we can look at the differences in metrics in MSI vs MESI and MESIF. The main benefit of the Exclusive state is in reducing the number of snooping bus transactions required. If we consider a parallel program where each thread works on a chunk of an array and updates only that chunk, or if we assume a sequential program that has a single thread, then in these cases, there will be a lot of cases where a program is reading a cacheline and updating it. In MSI, this would translate to first loading the cacheline using a BusRd moving to the S state, and then performing a BusRdX and moving to the M state. This requires two snooping bus transactions. In the case of MESI, this can be done in a single transaction. The cache would perform a BusRd moving to the E state and then since no other cache has the cacheline, there is no need of a BusRdX transaction to move to the M state. It can just silently change the status of the cacheline to Modified.
This gives a significant boost in programs which access and modify unique memory addresses.

Write-Invalidation vs Write-Update

Since we have implemented both write invalidation and write update protocols, our simulator can also tell whether for a given program or memory trace, write invalidation protocols will be better or write update.
For a write-invalidation protocol, when a processor writes to a memory location, other processor caches need to invalidate that cacheline. In a write-update protocol, instead of invalidating the cachelines, it sends the updated cacheline to the other caches. Therefore, in cases where the other processors will need to read those values in the future, write-update performs well, but if the other processors are not going to be needing those values, then the updates are not going to be of any use, and will just cause extra bus transactions. Therefore, the effects of the protocol would be completely dependent.
From our tests, we saw lesser number of bus transactions This would explain why updating rather than invalidating reduced the number of bus transactions.

MOESI (Unchecked)

I refer to the graph of

The code is in dir_moesi directory with one state added.

I just make it runnable and not yet tested.

Reference

  1. https://www.youtube.com/watch?v=3DoBCs7Lyv4
  2. https://kshitizdange.github.io/

[Computer Architecture] Sniper Intro

Info

The code is available at http://victoryang00.xyz:5012/victoryang/sniper_test.

The raw result is

admin@ubuntu_1604:~/sniper/test/lab0$ make
../../run-sniper -c ./config-lab0.cfg -- ./toy-lab0
[SNIPER] Warning: Unable to use physical addresses for shared memory simulation.
[SNIPER] Start
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Sniper using SIFT/trace-driven frontend
[SNIPER] Running full application in DETAILED mode
[SNIPER] --------------------------------------------------------------------------------
[SNIPER] Enabling performance models
[SNIPER] Setting instrumentation mode to DETAILED
[RECORD-TRACE] Using the Pin frontend (sift/recorder)
User program begins
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
<toy-lab0.c, clflush, 21> clflush to be run
[[email protected], iterate, 311] CLFLUSH instruction executed
User program ends
[TRACE:0] -- DONE --
[SNIPER] Disabling performance models
[SNIPER] Leaving ROI after 2.83 seconds
[SNIPER] Simulated 0.0M instructions, 0.1M cycles, 0.36 IPC
[SNIPER] Simulation speed 11.9 KIPS (11.9 KIPS / target core - 84229.9ns/instr)
[SNIPER] Setting instrumentation mode to FAST_FORWARD
[SNIPER] End
[SNIPER] Elapsed time: 3.06 seconds


Optional: Run '../../tools/cpistack.py' in this directory to generate cpi-stack output for this run
Optional: Run '../../tools/mcpat.py' in this directory to generate power output for this run
Optional: Run '../../tools/dumpstats.py' in this directory to view detailed statistics for this run
Optional: Run '../../tools/gen_topology.py' in this directory to view the system topology for this run

The modified code is http://victoryang00.xyz:5012/victoryang/sniper_test/blob/master/common/performance_model/performance_model.cc#L310

if(ins->opcode==4542892)
      fprintf(stderr, "[[email protected], %s, %d] clflush to be run\n",  __func__, __LINE__);

Introduction

The Sniper simulator allows one to perform timing simulations for both multi-program workloads and multi-threaded, shared-memory applications with 10s to 100+ cores. The maintainer is a researcher at NUS, Cambridge, Intel, and Ghent University.

Cache implementation

We have the cfg for the cache. So I consult their dispatch process in the source code.

# Configuration file for the Sniper simulator

# This file is organized into sections defined in [] brackets as in [section].
# Sections may be hierarchical withsub-sections split by the '/' character as
# in [section/sub_section].
#
# values can be "strings" , numbers, or true/false, existing values
# should indicate the type

# This section controls various high-level simulation parameters.
[general]
magic = false # Enable performance simulation straight away (false), or wait for Roi{Begin,End} magic instruction (true)
roi_script = false # Allow ROI to be set by a script, and ignore Roi{Begin,End} magic instructions
inst_mode_init = cache_only
inst_mode_roi = detailed
inst_mode_end = fast_forward
inst_mode_output = true
syntax = intel # Disassembly syntax (intel, att or xed)
issue_memops_at_functional = false # Issue memory operations to the memory hierarchy as they are executed functionally (Pin front-end only)
num_host_cores = 0 # Number of host cores to use (approximately). 0 = autodetect based on available cores and cpu mask. -1 = no limit (oversubscribe)
enable_signals = false
enable_smc_support = false # Support self-modifying code
enable_pinplay = false # Run with a pinball instead of an application (requires a Pin kit with PinPlay support)
enable_syscall_emulation = true # Emulate system calls, cpuid, rdtsc, etc. (disable when replaying Pinballs)
suppress_stdout = false # Suppress the application's output to stdout
suppress_stderr = false # Suppress the application's output to stderr

# Total number of cores in the simulation
total_cores = 64

enable_icache_modeling = false

# This section is used to fine-tune the logging information. The logging may
# be disabled for performance runs or enabled for debugging.
[log]
enabled = false
stack_trace = false
disabled_modules = ""
enabled_modules = ""
mutex_trace = false
pin_codecache_trace = false
circular_log = false

[progress_trace]
enabled = false
interval = 5000
filename = ""

[clock_skew_minimization]
scheme = barrier
report = false

[clock_skew_minimization/barrier]
quantum = 100                         # Synchronize after every quantum (ns)

# This section describes parameters for the core model
[perf_model/core]
frequency = 1        # In GHz
type = oneipc        # Valid models are oneipc, interval, rob
logical_cpus = 1     # Number of SMT threads per core

[perf_model/core/interval_timer]
#dispatch_width = 4
#window_size = 96
issue_contention = true
num_outstanding_loadstores = 8
memory_dependency_granularity = 8 # In bytes
lll_dependency_granularity = 64 # In bytes. Model the MSHR for overlapping misses by adding additional dependencies on long-latency loads using cache-line granularity
lll_cutoff = 30
issue_memops_at_dispatch = false # Issue memory operations to the cache hierarchy at dispatch (true) or at fetch (false)

# This section describes the number of cycles for
# various arithmetic instructions.
[perf_model/core/static_instruction_costs]
add=1
sub=1
mul=3
div=18
fadd=3
fsub=3
fmul=5
fdiv=6
generic=1
jmp=1
string=1
branch=1
dynamic_misc=1
recv=1
sync=0
spawn=0
tlb_miss=0
mem_access=0
delay=0
unknown=0

[perf_model/branch_predictor]
type=one_bit
mispredict_penalty=14 # A guess based on Penryn pipeline depth
size=1024

[perf_model/tlb]
# Penalty of a page walk (in cycles)
penalty = 0
# Page walk is done by separate hardware in parallel to other core activity (true),
# or by the core itself using a serializing instruction (false, e.g. microcode or OS)
penalty_parallel = true

[perf_model/itlb]
size = 0              # Number of I-TLB entries
associativity = 1     # I-TLB associativity

[perf_model/dtlb]
size = 0              # Number of D-TLB entries
associativity = 1     # D-TLB associativity

[perf_model/stlb]
size = 0              # Number of second-level TLB entries
associativity = 1     # S-TLB associativity

[perf_model/l1_icache]
perfect = false
passthrough = false
coherent = true
cache_block_size = 64
cache_size = 32 # in KB
associativity = 4
address_hash = mask
replacement_policy = lru
data_access_time = 3
tags_access_time = 1
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite
prefetcher = none

[perf_model/l1_dcache]
perfect = false
passthrough = false
cache_block_size = 64
cache_size = 32 # in KB
associativity = 4
address_hash = mask
replacement_policy = lru
data_access_time = 3
tags_access_time = 1
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
outstanding_misses = 0
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite
prefetcher = none

[perf_model/l2_cache]
perfect = false
passthrough = false
cache_block_size = 64 # in bytes
cache_size = 512 # in KB
associativity = 8
address_hash = mask
replacement_policy = lru
data_access_time = 9
tags_access_time = 3  # This is just a guess for Penryn
perf_model_type = parallel
writeback_time = 0    # Extra time required to write back data to a higher cache level
dvfs_domain = core    # Clock domain: core or global
shared_cores = 1      # Number of cores sharing this cache
prefetcher = none     # Prefetcher type
next_level_read_bandwidth = 0 # Read bandwidth to next-level cache, in bits/cycle, 0 = infinite

[perf_model/l3_cache]
perfect = false
passthrough = false

[perf_model/l4_cache]
perfect = false
passthrough = false

[perf_model/llc]
evict_buffers = 8

[perf_model/fast_forward]
model = oneipc        # Performance model during fast-forward (none, oneipc)

[perf_model/fast_forward/oneipc]
interval = 100000     # Barrier quantum in fast-forward, in ns
include_memory_latency = false # Increment time by memory latency
include_branch_misprediction = false # Increment time on branch misprediction

[core]
spin_loop_detection = false

[core/light_cache]
num = 0

[core/cheetah]
enabled = false
min_size_bits = 10
max_size_bits_local = 30
max_size_bits_global = 36

[core/hook_periodic_ins]
ins_per_core = 10000  # After how many instructions should each core increment the global HPI counter
ins_global = 1000000  # Aggregate number of instructions between HOOK_PERIODIC_INS callbacks

[caching_protocol]
type = parametric_dram_directory_msi
variant = mesi                            # msi, mesi or mesif

[perf_model/dram_directory]
total_entries = 16384
associativity = 16
max_hw_sharers = 64                       # number of sharers supported in hardware (ignored if directory_type = full_map)
directory_type = full_map                 # Supported (full_map, limited_no_broadcast, limitless)
home_lookup_param = 6                     # Granularity at which the directory is stripped across different cores
directory_cache_access_time = 10          # Tag directory lookup time (in cycles)
locations = dram                          # dram: at each DRAM controller, llc: at master cache locations, interleaved: every N cores (see below)
interleaving = 1                          # N when locations=interleaved

[perf_model/dram_directory/limitless]
software_trap_penalty = 200               # number of cycles added to clock when trapping into software (pulled number from Chaiken papers, which explores 25-150 cycle penalties)

[perf_model/dram]
type = constant                           # DRAM performance model type: "constant" or a "normal" distribution
latency = 100                             # In nanoseconds
per_controller_bandwidth = 5              # In GB/s
num_controllers = -1                      # Total Bandwidth = per_controller_bandwidth * num_controllers
controllers_interleaving = 0              # If num_controllers == -1, place a DRAM controller every N cores
controller_positions = ""
direct_access = false                     # Access DRAM controller directly from last-level cache (only when there is a single LLC)

[perf_model/dram/normal]
standard_deviation = 0                    # The standard deviation, in nanoseconds, of the normal distribution

[perf_model/dram/cache]
enabled = false

[perf_model/dram/queue_model]
enabled = true
type = history_list

[perf_model/nuca]
enabled = false

[perf_model/sync]
reschedule_cost = 0 # In nanoseconds

# This describes the various models used for the different networks on the core
[network]
# Valid Networks :
# 1) magic
# 2) emesh_hop_counter, emesh_hop_by_hop
# 3) bus
memory_model_1 = emesh_hop_counter
system_model = magic
collect_traffic_matrix = false

[network/emesh_hop_counter]
link_bandwidth = 64 # In bits/cycles
hop_latency = 2

[network/emesh_hop_by_hop]
link_bandwidth = 64   # In bits/cycle
hop_latency = 2       # In cycles
concentration = 1     # Number of cores per network stop
dimensions = 2        # Dimensions (1 for line/ring, 2 for 2-D mesh/torus)
wrap_around = false   # Use wrap-around links (false for line/mesh, true for ring/torus)
size = ""             # ":"-separated list of size for each dimension, default = auto

[network/emesh_hop_by_hop/queue_model]
enabled = true
type = history_list
[network/emesh_hop_by_hop/broadcast_tree]
enabled = false

[network/bus]
ignore_local_traffic = true # Do not count traffic between core and directory on the same tile

[network/bus/queue_model]
type=contention

[queue_model/basic]
moving_avg_enabled = true
moving_avg_window_size = 1024
moving_avg_type = arithmetic_mean

[queue_model/history_list]
# Uses the analytical model (if enabled) to calculate delay if cannot be calculated using the history list
max_list_size = 100
analytical_model_enabled = true

[queue_model/windowed_mg1]
window_size = 1000        # In ns. A few times the barrier quantum should be a good choice

[dvfs]
type = simple
transition_latency = 0 # In nanoseconds

[dvfs/simple]
cores_per_socket = 1

[bbv]
sampling = 0 # Defines N to skip X samples with X uniformely distributed between 0..2*N, so on average 1/N samples

[loop_tracer]
#base_address = 0 # Start address in hex (without 0x)
iter_start = 0
iter_count = 36

[osemu]
pthread_replace = false   # Emulate pthread_{mutex|cond|barrier} functions (false: user-space code is simulated, SYS_futex is emulated)
nprocs = 0                # Overwrite emulated get_nprocs() call (default: return simulated number of cores)
clock_replace = true      # Whether to replace gettimeofday() and friends to return simulated time rather than host wall time
time_start = 1337000000   # Simulator startup time ("time zero") for emulated gettimeofday()

[traceinput]
enabled = false
address_randomization = false # Randomize upper address bits on a per-application basis to avoid cache set contention when running multiple copies of the same trace
stop_with_first_app = true    # Simulation ends when first application ends (else: when last application ends)
restart_apps = false          # When stop_with_first_app=false, whether to restart applications until the longest-running app completes for the first time
mirror_output = false
trace_prefix = ""             # Disable trace file prefixes (for trace and response fifos) by default
num_runs = 1                  # Add 1 for warmup, etc

[scheduler]
type = pinned

[scheduler/pinned]
quantum = 1000000         # Scheduler quantum (round-robin for active threads on each core), in nanoseconds
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)
interleaving = 1          # Interleaving of round-robin initial assignment (e.g. 2 => 0,2,4,6,1,3,5,7)

[scheduler/roaming]
quantum = 1000000         # Scheduler quantum (round-robin for active threads on each core), in nanoseconds
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)

[scheduler/static]
core_mask = 1             # Mask of cores on which threads can be scheduled (default: 1, all cores)

[scheduler/big_small]
quantum = 1000000         # Scheduler quantum, in nanoseconds
debug = false

[hooks]
numscripts = 0

[fault_injection]
type = none
injector = none

[routine_tracer]
type = none

[instruction_tracer]
type = none

[sampling]
enabled = false

Cache source code evaluation

Files related to cache in Sniper
config folder

gainestown.cfg contains the configuration of the L3 cache. The nesting contains the nehalem.cfg file

The nehalem.cfg file contains the configuration of L2 cache and L1 cache.

The default sniper argument is the gainestown.cfg file.

typedef int64_t SInt64;  
typedef int32_t SInt32;  
typedef int16_t SInt16;  
typedef int8_t  SInt8;  
typedef UInt8 Byte;  
typedef UInt8 Boolean;  
typedef uintptr_t IntPtr;  
extern UInt64 PC;  

\sniper\commoncore\memory_subsystem Contains the definition and specific implementation of the storage system in sniper.

  • parametric_dram_directory_msi\cache_cntlr.cc

Determine if the current access cache misses or hits, if it is a hit to access cache (including write back and read cache), if the cache misses, then insert cache.

HitWhere::where_t  
CacheCntlr::processMemOpFromCore(Core::lock_signal_t lock_signal,  
Core::mem_op_t mem_op_type,IntPtr ca_address, UInt32 offset,  
Byte* data_buf, UInt32 data_length,bool modeled,bool count);  
/* Accepts a store access or a store write request, determines if the current cache access is hit or missing, and then calls a different processing function.  */  
SharedCacheBlockInfo* CacheCntlr::insertCacheBlock(IntPtr address,  
CacheState::cstate_t cstate, Byte* data_buf,  
core_id_t requester, ShmemPerfModel::Thread_t thread_num);  
/* This is called by the previous method when a cache misses. The main function is to find cache blocks to replace */  
void CacheCntlr::accessCache(  
Core::mem_op_t mem_op_type, IntPtr ca_address, UInt32 offset,  
Byte* data_buf, UInt32 data_length, bool update_replacement);  
/* The operation when the cache does not have a hit is also called by the processMemOpFromCore method. It consists of two main functions: read cache/write cache。 */  
  • cache\cache.cc&cache.h

Each actual cache is defined as an object by the Cache class, such as L1-icache, which contains the basic information about the cache, including size, type, connectivity, and some operations to get the information. The cache class also includes two methods to access and insert the cache: accessSingleLine and insertSingleLine, both of which are called from CacheCntlr.

/* cache attributes */  
// Cache counters  
UInt64 m_num_accesses;  
UInt64 m_num_hits;  
// Generic Cache Info  
cache_t m_cache_type;  
CacheSet** m_sets;  
CacheSetInfo* m_set_info;  
/* cache constructor */  
Cache(String name,String cfgname,core_id_t core_id,UInt32 num_sets,  
UInt32 associativity, UInt32 cache_block_size,  
String replacement_policy, cache_t cache_type,  
hash_t hash = CacheBase::HASH_MASK,  
FaultInjector *fault_injector = NULL,  
AddressHomeLookup *ahl = NULL);  
/* accessSingleLine: When a cache hit occurs, the Cache controller calls the accessCache method, which in turn calls this method in the cache class. This method reads and writes to the cache. */    
CacheBlockInfo* accessSingleLine(IntPtr addr,  
access_t access_type, Byte* buff, UInt32 bytes,  
SubsecondTime now, bool update_replacement);  
/* insertSingleLine: When cache misses, the Cache controller calls the insertCacheBlock method, where it further calls this method in the cache class.    */  
void insertSingleLine(IntPtr addr, Byte* fill_buff,  
bool* eviction, IntPtr* evict_addr,  
CacheBlockInfo* evict_block_info, Byte* evict_buff,  
SubsecondTime now, CacheCntlr *cntlr = NULL);  
  • .\cache\cache_base.h

The CacheBase class includes some basic information about the cache, such as connectivity, cache size, and also includes some type definitions, such as replacement policy. It also includes some type definitions, such as replacement policy, which needs to be changed if a replacement algorithm is added.

enum ReplacementPolicy  
{  
ROUND_ROBIN = 0,LRU,LRU_QBS,  
NRU,MRU,NMRU,PLRU,  
SRRIP,SHCT_SRRIP,  
SRRIP_QBS,RANDOM,  
NUM_REPLACEMENT_POLICIES,SHCT_LRU  
};//replace the enum type
  • cache\cache_set.cc和cache_set.h

The cache substitution algorithm is a set of cache lines in a group, the number of cache lines is the degree of connectivity. The substitution algorithm selects an appropriate cacheline in the group to be replaced. Each group is defined as an object by the CacheSet class, which includes more basic operations on cache. accessSingleLine method calls the read_line and write_line methods, and insertCacheBlock calls the insert method.

/* cache hit, used for data reading */  
void read_line(UInt32 line_index, UInt32 offset, Byte *out_buff,  
UInt32 bytes, bool update_replacement);  
/*  cache hit, used for data writing back */  
void write_line(UInt32 line_index, UInt32 offset, Byte *in_buff,  
UInt32 bytes, bool update_replacement);  
/*  cache miss, apply the agorithm to replace item in the cache  */  
void insert(CacheBlockInfo* cache_block_info, Byte* fill_buff,  
bool* eviction, CacheBlockInfo* evict_block_info,  
Byte* evict_buff, CacheCntlr *cntlr = NULL);  

In addition to the access methods for cacheset, the following two methods need to be changed if you need to add your own replacement algorithm.

/* Create corresponding cache_set objects depending on the replacement algorithm. */  
CacheSet* CacheSet::createCacheSet(String cfgname, core_id_t  
core_id,String replacement_policy,  
CacheBase::cache_t cache_type,  
UInt32 associativity, UInt32 blocksize,  
CacheSetInfo* set_info);  
/* Create corresponding cachesetinfo objects according to the replacement algorithm. */  
CacheSetInfo* CacheSet::createCacheSetInfo(String name,  
String cfgname, core_id_t core_id,  
String replacement_policy, UInt32 associativity);  
/* Determine the type of the substitution algorithm according to the input string of the substitution algorithm. */  
CacheBase::ReplacementPolicy  
CacheSet::parsePolicyType(String policy); 
  1. .\cache\cache_block_info.cc&cache_block_info.h

Each cacheline will have an object created by the class cacheBlockInfo to hold additional information about the cache line, such as tag bits, used bits, etc. If the addition of a replacement algorithm requires additional information, consider adding it in this place or in the previous layer of cacheset.

IntPtr m_tag;  
CacheState::cstate_t m_cstate;  
UInt64 m_owner;  
BitsUsedType m_used;  
UInt8 m_options;  
// large enough to hold a bitfield for all available option_t's  
  1. .\cache\cache_set_lru.cc和cache_set_lru.h

The lru algorithm that comes with sniper, whose base classes are both cacheset classes, implements the getReplacementIndex method and updateReplacementIndex method of the base class. The former is used to select the appropriate cache line to be replaced when looking for a replacement cache line, according to the algorithm that determines the replacement. The latter is used for the update operation that the replacement algorithm needs to perform when a certain cache line is accessed (read, write back, insert) (updating itself with additional information, such as the LRU access record).

clflush

clflush is often executed when a hacker is carrying out the spectre attack. Invalidates from every level of the cache hierarchy in the cache coherence domain the cache line that contains the linear address specified with the memory operand. If that cache line contains modified data at any level of the cache hierarchy, that data is written back to memory. The source operand is a byte memory location. The semantics is defined below:

flush Cache Line

Opcode / InstructionOp/En64-bit ModeCompat/Leg ModeDescription
NP 0F AE /7 CLFLUSH m8MValidValidFlushes cache line containing m8.

Instruction Operand Encoding

Op/EnOperand 1Operand 2Operand 3Operand 4
MModRM:r/m (w)NANANA

CLFLUSH operation is the same in non-64-bit modes and 64-bit modes.

Operation

Flush_Cache_Line(SRC);

Intel C/C++ Compiler Intrinsic Equivalents

void _mm_clflush(void const *p)

Protected Mode Exceptions

#GP(0)For an illegal memory operand effective address in the CS, DS, ES, FS or GS segments.
#SS(0)For an illegal address in the SS segment.
#PF(fault-code)For a page fault.
#UDIf CPUID.01H:EDX.CLFSH[bit 19] = 0.
If the LOCK prefix is used.

Real-Address Mode Exceptions

#GPIf any part of the operand lies outside the effective address space from 0 to FFFFH.
#UDIf CPUID.01H:EDX.CLFSH[bit 19] = 0.
If the LOCK prefix is used.

The Process of finding the execution point

It’s hard first to execute the code just with the static analysis. So it’s natural to utilize gdb just with option --gdb on for run_sniper. However, the sift requires the thread synchronization but gdb is hard to make all thread synchronized. So I gave up. First browsing the code, I found function void PerformanceModel::queueInstruction(DynamicInstruction *ins) which first padding the code into a queue and simulate them with the iterator.

void PerformanceModel::iterate()
{
   while (m_instruction_queue.size() > 0)
   {
      // While the functional thread is waiting because of clock skew minimization, wait here as well
      #ifdef ENABLE_PERF_MODEL_OWN_THREAD
      while(m_hold)
         sched_yield();
      #endif
      DynamicInstruction *ins = m_instruction_queue.front();
      LOG_ASSERT_ERROR(!ins->instruction->isIdle(), "Idle instructions should not make it here!");
      if (!m_fastforward && m_enabled){
         handleInstruction(ins);
      }
      delete ins;
      m_instruction_queue.pop();
   }
   synchronize();
}

The instruction is dispatched there. So I first printf("sb"); to test whether it can be interleaved inside the instruction. The result is yes, but so many sbs, around a thousand both even before the program begins and after the program shuts. My guess is that the emulator has the init.S for OS booting, and some C runtime loaded in the first place. And clflush should be selected from other syscalls.

So our problem converted to how to identify the 4 clflush from other syscalls. First, I did’t find the identifier as my code in the riscv simulator. Then, I found ins has a lot of identifiers:

class DynamicInstruction {
private:
    // Private constructor: alloc() should be used
    DynamicInstruction(Instruction *ins, IntPtr _eip) {
        instruction = ins;
        eip = _eip;
        branch_info.is_branch = false;
        num_memory = 0;
    }

public:
    struct BranchInfo {
        bool is_branch;
        bool taken;
        IntPtr target;
    };
    struct MemoryInfo {
        bool executed; // For CMOV: true if executed
        Operand::Direction dir;
        IntPtr addr;
        UInt32 size;
        UInt32 num_misses;
        SubsecondTime latency;
        HitWhere::where_t hit_where;
    };
    static const UInt8 MAX_MEMORY = 2;
    Instruction *instruction;
    IntPtr eip; // Can be physical address, so different from instruction->getAddress() which is always virtual
    BranchInfo branch_info;
    UInt8 num_memory;
    MemoryInfo memory_info[MAX_MEMORY];

    static Allocator *createAllocator();

    ~DynamicInstruction();

    static DynamicInstruction *alloc(Allocator *alloc, Instruction *ins, IntPtr eip) {
        void *ptr = alloc->alloc(sizeof(DynamicInstruction));
        DynamicInstruction *i = new(ptr) DynamicInstruction(ins, eip);
        return i;
    }

    static void operator delete(void *ptr) { Allocator::dealloc(ptr); }

    SubsecondTime getCost(Core *core);

    bool isBranch() const { return branch_info.is_branch; }

    bool isMemory() const { return num_memory > 0; }

    void addMemory(bool e, SubsecondTime l, IntPtr a, UInt32 s, Operand::Direction dir, UInt32 num_misses,
                   HitWhere::where_t hit_where) {
        LOG_ASSERT_ERROR(num_memory < MAX_MEMORY, "Got more than MAX_MEMORY(%d) memory operands", MAX_MEMORY);
        memory_info[num_memory].dir = dir;
        memory_info[num_memory].executed = e;
        memory_info[num_memory].latency = l;
        memory_info[num_memory].addr = a;
        memory_info[num_memory].size = s;
        memory_info[num_memory].num_misses = num_misses;
        memory_info[num_memory].hit_where = hit_where;
        num_memory++;
    }

    void addBranch(bool taken, IntPtr target) {
        branch_info.is_branch = true;
        branch_info.taken = taken;
        branch_info.target = target;
    }

    SubsecondTime getBranchCost(Core *core, bool *p_is_mispredict = NULL);

    void accessMemory(Core *core);
};

We first found how Instruction size for identifier, but other than clflush, there’s other instructions of size of 9. Then we got meminfo, take meminfo->addr, but it’s changeable every run. Then we found something for identifier well, that is opcode, which unique to every ISA. Their code section is quite different.