[Parallel Comuting] Cuda intrinsics

Reference: Cuda dev blog

why we introduce intrinsics

  • Race condition: Different outcomes depending on execution order.

  • Race conditions can occur in any concurrent system, including GPUs.

    • Code should print a =1,000,000.
    • But actually printed a = 88.
  • GPUs have atomic intrinsics for simple atomic operations.

  • Hard to implement general mutex in CUDA.

    • Codes using critical sections don’t perform well on GPUs anyway.

The atomic thread in critical section is not good in performance. The example is identified for variable d_aimage-20200415201026154

cuda atomics

Can perform atomics on global or shared memory variables.

  • int atomicInc(int *addr)
    • Reads value at addr, increments it, returns old value.
    • Hardware ensures all 3 instructions happen without interruption from any other thread.
  • int atomicAdd(int *addr, int val)
    • Reads value at addr, adds val to it, returns old value.
  • int atomicMax(int *addr, int val)
    • Reads value at addr, sets it to max of current value and val, returns old value.
  • int atomicExch(int *addr1, int val)
    • Sets val at addr to val, returns old value at val.
  • int atomicCAS(int *addr, old, new)
    • “Compare and swap”, a conditional atomic.
    • Reads value at addr. If value equals old, sets value to new. Else does nothing.
    • Indicates whether state changed, i.e. if your view is up to date.
    • Universal operation, i.e. can be used to perform any other kind of synchronization

examples

Finding max of array

image-20200415201746698
image-20200415201926626

improve it

we can improve it bu split the single global max into num_locals number of local max value.

Thread i atomically maxes with its local max. can max the local_max[locali],

image-20200415202001710

a better solution is to make it into the tree DS+CA

Compute the histogram

image-20200415202130295

! whether the data is stored on shared or global memory is depend on programmers

Shuffle intrinsics

! this intrinsic is not limited to cuda but all SIMD architecture

from avx256 vector elements we have

__m256 load_rotr(float *src)
{
#ifdef __AVX2__
    __m256 orig = _mm256_loadu_ps(src);
    __m256 rotated_right = _mm256_permutevar8x32_ps(orig, _mm256_set_epi32(0,7,6,5,4,3,2,1));
    return rotated_right;
#else
    __m256 shifted = _mm256_loadu_ps(src + 1);
    __m256 bcast = _mm256_set1_ps(*src);
    return _mm256_blend_ps(shifted, bcast, 0b10000000);
#endif
}

For Kepler architecture, we have 4 intrisics.

image-20200415225424500

The goal of the shuffle intrinsics is actually for optimizing Memory-Fetch Model

image-20200415230800174

butterfly operations

image-20200415223804087
image-20200415223136072

Assume we have at most shuffle instead of shared memory

image-20200415223508673
image-20200415223701215