[Parallel Computing] Implementing collective communication

Collective communication

  • Important as the basis for many parallel algorithms.
  • Implemented using multiple point to point communications, possibly in parallel.
    • Assume communicating m words (without contention) between a source and destination takes \(t_{s} + m t_{w}\) time.
    • Ignore the the distance of the message, since per hop latency th is usually small.
  • With contention c, time becomes \(t_{s} + c m t_{w}\).
  • Implementations depend on communication hardware architecture.
  • Operations
  • Broadcast and reduction.
  • All-to-all broadcast and reduction.
  • All-reduce, prefix sum.
  • Scatter and gather.
  • All-to-all scatter and reduce.
  • Circular shift.

Broadcast and reduction on ring

image-20200323151043402

just cut them into halves.

cost

  • With p processors, log p steps.
  • In each step, a processor sends a size m message.
  • Total time$ (t_{s} + m t_{w}) log p$ take \((t_{s} + m t_{w})\) as const

Broadcast on mesh

image-20200323152756914

algorithm

  • Root first broadcasts along its row using ring algorithm.
  • The nodes of root’s row then broadcast along their columns using ring algorithm.

cost

  • \(log \sqrt{p} = (log p) / 2\) steps along row, same along columns.
  • Total time (ts + m tw) log p.log
  • p = (log p) / 2 steps along row, same along columns.
  • Total time (ts + m tw) log p.

Broadcast on hypercube

algorithm

image-20200323151732584

RING CAN BE PROJECTED TO THE HYPERCUBE but with more congestion.

prefix sum on hypercube

image-20200323154823396