NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.

What's new in NVSHMEM 2.0.3

  • Added the teams and team-based collectives APIs from OpenSHMEM 1.5.

  • Added support to use the NVIDIA® Collective Communication Library (NCCL) for optimized NVSHMEM host and on-stream collectives.

  • Added support for RDMA over Converged Ethernet (RoCE) networks.

  • Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.

  • Added support for PMIx to enable an NVSHMEM job launch with PMIx-compatible launchers, such as Slurm and Open MPI.

  • Uniformly reformatted the perftest benchmark output.

  • Added support for the putmem_signal and signal_wait_until APIs.

  • Improved support for single-node environments without InfiniBand.

  • Fixed a bug that occurred when large numbers of fetch atomic operations were performed on InfiniBand.

  • Improved topology awareness in NIC-to-GPU assignments for NVIDIA® DGX™ A100 systems.

  • Added the NVSHMEM_CUDA_LIMIT_STACK_SIZE environment variable to set the GPU thread stack size on Power systems.

  • Updated the threading level support that was reported for host and stream-based APIs to NVSHMEM_THREAD_SERIALIZED.

  • Device-side APIs support NVSHMEM_THREAD_MULTIPLE.

Key Features

  • Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs

  • Includes a low-overhead, in-kernel communication API for use by GPU threads

  • Includes stream-based and CPU-initiated communication APIs

  • Supports x86 and POWER9 processors

  • Is interoperable with MPI and other OpenSHMEM implementations

NVSHMEM Advantages

Increase Performance Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges. In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.

3 views0 comments

Recent Posts

See All