NVSHMEM™ is a parallel programming interface based on OpenSHMEM that provides efficient and scalable communication for NVIDIA GPU clusters. NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA® streams.
What's new in NVSHMEM 2.0.3
Added the teams and team-based collectives APIs from OpenSHMEM 1.5.
Added support to use the NVIDIA® Collective Communication Library (NCCL) for optimized NVSHMEM host and on-stream collectives.
Added support for RDMA over Converged Ethernet (RoCE) networks.
Added support for PMI-2 to enable an NVSHMEM job launch with srun/SLURM.
Added support for PMIx to enable an NVSHMEM job launch with PMIx-compatible launchers, such as Slurm and Open MPI.
Uniformly reformatted the perftest benchmark output.
Added support for the putmem_signal and signal_wait_until APIs.
Improved support for single-node environments without InfiniBand.
Fixed a bug that occurred when large numbers of fetch atomic operations were performed on InfiniBand.
Improved topology awareness in NIC-to-GPU assignments for NVIDIA® DGX™ A100 systems.
Added the NVSHMEM_CUDA_LIMIT_STACK_SIZE environment variable to set the GPU thread stack size on Power systems.
Updated the threading level support that was reported for host and stream-based APIs to NVSHMEM_THREAD_SERIALIZED.
Device-side APIs support NVSHMEM_THREAD_MULTIPLE.
Combines the memory of multiple GPUs into a partitioned global address space that’s accessed through NVSHMEM APIs
Includes a low-overhead, in-kernel communication API for use by GPU threads
Includes stream-based and CPU-initiated communication APIs
Supports x86 and POWER9 processors
Is interoperable with MPI and other OpenSHMEM implementations
Increase Performance Convolution is a compute-intensive kernel that’s used in a wide variety of applications, including image processing, machine learning, and scientific computing. Spatial parallelization decomposes the domain into sub-partitions that are distributed over multiple GPUs with nearest-neighbor communications, often referred to as halo exchanges. In the Livermore Big Artificial Neural Network (LBANN) deep learning framework, spatial-parallel convolution is implemented using several communication methods, including MPI and NVSHMEM. The MPI-based halo exchange uses the standard send and receive primitives, whereas the NVSHMEM-based implementation uses one-sided put, yielding significant performance improvements on Lawrence Livermore National Laboratory’s Sierra supercomputer.