NVIDIA HPC Fortran, C++ and C Compilers with OpenACC
Using NVIDIA HPC compilers for NVIDIA data center GPUs and X86-64, OpenPOWER and Arm Server multi-core CPUs, programmers can accelerate science and engineering applications using Standard C++ and Fortran parallel constructs, OpenACC directives and CUDA Fortran.
The NVIDIA HPC compilers split execution of an application across multicore CPUs and NVIDIA GPUs using standard language constructs, directives, or CUDA.
Key AdvantagesThe NVIDIA Fortran, C++ and C compilers enable cross-platform HPC programming for NVIDIA GPUs and multicore CPUs. They are fully interoperable with NVIDIA optimized math libraries, communication libraries, and performance tuning and debugging tools. Commercial support is available with NVIDIA HPC Compiler Support Services (HCSS).
Full C++17 including Parallel Algorithms The NVC++ compiler supports all features of C++17 including automatic acceleration of the C++17 Parallel Algorithms on NVIDIA GPUs. OpenACC for Fortran, C++ and C Applications Accelerate HPC applications with OpenACC directives, the proven solution used by over 200 HPC applications for performance-portable GPU programming. Easy Access to NVIDIA Tensor Cores The NVFORTRAN compiler can automatically accelerate standard Fortran array intrinsics and array syntax on NVIDIA Tensor Core GPUs. For x86-64, Arm and OpenPOWER CPUs Develop HPC Applications on servers containing any mainstream CPU. The NVIDIA HPC Compilers are supported on over 99% of Top 500 systems.
C++ Parallel Algorithms: Accelerated The C++17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms by adding an execution policy as the first parameter to any algorithm that supports them. Most of the existing Standard C++ algorithms now support execution policies, and C++17 defined several new parallel algorithms, including the useful std::reduce and std::transform_reduce. The NVIDIA NVC++ compiler offers a comprehensive and high-performance implementation of the Parallel Algorithms for NVIDIA V100 and A100 datacenter GPUs, so you can get started with GPU programming using standard C++ that is portable to most C++ implementations for Linux, Windows, and macOS. The NVIDIA C++ Parallel Algorithms implementation is fully interoperable with OpenACC and CUDA for use in the same application. Read Blog Leverage NVIDIA Tensor Cores NVIDIA A100 and V100 Datacenter GPU Tensor Cores enable fast FP16 matrix multiplication and accumulation into FP16 or FP32 results with performance 8x to16x faster than pure FP32 or FP64 in the same power envelope. NVIDIA A100 GPUs add Tensor Cores support for TF32 and FP64 data types, enabling scientists and engineers to dramatically accelerate suitable math library routines and applications using mixed-precision, single-precision or full double-precision. With the NVIDIA HPC Fortran compiler, you can leverage Tensor Cores in your CUDA Fortran and OpenACC applications through automatic mapping of Fortran array intrinsics to cuTENSOR library calls, or by using the CUDA API interface to Tensor Core programming in a pre-defined CUDA Fortran module. Read Blog OpenACC for GPUs and CPUs NVIDIA HPC compilers support full OpenACC 2.6 and many OpenACC 2.7 features on both NVIDIA datacenter GPUs and multicore CPUs. Use OpenACC directives to incrementally parallelize and accelerate applications, starting with your most time-intensive loops and routines and gradually accelerating all appropriate parts of your application while retaining full portability to other compilers and systems. NVIDIA compilers leverage CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64, Arm and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed, simplifying GPU acceleration of applications and allowing you to focus on parallelization and scalability of your algorithms. Learn More
OpenMP for Multicore CPUs NVIDIA HPC Fortran, C++ and C compilers include support for OpenMP 4.5 syntax and features. You can compile OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads. Multicore CPU performance remains one of the key strengths of the NVIDIA compilers, which now support all three major CPU families used in HPC systems: x86-64, 64-bit Arm and OpenPOWER. NVIDIA compilers deliver state-of-the-art SIMD vectorization and benefit from optimized single and double precision numerical intrinsic functions that use a uniform implementation across all types of CPUs to deliver consistent results across systems for both scalar and SIMD execution. Learn More Debug Programs with PCAST NVIDIA Parallelizing Compiler Assisted Software Testing (PCAST) detects where and why results diverge between CPU and GPU-accelerated versions of code, between successive versions of a program you are optimizing incrementally, or between the same program executing on two different processor architectures. OpenACC auto-compare runs compute regions redundantly on both the CPU and GPU, and compares the GPU andCPU results. Difference reports are controlled by environment variables, letting you pinpoint where results diverge. The PCAST API lets you capture selected data and compare it against a separate execution of the program, and NVIDIA HPC compilers include a directive-based interface for the PCAST API, maintaining portability to other compilers and platforms. Developer Blog: Detecting Divergence Using PCAST to Compare GPU to CPU Results
Who’s Using NVIDIA HPC CompilersOver 200 GPU accelerated application ports have been initiated or in production using OpenACC and the NVIDIA HPC compilers including three of the top five HPC applications as reported in a 2016 Intersect360 site census survey. ANSYS Fluent, Gaussian and VASP .
What Users are Saying
We have with delight discovered the NVIDIA “stdpar” implementation of C++17 Parallel Algorithms. … We believe that the result produces state-of-the-art performance, is highly didactical, and introduces a paradigm shift in cross-platform CPU/GPU programming in the community. Professor Jonas Latt, University of Geneva