Data Analytics on Matrix Data Science Routers
Given the parallel nature of many data processing tasks, it’s only natural that the massively parallel architecture of a GPU should be able to parallelize and accelerate Apache Spark data processing queries, in the same way that a GPU accelerates deep learning (DL) in artificial intelligence (AI). NVIDIA has worked with the Apache Spark community to implement GPU acceleration through the release of Spark 3.0 and the open source RAPIDS Accelerator for Spark. In this post, we dive into how the RAPIDS Accelerator for Apache Spark uses GPUs to:
Accelerate end-to-end data preparation and model training on the same Spark cluster.
Accelerate Spark SQL and DataFrame operations without requiring any code changes.
Accelerate data transfer performance across nodes (Spark shuffles).
Data preparation and model training on Spark 2.x GPUs have been responsible for the advancement of DL and machine learning (ML) model training in the past several years. However, 80% of a data scientist’s time is spent on data preprocessing. Preparing a data set for ML requires understanding the data set, cleaning and manipulating data types and formats, and extracting features for the learning algorithm. These tasks are grouped under the term ETL (extract, transform, load). ETL is often an iterative, exploratory process. As ML and DL are increasingly applied to larger datasets, Spark has become a commonly used vehicle for the data preprocessing and feature engineering needed to prepare raw input data for the learning phase. Because Spark 2.x has no knowledge about GPUs, data scientists and engineers perform the ETL on CPUs, then send the data over to GPUs for model training. That’s where the performance really is. As data sets grow, the interactivity of this process suffers.