CSR: Small: CONCERT: Designing Scalable Communication Runtimes with On-the-fly Compression for HPC and AI Applications on Heterogeneous Architectures
Ohio State University, The, Columbus OH
Investigators
Abstract
In a world where Artificial Intelligence (AI) and High-Performance Computing (HPC) hold immense potential for transformative advancements, this research aims to develop CONCERT, an innovative communication and compression stack, to unlock the full power of heterogeneous architectures and drive high performance and scalability. By leveraging emerging accelerators and networking hardware, CONCERT seeks to address fundamental challenges in utilizing heterogeneous architectures, scaling communication, and integrating application agnostic on-the-fly data compression. The project's significance lies in its potential to advance the field of AI and HPC by enabling efficient utilization of heterogeneous resources, resulting in enhanced performance and scalability. CONCERT's impact extends beyond scientific advancements. The project will provide valuable guidelines for designing and deploying next-generation HPC systems, benefiting users in academia and industry. By actively promoting diversity and inclusion, particularly among underrepresented minorities and female students, the project fosters a more inclusive STEM environment. The research outcomes will contribute to curriculum advancements, supporting education and research in HPC, Deep/Machine Learning, and Data Analytics. Additionally, the dissemination of results to collaborating organizations will positively impact their HPC software applications, benefiting society as a whole. Over the last few years, Artificial Intelligence (AI) and High-Performance Computing (HPC) applications have been continuously enhanced for performance by exploiting the latest trends in highly heterogeneous hardware in modern HPC systems. These applications have high communication requirements and exchange massive amounts of data given a cluster’s limited bandwidth. However, it is challenging for an application to efficiently use all resources available in the system to scale up communication with the emerging on-the-fly compression support. For this reason, an adaptive communication/compression stack called CONCERT (sCalable cOmmunicatioN Runtimes with On-the-fly Compression for HPC and AI Applications on hEterogenous aRchiTectures) is proposed. CONCERT dynamically employs dedicated resources through load and architectural aware Functional Partitioning (FP). It enhances the existing de-facto standard for programming large-scale applications using the Message Passing Interface (MPI). Specific issues to be focused under this research are: 1) Efficient support for MPI/Hybrid programming models on heterogeneous hardware to scale-up communication and on-the-fly compression, 2) Designing FP-based schemes to offload communication /compression tasks, 3) Designing a communication/compression FP scheme to support scale-up requests from thousands of endpoints, and 4) Studying the benefits of these schemes in terms of performance and scalability. The transformative impact of the proposed research enables a broad range of AI and HPC applications to efficiently and transparently leverage the emerging accelerators and networking hardware from multiple vendors. A strong software distribution and data dissemination plan is also proposed to have a broader impact on academic and industrial HPC/AI communities. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →