CDS&E: An Effective Thermal Simulation Methodology for GPGPUs Enabled by Data-Driven Model Reduction
Clarkson University, Potsdam NY
Investigators
Abstract
Demands for general purpose graphics processing units (GPGPUs) in recent years have increased rapidly due to the needs for scientific, engineering and statistical computing. Meanwhile, GPGPUs are also quickly becoming an essential part of data centers around the globe. The number of data centers are growing drastically due to the recent explosion of social networking, movie streaming, online shopping, big data, internet of things, etc. With hundreds or thousands of cores running in each GPGPU, severe heating is a serious challenge which can significantly degrade GPGPU performance, reliability and energy efficiency unless effective cooling is employed. However, effective cooling of data centers requires enormous expenditure of energy. To ease all these problems, effective thermal management and thermal-aware task scheduling for GPGPU operation are needed, which however requires an accurate simulation tool that is able to offer efficient dynamic thermal prediction with a reasonable spatial resolution. Currently, there is a lack of thermal simulation tools that offer high efficiency and accuracy with a reasonable resolution. The proposed work aims to develop an efficient simulation methodology based on a reduced learning algorithm that is capable of predicting accurate dynamic temperature distributions with a high resolution in GPGPUs. With this novel approach implemented in GPGPUs, effective thermal management and task scheduling will become possible and will improve GPGPU performance and reliability. This will also improve energy savings in cooling, computing and streaming and minimize the earth’s environmental stress. This project will also contribute to interdisciplinary workforce training and prepare students for the emerging challenge of heating problems in GPGPU computing. Research related to the proposed work will be integrated into several courses taught by the PIs. Course projects will be developed by the Ph.D. and undergraduate students working on the proposed work. This will offer undergraduate and graduate students a useful learning experience beyond the textbooks and lectures. The PIs will also expand and integrate several ongoing activities to broaden participation of underrepresented groups in STEM, e.g. through the Co-PI's NSF REU site. A special effort will be made to recruit and mentor Native Americans from an Indian Reservation near the PI’s university to join STEM activities and to pursue their careers in STEM. The goal of this project is to develop a multi-block simulation methodology for efficient, accurate prediction of dynamic thermal profiles of GPGPUs derived from a reduced learning algorithm. To reduce simulation space and thus the computational time while maintaining accurate thermal solution, the domain structure of a GPGPU is projected onto a functional space described by a set of basis functions obtained from the reduced learning method. This projection learning process however requires collection of massive amounts of thermal data for the entire GPGPU and is computationally prohibitive. Domain decomposition is therefore applied to partition the GPGPU domain into hundreds of smaller generic building blocks. This building-block approach enables more efficient training of the basis functions to develop the multi-block thermal model. This methodology offers a reduction in the computational time by several orders of magnitude for thermal simulation of semiconductor chips, compared with the direct numerical simulation. Currently, thermal simulations of GPGPUs rely on the efficient compact resistance-capacitance (RC) thermal model that provides poor resolution and inaccurate thermal profiles. It is expected that the developed thermal simulation model will be even more efficient than the compact RC model. Also, the multi-block approach possesses a natural advantage of effective parallel computing. This project will implement the developed multi-block model in hundreds of cores in a GPGPU to perform parallel GPGPU computing that will further speed up the thermal simulation of GPGPUs. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
View original record on NSF Award Search →