SHF: Small: Toward True Heterogeneous Computing: Concurrent Data Structure Design and Optimization

$591,167FY2019CSENSF

University Of Mississippi, University MS

Investigators

Abstract

The introduction of Graphics Processing Units (GPU) for handling intensive computing tasks in modern-day computers has changed the landscape for parallel computing. At the core of this phenomenon are massively-multithreaded, data-parallel computer architectures with impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. While the GPU can handle computation-intensive tasks providing high through, the computer's brains, the Central Processing Unit (CPU), is good at handling latency-oriented tasks. The CPU and GPU communicate with each other through shared data structures that need to be efficiently designed in order to avoid mismatch in the use of the CPU and GPU when handling heterogeneous applications that tax the CPU and GPU at different levels. This project seeks to ensure such efficiencies always exist regardless of workload, through the design and implementation of highly-scalable Concurrent Data Structures (CDSs). The developed CDSs are practically tested by developing library and real-world true heterogeneous workloads. The results, findings, and outcomes of the project are contributed to open-source community and disseminated through scholarly publications, online materials, and websites to the community. The recent support of fine-grained data sharing and thread communication on GPU-powered computing platforms allows applications to benefit from easy, cheap, diverse CPU and GPU collaboration models through data structures in shared virtual memory. In newly enabled CPU-GPU collaboration models, latency-oriented CPU threads and throughput-oriented GPU threads synchronize and communicate with one another through shared data structures. Therefore, the efficiency of these data structures is crucial for the success of true heterogeneous computing. Designing a CDS that scales well across different concurrency levels is known to be a very challenging task. The increased concurrency on heterogeneous platforms makes it even more challenging with respect to performance and correctness. While a significant amount of research has been done in the context of traditional CPUs, there is very little known in the context of heterogeneous platforms. The full potential of a highly-scalable CDS design can be achieved when co-designed with hardware; current hardware is the first generation designs that have not benefited from a good understanding of inter-processor fine-grained data sharing and thread communication. While the primary objective of the project is designing efficient CDSs whose performance scales well across a spectrum of concurrency demands on heterogeneous platforms, the other important target is to improve and optimize hardware support crucial for efficient fine-grained data sharing thread communication, such as inter-processor cache coherence, GPU thread scheduling, and SIMD aware optimizations. To that end, this research takes software-hardware co-design approach using six specific research aims (reducing loads, developing contention management schemes, optimizing CPU-GPU cache coherence protocol, SIMD aware optimization, architecture aware operation mapping, library and real-world application development) to address three fundamental challenges (sequential bottlenecks, memory contention, and architectural heterogeneity). This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

View original record on NSF Award Search →