Investigation of Reliability-Constrained On-Chip Networks
University Of Southern California, Los Angeles CA
Investigators
Abstract
Abstract 0541417 Alexander Sawchuk Los Angeles, CA Investigation of Reliability - Constrained On-Chip Networks Among the many challenges computer architects will face over the next decade and beyond is the growing demand for reliable on-chip communication between system microarchitecture functional domains. Continued increases in scaling and integration of transistor and wiring resources are allowing more system functions to be implemented on chip, but also more circuit defects and variability. Recent trends toward partitioning the system microarchitecture into multiple on-chip compute domains in the form of functional unit blocks, tiles and processor cores mitigate chipcrossing delays and facilitate chip survivability. That is, it helps to prevent system performance and cost from being encumbered by deep submicron technology scaling. With these developments, support for low latency, high throughput, and fault tolerant communication is becoming more and more critical within the on-chip network used to interconnect the compute domains. Much recent research is directed toward the design of on-chip networks to meet certain cost/performance goals (chip area, latency and throughput), but very little architecture research explores on-chip network reliability issues specific to the problem of hard faults, which is recognized as a growing problem. In this research, we investigate reliability challenges and techniques for on-chip networks that will meet manufacturing yield and chip reliability targets as technology scales into the deep submicron regime. The goal is to understand the problem more fully and to develop on-chip network techniques for efficient resource and reliability management, fault isolation, dynamic reconfiguration and fault recovery to allow fault-stricken microarchitectures partitioned across a chip to have increased usability and prolonged life. We endeavor to increase understanding of chip failure mechanisms (their causes and impact); appropriately model them as related specifically to on-chip networks; develop approaches and techniques that will allow on-chip networks (in cooperation with techniques for other components of the chip microarchitecture) to be resilient to hard faults; evaluate and assess the benefit of the proposed techniques under expected workloads and common-case operational conditions; and, furthermore, understand the tradeoffs in using the proposed fault-resilient on-chip network techniques that is, identify those situations in which various techniques can be most usefully applied given the existence of other possible constraints. The Intellectual Merit of this research is substantial. The research is timely as it addresses an important issue that will only worsen with continuing advancements in technology scaling. The research will culminate with key contributions made in (1) increasing our understanding of the fundamental design, process, and operational mechanisms most responsible for on-chip interconnect failures and (2) producing original and promising techniques for increasing on-chip interconnect reliability and chip reliability as a whole. Beyond the specific results produced by the models and simulation environments we will develop through this project, these tool artifacts will likely have a profound impact on future research infrastructure and education for years to come. They will be invaluable assets to researchers, students, and practitioners for understanding, developing, evaluating, and trading-off alternative reliability techniques as demanded by advanced technologies and systems. The tools we develop will be made publicly available and are expected to have widespread use. The results of this research will also be widely disseminated through publications. The Broader Impact of this research is significant and far-reaching. This research can have a profound impact on the success of near-future nanoscale technologies (molecular, quantum, etc.) used to implement integrated circuits beyond the CMOS era as ICs implemented in these technologies are expected to have substantially more hard faults (orders of magnitude) than CMOS ICs. Reliability techniques such as the ones that will be derived from this research will be critical to systems implemented in these technologies as well as those implemented in future deep submicron technology. In the nearer term, many of the ideas coming from this research may be transferrable to system-level networks, where form-factor constraints often are not as rigid as they are on-chip.
View original record on NSF Award Search →