SHF: Small: Collaborative Research: A Holistic Design Methodology for Fault-Tolerant and Robust Network-on-Chips (NoCs) Architectures
Ohio University, Athens OH
Investigators
Abstract
Technology scaling down to the nanometer regime has aided the growth in transistors that have made multi-core architectures a power-efficient approach to harnessing parallelism and improving performance. Consequently, the design of low latency, high bandwidth, power-efficient and reliable Network-on-Chips (NoCs) is proving to be one of the most critical challenges to achieving the performance potential of future chips. While multicores are facilitating an enormous integration capacity, aggressive transistor scaling has also led to a steady degradation of the device and circuit reliability. Increased device wear-out (due to negative-bias temperature instability (NBTI), electro migration (EM) and hot carrier injection (HCI)) has exacerbated the waning reliability of transistors, thereby resulting in a significant increase in faults (both permanent and transient), and hardware failures. As faults manifest within the NoC substrate, multicore chips are faced with excessive delays and increased power consumption while recovering from the fault. While NoC reliability research has made significant strides at inter- and intra-router levels, there is still a lack of a holistic design approach covering the reliability of the entire NoC architecture, from device wear-out, to links and routers, to routing protocols, to applications in a cohesive manner. This project will develop a holistic design methodology that addresses the reliability of the entire NoC communication infrastructure (device, links, routers, routing algorithms, and topology) while minimizing energy footprint, reducing the area overhead and only marginally impacting performance. To achieve our goal of improving link fault-recovery, this project will develop techniques to maximize the utilization of the inter-router links with minimum power and area overhead. For the router, this project will propose intra-router reliability techniques with the goals of maximizing hardware utilization, reducing redundancy and area overhead, and minimizing router pipeline latency. Further, wear-leveling techniques developed by this project will improve the reliability of NoCs and the lifetime of the chip. Finally, the proposed techniques will be evaluated by developing fault models that are injected into the NoC and evaluate the fault coverage, performance degradation and energy efficiency through extensive modeling and simulation. The holistic design methodology spanning the entire NoC architecture and the reliability techniques developed from this project will positively impact the next generation multi-core and System-on-Chip (SoC) architectures with improvements in energy efficiency, performance and robustness to hard faults and soft errors. This project will play a major role in education by integrating discovery with teaching and training, and by attracting and training minority students in this field.
View original record on NSF Award Search →