SHF: Small: Light-weight Architectural Schemes for Resilient High-performance Microprocessors
Purdue University, West Lafayette IN
Investigators
Abstract
In future technology generations, smaller and more transistors operating at low supply voltages and high clock speeds will be increasingly susceptible to many different resiliency problems, such as soft errors, wear-out issues, hard errors, and off- and on-chip bus bit errors. These errors may cause silent data corruption, application aborts, or system crashes in high-performance microprocessors and computer systems. Previous techniques for addressing these errors incur significant performance and power overheads despite optimizations, and often require invasive changes that incur high implementation complexity. In this research project, the investigators propose a novel, light-weight, yet highly-effective architectural approach to processor reliability that incurs much lower overheads than existing approaches by leveraging key architectural observations about the problems. This project's innovative approach for the detection of soft errors, wear-out, and hard errors is based on detecting execution anomalies that are triggered by errors, without using redundant execution. By exploiting the notion of value locality, this project generalizes anomalies to include unexpected values as well as conditions (e.g., memory access exceptions) and provides significant coverage which includes the most problematic cases of silent data corruption. For recovery from soft errors, the project's investigators propose a retry-based scheme that avoids adding any hardware overhead to achieve recovery by using existing spare speculative resources in the processor. For off-chip bus bit errors, the investigators propose a novel bit interleaving scheme that reduces the chances of multiple bits in a single error correcting code (ECC)-protected data unit being corrupted undetectably or uncorrectably. Like the other schemes, this interleaving imposes minimal power, performance, and complexity overhead. This project targets achieving reliability while keeping power, performance, and hardware overheads low, an important goal for the U.S. microprocessor and computer hardware industry. The project's investigators are committed to releasing the research artifacts as open-source software to be used by the research community. The graduate students working on this project will be trained in architecture and reliability issues and will be well-positioned to join the U.S. computer hardware industry. This project will also support educational activities such as homework and term projects in undergraduate and graduate courses as well as outreach activities of various centers at Purdue with which the investigators are involved. With a woman as one of the investigators, the project will act as a basis for encouraging women to join graduate programs in electrical and computer engineering.
View original record on NSF Award Search →