ITR: Scalable Non-Stop Blade-Based Servers
Carnegie Mellon University, Pittsburgh PA
Investigators
Abstract
This project proposes the Scalable Non-stOP Server (SNOPS) architecture: a reliable, available, and serviceable (RAS) hardware platform. SNOPS offers both cost and performance scalability unparalleled by conventional RAS-oriented servers by using commodity blade components interconnected through a scalable network and hardware distributed shared memory (DSM). SNOPS offers non-stop service through fast application-transparent detection and recovery of soft/transient errors and/or single processor/memory component failure, and hot-swapping of a module upon hardware component failure without state loss. The project proposes the MEMory BaRrier Across the NEtwork (MEMBRANE), an abstraction layer to regulate memory data transfer between processor and memory modules. MEMBRANE serves as an impenetrable barrier for faulty data both from the processor and the memory sides, allowing only error-free data to transfer across. On the processor side, the MEMBRANE coherence protocols detect and trigger recovery for errors originating from redundant processors of a group by comparing their requests. On the memory side, the MEMBRANE memory redundancy protocols detect and recover from errors originating from the memory components via a RAID-like distributed parity scheme. This prototype will support a commodity OS (such as Linux) and deliver the necessary performance to permit full-scale evaluation of our ideas against commercial-grade server applications. Scalable server simulation infrastructure will be also be available for distribution, allowing for fast, accurate, and full-system simulation of servers. Several proof-of-concept prototypes have been built in industrial and academic settings and especially for students are an invaluable experience, but with a high potential to have impact on industrial research and development as well.
View original record on NSF Award Search →