Publications

Publications / Conference

Towards a specification for measuring red storm reliability, availability, and serviceability (RAS)

Stearley, Jon S.

The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved, hinders their solution, and increases total system cost. Seeking to foster a common basis for communication about supercomputer RAS, [1] proposed a general system state model, definitions, and measurements based on the SEMI-E10 specification [2] used in the semiconductor manufacturing industry. This document enumerates the platform-specific details necessary to apply that general framework to the Red Storm system at Sandia National Laboratories. Familiarity with [1] is a strong prerequisite for understanding of this document, as is familiarity with the Red Storm RAS subsystem (although to a much lesser degree). Given the current pre-production status of Red Storm, this document does not specify actual policy or practice, but rather proposes a framework by which to measure RAS performance on Red Storm.