Publications

Results 26–34 of 34
Skip to search filters

Towards a specification for measuring red storm reliability, availability, and serviceability (RAS)

Stearley, Jon S.

The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved, hinders their solution, and increases total system cost. Seeking to foster a common basis for communication about supercomputer RAS, [1] proposed a general system state model, definitions, and measurements based on the SEMI-E10 specification [2] used in the semiconductor manufacturing industry. This document enumerates the platform-specific details necessary to apply that general framework to the Red Storm system at Sandia National Laboratories. Familiarity with [1] is a strong prerequisite for understanding of this document, as is familiarity with the Red Storm RAS subsystem (although to a much lesser degree). Given the current pre-production status of Red Storm, this document does not specify actual policy or practice, but rather proposes a framework by which to measure RAS performance on Red Storm.

More Details

Defining and measuring supercomputer Reliability, Availability, and Serviceability (RAS)

Stearley, Jon S.

The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercomputer RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 specification which is widely used in the semiconductor manufacturing industry.

More Details
Results 26–34 of 34
Results 26–34 of 34