The High Performance Conjugate Gradient (HPCG) benchmark [cite SNL, UTK reports] is a tool for ranking computer systems based on a simple additive Schwarz, symmetric Gauss-Seidel preconditioned conjugate gradient solver. HPCG is similar to the High Performance Linpack (HPL), or Top 500, benchmark [1] in its purpose, but HPCG is intended to better represent how today’s applications perform. In this paper we describe the technical details of HPCG: how it is designed and implemented, what code transformations are permitted and how to interpret and report results.
The High Performance Linpack (HPL), or Top 500, benchmark [1] is the most widely recognized and discussed metric for ranking high performance computing systems. However, HPL is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications. In this paper we describe a new high performance conjugate gradient (HPCG) benchmark. HPCG is composed of computations and data access patterns more commonly found in applications. Using HPCG we strive for a better correlation to real scientific application performance and expect to drive computer system design and implementation in directions that will better impact performance improvement.
The Trilinos Project is an effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems. A new software capability is introduced into Trilinos as a package. A Trilinos package is an integral unit and, although there are exceptions such as utility packages, each package is typically developed by a small team of experts in a particular algorithms area such as algebraic preconditioners, nonlinear solvers, etc. The Trilinos Developers SQE Guide is a resource for Trilinos package developers who are working under Advanced Simulation and Computing (ASC) and are therefore subject to the ASC Software Quality Engineering Practices as described in the Sandia National Laboratories Advanced Simulation and Computing (ASC) Software Quality Plan: ASC Software Quality Engineering Practices Version 3.0 document [1]. The Trilinos Developer Policies webpage [2] contains a lot of detailed information that is essential for all Trilinos developers. The Trilinos Software Lifecycle Model [3] defines the default lifecycle model for Trilinos packages and provides a context for many of the practices listed in this document.
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.