Sandia LabNews

Sandia's off-the-shelf 'home-grown' supercomputer may become 20th fastest in world


1,300 new Cplant nodes arrive at Sandia

One thousand three hundred new computers from Compaq Computer Corporation have arrived at Sandia to increase the power of a “home-grown” Sandia computational cluster that already, linking only 600 desktop computers, ranks 44th among the world’s fastest supercomputers.

[Solar collector]
NOT A MAGIC LANTERN, BUT CLOSE — Ron Brightwell (9223), examines the motherboard of one of Cplant’s computers. The board is out of a Digital personal workstation 500A. Digital is now owned by Compaq Computer Corp. (Photo by Randy Montoya)

Researchers expect the latest version of Cplant Antarctica, the upwardly mobile computer at Sandia, to become approximately the 20th fastest in the world.

The grouping was already the largest “production” Linux cluster — a cluster that produces technical results to aid ongoing science projects.

“This is another kind of revolution going on,” says lead CplantTM software developer Rolf Riesen of Scalable Computing Systems Dept. 9223, “that a major government laboratory like Sandia is willing to spend $9.6 million plus a significant amount of in-house development to make a supercomputer out of a supply of off-the-shelf parts.”

The new Cplant will include 1,600 computers, also called nodes. (Three hundred older nodes will be used for other purposes). The additional units are expected to be up and running early this fall.

“Supercomputers for the past decade have traditionally been purchased as turnkey machines from the world’s largest computer makers,” says Neil Pundit (9223), manager of Cplant software development at Sandia. “Such machines have cables, connection boxes, as well as monitors and testing equipment, already built in place. In Cplant, we are following a new path, assembling a supercomputer out of parts, open-source software, and our own developments.”

The fastest supercomputers in the world are an integral part of DOE’s science-based stockpile stewardship program, which requires extremely high computational speeds to simulate nuclear explosions and to make sense of the torrent of data obtained from those simulations. ASCI Red, Sandia’s Intel-built supercomputer, was the fastest machine in the world for several years until bested in early July by another DOE supercomputer — ASCI White, an IBM-built supercomputer at Lawrence Livermore National Laboratory. The factory-built machines are still far superior to any off-the-shelf products.

The poor man’s supercomputer

However, Sandia researchers David Greenberg (who has since left Sandia), Art Hale (9220), George Davidson (9201), and Bill Camp (9200) decided they could create a “poor man’s” ASCI Red architecture by combining high-performance commodity parts with Sandia software to be developed by Rolf and his colleagues at Sandia’s sites in New Mexico and California. They called this idea CplantTM. Because they had helped develop the system software that made Red into the fastest computer in the world, they believed they could succeed with an off-the-shelf version.

Sandia took up the task of physically linking the highest performance commodity PCs in the world, driven by Compaq’s Alpha DS10L processors, into a tightly knit cluster — really a virtual supercomputer. The researchers then developed the software to make this work.

Bill Blake, vice president of Compaq’s High Performance Technical Computing Group, says, “Sandia is doing pioneering work in building truly large Linux systems, using a combination of open source software along with their researchers’ own development, along with hardware, tools and compilers from Compaq.”

Better than Beowulf?

Cplant differs from better-publicized clusters like Beowulf, developed to run very specific programs for small groups of users, or the University of California’s Millennium Project, which attempts to link clusters of computers so that, when unused by specific owners, they can be tapped to contribute to the overall power of the system.

Cplant is a true, multipurpose supercomputer, says Bill Camp, Director of Center 9200. Scientists can run any program in exactly the same fashion as though they were using ASCI Red. Cplant’s current use is to provide backup for the over-subscribed Red machine, also known as the teraflops computer. With Cplant’s new capabilities, it should run from one-half to two-thirds the top computer’s speed.

The term CplantTM, for Computational Plant, has a double meaning: physical computational hardware (as in industrial plant), and an organic plant that grows, evolves, and is pruned.

Ahead of the pack — somewhat

“Most researchers have a hard time convincing their sponsors that this approach is feasible; the software out there doesn’t scale to such numbers of nodes,” says Rolf. “Our software, on the other hand, already ran. So Sandia jumped out ahead of the pack.

“But not that ahead — only a year or two. Eventually, other people will get there too. People all over the world are already using Beowulf. We are hoping to release our software to the general public soon. Then everyone in the world will help us improve it, and kids who try to hack new capabilities into it will become a work force we can later on employ. Otherwise we will have this proprietary code that no one knows about, and something else that may not be as good will become the standard to be improved; and when we hire people, they won’t have experience with the systems we’re running.”

The Compaq AlphaServer systems run a modified version of RedHat Linux plus the parallel systems software developed in the Cplant project. The DS10L is less than two inches tall, allowing up to 42 DS10L systems to be packaged in a standard rack. The Sandia design packages 33 systems in a rack, leaving room for other required components such as high-performance interconnections, networking, and system management. This is a significant expansion since current Cplant designs allow only eight systems in a rack with little space for other components. The new racks are designed to require as few external connections as possible, allowing the major functional units of the system to be integrated and tested in manufacturing at Compaq. This greatly simplified installation and maintenance of this large system.

Internal communications among processors are carried out over a series of links and switches called Myrinet, developed by Myricom Corp. The several internal communications networks in Cplant are critical to managing the computer as a single resource and to carrying large parallel jobs. The newest Myrinet switches and links arrived in July. “The machines should be up and running as production resources in their new configuration within a few months,” says Art Hale, Sandia manager of the Cplant project.

Sandia now has 2,600 Compaq computers as nodes in Cplant clusters of various configurations, with 512 at Sandia’s California site. The Antarctica subcluster, in New Mexico, is the largest and has 1,632 processors. This system is really three systems, with 256 processors always in a classified partition, 256 always in a secure but unclassified partition, and 64 always in a “open” partition. The last are available to uncleared staff and partners from industry and academia. The other 1,056 processors will be switched among the three elements as demand for the types of calculations warrants.

Last modified: August 18, 2000