Sandia National Laboratories and DataDirect Networks (DDN) have partnered over the last 3.5+ years to design a next generation parallel storage system to meet the needs of current and future scientific workloads executing on large-scale High Performance Computing (HPC) systems. This partnership has successfully resulted in designing extensions for their new product, Infinia, which targets HPC storage workloads with simplicity and strong data management features as its core principles. This will bring the world’s fastest object store to bear on HPC’s hardest storage problems.
This partnership is the result of the National Nuclear Security Administration’s (NNSA) post-Exascale-Computing-Initiative-Portfolio, intended to have a demonstratable impact on NNSA’s Advanced Simulation Computing (ASC) mission capabilities. This project has enabled Sandia and its partner Department of Energy (DOE) laboratories to influence the Infinia product during its critical design phase.
“This is an exciting example of the innovation that can take place when national laboratories collaborate with industry. Working with industry experts, such as DDN, allows ASC to not only tackle today’s challenges, but better anticipate tomorrow’s as well,” said ASC Program Director Thuc Hoang.
The ability to have this level of collaboration, allows better support and influence on early decisions that will enhance the product capabilities for complex HPC workloads.
“DDN’s mission has always been to deliver powerful storage solutions that will meet the needs of the world’s most data intensive and challenging workloads. Working with the NNSA laboratories has allowed us to develop functionality within a new product that meets their complex requirements today and in projects the NNSA foresees in the future,” said Sven Oehme, DDN’s CTO.
Historically, the most advanced storage systems include state of the art techniques for fault tolerance, high-speed networking, and consistency management, while maintaining high data rates in the face of extreme resilience requirements. Other aspects of HPC have the leeway to fail from time to time, while data storage must be reliable 100% of the time. If any of these concerns are not addressed, the system becomes unusable.
“DDN and DOE embarked on this collaboration to ensure that traditional and emerging HPC workloads can be serviced well by Infinia systems. In addition to the traditional POSIX-centric modeling and simulation jobs we have run for decades, DOE has an emerging crop of workflows that use new access methods and patterns including machine learning and cloud-centric tasks,” said Matthew Curry, Principal Member of Technical Staff at Sandia. “We have found that by allowing DDN and a diverse representation of subject matter experts from DOE laboratories to jointly participate in design activities, Infinia’s architecture can be broadened to handle the gamut of DOE workloads and beyond, contributing to mission success for DOE and a wider market for DDN.”
It is critically important that any new storage system can accommodate not only the current simulation mission requirements, which include traditional scalable check-point-restart capabilities, but future use cases that are expected to be important to the DOE in the very near future. This collaboration has focused on ensuring that the historic niche requirements are built into the foundation of the Infinia product since history has shown that attempting to do this as an add-on later in not feasible.
“As our mission workloads transition to the digital engineering realm we need storage technologies such as Infinia which allow us to move seamlessly between HPC, cloud and enterprise computing resources,” said Robert Hoekstra, Senior Manager of Extreme Scale Computing at Sandia.
Satisfying our traditional needs combined with the additional storage access methods available in Infinia gives us the flexibility to support existing and future use cases.
“DDN has been working on a next generation data services platform for data that is dynamic, simple to manage and ensures high performance. It has been extremely valuable to implement the insight from technology leadership within the NNSA laboratories as well some DOE Office of Science labs, which is critically important regarding implementation for this project,” said James McKenna, DDN’s Account Executive.
This collaboration is part of Sandia’s Vanguard program which is tasked with investigating and developing advanced technologies that can be applied to the ASC’s core national security mission.
“The collaboration with DDN is a perfect example of a project that fits well under Sandia’s Vanguard program. Taking on our most challenging problems and providing solutions for not only Sandia but the DOE and wider HPC community,” said James H. Laros III, Distinguished Member of Technical Staff and program lead at Sandia. “This partnership with DDN serves as an exemplar for collaborations between DOE labs and industry.”
About NNSA: Established by Congress in 2000, NNSA is a semi-autonomous agency within the U.S. Department of Energy responsible for enhancing national security through the military application of nuclear science. NNSA maintains and enhances the safety, security, and effectiveness of the U.S. nuclear weapons stockpile; works to reduce the global danger from weapons of mass destruction; provides the U.S. Navy with safe and militarily effective nuclear propulsion; and responds to nuclear and radiological emergencies in the United States and abroad.
May 8, 2024