2010
Ron Brightwell, Kurt Ferreira, Suzanne Kelly, Michael Levenhagen, Courtenay Vaughan,
Kitten Operating System Virtualization Team, Sandia National Laboratories,
March 23, 2010
Kurt Brian Ferreira
Scalable System Software
Scalable System Software
(505) 844-0433
Sandia National Laboratories, New Mexico
P.O. Box 5800
Albuquerque, NM 87185-1319
Biography
Principal Member of Technical Staff
My area of expertise is system software and resilience/fault-tolerance methods for large-scale, massively parallel, distributed-memory, scientific computing systems. I have designed and developed a number of innovative, high-performance, and resilient implementations of low-level system software for several HPC platforms including the Cray Red Storm (XT3) machine at Sandia National Laboratories. My research interests include the design and construction of operating systems for massively parallel processing machines and innovative application and system-level fault-tolerance mechanisms for HPC.
Education
I received my BS in mathematics and BS in computer science in 2000 from New Mexico Tech and my MS in computer science in 2008 and my PhD in computer science in 2011 from the University of New Mexico
Publications
Kurt Ferreira, Scott Levy, (2022). Characterizing Memory Failures Using Benford’s Law Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1887503 Publication ID: 75682
Kurt Ferreira, Scott Levy, (2021). Evaluating MPI resource usage summary statistics Parallel Computing https://doi.org/10.1016/j.parco.2021.102825 Publication ID: 75299
Keira Haskins, bridges bridges, Kurt Ferreira, Scott Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications https://www.osti.gov/servlets/purl/1899492 Publication ID: 76415
Keira Haskins, Patrick Bridges, Kurt Ferreira, Scott Levy, (2021). A Benchmark to Understand Communication Performance in Hybrid MPI and GPU Applications https://www.osti.gov/servlets/purl/1899493 Publication ID: 76416
Kurt Ferreira, Scott Levy, (2021). Characterizing Per-node Memory Failures Using Benford?s Law https://www.osti.gov/servlets/purl/1886179 Publication ID: 75504
Scott Levy, Kurt Ferreira, (2021). An Initial Examination of the Effect of Container Resource Constraints on Application Perturbation https://doi.org/10.2172/1869756 Publication ID: 78565
Stephen Olivier, Ronald Brightwell, Kurt Ferreira, Ryan Grant, Scott Levy, Kevin Pedretti, Andrew Younge, (2021). SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime https://www.osti.gov/servlets/purl/1861479 Publication ID: 77902
Kurt Ferreira, Scott Levy, Victor Kuhns, Nathan DeBardeleben, Sean Blanchard, (2021). Understanding the Effects of DRAM Correctable Error Logging at Scale Proceedings – IEEE International Conference on Cluster Computing, ICCC https://doi.org/10.1109/Cluster48925.2021.00060 Publication ID: 79606
Kurt Ferreira, Scott Levy, (2020). Evaluating MPI Message Size Summary Statistics ACM International Conference Proceeding Series https://www.osti.gov/servlets/purl/1825984 Publication ID: 71238
Ronald Brightwell, Kurt Ferreira, Ryan Grant, Scott Levy, Gerald Lofstead, Stephen Olivier, Kevin Pedretti, Andrew Younge, Ann Gentile, (2020). ALAMO: Autonomous Lightweight Allocation Management and Optimization https://www.osti.gov/servlets/purl/1818044 Publication ID: 74680
Kurt Ferreira, Ryan Grant, Michael Levenhagen, Scott Levy, Taylor Groves, (2020). Hardware MPI message matching: Insights into MPI matching behavior to inform design Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.5150 Publication ID: 64546
Scott Levy, Kurt Ferreira, Patrick Widener, (2020). The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints Concurrency and Computation: Practice and Experience https://doi.org/10.1002/cpe.4890 Publication ID: 54218
Scott Levy, Kurt Ferreira, (2020). Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1641289 Publication ID: 69979
Scott Levy, Kurt Ferreira, (2019). Evaluating tradeoffs between MPI message matching offload hardware capacity and performance ACM International Conference Proceeding Series https://doi.org/10.1145/3343211.3343223 Publication ID: 70063
Scott Levy, Kurt Ferreira, Whit Schonbein, Ryan Grant, Matthew Dosanjh, (2019). Using simulation to examine the effect of MPI message matching costs on application performance Parallel Computing https://doi.org/10.1016/j.parco.2019.02.008 Publication ID: 67578
Scott Levy, Kurt Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman, (2019). Lessons learned from memory errors observed over the lifetime of cielo Proceedings – International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018 https://doi.org/10.1109/SC.2018.00046 Publication ID: 67575
Elisabeth Baseman, Nathan Debardeleben, Sean Blanchard, Juston Moore, Olena Tkachenko, Kurt Ferreira, Taniya Siddiqua, Vilas Sridharan, (2019). Physics-Informed Machine Learning for DRAM Error Modeling 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018 https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 62156
Kurt Ferreira, (2019). Checkpointing Strategies for Shared High-Performance Computing Platforms International Journal of Networking and Computing https://doi.org/10.15803/ijnc.9.1_28 Publication ID: 60074
Stephen Olivier, Ronald Brightwell, Kevin Pedretti, Andrew Younge, Noah Evans, Scott Levy, Kurt Ferreira, Ryan Grant, (2019). SNL ATDM Software Ecosystem https://www.osti.gov/servlets/purl/1583026 Publication ID: 64200
Scott Levy, Kurt Ferreira, (2018). Using simulation to examine the effect of MPI message matching costs on application performance ACM International Conference Proceeding Series https://doi.org/10.1145/3236367.3236375 Publication ID: 63034
Scott Levy, Kevin Pedretti, Kurt Ferreira, (2018). Open science on Trinity’s knights landing partition: An analysis of user job data ACM International Conference Proceeding Series https://doi.org/10.1145/3229710.3229753 Publication ID: 62662
Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilca, Jack Dongarra, (2018). Optimal cooperative checkpointing for shared high-performance computing platforms Proceedings – 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018 https://www.osti.gov/servlets/purl/1480217 Publication ID: 53793
Scott Levy, Kurt Ferreira, Nathan DeBardeleben, Taniya Siddiqua, Vilas Sridharan, Elisabeth Baseman, (2018). Lessons Learned from Errors Observed over the Lifetime of Cielo https://doi.org/10.1109/SC.2018.00046 Publication ID: 63939
Elisabeth Baseman, Nathan DeBardeleben, Sean Blanchard, Juston Moore, Olena Tkachenko, Kurt Ferreira, Taniya Siddiqua, Vilas Sridharan, (2018). Physics-Informed Machine Learning for DRAM Error Modeling https://doi.org/10.1109/DFT.2018.8602983 Publication ID: 63390
Thomas Herault, Yves Robert, Aurelien Bouteiller, Dorian Arnold, Kurt Ferreira, George Bosilica, Jack Dongarra, (2018). Optimal Cooperative Checkpointing for Shared High-Performance Computing Platforms https://doi.org/10.1109/IPDPSW.2018.00127 Publication ID: 61598
Kurt Ferreira, Ryan Grant, Michael Levenhagen, Scott Levy, Taylor Groves, (2017). Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design https://doi.org/10.1002/cpe.5150 Publication ID: 54225
Rebecca Kreitinger, Scott Levy, Kurt Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis https://www.osti.gov/servlets/purl/1478158 Publication ID: 53562
Rebecca Kreitinger, Scott Levy, Kurt Ferreira, Patrick Widener, (2017). Spacehog: Evaluating the costs of dedicating resources to in situ analysis https://www.osti.gov/servlets/purl/1573776 Publication ID: 53563
Kurt Ferreira, Scott Levy, Kevin Pedretti, Ryan Grant, (2017). Characterizing MPI matching via trace-based simulation ACM International Conference Proceeding Series https://www.osti.gov/servlets/purl/1462518 Publication ID: 57396
Scott Levy, Kurt Ferreira, Patrick Bridges, (2017). Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data Proceedings – IEEE International Conference on Cluster Computing, ICCC https://doi.org/10.1109/CLUSTER.2017.99 Publication ID: 57799
Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, Vilas Sridharan, Taniya Siddiqua, Olena Tkachenko, (2017). Automating DRAM Fault Mitigation by Learning from Experience Proceedings – 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017 https://doi.org/10.1109/DSN-W.2017.39 Publication ID: 55872
Taniya Siddiqua, Vilas Sridharan, Steven Raasch, Nathan Debardeleben, Kurt Ferreira, Scott Levy, Elisabeth Baseman, Qiang Guan, (2017). Lifetime memory reliability data from the field 2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017 https://doi.org/10.1109/DFT.2017.8244428 Publication ID: 57295
Patrick Widener, Kurt Ferreira, Scott Levy, (2017). It’s not the heat it’s the humidity: scheduling resilience activity at scale https://www.osti.gov/servlets/purl/1367189 Publication ID: 56360
Marc Gammel, Keita Teranishi, Samuel Knight, Gregory Sjaardema, Hemanth Kolla, Jason Wilke, Nicole Slattengren, Kurt Ferreira, Janine Bennett, Nikhil Jain, Laxmikant Kale, (2017). Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity https://www.osti.gov/servlets/purl/1456562 Publication ID: 55874
Patrick Widener, Kurt Ferreira, Scott Levy, (2017). Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://doi.org/10.1007/978-3-319-58943-5_50 Publication ID: 50229
Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms ACM International Conference Proceeding Series https://doi.org/10.1145/2966884.2966920 Publication ID: 52299
Elisabeth Baseman, Nathan Debardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, Qiang Guan, (2016). Improving DRAM Fault Characterization through Machine Learning Proceedings – 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016 https://doi.org/10.1109/DSN-W.2016.13 Publication ID: 49553
Oscar Mondragon, Patrick Bridges, Scott Levy, Kurt Ferreira, Patrick Widener, (2016). Scheduling In-Situ Analytics in Next-Generation Applications Proceedings – 2016 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2016 https://www.osti.gov/servlets/purl/1333466 Publication ID: 41676
Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression International Conference for High Performance Computing, Networking, Storage and Analysis, SC https://doi.org/10.1109/SC.2016.27 Publication ID: 47905
Oscar Mondragon, Patrick Bridges, Scott Levy, Kurt Ferreira, Patrick Widener, (2016). Understanding Performance Interference in Next-Generation HPC Systems International Conference for High Performance Computing, Networking, Storage and Analysis, SC https://www.osti.gov/servlets/purl/1372149 Publication ID: 51068
Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Improving Application Resilience to Memory Errors with Lightweight Compression https://doi.org/10.1109/SC.2016.27 Publication ID: 51067
David Fiala, Frank Mueller, Kurt Ferreira, Christian Engelmann, (2016). Mini-Ckpts: Surviving OS failures in persistent memory Proceedings of the International Conference on Supercomputing https://doi.org/10.1145/2925426.2926295 Publication ID: 49177
Scott Levy, Kurt Ferreira, (2016). An examination of the impact of failure distribution on coordinated checkpoint/restart FTXS 2016 – Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale https://doi.org/10.1145/2909428.2909430 Publication ID: 50259
Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). Using Simulation to Evaluate the Performance of Resilience Strategies at Scale https://doi.org/10.1007/978-3-319-10214-6_5 Publication ID: 50027
Scott Levy, Kurt Ferreira, Patrick Widener, Patrick Bridges, Oscar Mondragon, (2016). How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms https://www.osti.gov/servlets/purl/1364728 Publication ID: 50139
Galen Shipman, Patrick McCormick, Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Ramanan Sankaran, Sean Treichler, Alex Aiken, Michael Bauer, (2016). Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime https://www.osti.gov/servlets/purl/1365384 Publication ID: 49758
Kurt Ferreira, (2016). An Examination of the Impact of the Failure Distribution on Coordinated Checkpoint/Restart https://www.osti.gov/servlets/purl/1345094 Publication ID: 48501
Elisabeth Baseman, Nathan DeBardeleben, Kurt Ferreira, Scott Levy, Steven Raasch, Vilas Sridharan, Taniya Siddiqua, Qiang Guan, (2016). A Machine Learning Approach for Automatic Characterization of Memory Faults https://www.osti.gov/servlets/purl/1346523 Publication ID: 48579
Patrick Widener, Scott Levy, Kurt Ferreira, Torsten Hoefler, (2016). On noise and the performance benefit of nonblocking collectives International Journal of High Performance Computing Applications https://doi.org/10.1177/1094342015611952 Publication ID: 39411
Scott Levy, Kurt Ferreira, Patrick Bridges, (2016). Similarity Engine: Using Content Similarity to Improve Memory Resilience https://www.osti.gov/servlets/purl/1239385 Publication ID: 46804
Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Early experiences with node-level power capping on the cray XC40 platform Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing – Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis https://doi.org/10.1145/2834800.2834801 Publication ID: 41617
Dewan Ibtesham, Kurt Ferreira, Dorian Arnold, (2015). A checkpoint compression study for high-performance computing systems International Journal of High Performance Computing Applications https://doi.org/10.1177/1094342015570921 Publication ID: 37407
Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Early Experiences with Node-Level Power Capping on the Cray XC40 Platform https://doi.org/10.1145/2834800.2834801 Publication ID: 46036
Alireza Goudarzi, Dorian Arnold, Darko Stefanovic, Kurt Ferreira, Guy Feldman, (2015). A principled approach to HPC event monitoring FTXS 2015 – Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015 https://www.osti.gov/servlets/purl/1239260 Publication ID: 41943
Rolf Riesen, Barney Maccabe, Balazs Gerofi, David Lombard, John Lange, Kevin Pedretti, Kurt Ferreira, Mike Lang, Pardo Keppel, Robert Wisniewski, Ronald Brightwell, Todd Inglett, Yoonho Park, Yutaka Ishikawa, (2015). Panel: What is a Lightweight Kernel? https://www.osti.gov/servlets/purl/1258200 Publication ID: 43556
Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Galen Shipman, Wei Shu, (2015). Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform https://www.osti.gov/servlets/purl/1258232 Publication ID: 43466
Scott Levy, Kurt Ferreira, Patrick Bridges, (2015). Similarity Engine: Using Content Similarity to Improve Memory Resilience https://www.osti.gov/servlets/purl/1530987 Publication ID: 43098
Galen Shipman, Patrick McCormick, Kevin Pedretti, Stephen Olivier, Kurt Ferreira, Jacqueline Chen, Ramanan Sankaran, Sean Treichler, Alex Aiken, Michael Bauer, (2015). Dynamic Task Scheduling to Mitigate System Performance Variability https://www.osti.gov/servlets/purl/1249032 Publication ID: 43099
Kurt Ferreira, (2015). Revisiting Checkpointing for Exascale-Class Systems https://www.osti.gov/servlets/purl/1251139 Publication ID: 43249
Vilas Sridharan, Nathan DeBardeleben, Sean Blanchard, Kurt Ferreira, Jon Stearley, John Shalf, Sudhanva Gurumurthi, (2015). Memory errors in modern systems: The good, the bad, and the ugly International Conference on Architectural Support for Programming Languages and Operating Systems – ASPLOS https://doi.org/10.1145/2694344.2694348 Publication ID: 38008
Patrick Widener, Kurt Ferreira, Scott Levy, Nathan Fabian, (2015). Canaries in a coal mine: Using application-level checkpoints to detect memory failures Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) https://www.osti.gov/servlets/purl/1256569 Publication ID: 43835
Kurt Ferreira, Scott Levy, Patrick Widener, Dorian Arnold, (2014). Using Machine Learning to Optimize Uncoordinated Checkpointing Performance https://www.osti.gov/servlets/purl/1319751 Publication ID: 39111
Kurt Ferreira, (2014). Fault Survivability of Lightweight Operating Systems for exascale https://doi.org/10.2172/1459775 Publication ID: 38559
Showing Results.
Awards & Recognition
2009
Ronald Brightwell, Kurt Brian Ferreira, Suzanne M. Kelly, James H. Laros, Kevin Pedretti, James Tomkins, John P. Vandyke, Courtenay T. Vaughan, Robert Ballance, Trammell Hudson,
R&D 100 Award, R&D Magazine, One of the 100 Most Technologically Significant New Products of the Year,
June 1, 2009