Publications

Publications / Conference

Combining virtualization, resource characterization, and resource management to enable efficient high performance compute platforms through intelligent dynamic resource allocation

Brandt, James M.; Chen, F.; De Sapio, Vincent D.; Gentile, Ann C.; Mayo, Jackson M.; Pébay, P.; Roe, D.; Thompson, D.; Wong, M.

Improved resource utilization and fault tolerance of large-scale HPC systems can be achieved through fine grained, intelligent, and dynamic resource (re)allocation. We explore components and enabling technologies applicable to creating a system to provide this capability: specifically 1) Scalable fine-grained monitoring and analysis to inform resource allocation decisions, 2) Virtualization to enable dynamic reconfiguration, 3) Resource management for the combined physical and virtual resources and 4) Orchestration of the allocation, evaluation, and balancing of resources in a dynamic environment. We discuss both general and HPC-centric issues that impact the design of such a system. Finally, we present our prototype system, giving both design details and examples of its application in real-world scenarios.