Publications

Publications / Conference

Ten million and one penguins, or, lessons learned from booting millions of virtual machines on HPC systems

In this paper we describe Megatux, a set of tools we are developing for rapid provisioning of millions of virtual machines and controlling and monitoring them, as well as what we've learned from booting one million Linux virtual machines on the Thunderbird (4660 nodes) and 550,000 Linux virtual machines on the Hyperion (1024 nodes) clusters. As might be expected, our tools use hierarchical structures. In contrast to existing HPC systems, our tools do not require perfect hardware; that all systems be booted at the same time; and static configuration files that define the role of each node. While we believe these tools will be useful for future HPC systems, we are using them today to construct botnets. Botnets have been in the news recently, as discoveries of their scale (millions of infected machines for even a single botnet) and their reach (global) and their impact on organizations (devastating in financial costs and time lost to recovery) have become more apparent. A distinguishing feature of botnets is their emergent behavior: fairly simple operational rule sets can result in behavior that cannot be predicted. In general, there is no reducible understanding of how a large network will behave ahead of 'running it'. 'Running it' means observing the actual network in operation or simulating/emulating it. Unfortunately, this behavior is only seen at scale, i.e. when at minimum 10s of thousands of machines are infected. To add to the problem, botnets typically change at least 11% of the machines they are using in any given week, and this changing population is an integral part of their behavior. The use of virtual machines to assist in the forensics of malware is not new to the cyber security world. Reverse engineering techniques often use virtual machines in combination with code debuggers. Nevertheless, this task largely remains a manual process to get past code obfuscation and is inherently slow. As part of our cyber security work at Sandia National Laboratories, we are striving to understand the global network behavior of botnets. We are planning to take existing botnets, as found in the wild, and run them on HPC systems. We have turned to HPC systems to support the creation and operation of millions of Linux virtual machines as a means of observing the interaction of the botnet and other noninfected hosts. We started out using traditional HPC tools, but these tools are designed for a much smaller scale, typically topping out at one to ten thousand machines. HPC programming libraries and tools also assume complete connectivity between all nodes, with the attendant configuration files and data structures to match; this assumption holds up very poorly on systems with millions of nodes.