LinuxCon Europe – Day 2 (Part 4 of 4)
The second day of LinuxCon was about the history of Linux, it showcased its past, the present, mainly what is being done with it in large companies, also mentioning how the future of Linux looks like. We’ve seen its rise to support multicore systems, but above that, how much can you actually scale Linux? This question was answered in the following presentation.
Experiences booting 100s of thousands to millions of Linux VMs
The title alone is enough to give a good introduction to the level of scalability that is achievable with Linux, but it also begs the question “why would anyone want to do that?” Thoughts of a world ruled by computers and references to Skynet and The Matrix immediately jump into our Science Fiction loving minds, but we should get our facts straight before anything else. In this case, Andrew John Sweeney from Sandia National Labs was able to explain in the presentation held at LinuxCon that the goal behind such a massive project was to create a “realistic living Internet environment”. This should put your mind at ease, shouldn’t it!?
The early beginnings of this project date back in 2007, as Andrew Sweeney explained, when they started by building a 100-node cluster on a laptop. Then came the question if they could run Beowulf on it. Their first large scale experiment was to boot approx. 50.000 VM’s. The question was “what would it take to get to a million?”
They explored multiple virtualization platforms (lguest, QEMU, KVM, NOVA) and started with a very small Linux guest OS (~10MB). The guest (VM) configuration was computed at runtime and everything was stored in RAM. Each VM was treated as an application process (only performing tasks when the payload was active). Among the tools used was VMatic, an emulator deployment tool for the deployment of Virtual Machines. It can also create system images, configure and patch Linux kernels as well as system images. On a modern server, VMatic was able to generate an image and boot 1.000 VM’s in less than 3 minutes.
Another software used was gproc, a large scale cluster management system written in Go. Gproc scales beyond 200,000 instances, has a web interface to control and monitor commands, and provides an efficient mechanism for pushing out large files. The boot process is done by PXE-booting from a small Linux host kernel. The physical host boots, it starts the guest VM’s, the automatic network configuration of the guest is done via a script, and guests are booted via lguest with the extra information passed on the boot command line.
This led to booting 1 million VM’s using lguest in July 2009. The workload was run on a cluster comprised of 4,600 physical nodes with 2xIntel processors at 2.6 GHz per node. Each node ran 256 lguest sessions. This achievement had its problems, too: not everything was smooth, they crashed production HPC several times, they also filled the switches’ CAM tables, and there were many other “horror stories”, as Andrew Sweeney called them. These experiments also showcased this type of work did not fit the traditional HPC model, so a new cluster was built, called KANE. It was made out of 520 nodes, each node having an Intel Core i7 Processor, 12GB of RAM, no hard disk and packaged in a desktop-sized computer case. The second type of cluster is called Strongbox, a.k.a. “cluster on your desk”, which has a see-through plexiglas case, 490 Overo Tide ARM Cortex A8 CPU’s running at 720MHz with 100Mbit NIC per board and a very low energy footprint.
As I mentioned, part of the initial goal was to model large scale networks. The focus then shifted to higher fidelity networks, to support for multiple operating systems, virtual network topology independent of physical one, virtual routers (Quagga, a.o.), virtual switches, and others. When it comes to multiple operating systems, they also booted Windows VM’s, both Windows 7 and Windows XP, but not without issues. One of the main issues that Andrew Sweeney mentioned was that they couldn’t fully start Windows without its Graphical User Interface. Even so, they booted over 65,000 Windows instances, about 200 instances of Windows 7 per system and ~230 VM instances of Windows XP per system.
By trying to reduce the VM footprint the unneeded services, the applications, and where possible the GUI were removed alongside with starting VM’s from a shared state and using memory de-duplication KSM. In terms of improving performance they used KVM tool instead of QEMU/KVM, to reduce the initial memory footprint as well as AXFS to reduce the duplication of the space taken up in memory by processes.
But there is also a large market out there for handheld devices, and at Sandia National Laboratories they took it into consideration and booted up 300,000 Android VM’s thus far. Here the Dalvik virtual machine helped them a lot regarding the sharing memory, according to Andrew Sweeney. It is with this detail that the presentation and the 2nd day of LinuxCon ended. Stay tuned for the following articles covering day 3.