Monday, February 22, 2010

Guide to Virtualization

From buzz to reality

In 2003, Intel announced that it was working on a technology called "Vanderpool" that was aimed at providing hardware-level support for something called "virtualization." With that announcement, the decades-old concept of virtualization had officially arrived on the technology press radar. In spite of its long history in computing, however, as a new buzzword, "virtualization" at first smelled ominously similar to terms like "trusted computing" and "convergence." In other words, many folks had a vague notion of what virtualization was, and from what they could tell it sounded like a decent enough idea, but you got the impression that nobody outside of a few vendors and CIO types was really too excited.

Fast-forward to 2008, and virtualization has gone from a solution in search of a problem, to an explosive market with an array of real implementations on offer, to a word that's often mentioned in the same sentence with terms like "shakeout" and "consolidation." But whatever the state of "virtualization" as a buzzword, virtualization as a technology is definitely here to stay.

Virtualization implementations are so widespread that some are even popular in the consumer market, and some (the really popular ones) even involve gaming. Anyone who uses an emulator like MAME uses virtualization, as does anyone who uses either the Xbox 360 or the Playstation 3. From the server closet to the living room, virtualization is subtly, but radically, changing the relationship between software applications and hardware.

In the present article I'll take a close look at virtualization—what it is, what it does, and how it does what it does.

Abstraction, and the big shifts in computing

Most of the biggest tectonic shifts in computing have been fundamentally about remixing the relationship between hardware and software by inserting a new abstraction layer in between programmers and the processor. The first of these shifts was the instruction set architecture (ISA) revolution, which was kicked off by IBM's invention of the microcode engine. By putting a stable interface—the programming model and the instruction set—in between the programmer and the hardware, IBM and its imitators were able to cut down on software development costs by letting programmers reuse binary code from previous generations of a product, an idea that was novel at the time.

Another major shift in computing came with the introduction of the reduced instruction set computing (RISC) concept, a concept that put compilers and high-level languages in between programmers and the ISA, leading to better performance.

Virtualization is the latest in this progression of moving software further away from hardware, and this time, the benefits have less to do with reducing development costs and increasing raw performance than they do with reducing infrastructure costs by allowing software to take better advantage of existing hardware.

Right now, there are two different technologies being pushed by vendors under the name of "virtualization": OS virtualization, and application virtualization. This article will cover only OS virtualization, but application virtualization is definitely important and deserves its own article.

The hardware/software stack

Figure 1 below shows a typical hardware/software stack. In a typical stack, the operating system runs directly on top of the hardware, while application software runs on top of the operating system. The operating system, then, is accustomed to having exclusive, privileged control of the underlying hardware, hardware that it exposes selectively to applications. To use client/server terminology, the operating system is a server that provides its client applications with access to a multitude of hardware and software services, while hiding from those clients the complexity of the underlying hardware/software stack.


Because of its special, intermediary position in the hardware/software stack, two of the operating system's most important jobs are isolating the various running applications from one another so that they don't overwrite each other's data, and arbitrating among the applications for the use of shared resources (memory, storage, networking, etc.). In order to carry out these isolation and arbitration duties, the OS must have free and uninterrupted rein to manage every corner of the machine as it sees fit... or, rather, it must think that it has such exclusive latitude. There are a number of situations (described below) where it's helpful to limit the OS's access to the underlying hardware, and that's where virtualization comes in.

Virtualization basics

The basic idea behind virtualization is to slip a relatively thin layer of software, called a virtual machine monitor (VMM) directly underneath the OS, and then to let this new software layer run multiple copies of the OS, or multiple different OSes, or both. There are two main ways that this is accomplished: 1) by running a VMM on top of a host OS, and letting it host multiple virtual machines, or 2) by wedging the VMM between the hardware and the guest OSes, in which case the VMM is called a hypervisor. Let's look at the second, hypervisor-based method, first.

The hypervisor

In a virtualized system like the one shown in Figure 2, each operating system that runs on top of the hypervisor is typically called a guest operating system. These guest operating systems don't "know" that they're running on top of another software layer. Each one believes that it has the kind of exclusive and privileged access to the hardware that it needs in order to carry out its isolation and arbitration duties. Much of the challenge of virtualization on an x86 platform lies in maintaining this illusion of supreme privilege for each guest OS. The x86 ISA is particularly uncooperative in this regard, which is why Intel's virtualization technology (VT-x, formerly known as Vanderpool) is so important. But more on VT-x later.


In order to create the illusion that each OS has exclusive access to the hardware, the hypervisor (also called the virtual machine monitor, or VMM) presents to guest OS a software-created image or simulation of an idealized computer—processor, peripherals, the works. These software-created images are called virtual machines (VMs), and the VM is what the OS runs on top of and interacts with.

In the end, the virtualized software stack is arranged as follows: at the lowest level, the hypervisor runs multiple VMs; each VM hosts an OS; and each OS runs multiple applications. So the hypervisor swaps virtual machines on and off of the actual system hardware, in a very low-granularity form of time sharing.

I'll go into much more technical detail on exactly how the hypervisor does its thing in a bit, but now that we've got the basics out of the way let's move the discussion back out to the practical level for a moment.

The host/guest model

Another, very popular method for implementing virtualization is to run virtual machines as part of a user-level process on a regular OS. This model is depicted in Figure 3, where an application like VMware runs on top of a host OS, just like any other user-level app, but it contains a VMM that hosts one or more virtual machines. Each of these VMs, in turn, host guest operating systems.

As you might imagine, this virtualization method is typically slower than the hypervisor-based approach, since there's much more software sitting between the guest OS and the actual hardware. But virtualization packages that are based on this approach are relatively painless to deploy, since you can install them and run them like any other application, without requiring a reboot.

Why virtualization?

Virtualization is finding a growing number of uses, in both the enterprise and the home. Here are a few places where you'll see virtualization at work.

Server consolidation

A common enterprise use of virtualization is server consolidation. Server consolidation involves the use of virtualization to replace multiple real but underutilized machines with multiple virtual machines running on a single system. This practice of taking multiple underutilized servers offline and consolidating all of them onto a single server machine with virtualization saves on space, power, cooling, and maintenance costs.

Live migration for load balancing and fault tolerance

Load balancing and fault tolerance are closely related enterprise uses of virtualization. Both of these uses involve a technique called live migration, in which an entire virtual machine that's running an OS and application stack is seamlessly moved from one physical server to another, all without any apparent interruption in the OS/application stack's execution. So a server farm can load-balance by moving a VM from an over-utilized system to an under-utilized system; and if the hardware in a particular server starts to fail, then that server's VMs can be live migrated to other servers on the network and the original server shut down for maintenance, all without a service interruption.

Performance isolation and security

Sometimes, multi-user OSes don't do a good enough job of isolating users from one another; this is especially true when a user or program is a resource hog or is actively hostile, as is the case with an intruder or a virus. By implementing a more robust and coarse-grained form of hardware sharing that swaps entire OS/application stacks on and off the hardware, a VMM can more effectively isolate users and applications from one another for both performance and security reasons.

Note that security is more than an enterprise use of virtualization. Both the Xbox 360 and the Playstation 3 use virtual machines to limit the kinds of software that can be run on the console hardware and to control users' access to protected content

Software development and legacy system support

For individual users, virtualization provides a number of work- and entertainment-related benefits. On the work side, software developers make extensive use of virtualization to write and debug programs. A program with a bug that crashes an entire OS can be a huge pain to debug if you have to reboot every time you run it; with virtualization, you can do your test runs in a virtual machine and just reboot the VM whenever it goes down.

Developers also use virtualization to write programs for one OS or ISA on another. So a Windows user who wants to write software for Linux using Windows-based development tools can easily do test runs by running Linux in a VM on the Windows machine.

A popular entertainment use for virtualization is the emulation of obsolete hardware, especially older game consoles. Users of popular game system emulators like MAME can enjoy games written for hardware that's no longer in production.

Types of virtualization

For virtualization to work, the VMM must give each guest OS the illusion of exclusive access to the following parts of the machine:

CPU
Main memory
Mass Storage (typically a hard disk)
I/O (typically a network interface)
Virtualization software accomplishes this bit of magic by virtualizing each of the four components to some degree or another. In other words, the software presents a carefully crafted and controlled model of the whole computer—called a virtual machine—to each guest OS. This virtual machine consists of the four main parts listed above, with each part being abstracted from the actual hardware to a greater or lesser degree, depending on the needs of the guest OS and the capabilities of the hardware.

Depending on certain features of the hardware and guest operating system, each of these four parts can be easier or harder to virtualize. Problems with virtualizing one or more of the four components listed above have resulted in the development of three primary types of virtualization, each of which is distinguished by the manner in which the VMM interposes itself in between the hardware and the guest OS:

Emulation (including binary translation)
Classical virtualization
Paravirtualization
Except for this list's omission of OS virtualization, which I won't cover here, it superficially resembles the standard list of virtualization types that you'll see in most articles on the topic. In this article, however, these categories work in a slightly different, but hopefully more useful, manner than is common. (For those who are already familiar with some virtualization terminology, you'll notice that I've opted for the more strictly defined "classical virtualization" category instead of the "full virtualization" category. This was done for reasons that will become clear later.)

Emulation

Emulation is the flavor of virtualization that places the largest amount of software in between the hardware and the guest OS, and because of that, it can also be the slowest of the three types. With emulation, the VMM presents to each guest OS a software-based model of the entire computer, including the microprocessor. All of the instructions in the instruction streams of both the guest OS and application programs must first pass through the VMM before being passed on to the processor, often so that they can be translated into the processor's native ISA and executed.

Even the parts of the OS that interface with the I/O and mass storage hardware (i.e. the drivers) must also pass through the virtual machine, with the result that no part of the OS really touches the hardware directly without going through the VMM first.

Because of all of the software that sits between the guest OS and the hardware, emulation can reduce OS and application performance by orders of magnitude versus native execution. This is certainly the case for virtualized systems where the processor has an ISA that's different from that for which the OS was written (e.g., the version of VirtualPC that ran x86-based Windows on the PowerPC-based Mac platform). However, some modern binary-translation-based approaches, like VMware's products where both the guest and host operating systems have the same ISA, boast speeds approaching native execution for certain kinds of workloads. (This is because VMware binary translation kernel only emulates the small fraction of x86 instructions that present problems for virtualization, while passing the rest directly on to the hardware. But more on this in Part II.)

Classical virtualization

When a guest OS and host processor share the same ISA, and when that ISA is amenable to the trap-and-emulate technique (more on this term Part II), the VMM can forgo the costly binary translation step and pass the OS and application instruction streams directly on to the processor. The result is that each guest OS and its attendant applications run faster than they would under emulation, but not quite as fast as they would if the OS had exclusive control of the hardware.

With classical virtualization, the processor traps instructions that might accidentally clue the OS in to the fact that there's something odd and unexpected going on behind its back. These problem instructions have to be emulated by the VMM, so that the VMM can keep the guest OS in the dark about what's going on.

We'll talk more about these problem instructions and how the VMM handles them in Part II.

Paravirtualization

Because classical virtualization requires that the VMM trap and emulate a handful of common problem instructions, guest OSes and their applications can sometimes run more slowly than they do when running natively. A technique called paravirtualization remedies this problem by modifying the guest OS so that these instructions don't pose a problem. With a cooperative guest OS that has been properly modified, the VMM can trust the OS to run with less oversight—and less costly overhead.

The main drawback to paravirtualization is that the OS must be modified in order to support the technique. These modifications are typically minimal, but they require access to the OS source code. For this reason, Linux is the most popular paravirtualized OS.

Though I don't often link to Wikipedia, this table provides an excellent overview of virtualization packages and techniques on different platforms. At this point in the article, you should be well equipped to understand most of what you'll find there, so go check it out before proceeding with Part II.

Ultimately, there are a number of factors that play into a decision of which type of virtualization is best for a given implementation. The nature of the hardware and of the guest OS may rule out one or more of the three options, and for hardware/OS combinations where multiple options are possible, performance, stability, or ease of remote management may be among the deciding factors.

In an ideal world, binary translation and paravirtualization wouldn't be necessary, and full virtualization would enable VMM to run guest OSes at near-native speeds. Historically, the main barrier to making this happen on commodity hardware has been the presence of certain problems in the x86 ISA, problems that Intel has fixed with VT-x, but are nonetheless worth taking a look at in order to understand how virtualization is actually implemented in hardware and software.

Privilege levels, rings, and fooling the guest OS

In the previous installment of the Virtualization Guide, I talked in general ways about the exculsive hardware access privileges that the OS reserves for itself. Now it's time to nuance that picture a bit, so you can see exactly how the OS retains the upper hand over applications and users. This brief installment sets the stage for Part III, which will talk in some detail about Intel VT.

A microprocessor does more than just blindly run whatever instructions are loaded into its front end, without regard for where those instructions came from. Microprocessors are in fact "aware" of the OS, and they provide direct hardware support for enforcing divisions between components of the hardware/software stack that I described in the previous article.

In order to keep applications from usurping any part of the OS's privileged access to system hardware, processors provide a mechanism that allows different programs to run at different privilege levels. These privilege levels are called rings, and they're arranged in a hierarchy that starts with Ring 0 (the lowest, most trusted level) and extends upwards through one or more progressively less-trusted Rings (e.g., Ring 1, Ring 2, and so on).

On any given processor, Ring 0 is the most privileged level, and any software that runs in Ring 0 is running in the most privileged state that the hardware supports. Such trusted software has complete command of the processor and of the rest of the system, which is why Ring 0 is typically reserved exclusively for the OS. Rings 1 and higher are less privileged, and they're home to less sensitive parts of the OS and to user-level application software.

Many processors have only two rings, Ring 0 for the OS and Ring 1 for all the other software in the stack. The x86 ISA, in contrast, has four rings (Rings 0 through 3), presumably because x86's designers thought more was better. But it turns out that all operating systems (with the exception of the erstwhile OS/2) use only two of x86's privilege levels: Ring 0 for the OS and Ring 3 for everything else. Rings 1 and 2 go completely unused.

Because programs running in the higher rings have restrictions on what parts of the system they can touch, it's harder for these de-privileged programs to do any real damage to the system, like crash it, or overwrite another user's data either through accident or malice. Conversely, an accidental or malicious error in a Ring 0 program (typically the OS kernel) often has catastrophic consequences for the entire software stack. The general rule is that programs are vulnerable to interference from programs that are running in the same ring or in a lower ring, but not in a higher ring. This rule means that the program at the very lowest ring is untouchable, while the programs in the higher rings are at the mercy of programs running below them.

The introduction of a hypervisor into this ring structure complicates the software stack picture to a greater or lesser degree, depending on the nature of the hardware's ISA and the exact type of virtualization being used. Specifically, the hypervisor must be the most privileged program in the stack that it hosts, which means that it must run in a lower ring than the guest OS. Clearly, this means that the guest OS must be de-privileged by being booted out of Ring 0 and forced to run in a higher ring. Most of the challenge of virtualization lies in keeping the guest OS in the dark about the fact that it's no longer running in Ring 0, and "classical" virtualization solutions meet this challenge with two tricks: trap-and-emulate, and shadow structures.

Trap and emulate

Whenever a program attempts to execute an instruction for which it lacks sufficient privileges (i.e., it needs to be in a lower ring to execute that instruction), the attempted instruction fails and (ideally) triggers a special alert called a fault.

A hypervisor takes advantage of faults to implement virtualization by running the OS in a higher ring and then listening for it to trigger a fault. When the OS executes an instruction for which it needs the Ring 0 privileges to which it's accustomed, the resulting instruction fault alerts the hypervisor. The hypervisor then steps in and takes control of the processor (or, it traps the fault), so that it can emulate the execution of that instruction. By trapping instructions that fault because they require Ring 0 privileges, and then running those instructions in emulation in order to produce the expected result for the guest OS, the hypervisor can keep the OS from detecting that it's running in a Ring other than zero.

The number of instructions in an ISA that must be trapped and emulated by the virtual machine may be very small, but all it takes is one to make the trap-and-emulate technique absolutely necessary.

Shadow structures

You might think of trap-and-emulate as the more passive of the two methods that the hypervisor has of fooling a guest OS. It's passive in the sense that the hypervisor waits on the guest OS to do something that it normally should be able to do but can't, before kicking into action with its deception. But virtualization involves a more active form of deception as well: the constant presentation of certain artificial stage props to the guest OS, props that enable the OS to serve its own running applications without catching wind of the fact that it doesn't have exclusive access to the hardware.

As part of its isolation and arbitration duties, there are a number of special-purpose, hardware-based data structures that an OS must maintain and constantly reference . Some of these structures, which we'll call primary structures, are special-purpose registers on the CPU, while others are tables that are stored in memory. Because a normal microprocessor only supports one copy of each of these primary structures, the hypervisor must have a way to let all the guest OSes share that one copy.

The solution is for the virtual machine to show each guest OS its own private copy of each primary structure. These private, VM-specific copies are call shadow structures, and the hypervisor uses these shadow structures in conjunction with their corresponding primary structures to keep guests OSes from interfering with one another.

Privileged state

Trap-and-emulate and shadow structures work together to keep the OS from figuring out that it's not running in Ring 0 and that it's actually sharing the hardware with more than one operating system. Behind both of these techniques is the necessity that the hypervisor, instead of the OS, have exclusive write access to privileged state. Now, let's unpack this phrase, because the concept that it conveys is critical to understanding how virtualization works.

"State" is a term that programmers use to refer to the small but essential collection of variable values and tables, held both on the processor and in main memory, that make up any program's "short-term memory." A better way to define a program's state would be to say that state encompasses all of the information you would need to save somewhere if you were going to stop a running program, and then restart it later at the same point in its execution.

Privileged state, then, is private data that the OS needs about currently running applications, data such as which pages of memory they've been allocated and what flags they've set on the processor. This privileged state should only be altered by the most trusted program in the system, and this typically means the OS. However, when a hypervisor takes over the OS's management of privileged state, then the hypervisor has to monitor and manage the OS's access to this data—data that the guest OS still needs in order to do its job. So the hypervisor must provide each guest OS some sort of access to privileged state, but no guest OS must be allowed to alter—or write to—privileged state without the hypervisor's intervention. In may cases, guest OSes may get read access to privileged state, but write access is always forbidden.

When it comes to controlling write access to privileged state, the hardware's ISA can be either a huge help or a huge hindrance. The x86 ISA is the latter, a fact that makes virtualization on x86 hardware especially challenging.

Classical virtualization vs. x86-based virtualization

The trap-and-emulate technique that I've described above is an essential part of what's often called "classical virtualization." Classical virtualization is so called in order to distinguish it from the kind of virtualization that has been done on x86 systems prior to the recent introduction of Intel's VT-x technology. For reasons I'll discuss in the next installment, the trap-and-emulate technique just doesn't work on the x86 ISA, a fact that means that all virtualization on x86 hardware (pre-VT, of course) is either binary translation or paravirtualization. Even a software package like VMware, which most articles on virtualization place in the "full virtualization" category, still uses binary translation (BT) to control the OS's access to the CPU.

Why would a virtualization solution use binary translation—a technology typically reserved for translating between two separate ISAs—to run an x86-based OS on x86 processor hardware? The answer is that x86 unfortunately allows non-faulting write access to privileged state. In other words, the execution of some x86 instructions can have the side-effect of altering privileged state without triggering a fault that would alert a hypervisor to the fact that it needs to intervene and emulate the instruction. This feature of x86 makes classical, trap-and-emulate-based virtualization impossible to implement on x86 hardware prior to the introduction of VT-x.

No comments: