Showing posts with label Virtualization. Show all posts
Showing posts with label Virtualization. Show all posts

Monday, February 22, 2010

Guide to Virtualization

From buzz to reality

In 2003, Intel announced that it was working on a technology called "Vanderpool" that was aimed at providing hardware-level support for something called "virtualization." With that announcement, the decades-old concept of virtualization had officially arrived on the technology press radar. In spite of its long history in computing, however, as a new buzzword, "virtualization" at first smelled ominously similar to terms like "trusted computing" and "convergence." In other words, many folks had a vague notion of what virtualization was, and from what they could tell it sounded like a decent enough idea, but you got the impression that nobody outside of a few vendors and CIO types was really too excited.

Fast-forward to 2008, and virtualization has gone from a solution in search of a problem, to an explosive market with an array of real implementations on offer, to a word that's often mentioned in the same sentence with terms like "shakeout" and "consolidation." But whatever the state of "virtualization" as a buzzword, virtualization as a technology is definitely here to stay.

Virtualization implementations are so widespread that some are even popular in the consumer market, and some (the really popular ones) even involve gaming. Anyone who uses an emulator like MAME uses virtualization, as does anyone who uses either the Xbox 360 or the Playstation 3. From the server closet to the living room, virtualization is subtly, but radically, changing the relationship between software applications and hardware.

In the present article I'll take a close look at virtualization—what it is, what it does, and how it does what it does.

Abstraction, and the big shifts in computing

Most of the biggest tectonic shifts in computing have been fundamentally about remixing the relationship between hardware and software by inserting a new abstraction layer in between programmers and the processor. The first of these shifts was the instruction set architecture (ISA) revolution, which was kicked off by IBM's invention of the microcode engine. By putting a stable interface—the programming model and the instruction set—in between the programmer and the hardware, IBM and its imitators were able to cut down on software development costs by letting programmers reuse binary code from previous generations of a product, an idea that was novel at the time.

Another major shift in computing came with the introduction of the reduced instruction set computing (RISC) concept, a concept that put compilers and high-level languages in between programmers and the ISA, leading to better performance.

Virtualization is the latest in this progression of moving software further away from hardware, and this time, the benefits have less to do with reducing development costs and increasing raw performance than they do with reducing infrastructure costs by allowing software to take better advantage of existing hardware.

Right now, there are two different technologies being pushed by vendors under the name of "virtualization": OS virtualization, and application virtualization. This article will cover only OS virtualization, but application virtualization is definitely important and deserves its own article.

The hardware/software stack

Figure 1 below shows a typical hardware/software stack. In a typical stack, the operating system runs directly on top of the hardware, while application software runs on top of the operating system. The operating system, then, is accustomed to having exclusive, privileged control of the underlying hardware, hardware that it exposes selectively to applications. To use client/server terminology, the operating system is a server that provides its client applications with access to a multitude of hardware and software services, while hiding from those clients the complexity of the underlying hardware/software stack.


Because of its special, intermediary position in the hardware/software stack, two of the operating system's most important jobs are isolating the various running applications from one another so that they don't overwrite each other's data, and arbitrating among the applications for the use of shared resources (memory, storage, networking, etc.). In order to carry out these isolation and arbitration duties, the OS must have free and uninterrupted rein to manage every corner of the machine as it sees fit... or, rather, it must think that it has such exclusive latitude. There are a number of situations (described below) where it's helpful to limit the OS's access to the underlying hardware, and that's where virtualization comes in.

Virtualization basics

The basic idea behind virtualization is to slip a relatively thin layer of software, called a virtual machine monitor (VMM) directly underneath the OS, and then to let this new software layer run multiple copies of the OS, or multiple different OSes, or both. There are two main ways that this is accomplished: 1) by running a VMM on top of a host OS, and letting it host multiple virtual machines, or 2) by wedging the VMM between the hardware and the guest OSes, in which case the VMM is called a hypervisor. Let's look at the second, hypervisor-based method, first.

The hypervisor

In a virtualized system like the one shown in Figure 2, each operating system that runs on top of the hypervisor is typically called a guest operating system. These guest operating systems don't "know" that they're running on top of another software layer. Each one believes that it has the kind of exclusive and privileged access to the hardware that it needs in order to carry out its isolation and arbitration duties. Much of the challenge of virtualization on an x86 platform lies in maintaining this illusion of supreme privilege for each guest OS. The x86 ISA is particularly uncooperative in this regard, which is why Intel's virtualization technology (VT-x, formerly known as Vanderpool) is so important. But more on VT-x later.


In order to create the illusion that each OS has exclusive access to the hardware, the hypervisor (also called the virtual machine monitor, or VMM) presents to guest OS a software-created image or simulation of an idealized computer—processor, peripherals, the works. These software-created images are called virtual machines (VMs), and the VM is what the OS runs on top of and interacts with.

In the end, the virtualized software stack is arranged as follows: at the lowest level, the hypervisor runs multiple VMs; each VM hosts an OS; and each OS runs multiple applications. So the hypervisor swaps virtual machines on and off of the actual system hardware, in a very low-granularity form of time sharing.

I'll go into much more technical detail on exactly how the hypervisor does its thing in a bit, but now that we've got the basics out of the way let's move the discussion back out to the practical level for a moment.

The host/guest model

Another, very popular method for implementing virtualization is to run virtual machines as part of a user-level process on a regular OS. This model is depicted in Figure 3, where an application like VMware runs on top of a host OS, just like any other user-level app, but it contains a VMM that hosts one or more virtual machines. Each of these VMs, in turn, host guest operating systems.

As you might imagine, this virtualization method is typically slower than the hypervisor-based approach, since there's much more software sitting between the guest OS and the actual hardware. But virtualization packages that are based on this approach are relatively painless to deploy, since you can install them and run them like any other application, without requiring a reboot.

Why virtualization?

Virtualization is finding a growing number of uses, in both the enterprise and the home. Here are a few places where you'll see virtualization at work.

Server consolidation

A common enterprise use of virtualization is server consolidation. Server consolidation involves the use of virtualization to replace multiple real but underutilized machines with multiple virtual machines running on a single system. This practice of taking multiple underutilized servers offline and consolidating all of them onto a single server machine with virtualization saves on space, power, cooling, and maintenance costs.

Live migration for load balancing and fault tolerance

Load balancing and fault tolerance are closely related enterprise uses of virtualization. Both of these uses involve a technique called live migration, in which an entire virtual machine that's running an OS and application stack is seamlessly moved from one physical server to another, all without any apparent interruption in the OS/application stack's execution. So a server farm can load-balance by moving a VM from an over-utilized system to an under-utilized system; and if the hardware in a particular server starts to fail, then that server's VMs can be live migrated to other servers on the network and the original server shut down for maintenance, all without a service interruption.

Performance isolation and security

Sometimes, multi-user OSes don't do a good enough job of isolating users from one another; this is especially true when a user or program is a resource hog or is actively hostile, as is the case with an intruder or a virus. By implementing a more robust and coarse-grained form of hardware sharing that swaps entire OS/application stacks on and off the hardware, a VMM can more effectively isolate users and applications from one another for both performance and security reasons.

Note that security is more than an enterprise use of virtualization. Both the Xbox 360 and the Playstation 3 use virtual machines to limit the kinds of software that can be run on the console hardware and to control users' access to protected content

Software development and legacy system support

For individual users, virtualization provides a number of work- and entertainment-related benefits. On the work side, software developers make extensive use of virtualization to write and debug programs. A program with a bug that crashes an entire OS can be a huge pain to debug if you have to reboot every time you run it; with virtualization, you can do your test runs in a virtual machine and just reboot the VM whenever it goes down.

Developers also use virtualization to write programs for one OS or ISA on another. So a Windows user who wants to write software for Linux using Windows-based development tools can easily do test runs by running Linux in a VM on the Windows machine.

A popular entertainment use for virtualization is the emulation of obsolete hardware, especially older game consoles. Users of popular game system emulators like MAME can enjoy games written for hardware that's no longer in production.

Types of virtualization

For virtualization to work, the VMM must give each guest OS the illusion of exclusive access to the following parts of the machine:

CPU
Main memory
Mass Storage (typically a hard disk)
I/O (typically a network interface)
Virtualization software accomplishes this bit of magic by virtualizing each of the four components to some degree or another. In other words, the software presents a carefully crafted and controlled model of the whole computer—called a virtual machine—to each guest OS. This virtual machine consists of the four main parts listed above, with each part being abstracted from the actual hardware to a greater or lesser degree, depending on the needs of the guest OS and the capabilities of the hardware.

Depending on certain features of the hardware and guest operating system, each of these four parts can be easier or harder to virtualize. Problems with virtualizing one or more of the four components listed above have resulted in the development of three primary types of virtualization, each of which is distinguished by the manner in which the VMM interposes itself in between the hardware and the guest OS:

Emulation (including binary translation)
Classical virtualization
Paravirtualization
Except for this list's omission of OS virtualization, which I won't cover here, it superficially resembles the standard list of virtualization types that you'll see in most articles on the topic. In this article, however, these categories work in a slightly different, but hopefully more useful, manner than is common. (For those who are already familiar with some virtualization terminology, you'll notice that I've opted for the more strictly defined "classical virtualization" category instead of the "full virtualization" category. This was done for reasons that will become clear later.)

Emulation

Emulation is the flavor of virtualization that places the largest amount of software in between the hardware and the guest OS, and because of that, it can also be the slowest of the three types. With emulation, the VMM presents to each guest OS a software-based model of the entire computer, including the microprocessor. All of the instructions in the instruction streams of both the guest OS and application programs must first pass through the VMM before being passed on to the processor, often so that they can be translated into the processor's native ISA and executed.

Even the parts of the OS that interface with the I/O and mass storage hardware (i.e. the drivers) must also pass through the virtual machine, with the result that no part of the OS really touches the hardware directly without going through the VMM first.

Because of all of the software that sits between the guest OS and the hardware, emulation can reduce OS and application performance by orders of magnitude versus native execution. This is certainly the case for virtualized systems where the processor has an ISA that's different from that for which the OS was written (e.g., the version of VirtualPC that ran x86-based Windows on the PowerPC-based Mac platform). However, some modern binary-translation-based approaches, like VMware's products where both the guest and host operating systems have the same ISA, boast speeds approaching native execution for certain kinds of workloads. (This is because VMware binary translation kernel only emulates the small fraction of x86 instructions that present problems for virtualization, while passing the rest directly on to the hardware. But more on this in Part II.)

Classical virtualization

When a guest OS and host processor share the same ISA, and when that ISA is amenable to the trap-and-emulate technique (more on this term Part II), the VMM can forgo the costly binary translation step and pass the OS and application instruction streams directly on to the processor. The result is that each guest OS and its attendant applications run faster than they would under emulation, but not quite as fast as they would if the OS had exclusive control of the hardware.

With classical virtualization, the processor traps instructions that might accidentally clue the OS in to the fact that there's something odd and unexpected going on behind its back. These problem instructions have to be emulated by the VMM, so that the VMM can keep the guest OS in the dark about what's going on.

We'll talk more about these problem instructions and how the VMM handles them in Part II.

Paravirtualization

Because classical virtualization requires that the VMM trap and emulate a handful of common problem instructions, guest OSes and their applications can sometimes run more slowly than they do when running natively. A technique called paravirtualization remedies this problem by modifying the guest OS so that these instructions don't pose a problem. With a cooperative guest OS that has been properly modified, the VMM can trust the OS to run with less oversight—and less costly overhead.

The main drawback to paravirtualization is that the OS must be modified in order to support the technique. These modifications are typically minimal, but they require access to the OS source code. For this reason, Linux is the most popular paravirtualized OS.

Though I don't often link to Wikipedia, this table provides an excellent overview of virtualization packages and techniques on different platforms. At this point in the article, you should be well equipped to understand most of what you'll find there, so go check it out before proceeding with Part II.

Ultimately, there are a number of factors that play into a decision of which type of virtualization is best for a given implementation. The nature of the hardware and of the guest OS may rule out one or more of the three options, and for hardware/OS combinations where multiple options are possible, performance, stability, or ease of remote management may be among the deciding factors.

In an ideal world, binary translation and paravirtualization wouldn't be necessary, and full virtualization would enable VMM to run guest OSes at near-native speeds. Historically, the main barrier to making this happen on commodity hardware has been the presence of certain problems in the x86 ISA, problems that Intel has fixed with VT-x, but are nonetheless worth taking a look at in order to understand how virtualization is actually implemented in hardware and software.

Privilege levels, rings, and fooling the guest OS

In the previous installment of the Virtualization Guide, I talked in general ways about the exculsive hardware access privileges that the OS reserves for itself. Now it's time to nuance that picture a bit, so you can see exactly how the OS retains the upper hand over applications and users. This brief installment sets the stage for Part III, which will talk in some detail about Intel VT.

A microprocessor does more than just blindly run whatever instructions are loaded into its front end, without regard for where those instructions came from. Microprocessors are in fact "aware" of the OS, and they provide direct hardware support for enforcing divisions between components of the hardware/software stack that I described in the previous article.

In order to keep applications from usurping any part of the OS's privileged access to system hardware, processors provide a mechanism that allows different programs to run at different privilege levels. These privilege levels are called rings, and they're arranged in a hierarchy that starts with Ring 0 (the lowest, most trusted level) and extends upwards through one or more progressively less-trusted Rings (e.g., Ring 1, Ring 2, and so on).

On any given processor, Ring 0 is the most privileged level, and any software that runs in Ring 0 is running in the most privileged state that the hardware supports. Such trusted software has complete command of the processor and of the rest of the system, which is why Ring 0 is typically reserved exclusively for the OS. Rings 1 and higher are less privileged, and they're home to less sensitive parts of the OS and to user-level application software.

Many processors have only two rings, Ring 0 for the OS and Ring 1 for all the other software in the stack. The x86 ISA, in contrast, has four rings (Rings 0 through 3), presumably because x86's designers thought more was better. But it turns out that all operating systems (with the exception of the erstwhile OS/2) use only two of x86's privilege levels: Ring 0 for the OS and Ring 3 for everything else. Rings 1 and 2 go completely unused.

Because programs running in the higher rings have restrictions on what parts of the system they can touch, it's harder for these de-privileged programs to do any real damage to the system, like crash it, or overwrite another user's data either through accident or malice. Conversely, an accidental or malicious error in a Ring 0 program (typically the OS kernel) often has catastrophic consequences for the entire software stack. The general rule is that programs are vulnerable to interference from programs that are running in the same ring or in a lower ring, but not in a higher ring. This rule means that the program at the very lowest ring is untouchable, while the programs in the higher rings are at the mercy of programs running below them.

The introduction of a hypervisor into this ring structure complicates the software stack picture to a greater or lesser degree, depending on the nature of the hardware's ISA and the exact type of virtualization being used. Specifically, the hypervisor must be the most privileged program in the stack that it hosts, which means that it must run in a lower ring than the guest OS. Clearly, this means that the guest OS must be de-privileged by being booted out of Ring 0 and forced to run in a higher ring. Most of the challenge of virtualization lies in keeping the guest OS in the dark about the fact that it's no longer running in Ring 0, and "classical" virtualization solutions meet this challenge with two tricks: trap-and-emulate, and shadow structures.

Trap and emulate

Whenever a program attempts to execute an instruction for which it lacks sufficient privileges (i.e., it needs to be in a lower ring to execute that instruction), the attempted instruction fails and (ideally) triggers a special alert called a fault.

A hypervisor takes advantage of faults to implement virtualization by running the OS in a higher ring and then listening for it to trigger a fault. When the OS executes an instruction for which it needs the Ring 0 privileges to which it's accustomed, the resulting instruction fault alerts the hypervisor. The hypervisor then steps in and takes control of the processor (or, it traps the fault), so that it can emulate the execution of that instruction. By trapping instructions that fault because they require Ring 0 privileges, and then running those instructions in emulation in order to produce the expected result for the guest OS, the hypervisor can keep the OS from detecting that it's running in a Ring other than zero.

The number of instructions in an ISA that must be trapped and emulated by the virtual machine may be very small, but all it takes is one to make the trap-and-emulate technique absolutely necessary.

Shadow structures

You might think of trap-and-emulate as the more passive of the two methods that the hypervisor has of fooling a guest OS. It's passive in the sense that the hypervisor waits on the guest OS to do something that it normally should be able to do but can't, before kicking into action with its deception. But virtualization involves a more active form of deception as well: the constant presentation of certain artificial stage props to the guest OS, props that enable the OS to serve its own running applications without catching wind of the fact that it doesn't have exclusive access to the hardware.

As part of its isolation and arbitration duties, there are a number of special-purpose, hardware-based data structures that an OS must maintain and constantly reference . Some of these structures, which we'll call primary structures, are special-purpose registers on the CPU, while others are tables that are stored in memory. Because a normal microprocessor only supports one copy of each of these primary structures, the hypervisor must have a way to let all the guest OSes share that one copy.

The solution is for the virtual machine to show each guest OS its own private copy of each primary structure. These private, VM-specific copies are call shadow structures, and the hypervisor uses these shadow structures in conjunction with their corresponding primary structures to keep guests OSes from interfering with one another.

Privileged state

Trap-and-emulate and shadow structures work together to keep the OS from figuring out that it's not running in Ring 0 and that it's actually sharing the hardware with more than one operating system. Behind both of these techniques is the necessity that the hypervisor, instead of the OS, have exclusive write access to privileged state. Now, let's unpack this phrase, because the concept that it conveys is critical to understanding how virtualization works.

"State" is a term that programmers use to refer to the small but essential collection of variable values and tables, held both on the processor and in main memory, that make up any program's "short-term memory." A better way to define a program's state would be to say that state encompasses all of the information you would need to save somewhere if you were going to stop a running program, and then restart it later at the same point in its execution.

Privileged state, then, is private data that the OS needs about currently running applications, data such as which pages of memory they've been allocated and what flags they've set on the processor. This privileged state should only be altered by the most trusted program in the system, and this typically means the OS. However, when a hypervisor takes over the OS's management of privileged state, then the hypervisor has to monitor and manage the OS's access to this data—data that the guest OS still needs in order to do its job. So the hypervisor must provide each guest OS some sort of access to privileged state, but no guest OS must be allowed to alter—or write to—privileged state without the hypervisor's intervention. In may cases, guest OSes may get read access to privileged state, but write access is always forbidden.

When it comes to controlling write access to privileged state, the hardware's ISA can be either a huge help or a huge hindrance. The x86 ISA is the latter, a fact that makes virtualization on x86 hardware especially challenging.

Classical virtualization vs. x86-based virtualization

The trap-and-emulate technique that I've described above is an essential part of what's often called "classical virtualization." Classical virtualization is so called in order to distinguish it from the kind of virtualization that has been done on x86 systems prior to the recent introduction of Intel's VT-x technology. For reasons I'll discuss in the next installment, the trap-and-emulate technique just doesn't work on the x86 ISA, a fact that means that all virtualization on x86 hardware (pre-VT, of course) is either binary translation or paravirtualization. Even a software package like VMware, which most articles on virtualization place in the "full virtualization" category, still uses binary translation (BT) to control the OS's access to the CPU.

Why would a virtualization solution use binary translation—a technology typically reserved for translating between two separate ISAs—to run an x86-based OS on x86 processor hardware? The answer is that x86 unfortunately allows non-faulting write access to privileged state. In other words, the execution of some x86 instructions can have the side-effect of altering privileged state without triggering a fault that would alert a hypervisor to the fact that it needs to intervene and emulate the instruction. This feature of x86 makes classical, trap-and-emulate-based virtualization impossible to implement on x86 hardware prior to the introduction of VT-x.

Guide to I/O Virtualization


Virtualization is a key enabling technology for the modern datacenter. Without virtualization, tricks like load balancing and multitenancy wouldn't be available from datacenters that use commodity x86 hardware to supply the on-demand compute cycles and networked storage that powers the current generation of cloud-based Web applications.

Even though it has been used pervasively in datacenters for the past few years, virtualization isn't standing still. Rather, the technology is still evolving, and with the launch of I/O virtualization support from Intel and AMD it's poised to reach new levels of performance and flexibility. Our past virtualization coverage looked at the basics of what virtualization is, and how processors are virtualized. The current installment will take a close look at how I/O virtualization is used to boost the performance of individual servers by better virtualizing parts of the machine besides the CPU.

Part 1 described three ways in which a component might be virtualized; emulation, "classic" virtualization, and paravirtualization, and part 2 described in more detail how each of these methods was used in CPU virtualization. But the CPU is not the only part of a computer that can use these techniques; although hardware devices are quite different from a CPU, similar approaches are equally useful.

I/O basics: the case of PCI and PCIe

Before looking at how I/O devices are virtualized, it's important to know in broad terms how they work. These days most PC hardware is, from an electronic and software perspective, PCI or PCI Express (PCIe); although many devices (disk controllers, integrated graphics, on-board networking) are not physically PCI or PCIe—they don't plug into a slot on the motherboard—the way in which they are detected, identified, and communicated with is still via PCI or PCIe.

In PCI, each device is identified by a bus number, a device number, and a device function. A given computer might have several PCI buses which might be linked (one bus used to extend another bus, joined through a PCI bridge) or independent (several buses all attached to the CPU), or some combination of the two. Generally, large high-end machines with lots of I/O expansion have more complicated PCI topologies than smaller or cheaper systems. Each device on a bus is assigned a device number by the PCI controller, and each device exposes one or more numbered functions. For example, many graphics cards offer integrated sound hardware for use with HDMI; typically the graphics capability will be function zero, the sound will be function 1. Only one device can use the bus at any given moment, which is why high-end machines often have multiple independent buses—this allows multiple devices to be active simultaneously.

PCIe operates similarly. PCIe is a point-to-point architecture rather than a bus architecture; rather than all devices (and all hardware slots) on the same bus being electrically connected, in PCIe there are no connections between devices. Instead, each device is connected solely to the controller. Each connection between device and controller is regarded as its own bus; devices are still assigned numbers, but because there can only be one device on each "bus," this number will always be zero. This approach allows software to treat PCIe as if it were PCI, allowing for easier migration from PCI to PCIe. This point-to-point topology alleviates the bus contention problem in PCI—since there is no bus sharing, there are fewer restrictions on concurrent device activity.

Actual data transfer to and from the device can use three mechanisms—system memory, x86 I/O ports, and PCI configuration space. x86 I/O ports are there to provide legacy compatibility, and PCI configuration space is used primarily for configuration. The main way that the OS communicates with PCIe devices is through system memory; this is the only mechanism that allows for large, general-purpose transfers. (With I/O ports, reads and writes are limited to 32 bits, and the CPU must take action after every single read or write, making communication slow and processor-intensive. And PCI configuration space is limited to 256 bytes, and used only for device configuration). Each device is assigned a block of system memory to which it can read and write directly ("DMA," direct memory access). For I/O devices requiring bulk transfers—disk controllers, network adaptors, video cards—this is the primary communication mechanism, as each of these devices performs regular large transfers.

When software wants to tell a PCI device to do something, the host delivers a command to the bus. Each device inspects the command, and acts on it if necessary. When the device wants to tell the CPU to do something—either because it has completed a command, or received some data—it interrupts the CPU, which in turn executes the device driver. PCI interrupts are generally delivered using 4 physical interrupt connections. These connections are shared between all devices on the same bus, so the device driver must then examine the interrupt to ensure it is handled properly. PCIe interrupts do not use physical hardware; instead, a message is sent to the device driver by writing to the block of memory assigned to the device—PCIe uses the same system for interrupts as it does for data transfer. This avoids the need to share interrupt lines, by enabling interrupts to be directed specifically and solely to the device that needs them.

Virtualizing PCI and PCIe

So, how do these things get virtualized? The first approach is emulation. Just as CPU emulation requires an entire virtual CPU to be run "in software," the same is true of device emulation. Generally, the approach taken is for the virtualization software to emulate well-known real-world devices. All the PCI infrastructure—device enumeration and identification, interrupts, DMA—is replicated in software. These software models respond to the same commands, and do the same thing as their hardware counterparts. The guest OS will write to its virtualized device memory (whether it be system memory, x86 I/O, or PCI configuration space), and trigger interrupts, and the VMM software will respond as if it were real hardware. Even this interrupt signalling uses emulation; one of the emulated devices is an interrupt controller.

This "response" generally means making an equivalent call to the host OS. So, for example, to write some data to disk, the guest OS will use its driver to write that to the disk controller's device memory, which sits inside a device model—a kind of virtual controller—along with the PCI configuration space and a virtual version of the controller chip. Then, using an interrupt sent via the VM's virtual interrupt controller, the guest OS commands the VMM's virtual disk controller to write that to a particular location on the disk. In turn, the VMM's disk controller will tell the host OS to write the data to a particular spot in a file (or, when used with so-called raw disks, to a particular spot on disk). The host OS then does the same thing as the guest OS—it copies the data to the disk controller's device memory via its driver and signals an interrupt.

in the diagram above, you can see that there's an entire virtual device and a virtual interrupt controller in the VM, and then another pair of these in the VMM. That's two layers of emulation before you get to the hardware. (The one element of the diagram above that's probably not at all self-explanatory is the little tab with gears on it beneath the OS. That's the device driver, and device model in the VMM uses it to interface with the hardware.)

By emulating real-world hardware, pre-existing guest OS drivers can be used, providing greater compatibility and ease of configuration. This is not without some risks; for example, support for the brand of network card that VirtualBox emulated was dropped in Windows Vista, meaning that VirtualBox lost its built-in networking support (this was ultimately addressed by VirtualBox being updated to emulate a second kind of network card, one that was still supported). Overall, however, it provides a simple solution that works with a broad range of guest OSes, and for basic, low bandwidth hardware—PCI controllers, mouse and keyboard controllers, etc.—the performance is acceptable, too.

This approach also provides decoupling. The guest OS might think that it is using, say, an IDE hard disk, but the host might be using SCSI, SATA, or even some future as-yet-uninvented interface. The virtualized hardware is "frozen;" regardless of the host technology, the virtual hardware is always the same. This is important for some use-cases, like Windows 7's Windows XP Mode, which is designed to run legacy software in a legacy OS. Windows XP lacks built-in support for SATA, for example, but since Virtual PC emulates IDE, XP's lack of support does not cause any compatibility issue. The guest OS only has to be compatible with the virtual machine.

The other major advantage of this technique is that it allows multiplexing; many guest OSes can run on the same host OS, and they can all share the host's I/O capabilities, enabling the use of more guest OSes than one has physical network interfaces or hard disks, which is greatly beneficial in system consolidation situations.

The big problem with this approach is with hardware that performs high bandwidth transfers (such as disk controllers or network interfaces), and with hardware that is very complex (such as graphics cards). For the former, the problem is that every time an I/O operation occurs, the VMM has to trap it and emulate it. Worse, it then has to call into the host OS to actually do the real work (write to disk, send a network packet, etc.), in turn causing additional data copying and interrupts. For a mouse or a keyboard, this overhead is small and not a big issue, but for a hard disk or network interface, which might perform hundreds of megabytes of I/O per second, the overhead is substantial. The result? Higher processor usage and lower throughput.

For the latter case, complex hardware, the problem is simply that emulating the hardware on the CPU is slow; GPUs are extremely fast at certain kinds of number crunching, and emulating this on a CPU is much, much slower. The most common way of avoiding this problem is for the VM software to simply not bother; instead of emulating a complex GPU, it emulates instead a simple 2D device, with no OpenGL or Direct3D capabilities. That's increasingly becoming unattractive, however, as mainstream OSes (including both Windows Vista and Windows 7) are demanding 3D hardware even for regular desktop usage.

One thing that's substantially missing here is any equivalent to CPU virtualization's "binary translation." A key performance feature of virtualization software is that the entire CPU doesn't have to be emulated. Most of the time, it can just run the guest OS's instructions directly. It's only certain unsafe instructions that have to be detected somehow (whether by binary translation or the trap-and-emulate approach) and performed in software. Everything else runs at full hardware speed. I/O devices typically aren't amenable to this kind of approach, because I/O devices, unlike CPUs, don't contain the machinery to be shared among multiple applications and/or users.

The performance problem with emulation can only be avoided by avoiding emulation entirely, which brings us neatly to paravirtualization.

Paravirtualization

Paravirtualization for the CPU requires modifications to the guest OS. Wherever the guest OS would do something that would normally require the VMM to step in, the guest OS either avoids the operation entirely, or tells the host OS what to do in a high-level way. For example, OSes typically disable processor interrupts for brief periods while performing critical operations to ensure that data integrity is maintained. Disabling interrupts requires using a privileged CPU instruction, so this must either be translated or trapped and emulated. With paravirtualization, the guest OS would simply tell the VMM to "disable interrupts." Communicating with the hypervisor can be done without the overhead of binary translation, so this approach can offer improved performance.

Paravirtualization for the CPU can be problematic because modifications have to be made to the OS core. Such modifications are not an issue for Linux or FreeBSD, but they're not an option for Windows or Mac OS X.

Paravirtualization for I/O takes a similar approach to paravirtualization for the CPU. The VMM exposes a relatively high-level API to guest OSes enabling, say, network operations or disk operations, and the guest OS is modified to use this API. Because I/O devices use drivers and are not part of the core OS, paravirtualization doesn't pose the same problems for I/O as it does for the CPU.



This approach is beginning to gain traction; Xen, VMware, and Microsoft's HyperV all provide paravirtualization APIs in addition to their emulated devices, so they can offer accelerated performance to any guests that have suitable paravirtual drivers. Though paravirtualization forfeits the driver compatibility of emulation, it retains the advantages of decoupling the guest from the specifics of the host hardware, and of multiplexing multiple guests onto a single set of physical hardware.

As well as providing improved performance for high-bandwidth devices, there are some efforts underway to use this approach to provide graphical acceleration to VMs. VirtualBox, for example, has experimental support for accelerated 3D within a VM. As with other paravirtualization systems, it requires the use of a special VirtualBox graphics driver within the guest OS. This driver passes 3D commands to the host system, where they are executed on the host's GPU. The results are then passed back to the guest. This use of paravirtualization greatly expands the range of tasks that virtual machines can be used for; robust support for accelerated 3D within a VM might one day make gaming, CAD, and visualization possible within a VM.

There is, however, a kind of hardware device that is widely used where communication uses neat, encapsulated packets rather than reading and writing from system memory, and that is USB. Though the USBcontroller is a regular PCI device that uses system memory to communicate, USB itself communicates using packets sent down the USB bus. An increasingly common feature of virtualization software is to continue to emulate the USB controller, but to pass the actual USB packets to the host's USB controller (and vice versa), enabling USB devices attached to the host to be passed through to the guest. The guest then uses its own USB device drivers to communicate with the device on the host.

In this way, there is direct communication between the guest OS and the actual device, allowing the full performance and range of capabilities of the device to be leveraged by the guest. This allows the wide range of USB devices to be used within the guest, without having to emulate each kind of device individually. It even allows the guest to use devices that the host has no driver for—again, this has particular advantages when virtualization is being used for legacy compatibility.

Talking to the hardware directly

Although paravirtualization improves performance, it's still not as good as native performance. To gain native performance, you need to cut out the emulated middle-man. Just as CPU virtualization gets a huge boost by direct execution of code, I/O virtualization would be improved by allowing virtual machines to talk to hardware directly.

This direct approach has an obvious pitfall—the ability to multiplex is lost. If a device is assigned to one guest, it can't be assigned to any other guests. But for many applications, that might not be such a big deal. It's relatively cheap to add a load of network interfaces to a machine (allowing one interface per guest), for example, so cost and management savings can still be achieved over and above dedicated hardware. Direct assignment also requires the guest OS to have an appropriate driver for the hardware, making the approach useless for legacy compatibility.

Hypervisor-based systems like HyperV and Xen already perform direct assignment, in a sense. With these hypervisors, all operating systems are run as guests. The first guest—the one used to bless the machine—is special, though, because it has the system's physical hardware available to it. The other guests use paravirtualized drivers to send I/O requests to this special first guest, and it uses its device drivers to communicate with the hardware. A more generalized direct assignment system would extend this capability to any guest.

Direct assignment is not without its problems, however. The big issue is the interrupts and shared memory used by devices to communicate with the CPU. The shared memory that the devices use for communication is all based on physical memory addresses. This is a problem, because each guest has its own virtualized physical memory addressing. The physical addresses used by the real hardware don't correlate to the virtual physical addresses visible to each guest, which means that whenever the guest's driver directs the device to perform DMA, it will end up using the wrong memory addresses. Interrupts pose another problem; they have to be serviced by the host, because only the host has access to the rest of the machine's hardware.

There are probably ways in which this could be worked around; perhaps a special driver in the host to handle the interrupt, translate any physical addresses, and pass it on to the guest, but such a driver would have to be tailored to the physical device to ensure that commands were properly translated.

The IOMMU

To support direct assignment robustly and in a device-independent manner requires support from the hardware. And so that's exactly what's happened, with Intel's VT-d and AMD's AMD IOMMU/AMD-Vi. These extensions add an I/O memory management unit (IOMMU) to the platform. An IOMMU allows the device memory addresses used in DMA to be mapped to physical memory addresses in a manner transparent to the device hardware, in much the same was a processor's MMU allows virtual memory addresses to be mapped to physical memory addresses.

With an IOMMU, the translation between the guest's physical addresses and the host's physical addresses can be handled completely transparently; the VMM will have to configure the IOMMU in the first place, but after that, everything else will happen automatically.

IOMMUs have been a feature of some platforms for many years, but x86 has always done without. During the AGP era, a similar (but more limited) device was found in x86 systems, the AGP GART (graphics aperture remapping table). The GART allowed AGP devices to "see" a contiguous view of system memory, even if the underlying memory was not actually contiguous. PCIe has a similar capability, but the PCIe GART is built into the PCIe graphics hardware itself. The AGP GART, in contrast, was a system feature provided by the chipset. The GART was limited, though, as it performed the same mapping for any request (whether by the CPU or the graphics card). A general-purpose IOMMU, that handles requests from different devices differently, is only recently becoming available.

Using the IOMMU not only allows the remapping to be performed automatically, it also provides a kind of memory protection. Without an IOMMU, a device can perform DMA to physical addresses that it should not be able to touch; with the IOMMU, such DMA requests can be blocked. The IOMMU can be configured such that a request from a particular device (identified by the bus/device/function triple) can only have access to particular memory ranges, with any accesses outside those ranges being trapped as an error.

The Intel and AMD IOMMUs also support interrupt remapping. Both PCI interrupts and PCIe interrupts are understood by the IOMMU, and redirected remapped as appropriate.

By using an IOMMU, the hypervisor can safely assign physical hardware directly to guest OSes, ending the need for them to funnel all their I/O through the host, and removing the layers of emulation that are normally needed for virtualized I/O, achieving native-level I/O performance for virtual machines.

The IOMMU is useful in non-virtualization scenarios, too. Many PCI devices can only use 32-bit physical addresses. This means that their buffers must all fit within the first 4 GiB of physical memory. This can make that first 4 GiB of physical memory cramped, especially when some devices, like video cards, create enormous buffers occupying many gigabytes of that memory. The IOMMU solves this problem by allowing the devices to stick with their 32-bit physical addresses, and transparently remapping them to any memory location.

The use of VT-d and AMD IOMMU in this way does sacrifice one of the benefits of emulation and paravirtualization systems: multiplexing. Direct assignment is 1:1; the device can be assigned to exactly one guest. This might not be an issue with some devices, such as multiport network cards, but it stands in the way of, say, robust native-performance virtualization of a graphics card.

The solution to this multiplexing problem has been to extend PCIe. PCIe has been extended so that devices can offer multiple virtualized functions. Though devices can currently support multiple functions, these multiple functions are used to support different hardware capabilities; the virtualized functions will be used to support the same hardware capabilities several times over. Each bus/device/virtual function triple will be assignable to a different VM, thereby allowing the device to be shared, while still allowing it be used with directly assigned I/O through the use of the IOMMU.

Widespread support for this is still a ways off. Devices that support the PCIe Single Root I/O Virtualization specification (SR-IOV) are on the market, but are unusual; a few high-end networking controllers support it (e.g. Intel's 82576 Gigabit Ethernet controller and Neterion's X3100 series). Because these devices have to support virtualization in hardware, meaning that any internal buffers have to be replicated for each associated VM, they do not offer the near-unlimited sharing of emulated devices. Nonetheless, Intel's ethernet controller supports 8 virtual functions per port, giving 8 VMs native access to the same physical hardware.

CPU virtualization has been near-native for many years, but the I/O performance of virtual machines has long left something to be desired. If and when PCIe SR-IOV devices become widespread, near-native virtualization of both processor and I/O alike will be a practical reality. When this happens, it will increase performance and reduce costs and overhead in the datacenter, as individual servers will have far less virtualization-related overhead.