Monday, February 22, 2010

Guide to I/O Virtualization


Virtualization is a key enabling technology for the modern datacenter. Without virtualization, tricks like load balancing and multitenancy wouldn't be available from datacenters that use commodity x86 hardware to supply the on-demand compute cycles and networked storage that powers the current generation of cloud-based Web applications.

Even though it has been used pervasively in datacenters for the past few years, virtualization isn't standing still. Rather, the technology is still evolving, and with the launch of I/O virtualization support from Intel and AMD it's poised to reach new levels of performance and flexibility. Our past virtualization coverage looked at the basics of what virtualization is, and how processors are virtualized. The current installment will take a close look at how I/O virtualization is used to boost the performance of individual servers by better virtualizing parts of the machine besides the CPU.

Part 1 described three ways in which a component might be virtualized; emulation, "classic" virtualization, and paravirtualization, and part 2 described in more detail how each of these methods was used in CPU virtualization. But the CPU is not the only part of a computer that can use these techniques; although hardware devices are quite different from a CPU, similar approaches are equally useful.

I/O basics: the case of PCI and PCIe

Before looking at how I/O devices are virtualized, it's important to know in broad terms how they work. These days most PC hardware is, from an electronic and software perspective, PCI or PCI Express (PCIe); although many devices (disk controllers, integrated graphics, on-board networking) are not physically PCI or PCIe—they don't plug into a slot on the motherboard—the way in which they are detected, identified, and communicated with is still via PCI or PCIe.

In PCI, each device is identified by a bus number, a device number, and a device function. A given computer might have several PCI buses which might be linked (one bus used to extend another bus, joined through a PCI bridge) or independent (several buses all attached to the CPU), or some combination of the two. Generally, large high-end machines with lots of I/O expansion have more complicated PCI topologies than smaller or cheaper systems. Each device on a bus is assigned a device number by the PCI controller, and each device exposes one or more numbered functions. For example, many graphics cards offer integrated sound hardware for use with HDMI; typically the graphics capability will be function zero, the sound will be function 1. Only one device can use the bus at any given moment, which is why high-end machines often have multiple independent buses—this allows multiple devices to be active simultaneously.

PCIe operates similarly. PCIe is a point-to-point architecture rather than a bus architecture; rather than all devices (and all hardware slots) on the same bus being electrically connected, in PCIe there are no connections between devices. Instead, each device is connected solely to the controller. Each connection between device and controller is regarded as its own bus; devices are still assigned numbers, but because there can only be one device on each "bus," this number will always be zero. This approach allows software to treat PCIe as if it were PCI, allowing for easier migration from PCI to PCIe. This point-to-point topology alleviates the bus contention problem in PCI—since there is no bus sharing, there are fewer restrictions on concurrent device activity.

Actual data transfer to and from the device can use three mechanisms—system memory, x86 I/O ports, and PCI configuration space. x86 I/O ports are there to provide legacy compatibility, and PCI configuration space is used primarily for configuration. The main way that the OS communicates with PCIe devices is through system memory; this is the only mechanism that allows for large, general-purpose transfers. (With I/O ports, reads and writes are limited to 32 bits, and the CPU must take action after every single read or write, making communication slow and processor-intensive. And PCI configuration space is limited to 256 bytes, and used only for device configuration). Each device is assigned a block of system memory to which it can read and write directly ("DMA," direct memory access). For I/O devices requiring bulk transfers—disk controllers, network adaptors, video cards—this is the primary communication mechanism, as each of these devices performs regular large transfers.

When software wants to tell a PCI device to do something, the host delivers a command to the bus. Each device inspects the command, and acts on it if necessary. When the device wants to tell the CPU to do something—either because it has completed a command, or received some data—it interrupts the CPU, which in turn executes the device driver. PCI interrupts are generally delivered using 4 physical interrupt connections. These connections are shared between all devices on the same bus, so the device driver must then examine the interrupt to ensure it is handled properly. PCIe interrupts do not use physical hardware; instead, a message is sent to the device driver by writing to the block of memory assigned to the device—PCIe uses the same system for interrupts as it does for data transfer. This avoids the need to share interrupt lines, by enabling interrupts to be directed specifically and solely to the device that needs them.

Virtualizing PCI and PCIe

So, how do these things get virtualized? The first approach is emulation. Just as CPU emulation requires an entire virtual CPU to be run "in software," the same is true of device emulation. Generally, the approach taken is for the virtualization software to emulate well-known real-world devices. All the PCI infrastructure—device enumeration and identification, interrupts, DMA—is replicated in software. These software models respond to the same commands, and do the same thing as their hardware counterparts. The guest OS will write to its virtualized device memory (whether it be system memory, x86 I/O, or PCI configuration space), and trigger interrupts, and the VMM software will respond as if it were real hardware. Even this interrupt signalling uses emulation; one of the emulated devices is an interrupt controller.

This "response" generally means making an equivalent call to the host OS. So, for example, to write some data to disk, the guest OS will use its driver to write that to the disk controller's device memory, which sits inside a device model—a kind of virtual controller—along with the PCI configuration space and a virtual version of the controller chip. Then, using an interrupt sent via the VM's virtual interrupt controller, the guest OS commands the VMM's virtual disk controller to write that to a particular location on the disk. In turn, the VMM's disk controller will tell the host OS to write the data to a particular spot in a file (or, when used with so-called raw disks, to a particular spot on disk). The host OS then does the same thing as the guest OS—it copies the data to the disk controller's device memory via its driver and signals an interrupt.

in the diagram above, you can see that there's an entire virtual device and a virtual interrupt controller in the VM, and then another pair of these in the VMM. That's two layers of emulation before you get to the hardware. (The one element of the diagram above that's probably not at all self-explanatory is the little tab with gears on it beneath the OS. That's the device driver, and device model in the VMM uses it to interface with the hardware.)

By emulating real-world hardware, pre-existing guest OS drivers can be used, providing greater compatibility and ease of configuration. This is not without some risks; for example, support for the brand of network card that VirtualBox emulated was dropped in Windows Vista, meaning that VirtualBox lost its built-in networking support (this was ultimately addressed by VirtualBox being updated to emulate a second kind of network card, one that was still supported). Overall, however, it provides a simple solution that works with a broad range of guest OSes, and for basic, low bandwidth hardware—PCI controllers, mouse and keyboard controllers, etc.—the performance is acceptable, too.

This approach also provides decoupling. The guest OS might think that it is using, say, an IDE hard disk, but the host might be using SCSI, SATA, or even some future as-yet-uninvented interface. The virtualized hardware is "frozen;" regardless of the host technology, the virtual hardware is always the same. This is important for some use-cases, like Windows 7's Windows XP Mode, which is designed to run legacy software in a legacy OS. Windows XP lacks built-in support for SATA, for example, but since Virtual PC emulates IDE, XP's lack of support does not cause any compatibility issue. The guest OS only has to be compatible with the virtual machine.

The other major advantage of this technique is that it allows multiplexing; many guest OSes can run on the same host OS, and they can all share the host's I/O capabilities, enabling the use of more guest OSes than one has physical network interfaces or hard disks, which is greatly beneficial in system consolidation situations.

The big problem with this approach is with hardware that performs high bandwidth transfers (such as disk controllers or network interfaces), and with hardware that is very complex (such as graphics cards). For the former, the problem is that every time an I/O operation occurs, the VMM has to trap it and emulate it. Worse, it then has to call into the host OS to actually do the real work (write to disk, send a network packet, etc.), in turn causing additional data copying and interrupts. For a mouse or a keyboard, this overhead is small and not a big issue, but for a hard disk or network interface, which might perform hundreds of megabytes of I/O per second, the overhead is substantial. The result? Higher processor usage and lower throughput.

For the latter case, complex hardware, the problem is simply that emulating the hardware on the CPU is slow; GPUs are extremely fast at certain kinds of number crunching, and emulating this on a CPU is much, much slower. The most common way of avoiding this problem is for the VM software to simply not bother; instead of emulating a complex GPU, it emulates instead a simple 2D device, with no OpenGL or Direct3D capabilities. That's increasingly becoming unattractive, however, as mainstream OSes (including both Windows Vista and Windows 7) are demanding 3D hardware even for regular desktop usage.

One thing that's substantially missing here is any equivalent to CPU virtualization's "binary translation." A key performance feature of virtualization software is that the entire CPU doesn't have to be emulated. Most of the time, it can just run the guest OS's instructions directly. It's only certain unsafe instructions that have to be detected somehow (whether by binary translation or the trap-and-emulate approach) and performed in software. Everything else runs at full hardware speed. I/O devices typically aren't amenable to this kind of approach, because I/O devices, unlike CPUs, don't contain the machinery to be shared among multiple applications and/or users.

The performance problem with emulation can only be avoided by avoiding emulation entirely, which brings us neatly to paravirtualization.

Paravirtualization

Paravirtualization for the CPU requires modifications to the guest OS. Wherever the guest OS would do something that would normally require the VMM to step in, the guest OS either avoids the operation entirely, or tells the host OS what to do in a high-level way. For example, OSes typically disable processor interrupts for brief periods while performing critical operations to ensure that data integrity is maintained. Disabling interrupts requires using a privileged CPU instruction, so this must either be translated or trapped and emulated. With paravirtualization, the guest OS would simply tell the VMM to "disable interrupts." Communicating with the hypervisor can be done without the overhead of binary translation, so this approach can offer improved performance.

Paravirtualization for the CPU can be problematic because modifications have to be made to the OS core. Such modifications are not an issue for Linux or FreeBSD, but they're not an option for Windows or Mac OS X.

Paravirtualization for I/O takes a similar approach to paravirtualization for the CPU. The VMM exposes a relatively high-level API to guest OSes enabling, say, network operations or disk operations, and the guest OS is modified to use this API. Because I/O devices use drivers and are not part of the core OS, paravirtualization doesn't pose the same problems for I/O as it does for the CPU.



This approach is beginning to gain traction; Xen, VMware, and Microsoft's HyperV all provide paravirtualization APIs in addition to their emulated devices, so they can offer accelerated performance to any guests that have suitable paravirtual drivers. Though paravirtualization forfeits the driver compatibility of emulation, it retains the advantages of decoupling the guest from the specifics of the host hardware, and of multiplexing multiple guests onto a single set of physical hardware.

As well as providing improved performance for high-bandwidth devices, there are some efforts underway to use this approach to provide graphical acceleration to VMs. VirtualBox, for example, has experimental support for accelerated 3D within a VM. As with other paravirtualization systems, it requires the use of a special VirtualBox graphics driver within the guest OS. This driver passes 3D commands to the host system, where they are executed on the host's GPU. The results are then passed back to the guest. This use of paravirtualization greatly expands the range of tasks that virtual machines can be used for; robust support for accelerated 3D within a VM might one day make gaming, CAD, and visualization possible within a VM.

There is, however, a kind of hardware device that is widely used where communication uses neat, encapsulated packets rather than reading and writing from system memory, and that is USB. Though the USBcontroller is a regular PCI device that uses system memory to communicate, USB itself communicates using packets sent down the USB bus. An increasingly common feature of virtualization software is to continue to emulate the USB controller, but to pass the actual USB packets to the host's USB controller (and vice versa), enabling USB devices attached to the host to be passed through to the guest. The guest then uses its own USB device drivers to communicate with the device on the host.

In this way, there is direct communication between the guest OS and the actual device, allowing the full performance and range of capabilities of the device to be leveraged by the guest. This allows the wide range of USB devices to be used within the guest, without having to emulate each kind of device individually. It even allows the guest to use devices that the host has no driver for—again, this has particular advantages when virtualization is being used for legacy compatibility.

Talking to the hardware directly

Although paravirtualization improves performance, it's still not as good as native performance. To gain native performance, you need to cut out the emulated middle-man. Just as CPU virtualization gets a huge boost by direct execution of code, I/O virtualization would be improved by allowing virtual machines to talk to hardware directly.

This direct approach has an obvious pitfall—the ability to multiplex is lost. If a device is assigned to one guest, it can't be assigned to any other guests. But for many applications, that might not be such a big deal. It's relatively cheap to add a load of network interfaces to a machine (allowing one interface per guest), for example, so cost and management savings can still be achieved over and above dedicated hardware. Direct assignment also requires the guest OS to have an appropriate driver for the hardware, making the approach useless for legacy compatibility.

Hypervisor-based systems like HyperV and Xen already perform direct assignment, in a sense. With these hypervisors, all operating systems are run as guests. The first guest—the one used to bless the machine—is special, though, because it has the system's physical hardware available to it. The other guests use paravirtualized drivers to send I/O requests to this special first guest, and it uses its device drivers to communicate with the hardware. A more generalized direct assignment system would extend this capability to any guest.

Direct assignment is not without its problems, however. The big issue is the interrupts and shared memory used by devices to communicate with the CPU. The shared memory that the devices use for communication is all based on physical memory addresses. This is a problem, because each guest has its own virtualized physical memory addressing. The physical addresses used by the real hardware don't correlate to the virtual physical addresses visible to each guest, which means that whenever the guest's driver directs the device to perform DMA, it will end up using the wrong memory addresses. Interrupts pose another problem; they have to be serviced by the host, because only the host has access to the rest of the machine's hardware.

There are probably ways in which this could be worked around; perhaps a special driver in the host to handle the interrupt, translate any physical addresses, and pass it on to the guest, but such a driver would have to be tailored to the physical device to ensure that commands were properly translated.

The IOMMU

To support direct assignment robustly and in a device-independent manner requires support from the hardware. And so that's exactly what's happened, with Intel's VT-d and AMD's AMD IOMMU/AMD-Vi. These extensions add an I/O memory management unit (IOMMU) to the platform. An IOMMU allows the device memory addresses used in DMA to be mapped to physical memory addresses in a manner transparent to the device hardware, in much the same was a processor's MMU allows virtual memory addresses to be mapped to physical memory addresses.

With an IOMMU, the translation between the guest's physical addresses and the host's physical addresses can be handled completely transparently; the VMM will have to configure the IOMMU in the first place, but after that, everything else will happen automatically.

IOMMUs have been a feature of some platforms for many years, but x86 has always done without. During the AGP era, a similar (but more limited) device was found in x86 systems, the AGP GART (graphics aperture remapping table). The GART allowed AGP devices to "see" a contiguous view of system memory, even if the underlying memory was not actually contiguous. PCIe has a similar capability, but the PCIe GART is built into the PCIe graphics hardware itself. The AGP GART, in contrast, was a system feature provided by the chipset. The GART was limited, though, as it performed the same mapping for any request (whether by the CPU or the graphics card). A general-purpose IOMMU, that handles requests from different devices differently, is only recently becoming available.

Using the IOMMU not only allows the remapping to be performed automatically, it also provides a kind of memory protection. Without an IOMMU, a device can perform DMA to physical addresses that it should not be able to touch; with the IOMMU, such DMA requests can be blocked. The IOMMU can be configured such that a request from a particular device (identified by the bus/device/function triple) can only have access to particular memory ranges, with any accesses outside those ranges being trapped as an error.

The Intel and AMD IOMMUs also support interrupt remapping. Both PCI interrupts and PCIe interrupts are understood by the IOMMU, and redirected remapped as appropriate.

By using an IOMMU, the hypervisor can safely assign physical hardware directly to guest OSes, ending the need for them to funnel all their I/O through the host, and removing the layers of emulation that are normally needed for virtualized I/O, achieving native-level I/O performance for virtual machines.

The IOMMU is useful in non-virtualization scenarios, too. Many PCI devices can only use 32-bit physical addresses. This means that their buffers must all fit within the first 4 GiB of physical memory. This can make that first 4 GiB of physical memory cramped, especially when some devices, like video cards, create enormous buffers occupying many gigabytes of that memory. The IOMMU solves this problem by allowing the devices to stick with their 32-bit physical addresses, and transparently remapping them to any memory location.

The use of VT-d and AMD IOMMU in this way does sacrifice one of the benefits of emulation and paravirtualization systems: multiplexing. Direct assignment is 1:1; the device can be assigned to exactly one guest. This might not be an issue with some devices, such as multiport network cards, but it stands in the way of, say, robust native-performance virtualization of a graphics card.

The solution to this multiplexing problem has been to extend PCIe. PCIe has been extended so that devices can offer multiple virtualized functions. Though devices can currently support multiple functions, these multiple functions are used to support different hardware capabilities; the virtualized functions will be used to support the same hardware capabilities several times over. Each bus/device/virtual function triple will be assignable to a different VM, thereby allowing the device to be shared, while still allowing it be used with directly assigned I/O through the use of the IOMMU.

Widespread support for this is still a ways off. Devices that support the PCIe Single Root I/O Virtualization specification (SR-IOV) are on the market, but are unusual; a few high-end networking controllers support it (e.g. Intel's 82576 Gigabit Ethernet controller and Neterion's X3100 series). Because these devices have to support virtualization in hardware, meaning that any internal buffers have to be replicated for each associated VM, they do not offer the near-unlimited sharing of emulated devices. Nonetheless, Intel's ethernet controller supports 8 virtual functions per port, giving 8 VMs native access to the same physical hardware.

CPU virtualization has been near-native for many years, but the I/O performance of virtual machines has long left something to be desired. If and when PCIe SR-IOV devices become widespread, near-native virtualization of both processor and I/O alike will be a practical reality. When this happens, it will increase performance and reduce costs and overhead in the datacenter, as individual servers will have far less virtualization-related overhead.

No comments: