Wednesday, December 12, 2007

Inside Barcelona: AMD's Next Generation

In 2002, AMD’s K7 was lagging behind the high frequency 130nm Pentium 4 (Northwood). The 2003 launch of the K8 was a bet-the-company kind of move for AMD. But, by 2005, they had clearly hit the jackpot with the K8 microprocessor. They had three out of four of the major OEMs pitching their server products and Intel’s products barely kept pace in the UP (1 socket) and DP (2 socket) server markets. In the MP (4 socket) server market, AMD was clearly the best choice by a wide margin.

This success was due to a confluence of favorable factors, both political and technical. Technically, the K8 was a solid conservative design that built off the previous generation K7, and was ideally suited for the server market. In contrast, Intel’s competing Pentium 4 was designed for consumer and media workloads, and was decidedly sub-optimal for servers and notebooks. Worse yet, the P4 failed to scale up frequency, which was the key idea behind the microarchitecture, due to power and thermal issues at 90nm and beyond. These design goals had been dictated by internal Intel politics at the highest level. Intel’s corporate strategy was to push Itanium processors for the server market, and x86 for the desktop and notebook space. Unfortunately, the Itanium project alienated Sun and IBM, and did not get as much traction as many had been expecting because of price, compatibility and availability issues. This left a hole in Intel’s plans, which is exactly where the K8 was aimed and was enthusiastically embraced by the Linux community and Microsoft. By any measure, AMD’s bet-the-company on the K8 strategy had worked out quite well, finally bringing them a measure of financial success.

Unfortunately, this got Intel’s attention rather quickly. Intel scrapped several P4 follow-on projects in various states of completion and changed direction rather rapidly. One strength of Intel’s culture and employees is the ability to not only deliver, but to excel and exceed expectations under pressure. Intel’s Israeli design team delivered in spades with the Core microarchitecture (based on the P6), which shipped in the first half of last year and easily took a lead in performance and efficiency across almost every market. The effects were dramatic, and certainly felt at AMD, which experienced both an erosion of profitability, average selling price and market share for the first quarter of the year, which will continue in the second quarter. Of course, this is a familiar position for AMD; precisely where they were in late 2002 and early 2003 before the Opteron launch. Yet again, AMD’s fate depends on a product that will be released in the near future.


Barcelona - The Proverbial Ace in the Hole

Over the course of the last year, AMD has slowly been revealing more and more details on their next generation processor, codenamed Barcelona. The first information came out in a keynote address from Senior Fellow Chuck Moore at the Spring Processor Forum in 2006. At the following Fall Processor Forum, Ben Sander gave a much more detailed outline of the microarchitecture for Barcelona. More recently, Shawn Searles gave a presentation at ISSCC ‘07 which described the physical implementation challenges of Barcelona and some of the design choices.

Barcelona is the first major architectural alteration to the K8, since it debuted in 2003. The K8 built on the very capable microarchitecture of the K7, and added 64 bit operation, two integrated DDR memory controllers and 3 HyperTransport lanes. These features were not novel; AMD’s architects followed in the footsteps of the Alpha EV7, which was the first MPU to integrated memory controllers (8 channels of DRDRAM), on-die routing (4 inter processor links) and directories. However, the K8 advanced the state of the art by bringing 64 bits and higher levels of integration to x86, the mainstream instruction set architecture. The K8 was the first AMD product to meet with any success in the server world, a clear testament to the wisdom of evolutionary and conservative design choices.

In many ways, Barcelona continues down this conservative path of evolution. There are no radical changes. In fact Barcelona has the same basic 12 stage pipeline as the K8, and many of the microarchitectural improvements in Barcelona have been successfully demonstrated elsewhere. This in no way detracts from the efforts of AMD’s architects and engineers – high risk features are inappropriate for a company that cannot afford a product failure.

This article will bring together all the existing information on Barcelona into a single place, discussing the system aspects and microarchitecture of Barcelona as well as the circuit design challenges and performance. This also presents a wonderful opportunity to examine what areas AMD focused on, in comparison to where Intel spent much of their effort enhancing the P6 core to produce the Pentium M and Core 2 line.

Barcelona is a 283mm2 design that uses 463M transistors to implement four cores and a shared 2MB L3 cache in AMD’s 65nm process. The SOI process uses 11 layers of copper interconnect with a low-k dielectric and dual stress liners and embedded SiGe for PMOS transistors. The device described at ISSCC was targeted at 2.2-2.8GHz at 1.15V, while operating within a 95W maximum thermal envelope. AMD claims that their 65nm process has a 15ps FO4 inversion delay, which suggests that Barcelona’s pipeline is just a little less than 24 FO4 delays. Later sections of this article will delve into seven major areas, the system architecture, the five major sections of the microarchitecture and lastly circuit level improvements and other features

System Architecture

The K8 system architecture is already quite good. For four socket servers, it clearly rules the x86 roost. The two major issues for AMD systems were the lack of quad core processors, and poor eight socket server performance.

The single most emphasized selling point for Barcelona is that it integrates four processor cores, bringing them to parity with Intel’s Xeon 53xx series. The Xeon 53xx, codenamed Clovertown, is actually a pair of dual core Woodcrest processors in a multi-chip package (MCP). These processors communicate over the front-side bus, rather than through an on-chip bus or caches. In contrast, AMD has opted for a shared cache approach, where the last level of cache, the L3 is used by all four cores. Figure 1 below compares the Revision F Opteron, Barcelona and Intel’s upcoming 3GHz Clovertown.


Figure 1 – System Architecture Comparison

The architects that designed Barcelona opted for a fully integrated MPU. A monolithic device ultimately provides higher performance, especially for bandwidth sensitive workloads that don’t benefit from caching, such as HPC or data mining. However, like any engineering decision it does not come without trade-offs. First of all, fully integrating everything is a decision that must be made at the beginning of the project. An MCP approach is far less time consuming and can use a slightly modified existing product; most importantly, these changes can be made late in the design cycle. Monolithic devices also have lower yields, because the larger die size means fewer candidate dice per wafer, and hence random defects have a larger impact. Monolithic MPUs are also more difficult to bin for frequency, since to run at a given speed, all four cores must exceed that target with appropriate power dissipation. However, there are design techniques that will let a MPU with a slow core and a fast core run at the slow speed, but with lower power.

While AMD’s marketing department likes to bill their approach as a ‘native’ or ‘true’ quad core design, the truth is that both approaches are equally valid; a fact belatedly recognized by some of AMD’s own executives. Intel’s Clovertown is a quad core device. Operating systems recognize Clovertown as four processors, and it certainly offers higher performance for many applications than a dual core MPU. However, it is equally true that in most situations performance favors fully integrated quad cores.

In the case of Barcelona, the advantages of greater integration have been augmented by careful attention to I/O bandwidth. The memory controllers in Barcelona received a major overhaul. The most visible change is that each controller supports independent 64B transactions, rather than a single 128B transaction across both controllers (memory mirroring is also supported now). Since DDR2 bursts stay at 32B, this improves command efficiency. However, when using DDR3, the command efficiency will drop because the burst length will double to 64B. Each controller also supports a separate set of open pages in DRAM, which is controlled by a new history based pattern predictor (which is somewhat analogous to a simple branch predictor). The predictor uses both per-bank access history and page accesses across banks to decide whether to keep a page open to improve performance, or close the page to reduce power. Lastly, Barcelona introduces data poisoning, which ensures that if a double bit error is detected by ECC, it is contained and only impacts the process which first accesses it, rather than crashing or corrupting the whole system.

While revision F Opteron processors supported DDR2, there was little performance advantage, if any. To actually take advantage of the available bandwidth for DDR2, deeper request and response queues are needed; these changes were not made in revision F, but are present in Barcelona. AMD also introduced a 16-20 entry write buffer in the memory controller, so that writes can be deferred, avoiding costly bus turn-arounds. Lastly, the memory controllers now support DRAM prefetchers that share the write buffer and can detect positive and negative strides. Server versions of Barcelona will support registered DIMMs at up to 667MHz, and desktop versions will work with slightly faster 800MHz DDR2.

Barcelona also adds a fourth HyperTransport lane for interprocessor communications and I/O devices. With four lanes, system vendors can build fully connected four socket systems; this reduces transaction latency substantially, since all processors can be reached with a single hop. Each node within the system could even have an attached I/O hub (see our preview of Barcelona). However, the current socket infrastructure only supports three HT1.1 lanes, so these innovative system designs will have to wait for a new socket interface. Initially, each link will run at 2GT/s, but they are compatible with HyperTransport 3.0 and future parts may operate at up to 5.2GT/s in newer systems. HT3.0 can also modulate link width and frequency to save power. Coherent Hypertransport also features a slight change that will improve latency for some transactions. When a K8 fetches a cache line into its L1D or L2, it has to snoop the system and wait for the results. In particular, the K8 will snoop memory and every other cache in the system; once it gets all of these responses, it can use the cache line it fetched. However, in Barcelona if a requested cache line is in the M or O state (meaning that memory has a dirty copy), the CPU does not wait to get the snoop response from memory, improving the transaction latency. The newer protocol also adds a retry mechanism to survive transient errors at higher clock rates.

HyperTransport 3.0 also adds a feature called ‘unganging’ or lane-splitting. The HT3.0 links are actually composed of 16 bit lanes running in each direction. These lanes can be split up into a pair of independent 8-bit wide links. This is fairly useful for connecting to I/O devices, as few systems have enough I/O devices to saturate a full 8GB/s interface; even 8 SAS hard drives and a pair of 10GBE cards would not require that much bandwidth. However, AMD has also pitched link-splitting as a way to build fully interconnected 8 socket servers. Previous generation Opteron-based systems supported up to 8 sockets, but the performance is positively underwhelming on the few benchmarks that have been published (mainly SAP 2 tier and SPECjbb2005). While Barcelona will offer higher performance 8 socket implementations, it isn’t clear how much demand there is from end-users. Sun and Fujitsu currently sell 8 socket servers, but both HP and Dell shelved earlier efforts in 2003.


The Fetch Phase

The front end of Barcelona is fairly complex and has been substantially improved over the K8, as shown in Figure 2 below. Each cycle, Barcelona fetches 32B of instructions from the L1I cache into the predecode/pick buffer. The previous generation K8 fetched 16B each cycle, as does Intel’s Core 2. The instruction fetch was widened because many of the SIMD and 64 bit instructions are longer, and as these become more common, larger fetches are required to keep the rest of the core busy. Consequently, the pre-decode and pick buffer for Barcelona has been enlarged, to at least 32B, although it could be somewhat larger - the K8's predecode buffer was 1.5x the fetch size, so a 48B buffer might not be out of the question.

This makes sense as Barcelona is targeted first and foremost at servers, where 64-bit mode is common. Core 2, on the other hand, was designed with more focus on consumers, who purchase the majority of computer systems. The reality is that even now, 64-bit operating systems are extraordinarily rare for desktops, and especially notebooks; in those market segments, the additional benefit is more limited and may not be worth the resources.


Figure 2 – Comparison of Front-End Microarchitecture

The branch prediction in the K8 also received a serious overhaul. The K8 uses a branch selector to choose between using a bi-modal predictor and a global predictor. The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information. The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register that tracks the last 8 branches to index into a 16K entry prediction table that contains 2 bit saturating counters. If the branch is predicted as taken, then the destination must be predicted in the 2K entry target array. Indirect branches use a single target in the array, while CALLs use a target and also update the return address stack. The branch target address calculator (BTAC) checks the targets for relative branches, and can correct predictions from the target array, with a two cycle penalty. Returns are predicted with the 12 entry return address stack.

Barcelona does not fundamentally alter the branch prediction, but improves the accuracy. The global history register now tracks the last 12 branches, instead of the last 8. Barcelona also adds a new indirect predictor, which is specifically designed to handle branches with multiple targets (such as switch or case statements). Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M. Indirect branches with a single target still use the existing 2K entry branch target buffer. The 512 entry indirect predictor allocates an entry when an indirect target is mispredicted; the target addresses are indexed by the global branch history register and branch RIP, thus taking into account the path that was used to access the indirect branch and the address of the branch itself. Lastly, the return address stack is doubled to 24 entries.

According to our own measurements for several PC games, between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers. For the same set of games, we measured that between 0.5-5% (1.5% on average) of all stack references resulted in overflow, but overflow may be more prevalent in server workloads.


The Decode Phase

x86 instructions are fairly complicated to decode; they are variable length and because of prefixes, the position of the op code cannot be known ahead of time. To simplify decoding, the K8 and Barcelona both use pre-decode information that marks the end of an instruction (and hence the start of the next instruction). However, the first time an instruction is fetched into cache there is no pre-decode information. The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle, and inserts pre-decode information, which is stored in the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions. Since there are almost no writes to the instruction stream, parity and refetching from memory is sufficient protection from errors in the instruction cache and ECC is not really required for code. As noted previously, this pre-decode information also includes branch selection and other related information.

Like the Pentium Pro, the K7/8 has an internal instruction set which is fairly RISC-like, composed of micro-ops. Each micro-op is fairly complex, and can include one load, a computation and a store. Any instruction which decodes into 3 or more micro-ops (called a VectorPath instruction) is sent from the pick buffer to the microcode engine. For example, any string manipulation instruction is likely to be micro-coded. The microcode unit can emit 3 micro-ops a cycle until it has fully decoded the x86 instruction. While the microcode engine is decoding, the regular decoders will idle; the two cannot operate simultaneously. The vast majority of x86 instructions decode into 1-2 micro-ops and are referred to as DirectPath instructions (singles or doubles).

In Barcelona, 128 bit SSE computations now decode into a single micro-op, rather than two; this makes the rest of the out-of-order machinery, such as the re-order buffer and the reservation station more effective. The same goes for integer and FP conversions, and 128 bit load instructions, which are needed to complement the new SIMD capabilities. Note that 128 bit stores still create 2 micro-ops. Another tweak that AMD added is support for unaligned SSE memory accesses, which help more efficiently fetch instructions by packing code more densely.

At some point slightly during the decode stages, instructions are passed through a new piece of hardware in Barcelona, the sideband stack optimizer. The x86 instruction set supports stacks in hardware, and can directly manipulate the stack of each thread, using PUSH, POP, CALL and RET instructions. These instructions modify the stack pointer (ESP), which in the K8 would generate a micro-op; worse yet, usually these instructions came in long dependent chains, which is a pain for the out-of-order machine.

AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream, similar to the dedicated stack engine in the Pentium M. Both MPUs use two registers, ESPO and ESPD (this is Intel’s terminology). ESPO is the original value for the stack pointer and is held in a register in the out-of-order machine, while ESPD, the delta register, tracks changes made to ESP and is in the front-end. Since ESP is an architecture register, a special micro-op is provided to recover ESP from ESPO and ESPD, although the use of this ‘fix up’ operation is minimized in Barcelona. When a stack modifying instruction is detected, it is removed and resolved by a dedicated ALU which modifies ESPD. This means that many stack operations can be processed in parallel, and frees up the reservation stations, re-order buffers and regular ALUs for other work. The benefits of this technique are highly workload dependent, but AMD and Intel agree that usually 5% of the micro-ops can be eliminated.

The last part of the decoding is the pack buffer (which is probably still 6 entries, like the K8).

The Out-of-Order Engines - Renaming and Scheduling

The first thing to notice is that the out-of-order control logic is vastly more complicated for K8 and Barcelona than the Core 2. Unlike Intel’s microarchitecture, the K8 and Barcelona have a split integer and floating point cluster, with distributed schedulers/reservation stations. The Core microarchitecture, has a single execution cluster, with a unified scheduler/reservation station and multiple issue ports. These choices, which date back to the P6 and the Athlon are one of the major factors that accounts for the strength of AMD microprocessors in floating point workloads.


Figure 3 – Comparison of Out-Of-Order Resources

The pack buffer, which is part of the decoding phase, is responsible for sending groups of exactly 3 micro-ops to the re-order buffer (ROB). However, the 72 instruction re-order buffer is not actually 72 independent entries. It contains 24 entries, with 3 lanes for instructions in each entry. The re-order buffer contains a rename register for the result of each operation in flight (or in the case of a FP operation, a pointer to the FP register file).

To ensure that the ROB is fully utilized, only a single group of exactly three instructions can be sent to the ROB each cycle. Therefore, the function of the pack buffer is twofold; it coalesces instructions into groups of three so that they can enter the ROB. Just as importantly, the pack buffer can move instructions between lanes to avoid a congested reservation station downstream or to observe issue restrictions. For example, floating point or integer multiplies must be in the first lane, while LZCOUNT must be in the third. Each lane corresponds to a specific reservation station further down in the pipeline, and once an instruction enters a specific lane in the ROB, it cannot be moved. Thus the pack buffer is also the last chance to switch lanes for an instruction.

At this point, the path for floating point and integer/memory instructions diverge. The next stop on the integer side is the Integer Future File and Register File (IFFRF). The IFFRF contains 40 registers broken up into three distinct sets. First, the Architectural Register File, which contains 16x64 bit non-speculative registers specified by the x86-64 instruction set. Instructions can only modify the Architectural Register File once they have retired, with no exceptions. Speculative instructions instead read from and write to the Future File, which contains the most recent speculative state of the 16 architectural instructions. The last 8 registers are scratchpad registers used by the microcode. In the case of a branch misprediction or exception, the pipeline must rollback, and architectural register file overwrites the contents of the Future File.

From the ROB, instructions issue to the appropriate scheduler. The integer cluster contains three reservation stations (or schedulers). Each one is tied to a specific lane in the ROB and holds 8 instructions, with the source operands. The source operands come from either the Future File, or the result forwarding bus (which is not shown because it is too complicated to draw).

The Floating Point Cluster

Floating point instructions are handled quite differently. Instead of being sent directly to the reservation stations, they first head to the FP Mapper and Renamer. One of the nasty aspects of the x86 instruction set is that FP operations are stack based; the FP mapper converts these stack operations to use a flat register file instead so that renaming can occur.

In the renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 entry FP register file. The file is large enough to rename up to the maximum of 72 instructions in flight. Along with the FP register file, there are two arrays, the architectural and future file arrays. In Barcelona, the architectural file array contains pointers to 44 of the 120 FP registers, which contain the non-speculative state: 8 for x87/MMX, 8 scratchpad registers for the microcode and 8x128 bit XMM registers. Previously, the K8 treated the XMM registers as 16x64 bit registers, but that changed once 128 bits became a 'native' data format. Similarly, the future file contains pointers to 44 renamed registers that contain the latest speculative values within the FP register file.

Once the micro-ops have been renamed, they may be issued to the three FP schedulers. Each reservation station holds up to 12 instructions, with the source operands. Like the integer schedulers, the operands can either come from the FP register file, or the forwarding network and each scheduler is tied to a specific lane in the ROB.


The Out-of-Order Engines - Execution Units

Once operations enter the schedulers, they wait until the source operands are ready. Then the scheduler will dispatch the oldest instruction and operands to the appropriate functional unit. The integer functional units in Barcelona are mostly unchanged from the K8. The three integer ALUs in K8 and Barcelona can execute most instructions and are largely symmetric. The two exceptions are that only the first ALU has an integer multiplier, and the third is used for POPCOUNT and other similar instructions. Note that the forwarding network for Barcelona has been omitted because it is far too complex to display in an organized manner.


Figure 4 – Comparison of Execution Units

The first substantial change in Barcelona’s integer units is that integer division is now variable latency, depending on the operands. IDIV instructions are handled through an iterative algorithm. In the K8, each IDIV would go through a fixed number of iterations – regardless of how many were required to achieve the final result. 32 bit divides took 42 cycles, while a full 64 bit divide required 74 cycles to calculate. In contrast, Barcelona only iterates the minimum number of times to produce an accurate answer. The latency for Barcelona is generally 23 cycles, plus the number of significant bits in the absolute value of the dividend (unsigned divides are roughly 10 cycles faster). Additionally, the third ALU pipeline now handles the new LZCOUNT/POPCOUNT instructions.

The FPUs in Barcelona did change a bit. They were widened to 128 bits so that SSE instructions can execute in a single pass (previously they went through the 64 bit FPU twice, just as in Intel’s Pentium M). Similarly, the load-store units, and the FMISC unit now load 128 bit wide data, to improve SSE performance.

One important difference between AMD and Intel’s microarchitectures is that AMD has their address generation units (AGUs) separate from the load store units (LSUs). This is because, as we noted earlier, AMD’s micro-ops can contain a load, an operation and a store, so there must be at least as many AGUs as ALUs. In contrast, Intel uops totally decouple calculations from memory accesses, so the AGUs are integrated into the load and store pipelines. The difference in the underlying uops and micro-ops result in the different AGU arrangements.

Another distinction between the Barcelona and Core microarchitectures is that AMD’s ALUs are symmetric and can execute almost any integer instruction, while the ALUs for Core 2 are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for AMD’s distributed schedulers and instruction grouping to work optimally. This is a clear architectural trade-off of performance and decreased control complexity versus power and increased execution complexity. Replicating three full featured ALUs uses more die area and power, but provides higher performance for certain corner cases, and enables a simpler design for the ROB and schedulers.

The Memory System

The memory pipelines and caches in Barcelona have been substantially reworked; they now have some limited out-of-order capabilities and each pipe can perform a 128 load or a 64 bit store every cycle. Memory operations in both the K8 and Barcelona start in the integer schedulers, and are dispatched to both the AGU and the 12 entry LSU1. The address generation takes one cycle, and the result is forwarded to the LSU1, where the data access waits.


Figure 5 – Comparison of Memory Pipelines

At this point, the behavior of Barcelona and the K8 diverge. In the K8, memory accesses were issued in-order, so if a load could not issue, it also stalled every subsequent load or store operation. Barcelona offers non-speculative memory access re-ordering. What this really means is that some memory operations can issue out-of-order.

During the issue phase, the lower 12 bits of the load operation’s address are tested against prior store addresses; if they are different, then the load may proceed ahead of the store, and if they are the same, there may be an opportunity for load-store forwarding. This is equivalent to the memory re-ordering capabilities of the P6 – a load may move ahead of another load, and a load may move ahead of a store if and only if they are accessing different addresses. Unlike the Core 2, there are no prediction and recovery mechanisms and no loads may pass a store with an unknown address.

In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB. The L1 DTLB now includes 8 entries for 1GB pages, which is useful for databases and HPC applications with large working sets. Any miss in the L1 DTLB will check the L2 DTLB. Once the physical address has been found, two micro-ops can probe (in case of a store) or read from (in case of a load) the cache each cycle, in any combination of load and store. The ability to do two 128 bit loads a cycle is beneficial primarily for HPC, where the bandwidth from the second port can come in handy. Once the load or store has probed the cache, it will move on to LSU2.

LSU2 holds up to 32 memory accesses, where they stay until retirement. LSU2 handles most of the complexity in the memory pipeline. It resolves any cache or TLB misses, by scheduling and probing the necessary structures. In the case of a cache miss, it will escalate up to the L2, L3 or memory, and TLB misses would go the L2 TLB, or main memory, where the page tables reside. The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness. Since all the stores are held in LSU2, it also does the load-store forwarding. Note that stores are still 64 bits wide, hence two entries are used to track a full 128 bit SSE write. This is a slight disadvantage as some instruction sequences, particularly those that involve copying data in memory, have equal numbers of reads and writes. However, the general trend is that there are twice as many (or more) loads than stores in an application.

The 64KB L1D cache is 2 way associative, with 64 byte lines and a 3 cycle access time. It uses a write-back policy to the L2 cache, which is exclusive of the L1. The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive), so a 64 byte line is transmitted in 4 cycles. As in the K8, the L2 cache is private to each core. The L2 capacity has been halved to 512KB, but the line size and associativity were kept at 64B and 16 ways respectively.

The L3 cache in Barcelona is entirely new feature for AMD. The shared 2MB L3 cache is 32 way associative and uses 64B lines, but did not fit in Figure 5. The controller for the cache is flexible and various AMD documents indicate that it can flexibly support up to 8MB of L3 cache. The L3 cache is specifically designed with data sharing in mind. This entails three particular changes from AMD’s traditional cache hierarchy. First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines. Lastly, since the L3 is shared between four different cores, access to the L3 must be arbitrated. A round-robin algorithm is used to give access to one of the four cores each cycle. The latency to the L3 cache has not been disclosed, but it depends on the relative northbridge and core frequencies – for reasons which we will see later.

The last improvements to Barcelona in the memory pipeline are the prefetchers. Each core has 8 data prefetchers (a total of 32 per device), which now fill to the L1D cache in Barcelona. In the K8, prefetched results were held in the L2 cache. The instruction prefetcher for Barcelona can have up to 2 outstanding fetches to any address, whereas the K8 was restricted to one fetch to an odd address and one fetch to an even address.

Circuit Techniques, Power Savings and More

From a circuit level perspective, the changes between the K8 and Barcelona were extremely significant. Barcelona is specified to operate at a wide range of voltages, from 0.8-1.4V. However, unlike its predecessor, each core in Barcelona has a dedicated clock distribution system (including PLL) and power grid. The frequency for each core is independent of both the other cores, and the various non-core regions; the voltage for all four cores is shared, but separate from the non-core. As a result, power can be aggressively managed by lowering frequency and voltage whenever possible. To support independent clocking and modular design, asynchronous dynamic FIFO buffers are used to communicate between different cores and the northbridge/L3 cache. These FIFOs absorb any global skew or clock rate variation, but the latency for passing through depends on the skew and frequency variance – which is why the L3 cache latency is variable. The northbridge and L3 cache compose roughly 20% of the die and share a voltage and clock domain that is independent of the four cores, which is essential for mobile applications. Previously, the northbridge clock and voltage was tied to the processors, so systems with integrated graphics could not reduce the processor voltage or frequency to deep power saving states. Separate sleep states, voltages and frequencies for the northbridge and processors should lower AMD’s average power dissipation which will help in the mobile market.


Figure 6 – Barcelona Die Micrograph

Barcelona also features a dedicated temperature sensor circuit for each core, and a separate one for the northbridge. Each core has 8 sensors on the circuit, while the northbridge contains 6. All the circuits are connected to and controlled by a global thermal control circuit. The global thermal controller uses the results to select power saving modes to reduce the temperature of the device.

One of the trickier areas for AMD’s design team was the SRAM cells for the caches. The L1 caches share a common 1.06um2 cell design. The 6T SRAM cells read during the first half of the cycle, and then perform a self-timed write and precharge in the latter part of the cycle. The timing for the write is based on extensive Monte Carlo analysis, incorporating lot-to-lot and local process variation and can be modified post-production with programmable fuses.

The L2 and L3 cache share many design elements, including the SRAM cells. The L2/3 cells are 0.81um2 and are also single ended for stability, which is unusual. One of the difficulties that AMD’s SRAM designers faced is that because they use the same die across all product lines, the likelihood of a read disturbance (i.e. reading the wrong data) must be very small. Specifically, a 5 sigma margin across the entire 0.7-1.3V range is required. Unfortunately, the floating body effect of SOI silicon precluded a more efficient small swing read design. According to AMD’s presentation, using a small swing read cell, they were only able to achieve a 4.53 sigma margin. The single ended design which was chosen had larger margins that were sufficient for actual product use.

Shifting to more software oriented matters; Barcelona also adds support for a variety of new instructions. Fortunately, these coincide with the supplemental SSE3 instructions that Intel added to the Core 2. Generally, these instructions were not terribly significant, except for the POPCOUNT instruction, a perennial favorite of intelligence agencies, which counts the number of 0’s, or 1’s in a given register. AMD also added support for unaligned SSE loads, as previously mentioned, and it will be interesting to see when or if Intel chooses to follow their lead.

More significant to server users are the nested page tables, which improve virtualization performance. One of the drawbacks of Shadow Page Tables is that page faults become very expensive, since the VMM is invoked to manage any changes to the SPT. The alternative, Nested or Extended Page Tables, which are used in Barcelona, is to virtualize the memory management unit. On Barcelona each guest maintains a hardware walked table that maps the guest physical to host physical address. Unfortunately, walking these tables can be extraordinarily expensive, so parts of the mappings can be cached as well. While this reduces the performance overhead of virtualization, customers waiting for I/O virtualization will have to wait till 2008 for that particular feature.


Barcelona Summary

AMD’s current competitive situation is rather difficult. For high-end MP servers and some HPC workloads, AMD is still regarded as the king of the hill. However, in almost every other segment, Intel is the performance and often power efficiency leader. This translates into significant financial problems for AMD, as the high-end server market is small (but lucrative) and HPC is generally very competitive for pricing. While this situation is reminiscent of the eve of the Opteron launch, in many ways AMD is better positioned than they were in early 2003.

In early 2003, AMD had no presence in the server market, and was not a particularly credible player since they had no track record. Moreover, AMD did not even have particularly strong ties with any of the server vendors. Today, AMD is an acknowledged participant in the server world, and is not perceived as a ‘follower’. AMD has also made significant sales and marketing in-roads with the major OEMs, including the last major holdout, Dell. At the same time, AMD has badly blundered with their channel partners – some of the earliest adopters and risk takers who pushed Opteron, and Intel has recently moved aggressively to court channel partners.

To date, AMD has only given a few hints on the frequencies and the expected performance for Barcelona. AMD has publicly predicted a 50% advantage over a Xeon 5355 (2.66GHz quad core) in SPECfp_rate2006, and a 20% advantage for SPECint_rate2006. Of course, AMD’s competition when Barcelona arrives will be a faster 3GHz processor from Intel, and later a 3.2GHz Penryn based design, using a 1.6GHz bus. Perhaps more importantly, SPECint and SPECfp only address a portion of the workloads that AMD and Intel target.

While AMD has not disclosed frequencies or TDP yet, rumors point to 1.9-2.6GHz in 100MHz steps with thermal envelopes of 68, 95 and 120W. Additionally, AMD is very likely to increase the frequency of Barcelona over the lifetime of the processor because of their approach to manufacturing. AMD tends to continuously improve their process, and these improvements should translate into incremental speed bumps along the way for their processors.

Looking at a comparison of the three microarchitectures, the K8, Core 2 and Barcelona below, shows some performance hints. For many of the most important features, AMD and Intel should be matched microarchitecturally because AMD has incorporated quite a few of the techniques that Intel used to boost the per clock efficiency of the Core 2. While it does appear that the Core 2 is 33% wider than Barcelona, in reality, neither processor comes close to peak capabilities on real code, so the performance will be much closer than the block diagrams imply. Barcelona's 3-wide issue, execute and retire capabilities are not a performance problem.


Figure 7 – Microarchitecture Comparison

Given all this information, and existing knowledge about AMD and Intel’s technical strengths and weaknesses it is possible to estimate how the competitive landscape will look towards the latter part of this year. At a high level Barcelona should provide an edge in multithreaded performance, but not an insurmountable one. Depending on clock speed, Intel may retain the performance crown for single threaded performance - which is essential for client systems.


Performance

For desktops, Barcelona will probably lead multithreaded performance and applications that strongly depend on high bandwidth, but single threaded workloads may slightly favor Intel’s designs. Note that the dual core desktop processors are likely to remain a large portion of the product mix, until the marginal cost for the additional cores is low, or most applications can use 4 threads. A mobile variant of Barcelona will not be introduced till 2009, after Griffin. This is because many of the improvements in Barcelona are focused on server performance, and may not have the right power/performance balance for notebooks.

Dual processor servers will be a mixed bag for AMD. Expect extremely strong performance for almost all HPC-style workloads – that is and will continue to be a strong suit for AMD’s architecture because of the highly integrated system design and copious bandwidth. However, for commercial workloads like file or web serving or transaction processing, which don’t require as much bandwidth, any performance gaps will be much smaller. For these workloads Barcelona will certainly be competitive and will exceed Intel’s performance on some benchmarks, but Intel will likely retain a lead for other benchmarks. Performance for single processor servers will generally be similar - but the advantages from AMD's system architecture will be smaller, hence the multithreaded performance will probably be even closer than for dual processor servers.

One area where AMD should have a slight edge is on dual processor platform power consumption, due to differences in the memory systems. AMD uses DDR2 DIMMs, which consume 3-5W each, while the FB-DIMMs that Intel’s systems use consume 5W above a normal DDR2 DIMM. The power advantage for AMD will depend on the configuration of each individual server. As more memory is added, AMD’s advantage will grow, however, as other components are added, AMD’s relative advantage will diminish – for example, in systems with 8 or more disks, the differences in memory systems may be lost in the noise. This difference in memory architectures applies for dual processor FB-DIMM based servers only, since Intel's single processor systems use DDR2 DIMMs and some dual processor servers may use regular DDR2 DIMMs. For those servers where Intel uses regular DDR2, AMD will not have any significant power advantages.

At the high-end, MP servers should be a bright spot for AMD. The higher level of integration and the additional HyperTransport link in Barcelona will improve an already formidable system architecture that has Intel on the defensive. One open question is whether Barcelona will truly commoditize the market for large MP (8 socket) servers. The capability is there, but it is unclear whether OEMs will aggressively push a solution that does not address the inherent limits of a snooping cache coherency policy, and lacks some of the RAS features that typical mid-range servers offer.

Conclusions

Barcelona is the first revision to AMD’s microarchitecture since 2003. Rather than starting from scratch, Barcelona builds on the previous generation and subtly improves almost every aspect of the design. In many ways, this mirrors the evolution of computer architecture – there are very few techniques that can give a large boost on a wide spectrum of applications. Instead, architects are turning to many smaller improvements, just as AMD has done with Barcelona. The only obvious trick left in the bag for AMD is multithreading, which could provide a big boost in a future microarchitecture, such as the K10. However, this style of conservative, consistent design has worked very well for AMD in the past.

Barcelona is a solid improvement across the board and should give AMD momentum across several key markets. The performance advantages will be decisive for HPC applications and MP servers, other areas will be close in performance. Hence, there is quite a bit to look forward to in the near future with the debut and performance numbers for Barcelona. No matter what, AMD's engineering teams deserve kudos for a solidly executed product.




No comments: