IT LIFE: AMD

Showing posts with label AMD. Show all posts

Tuesday, March 02, 2010

AMD 890GX Unveiled: Three Motherboards Compared

AMD’s chipsets have long provided great features for the money, especially compared to high-end platforms like X48 and X58 from its chief rival, Intel. Everything between the mid-priced (still high-end) 790FX to its more commonplace integrated-graphics products can be attractive, depending on your usage model.

The entire range provides expanded PCIe 2.0 pathways for multi-card configurations, and its integrated-graphics parts actually deliver reasonable 3D performance and an option for multi-monitor support. If you love building productivity-oriented machines at an affordable price or need the ultimate in configurability, AMD might be your best choice. After all, we've yet to be bowled over by Intel's CPU efforts between $100 and $200, while AMD continues to offer a number of compelling quad-core models.

Today’s launch focuses on two components, the 890GX northbridge with its revised Radeon HD 4290 graphics engine and the SB850 southbridge. Upgrades include DX10.1 graphics, SATA 6Gb/s, two additional USB 2.0 ports, and integrated gigabit networking.

But our emphasis here is on a trio of motherboards emerging alongside the new core logic from Asus, Gigabyte, and MSI. Note that you'll see USB 3.0 support in the pages to come. However, the 890GX platform does not natively support USB 3.0; rather, it's added via an on-board controller.

AMD Launches 6 Core CPU-ready 890GX Mobo

This board is prepped for the hexacore AMD Phenom II X6.

AMD and its motherboard partners today released the AMD 890GX Chipset, integrated with the ATI Radeon HD 4290, and are designed to be compatible with the upcoming AMD Phenom II X6 six-core processor.

The AMD 890GX Chipset supports the SATA 3.0 6Gb/s hard drive interface and many AMD 890GX-based motherboards feature SuperSpeed USB 3.0 support.

We've got our hands on the Gigabyte GA-890GPA-UD3H, the Asus M4A89GTD Pro/USB3, and the MSI 890GXM-G65. After putting them through a barrage of tests, our reviews department found that AMD’s SB850 southbridge is probably the best reason to select an 890GX motherboard over the products it replaces thanks to the new integrated SATA 6Gb/s controller.

Monday, February 22, 2010

AMD reveals Fusion CPU+GPU, to challege Intel in laptops

SAN FRANCISCO—The "Llano" processor that AMD described today in an ISSCC session is not a CPU, and it's not a GPU—instead, it's a hybrid design that the chipmaker is calling an "accelerated processor unit," or APU. Whatever you call it, it could well give Intel a run for its money in the laptop market, by combining a full DX11-compatible GPU with four out-of-order CPU cores on a single, 32nm processor die.

Details on the highly parallel vector hardware—the "GPU" part of the device—have yet to be disclosed, but AMD is focusing today's revelations on the CPU part of the design. In a nutshell, AMD has taken the "STARS" core that's used in their current 45nm offerings, shrunk it to a new 32nm SOI high-K process, and added new power gating and dynamic power optimization capabilities to it. Each out-of-order core has a bit under 35 million transistors, and a 1MB L2 cache that's not included in that number. AMD is targeting sub-3GHz operation, and a power consumption range of 2.5 to 25 watts.

The chipmaker will put down four such cores, shown in the micrograph below, along with enough vector hardware to power a DX11 GPU. Overall, most of the work on the x86 side of Llano was done on dynamic power optimization and on fitting the design to the 32nm process. In this respect, Llano differs from the upcoming "Bobcat" mobile part in that the latter is more portable across a range of processes and configurations, and features less custom work.AMD has announced that Llano will be sampling in the second half of this year, and will be available from OEMs sometime in 2011. Power optimization goes digital

It's not often that I say this, but perhaps the most interesting and novel part of the Llano core is its unique approach to dynamic power optimization. AMD fellow Sam Naffziger walked me through the approach in a briefing this morning, and it departs from traditional power management approaches in that it relies on digital, not analog, data.

A normal processor power module takes analog input from a set of diodes placed throughout the die, and these diodes act as thermal sensors, informing the module when the die heats up in an area due to increased compute activity. In this model, then, die temperature is monitored as a proxy for power consumption, and the power module uses this temperature/power data to make on-the-fly adjustments to parameters like clockspeed. The blessing and curse of this method is that these analog sensors respond to every change in thermals, whether it's driven by an actual, compute-related boost in power consumption or by external, environmental factors, like a sudden rise in ambient temperature.

Llano's approach, in contrast, uses a set of 95 digital signals from different parts of the chip that AMD has empirically identified as having a strong correlation to power consumption. So signals like integer traffic, cache misses, or branch mispredicts are monitored via low-frequency sampling, and these signals give the power module a picture of the chip's power consumption that AMD claims "is accurate to within 2 percent across a broad range of application types."

High hopes for first genuine "Fusion" offspring of AMD + ATI

I personally have relatively high hopes for Llano as notebook part that could well out-do whatever Intel has in 2011. Intel is infamous for the poor quality of its integrated graphics processors (IGPs), and, while the most recent Intel IGPs are much less embarrassing than their predecessors, it's not clear that the company has the ability or the will to compete with NVIDIA and AMD/ATI in this area. So when it comes to raw performance as a CPU and GPU, I expect Llano to do quite well. But for commercial success as a mobile part, the big question concerns Llano's platform-level power draw, and that will depend on real-world success of the power management innovations that AMD has introduced today.

It's possible that, for gaming on-the-go, Llano's biggest competitor will be NVIDIA's upcoming x86 CPU + GPU combination. But right now, that device is still just a secret skunkworks project about which almost nothing is known. Still, if it's not public by 2011, I'm not sure what NVIDIA's mobile strategy will look like. With CPU/GPU fusion products like Llano and the DMI licensing dispute combining to kill NVIDIA's IGP business, the company needs new mobile ideas in a big way.

Tuesday, March 04, 2008

AMD's chipset game: Rien ne vas plus

Markham (ON) – AMD’s has raised its bet and placed its chips in a cutthroat chipset market: To gain ground on Nvidia and Intel, the manufacturer decided to put a fully-fledged GPU into its next mainstream integrated chipset 780G, which could win the company lots of new customers. But it could cost AMD lots of discrete graphics card sales as well and cut deep into its profit margins. Will AMD win and what does the 780 mean to you?

AMD's Markham offices, previously ATI headquarters

On a Roulette table, you typically meet two types of players. Those who simply try to stay in the game as long as possible while holding on to their budget for as long as possible. Simply participating will get you a nearly 50/50 or a nearly 2/3 chance when playing the outer fields. You’ll never win big, but you won’t lose big either. However, if you want to catch up to the big boys, you’ll have to take higher risks that could leave you bankrupt or propel you to the top of the table.

AMD’s chipset division is in such a game right now. Let’s watch.

AMD has lost touch with the other big players, Intel and Nvidia, and is behind. There’s nothing to get particularly excited about AMD’s chipsets these days and the company could continue its current game and probably would be ok, if it pushes its platform message strongly. But the next bets in this game, the 780G and 780V, are now on the table and we know AMD’s strategy: The green team wants to join the high-rollers again and takes an unexpected risk that could surprise the others or fail miserably.

780G chipset (left), SB700 Southbridge

The big picture: Upvalue the chipset, devalue cheap graphics cards

Integrated graphics chipsets are the commodity in the graphics chip industry. They are in the very low-end of PCs, but account for the lion’s share of the market in terms of unit numbers. You don’t talk about them, they simply do their job. And for some time now, they are even good enough again to run today’s standard Windows operating system. No one who buys a PC with a graphics chipset really cares (or cannot afford to care) about the graphics performance. But this may be different with this new 780G chipset, which is aiming for cheap and mainstream PCs in the $399 and $499 price range.

Technically, from a performance view, the 780G isn’t just a chipset. It really is a $19 chipset that performs (we believe AMD on this one for a moment) like a $50 entry-level standalone graphics card. In the past, a graphics chipset was based on a recent graphics engine, but usually saw substantial downgrades to keep a clear performance and price distance to the discrete product. AMD claims that in the 780G there is a full R620 graphics chip, just like in its current entry-level graphics cards, offering a performance similar to that of, well, $50 graphics cards.

That means, of course that the 780G offers the R620 DirectX 10 core, which includes two independent display controllers (VGA and HDMI/DVI/DP with HDCP), a Hypertransport 3 interface and two PCIe Gen 2 interfaces. There’s also a new Displaycache, which cuts down power consumption. In terms of core data, there are two versions of the 780: The base 780V (codenamed RS780C) is clocked at 350 MHz and integrates a Radeon 3100 engine; the more interesting one is the 780G (codenamed RS780), which runs at 500 MHz, runs a Radeon 3200 engine, supports UVD as well as Hybrid Graphics, which allows users to combine the integrated chipset with a discrete graphics card to increase the system’s graphics performance. We will return to that further down.

The decision to put such a capable chip into the 780G has really two effects: From an application view, AMD increases the value of the chipset again, in a similar way Windows Vista devalued it: When Vista launched, your average chipsets, especially Intel’s 915 and 945 were pretty much useless, since they couldn’t run the software’s fancy eye candy. So you had to go with a 256 MB discrete graphics card and it is still a good idea to do so today. However, the 780G brings up the performance to a pre-Vista time and lends chipsets new credibility.

However, on the other side, if that chipset is good enough for Vista, why would you or an OEM keep using a $50 graphics card, if a $19 chipset does the job just as well? AMD’s corporate vice president and general manager of the firm’s chipset division, Phil Eisler, conceded to TG Daily that there is a certain danger that the company could shoot itself in the foot with this chip: This chip could cannibalize discrete sales. “We have had an internal debate about that,” he said. We have no doubt about that, especially since this chipset is claimed to playback HD DVD and Blu-ray without problems (something you only could do with a high-end graphics card 18 months ago) and to run almost any mainstream PC game out there. AMD itself calls the 780G “the by far fastest motherboard GPU we have ever built.” So, if the 780G is really that good, it is a clear money saving opportunity for OEMs, which potentially could drop discrete graphics cards from their systems – not just Nvidia cards, but ATI Radeon cards as well.

From that perspective, the decision to use a full R620 core for the 780G is a risky play, but it will also challenge Intel and Nvidia. Intel, of course, is the main target and AMD claims that the 780G is more than twice as fast as Intel’s G35 under 3DMark06, almost three times as fast under 3DMark05and achieves frame rates of 27 fps under Crysis (1024x768), 43 fps under Call of Duty 4, 40 fps under Half-Life 2 and 35 fps under Doom 3.

Hybrid Graphics

780G motherboards

To compensate for the risk of selling fewer discrete cards, Eisler believes OEMs and consumers will take advantage of the new Hybrid Graphics technology. The goal of every chip manufacturer is to sell more chips every quarter, so it shouldn’t be too surprising that AMD and Nvidia are linking integrated graphics chipsets with discrete graphics.

he concept itself is enticing: You can upgrade your $19 chipset with, for example, a Radeon HD 3450 graphics cards, which currently sells for about $55 in U.S. retail: AMD promises that the addition of the graphics card will more than double the graphics performance of the system. Compared to a non-hybrid graphics system with just a HD 3450 graphics card, the 780G will add about 70% of the 3450’s performance.

To illustrate the performance gain, AMD claims that the above mentioned frame rates will substantially increase in a hybrid graphics environment: Crysis will see 32 fps, Call of Duty 4 73 fps, Half-Life 2 68 fps and Doom 3 60 fps.

The problem in this scenario really is that buyers of $399-$499 PCs don’t upgrade their graphics, which means that OEMs will have to install the graphics cars in the first place. Margins are extremely tight in this space anyway, so why would spend an extra $30-$50 for a graphics card, especially if the 780G is already good enough to run most games and Vista?

A partial answer may be that graphics performance simply sells. And that $499 PC may not have a discrete card, but a $599 or $649 PC may have Hybrid graphics installed. Eisler believes that two out of three PCs using the 780 chipset will use the chipset only, whereas the remaining third of PCs will include an additional graphics card.

Conclusion

I'm not sure whether it is a smart move on AMD’s side to use a full R620 for the 780G chipset, which by the way is couple with the SB700 Southbridge (basically a SB600 Southbridge with lower power consumption and improved connectivity). Vendors will have a close look at this one and if they can save a few bucks, they will – no matter how great hybrid graphics is.

From the consumer view, it may be worth your while looking at those reviews and see how well the chipset stacks up against other chipsets and entry-level discrete systems. It could save you a bundle of money on your next Vista home office PC.

Silverstone Prototype HTPC

An interesting market for the 780G certainly is the home entertainment center PC (HTPC). If you think about those noisy boxes we have today, the idea of an entirely passively cooled system that is still capable of playing your HD movies and running a few games would be fantastic. It isn’t really surprising that AMD is especially pitching this idea, even if the company has to concede that the success and failure of HTPCs will not be decided by AMD – but by companies such as Comcast and AT&T, which do not provide the bandwidth that would be necessary to support decent HTPCs, as well as most Hollywood studios, which apparently still believe consumers will pay $20 for a DRM-riddled movie download.

Technically, the 780G chipset is a great platform for a HTPC. Realistically, the HTPC will not become mainstream in 2008, no matter how badly AMD wants this to happen.

Wednesday, December 12, 2007

Inside Barcelona: AMD's Next Generation

In 2002, AMD’s K7 was lagging behind the high frequency 130nm Pentium 4 (Northwood). The 2003 launch of the K8 was a bet-the-company kind of move for AMD. But, by 2005, they had clearly hit the jackpot with the K8 microprocessor. They had three out of four of the major OEMs pitching their server products and Intel’s products barely kept pace in the UP (1 socket) and DP (2 socket) server markets. In the MP (4 socket) server market, AMD was clearly the best choice by a wide margin.

This success was due to a confluence of favorable factors, both political and technical. Technically, the K8 was a solid conservative design that built off the previous generation K7, and was ideally suited for the server market. In contrast, Intel’s competing Pentium 4 was designed for consumer and media workloads, and was decidedly sub-optimal for servers and notebooks. Worse yet, the P4 failed to scale up frequency, which was the key idea behind the microarchitecture, due to power and thermal issues at 90nm and beyond. These design goals had been dictated by internal Intel politics at the highest level. Intel’s corporate strategy was to push Itanium processors for the server market, and x86 for the desktop and notebook space. Unfortunately, the Itanium project alienated Sun and IBM, and did not get as much traction as many had been expecting because of price, compatibility and availability issues. This left a hole in Intel’s plans, which is exactly where the K8 was aimed and was enthusiastically embraced by the Linux community and Microsoft. By any measure, AMD’s bet-the-company on the K8 strategy had worked out quite well, finally bringing them a measure of financial success.

Unfortunately, this got Intel’s attention rather quickly. Intel scrapped several P4 follow-on projects in various states of completion and changed direction rather rapidly. One strength of Intel’s culture and employees is the ability to not only deliver, but to excel and exceed expectations under pressure. Intel’s Israeli design team delivered in spades with the Core microarchitecture (based on the P6), which shipped in the first half of last year and easily took a lead in performance and efficiency across almost every market. The effects were dramatic, and certainly felt at AMD, which experienced both an erosion of profitability, average selling price and market share for the first quarter of the year, which will continue in the second quarter. Of course, this is a familiar position for AMD; precisely where they were in late 2002 and early 2003 before the Opteron launch. Yet again, AMD’s fate depends on a product that will be released in the near future.

Barcelona - The Proverbial Ace in the Hole

Over the course of the last year, AMD has slowly been revealing more and more details on their next generation processor, codenamed Barcelona. The first information came out in a keynote address from Senior Fellow Chuck Moore at the Spring Processor Forum in 2006. At the following Fall Processor Forum, Ben Sander gave a much more detailed outline of the microarchitecture for Barcelona. More recently, Shawn Searles gave a presentation at ISSCC ‘07 which described the physical implementation challenges of Barcelona and some of the design choices.

Barcelona is the first major architectural alteration to the K8, since it debuted in 2003. The K8 built on the very capable microarchitecture of the K7, and added 64 bit operation, two integrated DDR memory controllers and 3 HyperTransport lanes. These features were not novel; AMD’s architects followed in the footsteps of the Alpha EV7, which was the first MPU to integrated memory controllers (8 channels of DRDRAM), on-die routing (4 inter processor links) and directories. However, the K8 advanced the state of the art by bringing 64 bits and higher levels of integration to x86, the mainstream instruction set architecture. The K8 was the first AMD product to meet with any success in the server world, a clear testament to the wisdom of evolutionary and conservative design choices.

In many ways, Barcelona continues down this conservative path of evolution. There are no radical changes. In fact Barcelona has the same basic 12 stage pipeline as the K8, and many of the microarchitectural improvements in Barcelona have been successfully demonstrated elsewhere. This in no way detracts from the efforts of AMD’s architects and engineers – high risk features are inappropriate for a company that cannot afford a product failure.

This article will bring together all the existing information on Barcelona into a single place, discussing the system aspects and microarchitecture of Barcelona as well as the circuit design challenges and performance. This also presents a wonderful opportunity to examine what areas AMD focused on, in comparison to where Intel spent much of their effort enhancing the P6 core to produce the Pentium M and Core 2 line.

Barcelona is a 283mm2 design that uses 463M transistors to implement four cores and a shared 2MB L3 cache in AMD’s 65nm process. The SOI process uses 11 layers of copper interconnect with a low-k dielectric and dual stress liners and embedded SiGe for PMOS transistors. The device described at ISSCC was targeted at 2.2-2.8GHz at 1.15V, while operating within a 95W maximum thermal envelope. AMD claims that their 65nm process has a 15ps FO4 inversion delay, which suggests that Barcelona’s pipeline is just a little less than 24 FO4 delays. Later sections of this article will delve into seven major areas, the system architecture, the five major sections of the microarchitecture and lastly circuit level improvements and other features

System Architecture

The K8 system architecture is already quite good. For four socket servers, it clearly rules the x86 roost. The two major issues for AMD systems were the lack of quad core processors, and poor eight socket server performance.

The single most emphasized selling point for Barcelona is that it integrates four processor cores, bringing them to parity with Intel’s Xeon 53xx series. The Xeon 53xx, codenamed Clovertown, is actually a pair of dual core Woodcrest processors in a multi-chip package (MCP). These processors communicate over the front-side bus, rather than through an on-chip bus or caches. In contrast, AMD has opted for a shared cache approach, where the last level of cache, the L3 is used by all four cores. Figure 1 below compares the Revision F Opteron, Barcelona and Intel’s upcoming 3GHz Clovertown.

Figure 1 – System Architecture Comparison

The architects that designed Barcelona opted for a fully integrated MPU. A monolithic device ultimately provides higher performance, especially for bandwidth sensitive workloads that don’t benefit from caching, such as HPC or data mining. However, like any engineering decision it does not come without trade-offs. First of all, fully integrating everything is a decision that must be made at the beginning of the project. An MCP approach is far less time consuming and can use a slightly modified existing product; most importantly, these changes can be made late in the design cycle. Monolithic devices also have lower yields, because the larger die size means fewer candidate dice per wafer, and hence random defects have a larger impact. Monolithic MPUs are also more difficult to bin for frequency, since to run at a given speed, all four cores must exceed that target with appropriate power dissipation. However, there are design techniques that will let a MPU with a slow core and a fast core run at the slow speed, but with lower power.

While AMD’s marketing department likes to bill their approach as a ‘native’ or ‘true’ quad core design, the truth is that both approaches are equally valid; a fact belatedly recognized by some of AMD’s own executives. Intel’s Clovertown is a quad core device. Operating systems recognize Clovertown as four processors, and it certainly offers higher performance for many applications than a dual core MPU. However, it is equally true that in most situations performance favors fully integrated quad cores.

In the case of Barcelona, the advantages of greater integration have been augmented by careful attention to I/O bandwidth. The memory controllers in Barcelona received a major overhaul. The most visible change is that each controller supports independent 64B transactions, rather than a single 128B transaction across both controllers (memory mirroring is also supported now). Since DDR2 bursts stay at 32B, this improves command efficiency. However, when using DDR3, the command efficiency will drop because the burst length will double to 64B. Each controller also supports a separate set of open pages in DRAM, which is controlled by a new history based pattern predictor (which is somewhat analogous to a simple branch predictor). The predictor uses both per-bank access history and page accesses across banks to decide whether to keep a page open to improve performance, or close the page to reduce power. Lastly, Barcelona introduces data poisoning, which ensures that if a double bit error is detected by ECC, it is contained and only impacts the process which first accesses it, rather than crashing or corrupting the whole system.

While revision F Opteron processors supported DDR2, there was little performance advantage, if any. To actually take advantage of the available bandwidth for DDR2, deeper request and response queues are needed; these changes were not made in revision F, but are present in Barcelona. AMD also introduced a 16-20 entry write buffer in the memory controller, so that writes can be deferred, avoiding costly bus turn-arounds. Lastly, the memory controllers now support DRAM prefetchers that share the write buffer and can detect positive and negative strides. Server versions of Barcelona will support registered DIMMs at up to 667MHz, and desktop versions will work with slightly faster 800MHz DDR2.

Barcelona also adds a fourth HyperTransport lane for interprocessor communications and I/O devices. With four lanes, system vendors can build fully connected four socket systems; this reduces transaction latency substantially, since all processors can be reached with a single hop. Each node within the system could even have an attached I/O hub (see our preview of Barcelona). However, the current socket infrastructure only supports three HT1.1 lanes, so these innovative system designs will have to wait for a new socket interface. Initially, each link will run at 2GT/s, but they are compatible with HyperTransport 3.0 and future parts may operate at up to 5.2GT/s in newer systems. HT3.0 can also modulate link width and frequency to save power. Coherent Hypertransport also features a slight change that will improve latency for some transactions. When a K8 fetches a cache line into its L1D or L2, it has to snoop the system and wait for the results. In particular, the K8 will snoop memory and every other cache in the system; once it gets all of these responses, it can use the cache line it fetched. However, in Barcelona if a requested cache line is in the M or O state (meaning that memory has a dirty copy), the CPU does not wait to get the snoop response from memory, improving the transaction latency. The newer protocol also adds a retry mechanism to survive transient errors at higher clock rates.

HyperTransport 3.0 also adds a feature called ‘unganging’ or lane-splitting. The HT3.0 links are actually composed of 16 bit lanes running in each direction. These lanes can be split up into a pair of independent 8-bit wide links. This is fairly useful for connecting to I/O devices, as few systems have enough I/O devices to saturate a full 8GB/s interface; even 8 SAS hard drives and a pair of 10GBE cards would not require that much bandwidth. However, AMD has also pitched link-splitting as a way to build fully interconnected 8 socket servers. Previous generation Opteron-based systems supported up to 8 sockets, but the performance is positively underwhelming on the few benchmarks that have been published (mainly SAP 2 tier and SPECjbb2005). While Barcelona will offer higher performance 8 socket implementations, it isn’t clear how much demand there is from end-users. Sun and Fujitsu currently sell 8 socket servers, but both HP and Dell shelved earlier efforts in 2003.

The Fetch Phase

The front end of Barcelona is fairly complex and has been substantially improved over the K8, as shown in Figure 2 below. Each cycle, Barcelona fetches 32B of instructions from the L1I cache into the predecode/pick buffer. The previous generation K8 fetched 16B each cycle, as does Intel’s Core 2. The instruction fetch was widened because many of the SIMD and 64 bit instructions are longer, and as these become more common, larger fetches are required to keep the rest of the core busy. Consequently, the pre-decode and pick buffer for Barcelona has been enlarged, to at least 32B, although it could be somewhat larger - the K8's predecode buffer was 1.5x the fetch size, so a 48B buffer might not be out of the question.

This makes sense as Barcelona is targeted first and foremost at servers, where 64-bit mode is common. Core 2, on the other hand, was designed with more focus on consumers, who purchase the majority of computer systems. The reality is that even now, 64-bit operating systems are extraordinarily rare for desktops, and especially notebooks; in those market segments, the additional benefit is more limited and may not be worth the resources.

Figure 2 – Comparison of Front-End Microarchitecture

The branch prediction in the K8 also received a serious overhaul. The K8 uses a branch selector to choose between using a bi-modal predictor and a global predictor. The bi-modal predictor and branch selector are both stored in the ECC bits of the instruction cache, as pre-decode information. The global predictor combines the relative instruction pointer (RIP) for a conditional branch with a global history register that tracks the last 8 branches to index into a 16K entry prediction table that contains 2 bit saturating counters. If the branch is predicted as taken, then the destination must be predicted in the 2K entry target array. Indirect branches use a single target in the array, while CALLs use a target and also update the return address stack. The branch target address calculator (BTAC) checks the targets for relative branches, and can correct predictions from the target array, with a two cycle penalty. Returns are predicted with the 12 entry return address stack.

Barcelona does not fundamentally alter the branch prediction, but improves the accuracy. The global history register now tracks the last 12 branches, instead of the last 8. Barcelona also adds a new indirect predictor, which is specifically designed to handle branches with multiple targets (such as switch or case statements). Indirect branch prediction was first introduced with Intel’s Prescott microarchitecture and later the Pentium M. Indirect branches with a single target still use the existing 2K entry branch target buffer. The 512 entry indirect predictor allocates an entry when an indirect target is mispredicted; the target addresses are indexed by the global branch history register and branch RIP, thus taking into account the path that was used to access the indirect branch and the address of the branch itself. Lastly, the return address stack is doubled to 24 entries.

According to our own measurements for several PC games, between 16-50% of all branch mispredicts were indirect (29% on average). The real value of indirect branch misprediction is for many of the newer scripting or high level languages, such as Ruby, Perl or Python, which use interpreters. Other common indirect branch common culprits include virtual functions (used in C++) and calls to function pointers. For the same set of games, we measured that between 0.5-5% (1.5% on average) of all stack references resulted in overflow, but overflow may be more prevalent in server workloads.

The Decode Phase

x86 instructions are fairly complicated to decode; they are variable length and because of prefixes, the position of the op code cannot be known ahead of time. To simplify decoding, the K8 and Barcelona both use pre-decode information that marks the end of an instruction (and hence the start of the next instruction). However, the first time an instruction is fetched into cache there is no pre-decode information. The instruction cache contains a pre-decoder which scans 4B of the instruction stream each cycle, and inserts pre-decode information, which is stored in the ECC bits of the L1I, L2 and L3 caches, along with each line of instructions. Since there are almost no writes to the instruction stream, parity and refetching from memory is sufficient protection from errors in the instruction cache and ECC is not really required for code. As noted previously, this pre-decode information also includes branch selection and other related information.

Like the Pentium Pro, the K7/8 has an internal instruction set which is fairly RISC-like, composed of micro-ops. Each micro-op is fairly complex, and can include one load, a computation and a store. Any instruction which decodes into 3 or more micro-ops (called a VectorPath instruction) is sent from the pick buffer to the microcode engine. For example, any string manipulation instruction is likely to be micro-coded. The microcode unit can emit 3 micro-ops a cycle until it has fully decoded the x86 instruction. While the microcode engine is decoding, the regular decoders will idle; the two cannot operate simultaneously. The vast majority of x86 instructions decode into 1-2 micro-ops and are referred to as DirectPath instructions (singles or doubles).

In Barcelona, 128 bit SSE computations now decode into a single micro-op, rather than two; this makes the rest of the out-of-order machinery, such as the re-order buffer and the reservation station more effective. The same goes for integer and FP conversions, and 128 bit load instructions, which are needed to complement the new SIMD capabilities. Note that 128 bit stores still create 2 micro-ops. Another tweak that AMD added is support for unaligned SSE memory accesses, which help more efficiently fetch instructions by packing code more densely.

At some point slightly during the decode stages, instructions are passed through a new piece of hardware in Barcelona, the sideband stack optimizer. The x86 instruction set supports stacks in hardware, and can directly manipulate the stack of each thread, using PUSH, POP, CALL and RET instructions. These instructions modify the stack pointer (ESP), which in the K8 would generate a micro-op; worse yet, usually these instructions came in long dependent chains, which is a pain for the out-of-order machine.

AMD introduced a side-band stack optimizer to remove these stack manipulations from the instruction stream, similar to the dedicated stack engine in the Pentium M. Both MPUs use two registers, ESPO and ESPD (this is Intel’s terminology). ESP_O is the original value for the stack pointer and is held in a register in the out-of-order machine, while ESP_D, the delta register, tracks changes made to ESP and is in the front-end. Since ESP is an architecture register, a special micro-op is provided to recover ESP from ESP_O and ESP_D, although the use of this ‘fix up’ operation is minimized in Barcelona. When a stack modifying instruction is detected, it is removed and resolved by a dedicated ALU which modifies ESP_D. This means that many stack operations can be processed in parallel, and frees up the reservation stations, re-order buffers and regular ALUs for other work. The benefits of this technique are highly workload dependent, but AMD and Intel agree that usually 5% of the micro-ops can be eliminated.

The last part of the decoding is the pack buffer (which is probably still 6 entries, like the K8).

The Out-of-Order Engines - Renaming and Scheduling

The first thing to notice is that the out-of-order control logic is vastly more complicated for K8 and Barcelona than the Core 2. Unlike Intel’s microarchitecture, the K8 and Barcelona have a split integer and floating point cluster, with distributed schedulers/reservation stations. The Core microarchitecture, has a single execution cluster, with a unified scheduler/reservation station and multiple issue ports. These choices, which date back to the P6 and the Athlon are one of the major factors that accounts for the strength of AMD microprocessors in floating point workloads.

Figure 3 – Comparison of Out-Of-Order Resources

The pack buffer, which is part of the decoding phase, is responsible for sending groups of exactly 3 micro-ops to the re-order buffer (ROB). However, the 72 instruction re-order buffer is not actually 72 independent entries. It contains 24 entries, with 3 lanes for instructions in each entry. The re-order buffer contains a rename register for the result of each operation in flight (or in the case of a FP operation, a pointer to the FP register file).

To ensure that the ROB is fully utilized, only a single group of exactly three instructions can be sent to the ROB each cycle. Therefore, the function of the pack buffer is twofold; it coalesces instructions into groups of three so that they can enter the ROB. Just as importantly, the pack buffer can move instructions between lanes to avoid a congested reservation station downstream or to observe issue restrictions. For example, floating point or integer multiplies must be in the first lane, while LZCOUNT must be in the third. Each lane corresponds to a specific reservation station further down in the pipeline, and once an instruction enters a specific lane in the ROB, it cannot be moved. Thus the pack buffer is also the last chance to switch lanes for an instruction.

At this point, the path for floating point and integer/memory instructions diverge. The next stop on the integer side is the Integer Future File and Register File (IFFRF). The IFFRF contains 40 registers broken up into three distinct sets. First, the Architectural Register File, which contains 16x64 bit non-speculative registers specified by the x86-64 instruction set. Instructions can only modify the Architectural Register File once they have retired, with no exceptions. Speculative instructions instead read from and write to the Future File, which contains the most recent speculative state of the 16 architectural instructions. The last 8 registers are scratchpad registers used by the microcode. In the case of a branch misprediction or exception, the pipeline must rollback, and architectural register file overwrites the contents of the Future File.

From the ROB, instructions issue to the appropriate scheduler. The integer cluster contains three reservation stations (or schedulers). Each one is tied to a specific lane in the ROB and holds 8 instructions, with the source operands. The source operands come from either the Future File, or the result forwarding bus (which is not shown because it is too complicated to draw).

The Floating Point Cluster

Floating point instructions are handled quite differently. Instead of being sent directly to the reservation stations, they first head to the FP Mapper and Renamer. One of the nasty aspects of the x86 instruction set is that FP operations are stack based; the FP mapper converts these stack operations to use a flat register file instead so that renaming can occur.

In the renamer, up to 3 FP instructions each cycle are assigned a destination register from the 120 entry FP register file. The file is large enough to rename up to the maximum of 72 instructions in flight. Along with the FP register file, there are two arrays, the architectural and future file arrays. In Barcelona, the architectural file array contains pointers to 44 of the 120 FP registers, which contain the non-speculative state: 8 for x87/MMX, 8 scratchpad registers for the microcode and 8x128 bit XMM registers. Previously, the K8 treated the XMM registers as 16x64 bit registers, but that changed once 128 bits became a 'native' data format. Similarly, the future file contains pointers to 44 renamed registers that contain the latest speculative values within the FP register file.

Once the micro-ops have been renamed, they may be issued to the three FP schedulers. Each reservation station holds up to 12 instructions, with the source operands. Like the integer schedulers, the operands can either come from the FP register file, or the forwarding network and each scheduler is tied to a specific lane in the ROB.

The Out-of-Order Engines - Execution Units

Once operations enter the schedulers, they wait until the source operands are ready. Then the scheduler will dispatch the oldest instruction and operands to the appropriate functional unit. The integer functional units in Barcelona are mostly unchanged from the K8. The three integer ALUs in K8 and Barcelona can execute most instructions and are largely symmetric. The two exceptions are that only the first ALU has an integer multiplier, and the third is used for POPCOUNT and other similar instructions. Note that the forwarding network for Barcelona has been omitted because it is far too complex to display in an organized manner.

Figure 4 – Comparison of Execution Units

The first substantial change in Barcelona’s integer units is that integer division is now variable latency, depending on the operands. IDIV instructions are handled through an iterative algorithm. In the K8, each IDIV would go through a fixed number of iterations – regardless of how many were required to achieve the final result. 32 bit divides took 42 cycles, while a full 64 bit divide required 74 cycles to calculate. In contrast, Barcelona only iterates the minimum number of times to produce an accurate answer. The latency for Barcelona is generally 23 cycles, plus the number of significant bits in the absolute value of the dividend (unsigned divides are roughly 10 cycles faster). Additionally, the third ALU pipeline now handles the new LZCOUNT/POPCOUNT instructions.

The FPUs in Barcelona did change a bit. They were widened to 128 bits so that SSE instructions can execute in a single pass (previously they went through the 64 bit FPU twice, just as in Intel’s Pentium M). Similarly, the load-store units, and the FMISC unit now load 128 bit wide data, to improve SSE performance.

One important difference between AMD and Intel’s microarchitectures is that AMD has their address generation units (AGUs) separate from the load store units (LSUs). This is because, as we noted earlier, AMD’s micro-ops can contain a load, an operation and a store, so there must be at least as many AGUs as ALUs. In contrast, Intel uops totally decouple calculations from memory accesses, so the AGUs are integrated into the load and store pipelines. The difference in the underlying uops and micro-ops result in the different AGU arrangements.

Another distinction between the Barcelona and Core microarchitectures is that AMD’s ALUs are symmetric and can execute almost any integer instruction, while the ALUs for Core 2 are not symmetric and are slightly more restrictive. Each of the lanes must be nearly identical for AMD’s distributed schedulers and instruction grouping to work optimally. This is a clear architectural trade-off of performance and decreased control complexity versus power and increased execution complexity. Replicating three full featured ALUs uses more die area and power, but provides higher performance for certain corner cases, and enables a simpler design for the ROB and schedulers.

The Memory System

The memory pipelines and caches in Barcelona have been substantially reworked; they now have some limited out-of-order capabilities and each pipe can perform a 128 load or a 64 bit store every cycle. Memory operations in both the K8 and Barcelona start in the integer schedulers, and are dispatched to both the AGU and the 12 entry LSU1. The address generation takes one cycle, and the result is forwarded to the LSU1, where the data access waits.

Figure 5 – Comparison of Memory Pipelines

At this point, the behavior of Barcelona and the K8 diverge. In the K8, memory accesses were issued in-order, so if a load could not issue, it also stalled every subsequent load or store operation. Barcelona offers non-speculative memory access re-ordering. What this really means is that some memory operations can issue out-of-order.

During the issue phase, the lower 12 bits of the load operation’s address are tested against prior store addresses; if they are different, then the load may proceed ahead of the store, and if they are the same, there may be an opportunity for load-store forwarding. This is equivalent to the memory re-ordering capabilities of the P6 – a load may move ahead of another load, and a load may move ahead of a store if and only if they are accessing different addresses. Unlike the Core 2, there are no prediction and recovery mechanisms and no loads may pass a store with an unknown address.

In the 12 entry LSU1, the oldest operations translate their addresses from the virtual address space to the physical address space using the L1 DTLB. The L1 DTLB now includes 8 entries for 1GB pages, which is useful for databases and HPC applications with large working sets. Any miss in the L1 DTLB will check the L2 DTLB. Once the physical address has been found, two micro-ops can probe (in case of a store) or read from (in case of a load) the cache each cycle, in any combination of load and store. The ability to do two 128 bit loads a cycle is beneficial primarily for HPC, where the bandwidth from the second port can come in handy. Once the load or store has probed the cache, it will move on to LSU2.

LSU2 holds up to 32 memory accesses, where they stay until retirement. LSU2 handles most of the complexity in the memory pipeline. It resolves any cache or TLB misses, by scheduling and probing the necessary structures. In the case of a cache miss, it will escalate up to the L2, L3 or memory, and TLB misses would go the L2 TLB, or main memory, where the page tables reside. The LSU2 also holds store instructions, which are not allowed to actually modify the caches until retirement to ensure correctness. Since all the stores are held in LSU2, it also does the load-store forwarding. Note that stores are still 64 bits wide, hence two entries are used to track a full 128 bit SSE write. This is a slight disadvantage as some instruction sequences, particularly those that involve copying data in memory, have equal numbers of reads and writes. However, the general trend is that there are twice as many (or more) loads than stores in an application.

The 64KB L1D cache is 2 way associative, with 64 byte lines and a 3 cycle access time. It uses a write-back policy to the L2 cache, which is exclusive of the L1. The data paths into and from the L1D cache also widened to 256 bits (128 bits transmit and 128 bits receive), so a 64 byte line is transmitted in 4 cycles. As in the K8, the L2 cache is private to each core. The L2 capacity has been halved to 512KB, but the line size and associativity were kept at 64B and 16 ways respectively.

The L3 cache in Barcelona is entirely new feature for AMD. The shared 2MB L3 cache is 32 way associative and uses 64B lines, but did not fit in Figure 5. The controller for the cache is flexible and various AMD documents indicate that it can flexibly support up to 8MB of L3 cache. The L3 cache is specifically designed with data sharing in mind. This entails three particular changes from AMD’s traditional cache hierarchy. First, it is mostly exclusive, but not entirely so. When a line is sent from the L3 cache to an L1D cache, if the cache line is shared, or is likely to be shared, then it will remain in the L3 – leading to duplication which would never happen in a totally exclusive hierarchy. A fetched cache line is likely to be shared if it contains code, or if the data has been previously shared (sharing history is tracked). Second, the eviction policy for the L3 has been changed. In the K8, when a cache line is brought in from memory, a pseudo-least recently used algorithm would evict the oldest line in the cache. However, in Barcelona’s L3, the replacement algorithm has been changed to also take into account sharing, and it prefers evicting unshared lines. Lastly, since the L3 is shared between four different cores, access to the L3 must be arbitrated. A round-robin algorithm is used to give access to one of the four cores each cycle. The latency to the L3 cache has not been disclosed, but it depends on the relative northbridge and core frequencies – for reasons which we will see later.

The last improvements to Barcelona in the memory pipeline are the prefetchers. Each core has 8 data prefetchers (a total of 32 per device), which now fill to the L1D cache in Barcelona. In the K8, prefetched results were held in the L2 cache. The instruction prefetcher for Barcelona can have up to 2 outstanding fetches to any address, whereas the K8 was restricted to one fetch to an odd address and one fetch to an even address.

Circuit Techniques, Power Savings and More

From a circuit level perspective, the changes between the K8 and Barcelona were extremely significant. Barcelona is specified to operate at a wide range of voltages, from 0.8-1.4V. However, unlike its predecessor, each core in Barcelona has a dedicated clock distribution system (including PLL) and power grid. The frequency for each core is independent of both the other cores, and the various non-core regions; the voltage for all four cores is shared, but separate from the non-core. As a result, power can be aggressively managed by lowering frequency and voltage whenever possible. To support independent clocking and modular design, asynchronous dynamic FIFO buffers are used to communicate between different cores and the northbridge/L3 cache. These FIFOs absorb any global skew or clock rate variation, but the latency for passing through depends on the skew and frequency variance – which is why the L3 cache latency is variable. The northbridge and L3 cache compose roughly 20% of the die and share a voltage and clock domain that is independent of the four cores, which is essential for mobile applications. Previously, the northbridge clock and voltage was tied to the processors, so systems with integrated graphics could not reduce the processor voltage or frequency to deep power saving states. Separate sleep states, voltages and frequencies for the northbridge and processors should lower AMD’s average power dissipation which will help in the mobile market.

Figure 6 – Barcelona Die Micrograph

Barcelona also features a dedicated temperature sensor circuit for each core, and a separate one for the northbridge. Each core has 8 sensors on the circuit, while the northbridge contains 6. All the circuits are connected to and controlled by a global thermal control circuit. The global thermal controller uses the results to select power saving modes to reduce the temperature of the device.

One of the trickier areas for AMD’s design team was the SRAM cells for the caches. The L1 caches share a common 1.06um² cell design. The 6T SRAM cells read during the first half of the cycle, and then perform a self-timed write and precharge in the latter part of the cycle. The timing for the write is based on extensive Monte Carlo analysis, incorporating lot-to-lot and local process variation and can be modified post-production with programmable fuses.

The L2 and L3 cache share many design elements, including the SRAM cells. The L2/3 cells are 0.81um² and are also single ended for stability, which is unusual. One of the difficulties that AMD’s SRAM designers faced is that because they use the same die across all product lines, the likelihood of a read disturbance (i.e. reading the wrong data) must be very small. Specifically, a 5 sigma margin across the entire 0.7-1.3V range is required. Unfortunately, the floating body effect of SOI silicon precluded a more efficient small swing read design. According to AMD’s presentation, using a small swing read cell, they were only able to achieve a 4.53 sigma margin. The single ended design which was chosen had larger margins that were sufficient for actual product use.

Shifting to more software oriented matters; Barcelona also adds support for a variety of new instructions. Fortunately, these coincide with the supplemental SSE3 instructions that Intel added to the Core 2. Generally, these instructions were not terribly significant, except for the POPCOUNT instruction, a perennial favorite of intelligence agencies, which counts the number of 0’s, or 1’s in a given register. AMD also added support for unaligned SSE loads, as previously mentioned, and it will be interesting to see when or if Intel chooses to follow their lead.

More significant to server users are the nested page tables, which improve virtualization performance. One of the drawbacks of Shadow Page Tables is that page faults become very expensive, since the VMM is invoked to manage any changes to the SPT. The alternative, Nested or Extended Page Tables, which are used in Barcelona, is to virtualize the memory management unit. On Barcelona each guest maintains a hardware walked table that maps the guest physical to host physical address. Unfortunately, walking these tables can be extraordinarily expensive, so parts of the mappings can be cached as well. While this reduces the performance overhead of virtualization, customers waiting for I/O virtualization will have to wait till 2008 for that particular feature.

Barcelona Summary

AMD’s current competitive situation is rather difficult. For high-end MP servers and some HPC workloads, AMD is still regarded as the king of the hill. However, in almost every other segment, Intel is the performance and often power efficiency leader. This translates into significant financial problems for AMD, as the high-end server market is small (but lucrative) and HPC is generally very competitive for pricing. While this situation is reminiscent of the eve of the Opteron launch, in many ways AMD is better positioned than they were in early 2003.

In early 2003, AMD had no presence in the server market, and was not a particularly credible player since they had no track record. Moreover, AMD did not even have particularly strong ties with any of the server vendors. Today, AMD is an acknowledged participant in the server world, and is not perceived as a ‘follower’. AMD has also made significant sales and marketing in-roads with the major OEMs, including the last major holdout, Dell. At the same time, AMD has badly blundered with their channel partners – some of the earliest adopters and risk takers who pushed Opteron, and Intel has recently moved aggressively to court channel partners.

To date, AMD has only given a few hints on the frequencies and the expected performance for Barcelona. AMD has publicly predicted a 50% advantage over a Xeon 5355 (2.66GHz quad core) in SPECfp_rate2006, and a 20% advantage for SPECint_rate2006. Of course, AMD’s competition when Barcelona arrives will be a faster 3GHz processor from Intel, and later a 3.2GHz Penryn based design, using a 1.6GHz bus. Perhaps more importantly, SPECint and SPECfp only address a portion of the workloads that AMD and Intel target.

While AMD has not disclosed frequencies or TDP yet, rumors point to 1.9-2.6GHz in 100MHz steps with thermal envelopes of 68, 95 and 120W. Additionally, AMD is very likely to increase the frequency of Barcelona over the lifetime of the processor because of their approach to manufacturing. AMD tends to continuously improve their process, and these improvements should translate into incremental speed bumps along the way for their processors.

Looking at a comparison of the three microarchitectures, the K8, Core 2 and Barcelona below, shows some performance hints. For many of the most important features, AMD and Intel should be matched microarchitecturally because AMD has incorporated quite a few of the techniques that Intel used to boost the per clock efficiency of the Core 2. While it does appear that the Core 2 is 33% wider than Barcelona, in reality, neither processor comes close to peak capabilities on real code, so the performance will be much closer than the block diagrams imply. Barcelona's 3-wide issue, execute and retire capabilities are not a performance problem.

Figure 7 – Microarchitecture Comparison

Given all this information, and existing knowledge about AMD and Intel’s technical strengths and weaknesses it is possible to estimate how the competitive landscape will look towards the latter part of this year. At a high level Barcelona should provide an edge in multithreaded performance, but not an insurmountable one. Depending on clock speed, Intel may retain the performance crown for single threaded performance - which is essential for client systems.

Performance

For desktops, Barcelona will probably lead multithreaded performance and applications that strongly depend on high bandwidth, but single threaded workloads may slightly favor Intel’s designs. Note that the dual core desktop processors are likely to remain a large portion of the product mix, until the marginal cost for the additional cores is low, or most applications can use 4 threads. A mobile variant of Barcelona will not be introduced till 2009, after Griffin. This is because many of the improvements in Barcelona are focused on server performance, and may not have the right power/performance balance for notebooks.

Dual processor servers will be a mixed bag for AMD. Expect extremely strong performance for almost all HPC-style workloads – that is and will continue to be a strong suit for AMD’s architecture because of the highly integrated system design and copious bandwidth. However, for commercial workloads like file or web serving or transaction processing, which don’t require as much bandwidth, any performance gaps will be much smaller. For these workloads Barcelona will certainly be competitive and will exceed Intel’s performance on some benchmarks, but Intel will likely retain a lead for other benchmarks. Performance for single processor servers will generally be similar - but the advantages from AMD's system architecture will be smaller, hence the multithreaded performance will probably be even closer than for dual processor servers.

One area where AMD should have a slight edge is on dual processor platform power consumption, due to differences in the memory systems. AMD uses DDR2 DIMMs, which consume 3-5W each, while the FB-DIMMs that Intel’s systems use consume 5W above a normal DDR2 DIMM. The power advantage for AMD will depend on the configuration of each individual server. As more memory is added, AMD’s advantage will grow, however, as other components are added, AMD’s relative advantage will diminish – for example, in systems with 8 or more disks, the differences in memory systems may be lost in the noise. This difference in memory architectures applies for dual processor FB-DIMM based servers only, since Intel's single processor systems use DDR2 DIMMs and some dual processor servers may use regular DDR2 DIMMs. For those servers where Intel uses regular DDR2, AMD will not have any significant power advantages.

At the high-end, MP servers should be a bright spot for AMD. The higher level of integration and the additional HyperTransport link in Barcelona will improve an already formidable system architecture that has Intel on the defensive. One open question is whether Barcelona will truly commoditize the market for large MP (8 socket) servers. The capability is there, but it is unclear whether OEMs will aggressively push a solution that does not address the inherent limits of a snooping cache coherency policy, and lacks some of the RAS features that typical mid-range servers offer.

Conclusions

Barcelona is the first revision to AMD’s microarchitecture since 2003. Rather than starting from scratch, Barcelona builds on the previous generation and subtly improves almost every aspect of the design. In many ways, this mirrors the evolution of computer architecture – there are very few techniques that can give a large boost on a wide spectrum of applications. Instead, architects are turning to many smaller improvements, just as AMD has done with Barcelona. The only obvious trick left in the bag for AMD is multithreading, which could provide a big boost in a future microarchitecture, such as the K10. However, this style of conservative, consistent design has worked very well for AMD in the past.

Barcelona is a solid improvement across the board and should give AMD momentum across several key markets. The performance advantages will be decisive for HPC applications and MP servers, other areas will be close in performance. Hence, there is quite a bit to look forward to in the near future with the debut and performance numbers for Barcelona. No matter what, AMD's engineering teams deserve kudos for a solidly executed product.

AMD's K8L and 4x4 Preview

wice each year, AMD hosts an analyst day. The spring event, which took place at the Sunnyvale headquarters, is more technically oriented and tends to deal with the actual details of the company’s products and technology itself rather than financial performance and metrics. The event itself was relatively low key and included speakers from AMD and a few key partners, such as Sun, VMware, Rackable and video presentations from Microsoft and Alienware nee Dell. There was quite a bit of interesting information presented, but what really seemed to be worth dwelling on were the new revelations about the K8L and the 4x4 gaming systems.

Just as an aside, several architects at AMD expressed puzzlement at the origins of the K8L name. There is an internal engineering code name for the project, but the marketing team is slightly behind and has yet to provide something catchy for the rest of the world. At least one architect at AMD indicated a preference for the name K8++, but it seems unlikely that anyone in marketing or PR would share this point of view.

Do you Have Change for Some Cache?

Previously, Chuck Moore had described several incremental enhancements in the K8L at the Spring Processor Forum. The instruction fetch unit now includes an indirect branch predictor and fetches 32 bytes per cycle. The FP and SSE units have all been widened to 128 bits, as have the memory pipes. The load/store units also have somewhat more flexible execution; they can re-order loads with respect to other loads (although loads cannot move around stores). Physical and virtual addressing is expanded to 48 bits, and the page tables have been augmented as well. The page tables now support nesting for virtualization, and include 1GB pages. On the power side, the cores and system functionality will have separate power planes and independent C and P states. These are not all of the changes, but most of the key elements.

Figure 1 – Floor Plan of K8L

The first significant disclosures regarding the K8L had to do with the cache hierarchy within a single core. Despite an erroneous rumor to the contrary at Daily Tech, the L1D and L1I caches remain at 64KB each, according to a senior architect at AMD. The floor plan of the K8L also tends to confirm that the L1 caches have not decreased in size. The K8L did experience some L2 cache shrinkage and initial parts will feature a 2MB shared L3 cache. Based on the cache sizes, the L2 cache is still exclusive of the L1 contents, and the L3 cache is certainly not inclusive (although this does not mean it is exclusive). Additionally, it is easy to deduce, based on information about the load/store units that the bus between the L1 and L2 caches has been widened to 256 bits. The L3 cache is extensible, and it seems likely that 4MB parts will come out, perhaps as a way to differentiate between low-end parts intended for 1-2 sockets, and the higher-end parts for 4-8 sockets.

Scale up at Last?

Many industry insiders have commented that the K8 is eerily reminiscent of the ill-fated Alpha EV7. The EV7 augmented a high performance core with on-die directories for cache coherency and four interprocessor communication links operating at 6.4GB/s each. The EV7 also incorporated two memory controllers supporting eight channels of RDRAM, a total memory bandwidth of 12.8GB/s per processor. Like the EV7, the K8 enhanced on a prior generation design; adding a memory controller, and three Hypertransport links. With three 8GB/s links, the K8 is an excellent choice for 1-8P servers. In theory, the K8 can scale up to 8 sockets; however, in practice it is extremely difficult. First, the only glueless 8 socket systems require multiple system boards; the Tyan Thunder K8QW uses 2 boards, while the Iwill H8502 uses 5 boards. Secondly, the snoop broadcast protocol used in the K8 ends up saturating the Hypertransport links. Third, using 8 sockets requires slightly more complicated system topologies that increase the number of hops between sockets and hence average memory latency. As a result, performance projections for commercial server workloads (OLTP in particular) show very poor gains (10-40% depending on which estimate) for glueless 8 socket systems over 4 socket systems.

AMD’s success with the Opteron for 4 socket systems, where they have roughly half the market, has prompted the architects to extend the K8L’s scalability a step further. The K8L will add an additional lane of 16 bit Hypertransport 3.0 to each device, providing 4 in total. Each link can run at up to 5.2GT/s, and can be split into two separate 8 bit links. So a single device could be configured with eight 8 bit Hypertranpsort links, instead of the regular four 16 bit links. Figure 1 below shows a fully connected system using split links.

Figure 2 – 4 and 8 Socket K8L System

Alternative configurations are also conceivable, but are beyond the scope of this article. Given these disclosures, the K8L will be somewhat more suitable to 8 socket systems, since it solves the topology and latency issues, although there is no disclosed solution for the snooping problem.

While AMD did not discuss the matter, we had initially hypothesized that the limit was still 8 nodes per system. It turns out that we were premature, and AMD has increased the number of nodes, although we do not know by how much. Any limitation is probably in the neighborhood of 16-64, both due to AMD’s partners, and technical constraints. AMD’s major partners: IBM, Sun and HP, all have highly scalable systems that use other architectures (PPC, SPARC and IPF, respectively). The notion of white box vendors selling high processor count systems would not sit well with any of those three, since the margins are far higher on their larger systems. Moreover, such a move could interfere with Newisys’ scalable Opteron servers. Lastly, scaling to above 16 sockets would require a significant investment of technical resources that would not improve, and could even detract from single device performance.

Ultimately, the 8 socket scaling for the K8L should substantially improve over the prior generation. Whether anyone will be willing to attempt glueless 16 sockets or more is certainly unclear, but it seems safe to say that 8 socket K8L systems will be quite compelling.

Gaming the System

Anyone who has been following the gaming segment has probably noticed an increasing desperation on behalf of the major vendors (AMD, Intel, Nvidia and ATi) to retain leadership in their respective areas of expertise, no matter the cost. This started with the Pentium 4 Extreme Edition, which sold reasonably well, despite its equally extreme pricing. While these ‘extreme’ parts have ridiculous ASPs, it does seem like the major goal is really PR, rather than profitability. Sometimes, these products also fall short on real performance, because the software ecosystem is unprepared.

AMD introduced a new product line for the extreme gaming market, which is in essence, a dual socket system using Athlon FX processors and standard DDR2 (rather than the registered DIMMs used for Opteron). This announcement is obviously an attempt to bolster one of AMD’s core markets, against future encroachment from Intel’s Conroe XE. However, the good news is that AMD resisted the temptation to do a quick hack for bragging rights. Both MPUs can attach to memory, and the system is outfitted to work with dual GPUs. The latency should be rather similar or slightly better than existing two socket Opteron systems, although it is quite unclear to what extent the extra processors will improve performance for most games.

There is an added wrinkle, which is that to some extent 4x4 will compete with Opteron based workstations and servers, such as the Tyan Thunder K8WE. Although product plan details have not been announced, it seems like the main differentiator between 4x4 and a 2 socket Opteron would be ECC protection for memory. For some applications this could be enough to ensure that the users pick the appropriately positioned product. However, some buyers will see the two as interchangeable and simply opt for the cheaper solution. In fact, if the 4x4 works with regular Athlon parts, it might be just the thing for a low cost scale out system, load balanced web servers for instance. However, since the system is not likely to appear till the latter part of the year, AMD will have a while to figure out a marketing strategy to avoid these undesirable crossovers.

Conclusion

AMD’s spring analyst day presented a lot of news on future plans, products and focus areas at the company. The most interesting part was a nice preview of the next generation K8 microprocessor. The K8L, as it has been dubbed, is a strong incremental improvement over the 65nm shrink of the K8. There will be several changes to the microarchitecture, most notably in the memory hierarchy. The level of integration will also increase, enhancing the scalability of systems built around this next generation part, which is due out in the middle of 2007. Naturally, once more details are available, full coverage of the subject would be in order. Preliminarily, the “K8L” looks to be a very solid MPU, elegantly integrating four cores together. The other topic we covered was AMD’s 4x4 announcement, which is somewhat more niche. Fortunately for end-users, 4x4 is a well planned out design from the technical perspective and not a simple grab for the performance crown.

Unfortunately, discussing everything that went on would be nearly impossible. While we focused heavily on the K8L and the 4x4 platform, there were other topics that are worth mentioning. AMD discussed their plans for fabs and manufacturing capacity, initiatives to serve the developing world and a 65nm mobile Turion part. By far and away the most exciting topic that we have not discussed are coprocessors, but that is an issue for another day

Tuesday, November 27, 2007

AMD thumped by price war ‘Barcelona’ may help

Can AMD's first quad-core processor–Barcelona–due in mid-2007 give it more ammo against Intel in a pricing war?

John Spooner thinks so and argues as much in the ChipLand blog. AMD issued a profit warning and has been widely panned by Wall Street. To wit:

"While we estimate AMD desktop and notebook units grew healthy double-digits, we also believe it is likely AMD saw ASP declines. As Intel has been ramping up its competitive dual-core Core 2 Duo, AMD likely saw the most pressure in its own dual-core products," says UBS Equities analyst Uche Orji, who adds that Intel isn't likely to see pricing pressure.

Bottom line: All AMD has to do is get its pricing up. Easier said than done, but the best AMD can do is to hold the fort for a few more months. Once Barcelona enables AMD to raise prices the outlook should get better.

Saturday, November 24, 2007

AMD Phenom™ 9000 Series Quad-Core Processor

INCREDIBLE PERFORMANCE

The ultimate megatasking experience. Featuring true multi-core design and award-winning AMD64 technology with Direct Connect Architecture, AMD Phenom™ 9000 Series processors deliver the ultimate megatasking experience by providing direct and rapid information flow between processor cores, main memory, and graphics and video accelerators. AMD Phenom™ 9000 Series processors have the technology to break through the most challenging processing loads. AMD Phenom™ 9000 Series processors feature low latency access to main memory for amazingly rapid response and phenomenal system performance. AMD Phenom™ 9000 Series processors were designed for megatasking—running multiple, multi-threaded applications. Surge through the most demanding processing loads, including advanced multitasking, critical business productivity, advanced visual design and modeling, serious gaming, and visually stunning digital media and entertainment.

Phenomenal performance with advanced processor design. The AMD Phenom™ 9000 series processors are the most advanced processors for true multitasking with true quad-core design. Don’t get bogged down by non-native quad-core processors and obsolete front side bus architectures. With an integrated memory controller and shared L3 cache, AMD Phenom™ 9000 Series processors have low-latency access to main memory for amazingly rapid system response and phenomenal system performance.

Blast through performance bottlenecks. All AMD Phenom™ 9000 series processors feature AMD64 with Direct Connect Architecture to blast through performance bottlenecks. Award winning HyperTransport™ 3.0 technology just got faster, providing support for full 1080p high-definition video and extreme total system bandwidth.

Shatter the memory barrier. Superior AMD64 architecture offers direct access to DDR2 memory. Enjoy virtually unlimited memory options with AMD64 technology and 64-bit Windows Vista.® Shatter the memory barrier with AMD Phenom™ 9000 series processors and 64-bit Windows Vista.®

INTENSELY VISUAL

Experience Windows Vista.® Harness the power of Windows Vista® with the AMD Phenom™ 9000 Series quad-core processor. The AMD Phenom™ 9000 Series quad-core processor divides and conquers the most complex tasks with true multi-core design. Enjoy the ultimate megatasking experience on Windows Vista.® Enjoy virtually unlimited memory options with AMD64 technology and 64-bit Windows Vista.® Shatter the memory barrier with AMD Phenom™ 9000 Series quad-core processors and Windows Vista.®

STRIKINGLY EFFICIENT

Strikingly efficient Cool‘n’Quiet™ 2.0 technology.With the next generation of award-winning Cool‘n’Quiet™ technology, Cool‘n’Quiet™ 2.0 technology reduces heat and noise so you can experience amazing performance without distraction. Combined with core enhancements that can improve overall power savings, the AMD Phenom™ 9000 Series quad-core processor delivers seamless multitasking and optimum energy efficiency. Work, play, talk, and share a PC that’s seen, not heard.

Purchase with Confidence
Founded in 1969, AMD has shipped more than 240 million PC processors worldwide. Customers can depend on AMD64 processors and AMD for compatibility and reliability. AMD processors undergo extensive testing to help ensure compatibility with Microsoft Windows XP, Vista, Windows NT®, Windows 2000, as well as Linux and other PC operating systems. AMD works collaboratively with Microsoft and other ecosystem partners to achieve compatibility of AMD processors and to expand the capability of software and hardware products leveraging AMD64 technology. AMD conducts rigorous research, development, and validation to help ensure the continued integrity and performance of its products.

Friday, November 16, 2007

FAQ: Menginstall Windows XP di Mac (Tanpa Mac OS)

Kali ini saya akan menampilkan post yang saya ambil dari sebuah blog seseorang bernama adinoto yang katanya dia adalah seorang MAC geek, dan juga merangkap Linuxers. So enjoy aja....

Inti postingan tersebut (silahkan direfer sendiri untuk lengkapnya), adalah apakah dengan beralihnya Apple ke processor Intel dan dari eksperimennya bahwa produk Apple tersebut dapat langsung di install dengan menggunakan CD Windows XP (generik) yang diperuntukkan untuk PC biasa, berarti Apple adalah PC biasa, kelihatannya masih menimbulkan (dan berpotensi besar) menimbulkan kebingungan (confusing) dikalangan pengguna PC yang baru beralih ke Mac, atau pengguna Mac yang baru mau/telah berpindah dari processor berbasis PowerPC ke Intel (iBook -> MacBook/MacBook Pro).

Kelihatannya potensi kesimpangsiuran tersebut bisa besar sekali, sehingga saya merasa perlu meluruskan/berbagi informasi, sehingga masyarakat luas tidak rancu lagi.

Jawaban saya lebih kurang:

“hehehe malem-malem mo sahur dibangunin si tukul , hehehe kayaknya perlu sharing buat generasi muda nih. sebenarnya mungkin “salah kaprah” dan “expectasi” nya ketinggian. Apple tentu komputer pc biasa, seperti halnya Sun dan pc lain. Tapi kayaknya konotasi PC disini diasumsikan PC intel ya, ya Sun yang berbasis AMD Opteron juga cuma AMD dikasih box oleh Sun + dikasih Solaris x86 biasa (malah gratis) bukan Hackintosh yang dikenal selama ini buat PC generik (istilah yang lebih tepat) buat instal Mac-mac-an.

Kasus yang dilakukan Tukul kebalikan, yaitu Menginstall Windows (Wedhus biasa istilah Tukul) di Mac Intel. Experiment itu juga sudah saya coba untuk Vista 64-bit pada tanggal 12 Desember 2006

jadi apa sebenarnya cerita dibalik ini? (Share buat referensi generasi muda

), ya itu tadi Mac setelah pake Intel juga adalah PC biasa? Ya emang dari dulu juga PC biasa, cuma ganti processor aja ke Intel. Asal mula Windows juga dulu di desain multiplatform (buat arsitektur Intel, MIPS, dan PowerPC), tapi terakhir dibangun itu ya sampe mentok Windows NT 4.0 yang multiplatform, karena IBM stop ngebackup PowerPC Platform (bahkan OS/2 for PowerPC juga ga jadi keluar-keluar).

Singkat cerita (lagi males + ngantuk), loncat ke point masa kini aja deh, soal history as request aja daripada kepanjangan, ya itu tadi Mac dg Processor Intel ga bisa dibilang sepenuhnya PC biasa, tetep lebih advanced (Apple pake EFI — yaitu “BIOS” yang lebih canggih, sedangkan kebanyakan PC masih pake BIOS = dengan segala keterbatasannya). Aslinya Windows (generik biasa for Intel) tidak mendukung EFI, tapi cuma mendukung BIOS untuk booting, kecuali Windows XP 64-Bit Edition. Faktor lain selain BIOS adalah faktor Hard Disk Partition Table Layout. Windows menggunakan Partition Table yang jauh lebih primitive yaitu FAT table (yang cuma mampu mendukung 4 partisi, itu pun kalo di format di fdisk DOS cuma bisa 1 primary, partisi ke dua adalah Logical Drive, dan bisa mengandung “container” beberapa partisi logical lainnya), sedangkan Apple memilih standard yang lebih advanced yaitu GPT Table (GUID Partition Table) yang dipergunakan di Intel Itanium platform (standard 64-bit chip dari Intel yang gagal di pasaran), yang mendukung sampai dengan 32 partisi.

Jadi proses “experiment” Windows di Mac dengan processor Intel itu pada awalnya ga sesederhana sekarang, karena EFI tidak di kenal oleh Windows standard, sampe akhirnya tim “opreker” di www.onmac.net mengadakan sayembara barang siapa bisa “mengepatch” EFI di Mac sehingga support booting emulasi BIOS. Sayembara ini berlangsung awal tahun kemunculan Mac dengan Processor Intel, yang menghasilkan uang saweran 13,000 dollar buat yang menemukan trick ini. Akhirnya salah satu peserta berhasil memenangkan “sayembara” ini dan EFI bisa dipatch dengan support emulasi BIOS. Tak beberapa lama kemudian Apple mengeluarkan BootCamp. Saya suspect Apple melisensi solusi ini agar Mac berbasis processor Intel bisa mensupport emulasi BIOS, karena itu tiba-tiba (awal kemunculan Core Duo processor di Apple Mac masih harus mempatch firmware Apple nya agar bisa mendukung BootCamp — coba experiment dengan Mac mini Intel generasi pertama ato komputer-komputer (desktop dan portable) Apple yang pertama kali menggunakan processor Intel Core Duo sebelum di patch (Intel Core 2 Duo sudah dipatch dari Apple by default), pasti tidak bisa install Windows generik biasa.

Experiment lama saya pun (http://adinoto.org/?p=280 ) akhir tahun lalu pun menarik karena itu pertama kali Apple merelease produk dengan chip 64-bit, karena itu saya penasaran ingin install Windows (generik) as people knows di produk 64-bit pertamanya Apple (buat yang awam: Intel Core Duo = 32-bit processor, Intel Core 2 Duo = 64-bit processor, Intel desktop processor sudah berapa tahun terakhir adalah 64-bit processor = lihat apakah ada tulisan EM64T nya, dan instruksi 64-bit x86 ini compatible penuh dengan instruksi AMD64 = Intel melisensi instruksi 64-bit dari AMD, setelah gagal dengan processor versi 64-bitnya yaitu Itanium), namun Instalasi Vista pun tidak sederhana karena Vista (walaupun sudah mendukung EFI pada AKHIRnya = awalnya diputuskan belum, tapi Vista tidak mendukung GUID/GPT Table), nah karena itu partition layout harus di delete (bisa didelete dengan menggunakan Windows XP misalnya), baru kemudian dapat diinstall Vista.

Kenapa Windows XP bisa langsung di install di Mac berbasis Intel (tanpa Mac OS) sesuai dengan “experiment” tukul? Ya karena:

1. Windows XP tidak mendetect GUID/GPT Table, jadi proses instalasi pertama adalah memformat/merelayout partition table menjadi FAT Table. Hal ini tidak dapat dilakukan oleh Vista langsung karena Vista mendetect ada partition dengan layout yang tidak dikenal.

2. Mac berbasis processor Intel sudah mengandung EFI (standard “BIOS” yang lebih advanced = contoh bisa mendownload driver-driver yang diperlukan sebelum proses instalasi OS, atau mendukung partition s/d 32 partition), ya seperti diuraikan diatas yang sudah dipatch dari fabrikasinya untuk mengenal emulasi BIOS, seperti yang saya suspect melisensi solusi para hacker www.onmac.net (solusi emulasi BIOS ini juga karena diperlukan Apple untuk BootCampnya = BootCamp itu sebenarnya tidak lebih dari solusi “Resizing Partition ala Partition Magic di Windows ato Parted di Linux” plus membungkus semua “Driver-driver Apple Hardware di Windows” sehingga tidak merepotkan).

Pertanyaannya sendiri:
1. Apakah Anda memerlukan menjalankan Windows ONLY di Apple Hardware? (Apple dengan Processor Intel?).

Jawab:
Ya: kalo anda merasa Apple Hardware lebih menarik dan pas di workspace anda. *So far saya belum lihat kok hardware semenarik iMac 20″ Alumnium untuk ditaruh di desktop kerja anda. Apple yang slim dan tidak memakan space besar.

Tidak: Apabila alasan Anda memilih Mac karena anda menginginkan Mac OS X nya. Kebanyakan orang sudah puas dengan menjalankan BootCamp (solusi dual boot dari Apple) atau Emulator seperti Parallels Desktop ato VMWare Fusion atau CodeWeavers CrossOver for Mac.

Ya: Kalo anda ingin mencoba atas dasar ingin bereksperimen dan meningkatkan pengetahuan dan wawasan. Not bad for trying things rite?

Pertanyaan berikutnya:
Mengapa Mac sebelum ini tidak bisa diinstall native?

Jawab: Hmm mungkin ini perlu diluruskan karena definisi native ini bisa rancu. Yang dimaksud native ini apa dahulu. Kalo yang dimaksudkan native itu adalah menginstall native Windows (generik) di Mac sebelum berbasis Intel ya jelas tidak mungkin, secara arsitektur PowerPC (processor Apple sebelum Intel) atau Motorola 68K (processor Apple generasi yang sebelum PowerPC lagi) jelas berbeda dengan instruksi low level processor berbasis x86 Intel (Bahkan Itanium pun tidak compatible dengan processor Intel biasa (Pentium, Core Duo, Xeon dll) walaupun sama-sama diproduksi oleh Intel), TAPI apabila definisinya instalasi native adalah native OS ya tetap bisa saja, CARI lah sistem operasi yang dibangun native untuk platform itu, misal Linux PPC (Linux for PowerPC) atau OpenBSD (atau FreeBSD atau whatever variant BSD atau Linux) yang dibangun untuk PowerPC, atau untuk Motorola 68K (sesuaikan dengan chip yang ada di dalam Mac anda), salah satu platform OPEN SOURCE yang mendukung platform paling banyak adalah NetBSD, coba lihat dukungan platform dari www.netbsd.org

Pertanyaan lain:
Mengapa Virtual PC (dahulu milik Connectix, kemudian diakuisisi Microsoft) menjalankan Windows di Mac (berbasis PowerPC) jauh lebih lambat dibandingkan dengan Parallels Desktop atau VMWare Fusion?

Jawab:
Karena chip PowerPC memiliki instruksi low level yang berbeda dengan instruksi x86 Intel maka emulator tersebut harus mendecode instruksi tersebut (on the fly) dari instruksi PowerPC ke x86, vice versa, sehingga processornya kehilangan resource yang cukup besar (CPU Tax) jadi process yang kelihatan terlihat lambat. Sedangkan Parallels Desktop/VMWare Fusion tidak harus mendecode instruksi tersebut karena keduanya didesain untuk jalan di processor Intel sedangkan Windows sudah berbasis instruksi yang sama (sama-sama x86 instruction sets).

Lebih jauh, Apple menggunakan beberapa sistem decode yang berbeda untuk PC terutama untuk mapping grafis, Apple menggunakan BIG ENDIAN (dari besar ke kecil memory mapping) sedangkan Windows menggunakan LITTLE ENDIAN (dari kecil ke besar), sehingga performance penalty yang paling kelihatan di solusi emulasi Virtual PC adalah REDRAW di tampilan terlihat lambat karena process encoding dan decodingnya yang harus dilakukan on the fly.

Lebih jauh lagi (kalo masih mau baca) … solusi emulasi pertama di Mac untuk menjalankan Windows dibangun oleh company bernama INSIGNIA. Insignia membangun SoftPC kemudian SoftWindows (sebelum lahirnya saingan yang lebih ok, yaitu VirtualPC dari Connectix), Solusi yang dibangun Insignia bukan cuma dipergunakan di Apple, Insignia Solutions membangun juga emulasi SoftWindows untuk platform-platform UNIX commecial lainnya, misalnya SoftWindows untuk HP/UNIX, dlsb. Bahkan Windows NT (cikal bakal kernel yang dipergunakan di Windows 2000 dan Windows XP) pun sebenarnya tidak compatible dengan Windows sebelumnya (later lah untuk cerita ini kepanjangan), jadi Microsoft menggunakan solusi Insignia untuk emulasi Win16 di Win32 secara built-in mulai jaman Windows NT, sampai transisi aplikasi 32-bit sepenuhnya terjadi di platform Win32.

Pertanyaan lain:
Mengapa Windows bisa begitu popular sehingga segala acuan bisa mengarah ke Windows. Ya itu lah karena Microsoft Windows bisa diinstall dimana-mana, Bill Gates & Co. memahami potensi OEM dan Taiwan sebagai rekan kerja penghasil compatible machines for masses. Critical masses ini yang tadinya saya harapkan terjadi di platform PowerPC, ketika IBM mendukung distribusi dan standard platform PReP (PowerPC Reference Platform) ato CHRP (Common Hardware Reference Platform), tapi tidak diteruskan karena kepentingan dan visi manajerial dibawah Louis V. Gertsner. (Ref: Baca beberapa posting di blog saya yang terkait).

Nah terus apa bedanya Mac secara hardware dengan PC biasa? Selain EFI dan layout table yang berbeda tadi, ya sebenarnya masih ada satu lagi yaitu Gerstalt (mirip ID = identifier) disetiap mesin produksi Apple, jadi setiap kali machine Mac menerima installan dari OS X bawaannya (ato versi retailnya) dia akan detect ini mesin tipe apa, configurasinya apa, dan perlu diinstall software-software apa aja, berbeda dengan PC yang apabila kita tidak tahu perangkatnya ya harus tahu perangkat apa yang terpasang dan mencari drivernya masing-masing apabila tidak ada dibawaan perangkat lunak sistem operasinya.

Nah mengapa Apple masih keukeuh (bertahan) dengan business model Hardware+OS tidak mau membuka sistem operasinya untuk bisa diinstall langsung di PC PC rakitan? Rasanya Steve & Co. masih berpendapat tidak semua orang mau repot-repot mikirin tulisan diatas, karena mayoritas pengguna adalah end-user yang secara piramida adalah orang-orang yang menggunakan komputer untuk alat kerja tanpa memikirkan kerumitan teknisnya, tanyakan saja ke Steve Jobs kenapa? Karena gua juga udah bosen bilang dibuka aja hahahaa…. karena gua juga pengen dong pake Mac murah untuk satu-dua deployment tanpa harus setiap saat membeli Mac satu unit hehehe…

Jadi apakah Intel Core 2 Duo lebih inferior dibanding PowerPC? Hmm secara arsitektur saya sih lebih demen PowerPC, secara PowerPC itu lebih efisien, dan menghasilkan Instruction Per Cycle = kemampuan eksekusi instruksi yang jauh lebih cepat dibandingkan processor Intel pada jamannya (Intel Pentium 4), namun harap diingat PowerPC yang terakhir dipergunakan adalah 2 generasi lebih tua dibandingkan dengan generasi Intel Core 2 Duo. Jadi processor Intel Core 2 Duo jauh lebih baru, dan lebih maju dibandingkan PowerPC G4 (Processor PowerPC G5 masih lebih advanced dari sisi arsitektur sehingga dapat bersaing, walaupun termasuk 1 generasi sebelum nya, hanya sudah tidak didevelop dan tidak dipergunakan oleh Apple).

Secara singkat PowerPC Mac adalah masa lalu. Saat ini Intel processor yang dipergunakan Apple merupakan generasi terbaru sehingga memiliki kemampuan yang jauh dibandingkan produk-produk Apple berbasis PowerPC masa lalu (G4, G5), nah kecuali Apple nanti mau bikin Workstation atau Server high-end berbasis IBM PowerPC G6.

Hanya pertimbangan yang lebih diutamakan adalah aspek marketing dan critical masses. Dengan menggunakan processor Intel, Apple dapat menikmati anggaran dana iklan gratis dari Intel, dan lebih dapat “diterima” oleh orang awam, karena biaya mengiklankan suatu produk seperti processor memakan anggaran yang paling besar dalam aspek nilai jual produk. BTW, sangking besarnya bargaining power Apple untuk menggunakan produk Intel coba perhatikan dua hal:

1. Apple satu-satunya perusahaan yang diizinkan membundle produknya tanpa logo Intel Inside

2. Diskon Apple paling besar, karena Intel menginginkan Apple berpindah ke processor Intel sejak 20 tahun yang lalu. Andy Groove (CEO Intel masa lalu) dan Steve Jobs (CEO Apple) merupakan teman baik (pola Bapak-Anak), sehingga produk Apple bisa bersaing (dan lebih murah) dibandingkan produk pesaing (HP/DELL etc).

3. Apple memperoleh produk pertama di industri (Intel Core 2 Duo etc) sehingga Apple bisa menawarkan spek yang lebih tinggi dengan harga yang sama dibandingkan produk pesaing (dalam satu dua kasus Apple menikmati keuntungan yang lebih besar).

Koment saya: Jadi bener apa yang saya ramalkan dulu ketika AMD berniat membuat processor baru berbasiskan teknologi 64 bit, sedangkan Intel sendiri masih merasa belum saatnya terjun di dunia 64 bit. Waktu itu Intel masih berkutat dengan Pentium 4 2,6 Ghz LGA sampe yang paling parah Pentium 4 EE (Extreme Edition atawa Error Edition).

Kenapa saya bilang Error Edition? Karena konsumsi watt dari P4 ini sangat besar sekali (watt yang dikonsumsi bisa mencapai 100 watt, bahkan ada rekan penjual komputer rakitan yang dikomplen oleh pelangganya bahwa saking panasnya P4 EE ini, soket processor sempat leleh sedikit sampai merekat erat di processornya so usaha untuk menjual kembali motherboard dan P4 EE secara terpisah gagal karena motherboard itu udah kadung cacat. LELEH GILA SOCKETNYA MAS!!!) dibandingkan Athlon 64 yang dimana AMD belajar dari seri Athlon XP-nya yang lebih panas dari Pentium 4 (tapi jauh lebih cepat)dan di seri 64 bit-nya AMD memasangkan instruksi-instruksi baru yang tidak boros listrik, sehingga tidak menghasilkan panas yang berlebihan.

Ditambah lagi memory controller yang biasanya di tempatkan di chipset northbridge motherboard, dipindah secara integrated di dalam core processor. Walhasil, aliran data dari memory ke processor bener-bener lancar dan langsung, tanpa harus melewati chipset northbridge. Tapi yang paling lucu, kenapa ya Intel kok ga bisa bikin instruksi 64 bit-nya sendiri. Itanium gagal, so minta lisensi 64 bit instruction (bayar ke AMD hehehe)dari AMD dan di aplikasikan ke processor mereka seperti Pentium D sampai Core 2 Duo Extreme... So buat yang Intel maniac, inilah fakta tentang processor tercinta kalian...

Sekali lagi enjoy aja