Wednesday, December 12, 2007

AMD's K8L and 4x4 Preview

wice each year, AMD hosts an analyst day. The spring event, which took place at the Sunnyvale headquarters, is more technically oriented and tends to deal with the actual details of the company’s products and technology itself rather than financial performance and metrics. The event itself was relatively low key and included speakers from AMD and a few key partners, such as Sun, VMware, Rackable and video presentations from Microsoft and Alienware nee Dell. There was quite a bit of interesting information presented, but what really seemed to be worth dwelling on were the new revelations about the K8L and the 4x4 gaming systems.

Just as an aside, several architects at AMD expressed puzzlement at the origins of the K8L name. There is an internal engineering code name for the project, but the marketing team is slightly behind and has yet to provide something catchy for the rest of the world. At least one architect at AMD indicated a preference for the name K8++, but it seems unlikely that anyone in marketing or PR would share this point of view.

Do you Have Change for Some Cache?

Previously, Chuck Moore had described several incremental enhancements in the K8L at the Spring Processor Forum. The instruction fetch unit now includes an indirect branch predictor and fetches 32 bytes per cycle. The FP and SSE units have all been widened to 128 bits, as have the memory pipes. The load/store units also have somewhat more flexible execution; they can re-order loads with respect to other loads (although loads cannot move around stores). Physical and virtual addressing is expanded to 48 bits, and the page tables have been augmented as well. The page tables now support nesting for virtualization, and include 1GB pages. On the power side, the cores and system functionality will have separate power planes and independent C and P states. These are not all of the changes, but most of the key elements.


Figure 1 – Floor Plan of K8L

The first significant disclosures regarding the K8L had to do with the cache hierarchy within a single core. Despite an erroneous rumor to the contrary at Daily Tech, the L1D and L1I caches remain at 64KB each, according to a senior architect at AMD. The floor plan of the K8L also tends to confirm that the L1 caches have not decreased in size. The K8L did experience some L2 cache shrinkage and initial parts will feature a 2MB shared L3 cache. Based on the cache sizes, the L2 cache is still exclusive of the L1 contents, and the L3 cache is certainly not inclusive (although this does not mean it is exclusive). Additionally, it is easy to deduce, based on information about the load/store units that the bus between the L1 and L2 caches has been widened to 256 bits. The L3 cache is extensible, and it seems likely that 4MB parts will come out, perhaps as a way to differentiate between low-end parts intended for 1-2 sockets, and the higher-end parts for 4-8 sockets.

Scale up at Last?

Many industry insiders have commented that the K8 is eerily reminiscent of the ill-fated Alpha EV7. The EV7 augmented a high performance core with on-die directories for cache coherency and four interprocessor communication links operating at 6.4GB/s each. The EV7 also incorporated two memory controllers supporting eight channels of RDRAM, a total memory bandwidth of 12.8GB/s per processor. Like the EV7, the K8 enhanced on a prior generation design; adding a memory controller, and three Hypertransport links. With three 8GB/s links, the K8 is an excellent choice for 1-8P servers. In theory, the K8 can scale up to 8 sockets; however, in practice it is extremely difficult. First, the only glueless 8 socket systems require multiple system boards; the Tyan Thunder K8QW uses 2 boards, while the Iwill H8502 uses 5 boards. Secondly, the snoop broadcast protocol used in the K8 ends up saturating the Hypertransport links. Third, using 8 sockets requires slightly more complicated system topologies that increase the number of hops between sockets and hence average memory latency. As a result, performance projections for commercial server workloads (OLTP in particular) show very poor gains (10-40% depending on which estimate) for glueless 8 socket systems over 4 socket systems.

AMD’s success with the Opteron for 4 socket systems, where they have roughly half the market, has prompted the architects to extend the K8L’s scalability a step further. The K8L will add an additional lane of 16 bit Hypertransport 3.0 to each device, providing 4 in total. Each link can run at up to 5.2GT/s, and can be split into two separate 8 bit links. So a single device could be configured with eight 8 bit Hypertranpsort links, instead of the regular four 16 bit links. Figure 1 below shows a fully connected system using split links.


Figure 2 – 4 and 8 Socket K8L System

Alternative configurations are also conceivable, but are beyond the scope of this article. Given these disclosures, the K8L will be somewhat more suitable to 8 socket systems, since it solves the topology and latency issues, although there is no disclosed solution for the snooping problem.

While AMD did not discuss the matter, we had initially hypothesized that the limit was still 8 nodes per system. It turns out that we were premature, and AMD has increased the number of nodes, although we do not know by how much. Any limitation is probably in the neighborhood of 16-64, both due to AMD’s partners, and technical constraints. AMD’s major partners: IBM, Sun and HP, all have highly scalable systems that use other architectures (PPC, SPARC and IPF, respectively). The notion of white box vendors selling high processor count systems would not sit well with any of those three, since the margins are far higher on their larger systems. Moreover, such a move could interfere with Newisys’ scalable Opteron servers. Lastly, scaling to above 16 sockets would require a significant investment of technical resources that would not improve, and could even detract from single device performance.

Ultimately, the 8 socket scaling for the K8L should substantially improve over the prior generation. Whether anyone will be willing to attempt glueless 16 sockets or more is certainly unclear, but it seems safe to say that 8 socket K8L systems will be quite compelling.

Gaming the System

Anyone who has been following the gaming segment has probably noticed an increasing desperation on behalf of the major vendors (AMD, Intel, Nvidia and ATi) to retain leadership in their respective areas of expertise, no matter the cost. This started with the Pentium 4 Extreme Edition, which sold reasonably well, despite its equally extreme pricing. While these ‘extreme’ parts have ridiculous ASPs, it does seem like the major goal is really PR, rather than profitability. Sometimes, these products also fall short on real performance, because the software ecosystem is unprepared.

AMD introduced a new product line for the extreme gaming market, which is in essence, a dual socket system using Athlon FX processors and standard DDR2 (rather than the registered DIMMs used for Opteron). This announcement is obviously an attempt to bolster one of AMD’s core markets, against future encroachment from Intel’s Conroe XE. However, the good news is that AMD resisted the temptation to do a quick hack for bragging rights. Both MPUs can attach to memory, and the system is outfitted to work with dual GPUs. The latency should be rather similar or slightly better than existing two socket Opteron systems, although it is quite unclear to what extent the extra processors will improve performance for most games.

There is an added wrinkle, which is that to some extent 4x4 will compete with Opteron based workstations and servers, such as the Tyan Thunder K8WE. Although product plan details have not been announced, it seems like the main differentiator between 4x4 and a 2 socket Opteron would be ECC protection for memory. For some applications this could be enough to ensure that the users pick the appropriately positioned product. However, some buyers will see the two as interchangeable and simply opt for the cheaper solution. In fact, if the 4x4 works with regular Athlon parts, it might be just the thing for a low cost scale out system, load balanced web servers for instance. However, since the system is not likely to appear till the latter part of the year, AMD will have a while to figure out a marketing strategy to avoid these undesirable crossovers.

Conclusion

AMD’s spring analyst day presented a lot of news on future plans, products and focus areas at the company. The most interesting part was a nice preview of the next generation K8 microprocessor. The K8L, as it has been dubbed, is a strong incremental improvement over the 65nm shrink of the K8. There will be several changes to the microarchitecture, most notably in the memory hierarchy. The level of integration will also increase, enhancing the scalability of systems built around this next generation part, which is due out in the middle of 2007. Naturally, once more details are available, full coverage of the subject would be in order. Preliminarily, the “K8L” looks to be a very solid MPU, elegantly integrating four cores together. The other topic we covered was AMD’s 4x4 announcement, which is somewhat more niche. Fortunately for end-users, 4x4 is a well planned out design from the technical perspective and not a simple grab for the performance crown.

Unfortunately, discussing everything that went on would be nearly impossible. While we focused heavily on the K8L and the 4x4 platform, there were other topics that are worth mentioning. AMD discussed their plans for fabs and manufacturing capacity, initiatives to serve the developing world and a 65nm mobile Turion part. By far and away the most exciting topic that we have not discussed are coprocessors, but that is an issue for another day

No comments: