Wednesday, March 28, 2007

Intel's Update on Penryn and Nehalem

Intel’s Update on Penryn and Nehalem

Earlier this morning, Intel hosted a press conference in San Francisco, where they previewed details on the microarchitecture for Penryn, the successor to Merom which will go into production towards the end of 2007. There was also a very high level description of Nehalem, which is the next microarchitecture to be released from Intel and will debut alongside CSI in the latter part of 2008.

Penryn Enhancements

The Penryn microarchitecture is largely similar to Merom; for those interested in the fine details, see our previous article from IDF (LINK). However, Penryn is not just a shrink to 45nm, there are several specific enhancements to improve performance. According to Intel, their initial performance analysis shows that a 3.2GHz Penryn system using a 1.33GHz front-side bus is 20% faster on gaming workloads than the current high-end 2.93GHz Conroe with a 1.03GHz bus. For more bandwidth and floating point intensive workloads, Intel claims that a 3+GHz quad core based on Penryn, with a 1.6GHz bus will see a 45% improvement relative to a 2.67GHz/1.33GHz Clovertown server system. This is most likely measured using the SPECfp_rate 2006.

One of the most intriguing features for Penryn, is something that is internally described as ‘turbo-mode’. The general idea is to increase the frequency of one core, when the other core is idle and there is thermal headroom. While the mechanisms were not described in detail, it appears that the change in frequency is triggered by both software and electrical conditions. When one core switches into low power C-state, the other cores will evaluate whether they can increase frequency while staying within the TDP envelop. It is likely that more details will come out at IDF in Beijing later this year, and other relevant conferences. Further in this vein, Intel will be adding another low power state, C6, to Penryn. In the C6 state, basically the entire chip is turned off – PLLs, clock distribution and caches, and the voltage is dropped as far as it can, which substantially reduces the power draw relative to other power states. These benefits are not free, and entering or exiting C6 is more expensive because the L2 cache must be completely flushed to memory.

Intel has also totally revamped the integer and floating point dividers, claiming that the latency is roughly half on average compared to the previous generation. The divider in Merom is a radix-4 (base 4) divider, which processes 2 bits each cycle. Penryn’s divider is a new radix-16 design which is roughly twice as fast – handling 4 bits per cycle. As with the divide unit in Yonah and Merom, the latency of an operation is somewhat variable, depending on the input operands. The minimum latency is 6 cycles, and during this 6 cycle start up period, the divider will determine the latency of the operation and forward that information to the reservation station, so that scheduling decisions for other resources and in-flight operations can be made. Since division is used to synthesize square root operations, the performance benefits accrue there as well. Single precision square root instructions are about 2x faster, and double or extended precision are roughly 3x faster.

Another change is increased support for SSE data format operations. The various iterations of SSE contained miscellaneous instructions such as (un)packing data, shuffling, shifting or concatenation. These instructions were decoded into multiple uops and often handled by micro-code. The architects for Penryn have added functional unit support so that most of these 128 bit operations can be executed in a single cycle – they are now decoded into a single uop rather than several.

Lastly, Intel has improved the instructions used to enter and exit a virtual machine monitor. The claimed benefit is roughly 25-75%.

Nehalem Disclosures

In addition to discussing Penryn, Intel also confirmed information that was known about Nehalem, while adding some new details. Nehalem will be released in late 2008 and will use a microarchitecture that is somewhat similar to Merom/Penryn – it will be quad issue/execute/retire. Nehalem will also feature a new instruction set extension, which Intel called “ATA”, on top of SSE4. Unsurprisingly, simultaneous multithreading will make a come back, with each core supporting two logical processors. It will be interesting to see how resources are partitioned, as that was one of the critical problems in the Pentium 4, and later processors have shown much larger gains than the best case 20% that was reported for the P4. Intel describes Nehalem as dynamically managing cores, threads, cache, external interfaces and power – which certainly hints at more intelligent partitioning than the static division of resources in the P4.

There will be product baseds on Nehalem that offer from 1-8 cores, with different cache sizes, memory and interconnect bandwidth, etc. to fulfill different market requirements. The two other expected revelations were that Nehalem will use the Common System Interface to connect elements of a system, and that Nehalem will feature an integrated DDR3 memory controller. One surprise was that Intel’s integrated graphics will be integrated into the same package for some desktop and notebook products. Intel has clearly state that they prefer package integration to chip level integration for a graphics processor. However, Intel might have opted for a discrete northbridge with integrated graphics. Integrating the CPU and GPU on the same package is certainly preferable from an OEM’s point of view; it substantially reduces the chip count, which in turn reduces the complexity of the board. The only down side is that the OEM’s will have a slightly harder time differentiating their products.

Intel has also promised more information in mid-April, when the IDF in Beijing kicks off. However, this is still enough news to digest over the next few weeks.

No comments: