Paul DeMone, an eminently respected industry observer at Real World Technologies has dug up some information on Tukwila, the next generation microprocessor in the Itanium family.
Intel Caches Out with a Memory Controller
According to several slides at an HPC conference in Asia, Tukwila will be a quad core part, confirming earlier rumors reported by Ashlee Vance of the Register and Charlie Demerjian of the Inquirer. Tukwila features an on-die FB-DIMM memory controller, which will lower access latency. The FB-DIMM controller likely supports 4 channels of memory, possibly more. As a result of the lower memory latency, Tukwila requires less cache than its predecessor. Montecito featured 27MB of cache, for two processors, while Tukwila is reported to have 6MB of L3 cache per core, or 24MB for each MPU. Preliminary diagrams also indicate that there is on-die switch for traffic between the four cores and caches on each chip.
A Digital Legacy
Tukwila will also feature the debut of the Common Systems Interconnect or CSI. CSI is a low latency, point to point, serial interconnect that uses differential signaling. Tukwila will integrate four full width CSI links and two half width links. Full width links operate at 6.4GT/s or 4.8GT/s in each direction, depending on the SKU. In comparison, current Itanium 2 systems have a 667MT/s bus, that is 128 bits wide for a total of 10.6GB/s of bandwidth. Unfortunately the width of the CSI data path is unknown, so bandwidth estimates are difficult. The most likely scenario is that CSI is 8 or 16 bits wide, which would yield 64 and 128GB/s respectively.
Tukwila also has an on-die CSI router, and cache coherency directories. The router will improve latency for all systems, and the directories should ensure near linear system scalability for large (> 4 socket) systems. It is almost certain that the four full width CSI links will be used for a 2D torus topology, while the half width links will connect to I/O subsystems. This architecture is rather similar to the EV7, which was the first high performance MPU to have an on-die memory controller, router, directories and interconnects. It should hardly be surprising that Intel is following in the footsteps of the EV7, considering that many former DEC architects are now at Intel.
Performance
Intel has estimated 40GFLOPS for Tukwila, using four cores. These cores will be very similar to those in Montecito. Hence, each CPU provides 4 FLOPS/cycle, implying that the device will operate at 2.5GHz. While Intel did not comment on whether Tukwila uses multithreading, it is most likely that each core has two threads, like Montecito, and a total of 8 threads per socket. The slide claims to improve on Montecito's scalar performance by a factor of 1.3. However, it is unclear what this claim means. Is scalar performance measured by SPECint_2000, SPECfp_2000 or perhaps another benchmark? Was the slide referring to Montecito at 2GHz, as was originally planned, or the 1.6GHz Montecito that will actually ship? These mysteries will undoubtedly be cleared up at a future conference, perhaps IDF, Hot Chips or ISSCC; for now though, this leaves a bit to the imagination.

No comments:
Post a Comment