Tuesday, April 03, 2007

Intel’s Teraflops Research Project

One of Intel's research areas is what they call “Terascale Computing”. This research is really about discovering how to deal with computer architecture in the next decade or so. One element of this research, a project code-named Polaris, is a chip that delivers over a teraflop of performance. The first silicon prototype of this Teraflops chip was presented at ISSCC 2007 by members of the design team.

“Terascale Computing” is an Intel marketing term for research projects that are investigating how to take advantage of future process scaling and exploit greater parallelism. The first few generations of multicore products have been relatively straight forward extensions of conventional thinking. Right now, most MPU vendors are shipping products with 2-4 identical cores. For the next generation integrating 4-8 cores, with more cache, more memory bandwidth and more system functionality seems fairly reasonable. However, beyond that point, architects cannot continue using the same tricks with 16 or 32 cores. There are a variety of problems: yield, power consumption, thermal density, memory bandwidth and latency, ease-of-programming, efficiency, etc. Just as a simple example, look at the CELL microprocessor, which includes one PowerPC core, and 8 SIMD cores, connected using a ring topology. CELL is dominated by logic, rather than SRAM arrays, and according to IBM has 10-20% yields, which is fairly poor. The solution to this problem for Sony was to only require 7 SIMD units, increasing the number of good dice per wafer.

The whole purpose of Terascale Computing is to explore the possibilities and figure out what will work (i.e. redundancy) and what won’t (i.e. dense logic with no redundancy) for the product groups at Intel. The Terascale chip is entirely research oriented and is in no way shape or form going to be productized. The teams responsible for this project were spread across several sites at Intel: Washington, California, Oregon, Arizona and India. The main goals for this project are to explore various options for clock distribution, interprocessor communications, power management and general design philosophy.

Teraflop Chip Overview

The Teraflops chip is manufactured in Intel’s high performance 65nm process with 8 layers of metal and uses 100M transistors on a 275mm2 die. The device integrates 80 tiles (3mm2 each) arranged in an 8x10 2-dimensional mesh. Each tile contains a processing element (PE) and a 5 port router for external communications. The system is designed with 15 FO4 delays per stage and operates from 1-5.6GHz. While the paper submitted ISSCC contained simulation power results, the presentation contained actual measured data. As the various press announcements indicated, Intel was able to achieve 1TFLOP/s on a specific application, with the device operating at 0.95V and 3.16GHz. The on-chip network has a bisection bandwidth of 1.62Tbit/s, when operating at 3.16GHz.


Figure 1 – Teraflops Die Micrograph

Figure 2 below shows the power dissipation versus voltage, accompanied by certain performance points. Note that the application in question is the best case for Intel, with relatively little communication (more on that later). What is most remarkable is that the leakage power is extraordinarily low, both in absolute value and relatively speaking. When all when all tiles are computing roughly 10-15% of the power is leakage, which is very good compared to most high performance MPUs. For instance, Montecito, the 90nm Itanium2 microprocessor on average consumes around 25W leakage power, 25% of the total dissipation [4].


Figure 2 – Power, Voltage and Performance for Stencil Application

VLIW Processing Engines

The PEs in each tile are minimalist VLIW cores, that use a 96 bit instruction word to encode up to 8 operations using a 10 port register file. The emphasis here is on minimalist; the PEs don’t have caches, virtual memory, coherency or any of the other niceties that are found in modern embedded processors, let alone high performance ones. The operations in each word are two single precision FMACs, load, store, network send, network receive, branch and power management. The load and store instructions can access the 2KB data memory, and the instructions are fetched from a 3KB instruction memory. Roughly 74% of the PEs are covered by NMOS sleep transistors, which have a 5.4% area and 4% frequency penalty. The FMAC units use a 9 stage pipeline, with a single stage for accumulation. They have either a 3 or 6 cycle pipelined wake up, which reduces current spikes, but allows execution to initiate after a single cycle.


Figure 3 – Tile Microarchitecture

One future research direction for is 3D integration. The Teraflops group intends to connect to an external SRAM that is packaged below the device. This offers huge benefits in terms of bandwidth and latency, but is limited by power dissipation. Only one chip can be mated to a heatsink, and all others must have relatively little thermal dissipation. Initial 3D integration will probably use internally manufactured SRAMs, however, a natural evolution would be to use DRAMs. DRAMs have the advantage that they offer vastly greater capacity, lower power consumption and lower cost. However, Intel has long since exited that business and would have to work with an external partner.

Network and Router Design

The interesting part of the Teraflops research project is not the processing elements, but the routers, the mesochronous clocking and network. The network topology is a 2D mesh. Each tile contains a 5 port router that connects to the adjacent tiles in all four cardinal directions, and the processing element itself. Each network packet is broken down into multiple FLow control units (FLITs); the minimum packet size is 2 FLITs, and there is no maximum. The FLIT header has 3 bits to indicate the destination, and multiple headers may be chained together for long paths.


Figure 4 – Router Design

The router features five 36 bit ports, with a 5 stage pipeline and two virtual lanes to avoid deadlock, as shown above in Figure 4. At 4GHz, each router provides 80GB/s of communication bandwidth. Each lane has a FIFO buffer that can hold up to 16 FLITs. The first router design, which was shown at ISSCC 2001, had a crossbar dedicated to each lane. Instead this new design shares a single non-blocking crossbar for both lanes that is double pumped in the fourth pipestage using dual edge-triggered flip-flops. The dual edge-triggered flip-flops let the crossbar transfer data at both the rising and falling edge of the clock signal, similar to the way that the input and output on DDR memory works. This improvement reduces the crossbar area by 50%, the overall router area by 36%, improves average power by 13% and decreases latency by one cycle versus a prior design. Figure 5 below shows the design for the newer crossbar on top, and an area comparison between the two different designs at the bottom. The micrograph on the left is the new design, with a shared crossbar; the micrograph on the right is a scaled down 65nm version of the original dual crossbar design.


Figure 5 – Double Pumped Crossbar Switch Design

The on-chip network is wormhole switched. The first FLIT is launched by the sender, and the receiver will inspect the header, and then forward the first FLIT to the appropriate port. Any subsequent FLITs will be sent in the same path. This has the advantage of pipelining the message transmission. However, if there is a delay, the message will begin to stall. When a FLIT cannot be sent, it is stored in the 16 entry FIFO, and when a FIFO is full, the receiving router sends a message to the sender to stop. This process eventually creates back pressure on the original sender. Backpressure is a relatively simple technique for flow control, and is less efficient than credit based mechanisms. Backpressure and wormhole routing were used because they were simple and low-risk design choices, which enabled the team to spend their time on other more innovative portions of the project. For a real product, a much more sophisticated network would be implemented.

Mesochronous Interfaces and Clock Distribution

One of the more novel aspects of this project is that the interfaces between the routers are mesochronous. Typically, most signals in a chip are synchronous, running at the same frequency with no phase difference. Two signals are mesochronous if they run at the same frequency, but the phase of the signal may vary. Plesiochronous signals have both different frequencies and different phase alignment. Figure 6 below shows two signals that are (a) synchronous (b) mesochronous and (c) plesiochronous.


Figure 6 – Various Types of Signal Synchronization

The mesochronous clocking for different tiles is extremely advantageous for clock distribution. Since the clock signals don’t need to be as precisely coordinated, MPU designers can implement simpler, and less power hungry clock distribution networks, replacing a complicated H-tree with something simpler and shorter like a grid. This also means that many repeaters and buffers, which are used to keep signals in phase, can be removed, reducing power draw and thermal dissipation. For example, the 180nm Itanium2 microprocessor used a balanced H-tree and burned around 30% of the 130W power envelope [1]. The 90nm Itanium2 uses 25W of a 100W power budget on clock distribution [2]. This sort of power consumption is fairly typical for a high performance microprocessor; usually one quarter to one third of the power goes into the clock tree, when clock gating is used. Without clock gating, that can shoot up as high as 70% [3].

In comparison to high performance microprocessors, clock distribution in Polaris was both a simple and low power affair. The clock is distributed from the PLL by a grid across horizontal spines on M8, and vertical spines on M7 to the individual tiles. While the presentation did not give exact power numbers, the estimated global clock distribution (across M7 and M8) uses 2.2W. Within the individual tiles, roughly 10% of the power is used for clocking, and 33% of the communication power is used for clocking, plus 6% for the mesochronous interfaces. Given that the vast majority of clock distribution power is for driving the final stage, which is inside each tile, this implies a substantial improvement. These estimates are at 4GHz and 1.2V, with a total of 181W dissipation for the entire chip. Altogether, it seems likely that the clock distribution in Polaris is around 12-20% of the total power, although hopefully future disclosures will be more precise.

While mesochronous clocking saves power in clock distribution, it complicates the network design. Since there may be phase variation between any two tiles in the system, the mesochronous network interfaces must tolerate and correct any phase mismatches. The interfaces are implemented using a 4 deep circular FIFO; as data is received it is synchronized to the next tile by a programmable delay strobe. This synchronization normally does not have any latency impact, but sometimes there can be a 1 cycle delay. The worst case phase misalignment results in a 2 cycle delay, but this is very infrequent.

Results and Future Implications

Intel’s developers ported several high performance computing applications to this architecture, in what can only be described as a heroic effort. The applications were a heat equation solver using the stencil method, SGEMM a single precision matrix multiply, financial modeling for spreadsheets and a 2D fast-fourier transform. The heat equation problem was able to achieve 1TFLOP at 80 degrees C, 1.07V and 4.27GHz, with a measured power draw of 97W. The efficiency for the applications varied strongly, depending on the communication required as shown in Table 1 below.


Table 1 – Algorithmic Efficiency for Polaris

These results show that for algorithms with relatively little inter-node communication, the system design works well. The stencil algorithm uses mathematical properties of the heat equation to reduce communication, and consequently the high efficiency isn’t very surprising. However, for workloads with lots of communication, such as the 2D FFT, it is clear that the network could use some improvements and points the way to further research.

The mesochronous clocking and network is very interesting, since it offers the opportunity to reduce a significant source of power consumption. However, it is not quite ready for implementation in mainstream x86 MPUs. Regular MPUs have caches, instead of non-coherent memory blocks, and usually the last level of cache is shared between multiple cores. The design of the cache architecture will likely have implications for clocking, since cores sharing a cache should probably operate synchronously; alternatively, the cache itself could be designed to be somewhat asynchronous, similar to the 12MB L3 cache in Montecito. There will also be interaction effects between the asynchronous clocking and other portions of the chip, but it seems likely that clever design should be able to accommodate these issues.

No comments: