Wednesday, May 02, 2007

3D Integration: A Revolution in Design

he continuing pace of chip level feature miniaturization – Moore’s Law – has resulted in the doubling of the number of transistors per unit area approximately every couple of years. Chip designers have been provided with a plethora of transistor options to choose from in order to optimize for a given constraint. New materials with higher dielectric constants such as hafnium-based high-k gate oxide materials [1], along with metal gate electrodes, decrease leakage and boost drive current. Strained silicon engineering [2] enables higher transistor switching speeds. Different transistor designs featuring multiple threshold voltages optimize for low power or high performance applications. New transistor structures such as the double or the tri-gate [3] transistor enable further increases in device switching speeds while reducing leakage. Fundamentally, these improvements have concentrated on transistors, while interconnect performance has languished and fallen behind these new and faster transistors.

Three dimensional integration uses multiple vertical layers of transistors to improve performance, instead of the single layer of transistors that most modern integrated circuits use today. This report presents the interconnect problem, which is a major motivator for three dimensional integrated circuits and explains in detail how three dimensional integration will improve performance for modern integrated circuits and preliminary results for a 3D microprocessor.

Interconnect Problems – Think Globally

While transistor switching performance has continued to improve by roughly a third each fabrication process generation, the wires that connect them throughout the chip, the metal interconnects have comparatively deteriorated in performance [4]. Interconnect can also draw up to a third of the power utilization of a modern microprocessor. Indeed, the energy required to drive an operand across a chip’s wires can dwarf the energy needed to operate on it by the computation logic [5]. There have been isolated one-time improvements to interconnects, such as slightly better insulating materials between interconnect layers [6] to reduce parasitic capacitance. The switch from aluminum to copper interconnects reduced resistance, which similarly increases the performance of interconnects. However, the future of wire performance is clear and getting worse with each process generation.

Latency of on-chip wires is generally a product of their resistance and capacitance, the RC delay, which is near a factor of the speed of light. A wire’s RC propagation delay is quadratic in proportion to its length (i.e. a wire that is twice is long might have an RC delay 4x larger, or more). As process feature sizes shrink, the capacitance of shrunken wires decreases marginally. However, the cross-section of the wire is cut in half, which doubles the resistance, effectively doubling the propagation delay. Functional unit blocks will also shrink, which reduces the length of the local intra-block interconnect (i.e. wires between different stages of a multiplier). This tends to mitigate latency increases, at least for the local wires. Yet constant wire latency in the presence of enhanced transistor switching speeds effectively increases interconnect latency, in relative terms. This comparative increase is tolerable for intra-block wires as they are very short, contributing little latency to the overall cycle time of the chip. It is the inter-block and especially upper level global on-chip interconnect where the majority of the increasing delay is encountered. Assuming a constant microprocessor die size between process generations, the latency of the global wires that have to travel the length of the chip could double with each shrink. This would triple the relative difference between global wire and transistor performance every process generation, reducing the chip area a global signal can travel per narrowing clock cycle, as illustrated in Figure 1.


Figure 1 – Range of a wire in a single clock cycle

A time honored solution to alleviate this problem is inserting buffers and flip-flops to partition a long wire into segments, boosting the signal. Since wire delay is a quadratic function of the length of a wire, segmenting a wire into two equal sub-segments halves the total wire latency, although the buffer itself introduces a small delay. However, buffers and flip-flops are not free, as they consume additional power. The number of buffers needed to ameliorate the interconnect-transistor disparity would grow exponentially over different process generations, making it unsuitable as a long term solution.


Figure 2 – 65nm generation diagonal interconnect test chip (Source: Applied Materials)

A proposed improvement [7] from geometry for intermediate and global wires is to run wires across the length of the chip diagonally (Figure 2) to directly connect logic blocks. Traditional designs use Manhattan routing, with alternating horizontal and vertical wires for each layer of metal in the chip. Diagonal routing shortens such wires up to thirty percent and will have the effect of cutting the interconnect latency along these wires in half. However volume manufacturing and yield concerns have limited the appeal of this one time design improvement. The Electronic Design Automation (EDA) tools needed to support it have not matured to a level where the major semiconductor companies are comfortable utilizing it.

More Interconnect Tricks

The last, and most commonly used, method of mitigating the interconnect problem has been to add additional layers [4] of metal to the interconnect stack per process contraction, as shown in Figure 3. Note that in Figure 3, the even numbered metal layers run horizontally across the image, while odd numbered layers run perpendicular to the image. More metal wires increase active power consumption however, as greater signal transitions occur. More importantly additional metal layers require extra processing steps on the wafers, which increase the fabrication costs. Extra processing steps also reduce yields – the number of good die per wafer – as bonding extra metal layers increases wafer defect rate, which in turn drives up overall cost for the end product. The new interconnect layers are usually inserted at the bottom, nearest to the functional unit blocks at the local level, with dimensions shrunk to match the scaling of the transistors.


Figure 3 – Interconnect scaling with additional metal layers M1 and M2 added at bottom

Metal layers are deposited one at a time on top of the silicon substrate. The size and cross-sectional area of each layer of metals increases on each layer. The global wires, which are the highest, are the thickest and therefore have the least resistance and latency per unit length. Of course these highest quality wires must travel the farthest as they are used to distribute the global clock and power to the functional unit blocks. Since new finer metal layers are added at the bottom, the upper metal layers generally keep their dimensions and spacing between their constituent wires from the last process generation. As a result, the latency characteristics of the global wires are the same as for the prior generation process.


Figure 4 – 130nm process generation with 6 layers of copper interconnect (Source: Intel)

Unfortunately adding more metal layers is still not enough to appreciably ameliorate the growing mismatch in performance between transistors and global interconnect. The amount of global interconnect on die remains constant while the number of transistors doubles per generation. In some cases global wires can be moved further apart from one another, which reduces the parasitic capacitance and increases signal propagation performance. However this significantly decreases global metal density, whereas transistors are always getting denser with each process generation. The effect of this is that chips are made unnecessarily larger in size to accommodate global interconnect. The separation between the functional unit blocks and the intermediate wires above will grow, analogous to suburban sprawl, diminishing transistor density and negatively affecting Moore’s Law. To a certain degree this is occurring now with increasingly interconnect constrained designs.

The Third Dimension

Modern integrated circuits are already three dimensional in nature. The various layers of interconnect tower over the transistor devices underneath, each layer of metal larger than the previous one, as shown in Figure 4 on the previous page. Yet the transistors remain in a planar configuration, only the connections between them are stacked. The innovation in the third dimension is using multiple layers of silicon substrate, each one containing transistors, arranged in a stacked configuration, one on top of the other. The interconnect wires run on top of each substrate as in conventional chips, but also tunnel directly through layers in a vertical fashion [8], as shown in Figure 5.


Figure 5 – Two layer stack, face to back, with through die vias tunneling between layers

The three dimensional integration of multiple device layers results in several advantages over the present regime. The chief benefit is that the interconnects between blocks are shorter, in some instances considerably so, as illustrated in Figure 6. This lowers power dissipation since fewer buffers and flip-flops are needed. Reducing the amount of metal that runs across the chip also reduces power dissipation. Lower inter-block latency reduces cycle time, increasing frequency and chip performance. Stacking layers also increases chip density, as more transistors are able to be placed per unit of volume and within one clock cycle of each other. Cost reduction is a byproduct of this as fewer pins are needed per chip to communicate with other nearby chips, compared to the prior arrangement, simplifying packaging.


Figure 6 – Wires running across opposite ends of planar chip can be run directly between stacked functional units, reducing millimeters of metal to micrometers

Three dimensional integration has also resulted in a reinterpretation of Rent’s Rule [9], which has traditionally been used to analyze planar integrated circuits, to better predict interconnect complexity, power dissipation, and cost for various circuit types. The final advantage of three dimensional integration is using multiple heterogeneous device layers [8] on top of each other, as illustrated in Figure 7. One logic optimized IC can be stacked on a memory optimized chip, which can itself be stacked on top of a mixed-signal IC. Traditionally, incorporating distinct types of integrated circuits (i.e. logic, DRAM, analog) requires compromising the performance of any one type of integrated circuit. The advantage of three dimensional integration is that each layer can be individually optimized for performance (however performance is defined – be it density, frequency or precision), without compromises to share the same layer. Thus the best of all these various process-optimized disciplines can be integrated in a single stack.


Figure 7 – Complete system integration with heterogeneous 3D device stack

While all these advantages are significant, the three dimensional integration also has its drawbacks. Overall power consumption is reduced since less interconnect is used; however power density can increase in parts of the 3D integrated circuit. Without careful attention early in the design and simulation of the chip, the resulting thermals could reach unacceptable levels, affecting device reliability and requiring expensive cooling solutions.

3D Manufacturing Options

The most promising vertical interconnect strategies involve through die vias, particularly through silicon vias, which promise the highest vertical interconnect density. Vias are the short vertical wires between layers of interconnect, connecting planar wires. Figure 8 below shows a variety of vias from a scanning electron microscope.


Figure 8 – (a) Cross section of ~1.6um high via, (b) cleaved SEM image of isolated via (c), cleaved SEM image of ~175nm diameter vias (Source: IBM)

There are two primary methods of manufacturing three dimensional chips [10]. The ‘bottom up’ wafer fabrication method builds silicon layers sequentially on top of each other. The first layer is formed and transistor devices are fabricated, followed by the deposition of the second layer and the subsequent fabrication of its devices. This method requires substantial changes to the manufacturing process. There are also quality concerns regarding the device reliability fabricated in subsequent layers deposited on top of the earlier ones. One advantage of this method is that the size of the inter-layer vias can scale down with the transistor devices.

On the other hand, the ‘top down’ wafer fabrication method manufactures each layer separately and afterwards bonds them together. This is a more popular method for several reasons. Each wafer layer is qualified separately, and if a wafer meets quality criteria, it can then be assembled together with another already qualified wafer. Another advantage is that heterogeneous silicon layers, each one optimized for separate process functions, can be combined together. For example, one layer could be designed for memory density, while another is targeted at logic performance. One notable drawback of the ‘top down’ method is that the size of the inter-layer vias is not expected to scale at the same rate as the transistor devices. Even in the best case, vias cannot decrease below one micrometer in width, because of inter-inter layer bonding alignment tolerances. However this fabrication method requires the least amount of changes to existing processes, minimally perturbing manufacturing costs.

In the top down method, wafers and dies can be bonded face to face, or face to back. In the straightforward face to face bonding method, the tops of two layers are stacked facing each other, with their interconnect layers exposed and connected by vias. This results in the smallest possible inter-layer distance and hence smaller via sizes. In the more general face to back bonding approach, each layer is stacked on top of another, all having the same orientation. The distance between layers is larger and the vias must be larger as they have to go through the silicon substrate of each layer, forming a direct vertical interconnection. The wafer layers can be thinned, achieving better electrical characteristics and control.

The thicker vias do take away surface area from transistors. However this is not expected to be a problem, since transistors in current designs are not arranged densely due to global interconnect issues. The transistors could be densely arranged around the vias to form islands of logic. For three dimensional stacking with more than two layers, face to back bonding is the only viable approach in unlocking the true promise and full benefits of the third dimension.

3D Bonding and Yield

The final 3D IC fabrication issue facing semiconductor engineers is manufacturing yield and testing. The 3D stack can be assembled at the wafer scale, before testing, or at the tested die level. The wafer to wafer bonding technique assembles whole wafers to each other resulting in high manufacturing throughput [11]. Afterwards, the processed wafers are sliced and tested. Overall, wafer to wafer bonding should have a positive effect on yield if designers use 3D integration to reduce the size of each individual die.

Wafer to wafer bonding of large dice has a detrimental effect on yield, since as more wafers are stacked together, the more likely the whole chip stack is to be ruined by one bad layer. Assuming that the stacked dice are about the same size, with a die yield of 80%, a two die stack will yield at 64%, not counting stacks lost due to attachment. So designers are not likely to view as 3D integration as a tool to substantially increase die size for integrated circuits.

However, the resulting stacked dice do not need to be of equal size. If the designers stack two dice that are about half the size of the original integrated circuit, yields will actually improve. Specifically, breaking a planar die into multiple smaller pieces will increase the yield of the smaller dice, since more candidate dice can fit in a given wafer. Because there are more candidate dice per wafer, the number of defects (which should be roughly constant), will effect a smaller percentage of the overall number of candidate dice. Figure 9 below illustrates the benefits of wafer to wafer bonding on yield for a simple two layer stack. Note that the benefits of additional stacks decreases (even with perfectly sub-divisible integrated circuits) and eventually will reduce yield because the attachment process is not perfect. For example, a 100 layer stack of 1mm2 dice would certainly have lower yields than 100 wafers of 100mm2 integrated circuits. In reality, the optimal number of stacks depends on evenly dividing integrated circuits, and the yield of the attachment process.


Figure 9 – Yield impact of wafer to wafer bonding

Die to die bonding stacks known good quality dice together; ensuring manufacturing yield is controlled at a sufficiently high level before attachment. The manufacturing throughput is reduced since individual die are processed rather than whole wafers. However, the yields are higher than wafer to wafer bonding. In the example in Figure 9, a two layer stack with 2 defects per wafer produced 24 out of 28 good stacks, for wafer to wafer bonding. Assuming the same situation, die to die bonding would have even better yields, 26 out of 28 good stacks.

The overall impact of 3D integration is that the sweet spot for integrated circuits is likely to change a bit. Executives from Intel and AMD have noted that generally, 100-160mm2 is ideal for high volume microprocessors. With 3D integration, the sweet spot is likely to change to 50-80mm2 for two layer stacks, or 33-53mm2 for a three layer stack, etc. The overall area for integrated circuits will probably stay the same or slightly increase, but the area for a given layer will decrease substantially.

3D Processor Design

Intel conducted two system level experiments breaking up an existing microprocessor into two stacked, cache memory and the processor core itself [12]. One experiment was a simulation driven study, while the other was an actual working stacked silicon device; we will focus on the motives, results and implications of the manufactured device.

Current planar microprocessors are exceedingly complex designs, composed of many functional blocks interconnected through global wires. The more complex the design, the larger the surface area of the processor core, which increases the average global interconnect length and hence, signal propagation time. Modern processors have devoted whole pipeline stages to driving signals across the chip [13]. Unfortunately, longer pipelines reduce overall performance by increasing the number of in-flight instructions, and hence, the branch misprediction penalty.


Figure 10 – Planar Pentium 4 layout illustrating signals driving operands between and across functional unit blocks (Source: Intel)

The Intel Pentium 4 processor used in the experiment is a very high frequency design with a thirty stage miss-prediction pipeline is shown in Figure 8. The processor was broken into two smaller die (Figure 10) – each half the size of the original – stacked on top of each other in a face to face arrangement. Because the Pentium 4 processor core is primarily comprised of logic blocks, a minimal three dimensional arrangement of two stacked die was sufficient to capture most of the benefits of 3D IC without compromising the resulting design due to power density issues. Logic elements switch and consume power at a much higher rate than memory circuitry. This presented a problem since existing thermal issues could be exasperated due to hot spots in the design. To remain within the thermal limit of the original design, very active power regions could not be placed on top of each other, as this would increase power density and temperature. Blocks had to be carefully arranged to complement each other, power wise.

Intel has one chief goal for 3D integration: reducing the length of metal interconnects. Therefore face to face stacking was chosen by Intel, as it minimizes the inter die interconnect distance. It reduces the length and latency of the inter die vias – as well as their width – since they do not have to tunnel through the silicon substrate of each die, like in a face to back arrangement. This denser arrangement places more transistors within a clock cycle of each other, reducing global metal interconnect latency as a proportion of cycle time, as well as improving overall power consumption. Reducing the metal wiring between functional unit blocks results in a processor design that is limited more by transistor switching than interconnect delay. In this particular instance, Intel achieved both higher frequency operation as well as fewer pipeline stages [14]. and the shorter pipeline decreases the branch misprediction penalty and therefore increases efficiency and performance. Power dissipation is also reduced as a result of lower wire capacitance and fewer repeaters associated with global metal interconnects. Additionally, the latches and flip flops for the removed pipeline stages can be eliminated.

3D Processor Design Results

Latency between certain performance critical blocks was reduced by judiciously stacking these blocks close together, resulting in higher performance. For instance in the planar Pentium 4, the L1D cache was placed beside the functional units. The worst case operand latency is when the operand traverses from the far end of the data cache to the farthest functional unit. In the stacked Pentium 4 implementation, the functional units are placed right under the center of the data cache. This reduces wire length and latency, enabling one pipeline stage to be eliminated in a performance critical part of the design. Another pipeline stage was removed in the floating point cluster. In the planar implementation, a register file has to drive its operands not only to the multimedia (SIMD) unit, but across it and into the input of the floating point unit as well. The floating point unit requires two more cycles than necessary to access its operands as a result of this arrangement. In the 3D redesign, the multimedia unit is left beside the register file on the bottom stack, while the floating point unit is placed directly over the register file, on the top stack, removing two clock cycles due to wire delay and reducing the access latency [12]. As a result, both units have optimal access to the register file without penalizing either use case.


Figure 11 – 3D Pentium 4 with data cache and FP unit on top; SIMD unit and register file on bottom (Source: Intel)

Besides stacking whole functional units on top of one another, another technique Intel demonstrated is splitting the larger units into smaller pieces, with each slice occupying a layer of the stacked implementation. The benefit of this approach is reduced intra-block latency and power consumption, which complements the reduced inter-block wires and power savings. The Pentium 4’s large 1MB L2 cache was split into two sets of smaller sub-arrays which reduced cache line read latency by a 25% and cache power dissipation by 20%. The hottest block in the planar Pentium 4 is the dynamic instruction scheduler, which chooses ready instructions to execute in any given cycle. The instruction scheduler was split in a manner that greatly reduced critical internal wires, resulting in a 15% shorter access latency. Since the instruction scheduler is a critical part of the design, intricately timed to require a small fraction of the stages of the overall pipeline, the reduction in latency was not used to eliminate pipe stages but to enable a less aggressive circuit implementation to be utilized. The scheduler circuitry was converted from dynamic to static logic, eliminating half the power dissipation of the block, with similar power density and block-level performance.


Chart 1 – Temperature and Power of 2D and 3D Pentium 4

The redesigned 3D layout of the Pentium 4 eliminated approximately a quarter of the pipeline stages from the final design, as shown in Table 1. These pipeline stages were extraneous, as they were used purely to drive signals across wires from one unit to another in the planar implementation. This pipeline compaction improved single threaded performance by roughly 15%, leading to a much more efficient design. Since most of the pipeline stages removed were dominated by global interconnect, the new 3D design halved the number of repeaters. In conjunction with more efficient intra-block interconnect, these two improvements achieved a 15% decrease in total power consumption [14]. Substantially increasing performance while decreasing power is an unusual feat in modern microprocessor design. Most improvements, such as out-of-order execution or multithreading, increase performance, but add to power consumption. This simple redesign of the Pentium 4 leads to a 35% improvement in efficiency, as measured by performance per watt. That is the beauty of three dimensional integration: in many cases the designer can have the best of both worlds.


Table 1 – Selected pipeline stage reductions and related performance improvements

The power consumption of the planar Pentium 4 is 147W, while the stacked implementation consumes 125W. However, the peak power density of the 3D design did increase. The hottest part of the processor rose from 99 degrees centigrade to 113 degrees, as displayed in Chart 1. This double digit thermal increase at the hotspot is problematic, since it will likely impact the operating reliability and long term durability of the integrated circuit.

In modern microprocessors, power dissipation is dominated by dynamic power, which is the current expended to discharge the parasitic capacitance of transistors and interconnect. Since dynamic power is proportional to the product of the switching capacitance, the frequency, and the square of the supply voltage; a small decrease in frequency enables the power supply voltage to be reduced which results in a disproportionately large decrease in power dissipation.

In order to reach neutral thermals, the frequency of the 3D Pentium 4 processor was reduced, with a corresponding drop in its supply voltage. The net result was a 97W thermal envelope; a total power reduction of one third, as compared to the original design. More importantly, the peak hotspot temperature of the stacked implementation was brought down to approximately 99 degrees centigrade. Even with these further reductions in frequency, the stacked implementation ran a modest 8% faster than the original planar design [12]. If low power dissipation was the ultimate goal, frequency could be scaled down further. The designers estimated that they could achieve the same performance as the planar design, while cutting down total power consumption by more than half.

Conclusion

As interconnect scaling problems exasperate power, performance, and yield issues for current MPU designs, architects and manufacturers have sought new methods to surmount these obstacles. While certain techniques can alleviate some of the problems in the short term, they come with their own problems and are not a permanent solution for interconnect performance.

Three dimensional stacked integration presents a unique and novel solution to interconnects that promises to simultaneously improve performance, while reducing power and silicon real estate of future integrated circuits. Three dimensional stacking also permits heterogeneous layers to be integrated, each optimized for their unique functions, making novel integrated circuit applications possible. The increased density of stacked silicon designs can be achieved while mitigating device yield concerns and delivering on the promise of Moore’s Law for the immediate future.


No comments: