[ieee comput. soc. press 1996 ieee multi-chip module conference (cat. no.96ch35893) - santa cruz,...

Chip and Package Co-Design Technique for Clock Networks

Qing Zhu* Wayne W.M. Dai

RN5-09, Microprocessor Technology

Intel Corporation, Santa Clara, CA95052

Computer Engineering Department

University of California, Santa Cruz, CA 95064

qzhu@ scdt.intel.com dai@ cse.ucsc. edu

Abstract

This paper presents the motivation and a case study for a new clock distribution technique: route the global clock on package. This technique can be used in single chips and multichip modules based on area IIOs of theJEip chip technology. Due to 2-4 order lower interconnect resistance on package layers, the clock skew and path delay of the clock network are signiJcantly reduced.

1. Interconnect Scaling Property In deep submicron chips, long interconnect (e.g. clock net) delay will limit the chip performance. For a multilayer embedded microstripline, the capacitance per unit length and the inductance per unit length are scale invariant:

W C = & - ti

ti L = p - W

where w is the line width, ti the dielectric thickness, E the dielectric permittivity, and p the dielectric permeability. Therefore, if we uniformly scale the interconnect cross-section of a line as

well as the dielectric thickness, the capacitance per unit length and inductance per unit length will remain the same. Here, we make the ideal scaling effect on the ground capacitance of the line without the consideration of the fringing capacitance and the line-to-line coupling capacitance. For small line width and small spacing between lines, the fringing capacitance and line-to-line coupling capacitance can easily dominate over the line-to-ground capacitance.

The resistance per unit length is inversely proportional to the area of the line cross-section:

R = P (2) wtm

where p is the metal resistivity, and tm the metal thickness. The interconnect delay is composed of two terms: yand zRC. The yterm is the time-of- flight delay which is not dependent on the area of the line cross-section, but set by material parameters, and proportional to the line length. For a line of length 1 , the time-of-flight delay is given by

*This work was done in the first author’s Ph.D. research at the Computer Engineering Department of the University of CalifOmia, h ~ t a CmZ. It was supported in Part by Intel Corporation and in part by National Science Foundation Presidential Young Investigator Award.

On the other hand, the zRC term is the distributed RC delay, which is inversely proportional to the area of the line cross-section, and proportional to the square of the line length[ 11:

160 0-8186-7286-2/96 $05.00 0 1996 IEEE

http://scdt.intel.com

The interconnect scaling properties have two major implications on the interconnect design:

(a) With all other factors the same, thicker film results in the lower signal delay.

(b) For a long line, the zRC term domnnates the interconnect delay[ 11. Long lines should be placed on the layer with the smaller RC parameters.

The package layers provide much wider and thicker interconnects, usually with 11 -2 order larger scale than the interconnects on the chip layers. While the unit length inductance and capacitance of a line on the package are similar to that of a line on the chip, the unit length resistance is about 2-4 order less. This suggests that it is more beneficial to place long lines on the package layer instead of on the chip layer.

2. Implications from RC Tree Delay The delay of a interconnect RC tree can be calculated by the Elmore delay model in the first order approximation[2]. Formally, the Elmore delay d(s, t ) from the source s to a sink t in an interconnect tree Tis calculated as follows:

d (s, t ) = Rd( Cd + CO) + ri( c i / 2 + Ci) (5) ei E path (s, t )

Here, Rd and Cd are om-resistance and on- capacitance of the driver at the source ; CO is the total capacitance of lines and sinks of I' ; ei is the line from node ni to its parent node in 7 ; ri and ci are the line resistance and line capacitance of ei ; Ci is the total capacitance of lines and sjnks in the subtree of T rooted at node ni . In (5) , let dl(s,t) denote the first and t&(s,t) the second term of d(s,t). dl(s,t) can be reduced by decreasing the total wire length or total wire

capacitance of T. Delay d2(s,t) 'has two implications on the interconnect RC tree design:

(a) In a RC tree, the resistance of the interconnect more greatly increases the path delay if more of the total resistance is closer to the source than further from the source.

(b) In a RC tree, the capacitance: of the interconnect more greatly increases the path delay if more of the total capacitance is further from the source than closer to the source.

As an implication on the clock distribution, we want to reduce the resistance of thie global clock tree which is closer to the cloclk source. Therefore, we assign the global cloclk tree on the package layer for much smaller interconnect resistance. Meanwhile, we also want to reduce the total wire length (or total wire capacitance) of the local clock tre:es which are farther to the clock source (closer to clock terminals). In [3,4], we construct ithe local clock trees using delay bounded Steiner trees based on the tolerable skew insteadl of zero skew to decrease the total wire length of local clock trees.

3. Routing Global Clock on Package A two-level clock tree hierarchy is shown in Figure 1.

driver

buffers local clock trees

global clock tree €

I b--E clk *-I

Figurel, Two-Level Clock Tree Hierarchy

161

The clock netlist is partitioned into a set of clusters of clock terminals, and local buffers are inserted at every cluster. The first level tree or global clock tree connects the clock driver (source) to local buffers, and the second level or local clock trees connect clock terminals within every cluster. One layer of local buffers are inserted between the global clock tree and the local clock trees.

The layout of the two-level clock tree in a multichip module is shown in Figure 2. The global clock tree is routed on the package substrate. The local clock trees, the clock driver and local buffers are on chips. Flip chip solder bumps are used to connect the on-package global clock tree to the on-chip driver and on-chip local buffers .

driver local buffer

chip / chip

local clock trees

I global clock tree I substrate I1 I

Clk

Figure 2. Routing Global Clock on Package Layer and MGM Substrate

The flip chip assembly technology, providing multiple low-inductance and low-capacitance area YOs, makes the scheme feasible to place the global clock tree on the package layer. In this technology, the dice or bare chips are attached with pads facing down and via solder bumps which form the mechanical and electrical

connections to the substrate. The flip-chip provides area YOs which are distributed over the entire chip surface rather than being confined to the periphery. Compared with wire bonding and TAB, flip-chip has the highest U 0 density, smallest chip size and lowest inductance.

Table 1 shows the comparison of the unit length interconnect RC parameters of a 1 .Ow CMOS chip layer and a plastic package layer. For the same length wire, the wire resistance on the package layer is 1690 times (three orders) smaller than the wire resistance on the M2 (metal two) layer of the chip. Meanwhile, the wire capacitance on a package layer is IO times smaller than the wire capacitance on the M2 layer of the chip. Note that the RC parasitics of a solder bump are negligible when compared with the wire RC parasitics.

Table 1 : Comparison of interconnect RC parameters between chip layer and package layer.

A case study has been done to evaluate the performance improvement when the global clock is routed on the package. The test chip has 13440 clock loads[5]. The global (first- level) clock tree is routed on a dedicated package layer as shown in Figure 2(a), and other parts of the clock tree is on two chip layers as shown in Figure 2(b). This design uses the H-clock tree for the delay balance. HSPICE simulation results are shown in Figure 2(c), where the first-level H tree is either on package (new design) or on chip (old design). The trunk capacitance and trunk resistance are the sum of the wire capacitances and the wire resistances

162

in the first-level H tree. The clock delay from the clock driver to the loads is significantly reduced by 13% when routing the global clock lxee on the package layer. ESD protection series resistances of 4 area clock pads are not included in the simulation.

V V

H H H H

I m

(a) On package (b) On chip

First level First level clock tree clock tree Reduction on chip on aackage

(c) It compacts the chip size by removing the global clock from the chip. The layout design of the chip becomes easier by separating the global clock network from the concern.

We need to solve the problems on the chip sort test and the ESD protection of clock buffers (area I/Qs) when the global clock is routed on package. The on-chip clock is incomplete. In the sort test, the probing card needs to provide multiple clocks for the chip. ESD circuit is designed specifically for area clock pads with smaller protection resistance and smaller size of diodes.

The case study suggests that the chip and the package should be designed concuirently to achieve the optimum performance for VLSI systems. This chip and packaging cloncurrent design methodology encourages tlhe VLSI designers to incorporate the package in the early stages of the design flow for better clock and interconnect performance. Developing CAD tools for chip and package co-design will be expected.

Clock delay 7.26 ns 6.33 ns 13.0%

References (c) Simulation data

Figure 3. Case Study of A Test Chip

4. Concluding Remarks Routing the global clock on the package layer would provide the following advantages:

(a) It dramatically reduces the clock skew and the path delay of the clock network due to the very low interconnect resistance on the package layer.

(b) It probably reduces the capacitance of the global clock network on the package layer as shown in Figure 3(c)., Low interconnect capacitance has the benefit on the power saving of the clock net.

R.C. Frye, “Physical Scaling and Interconnixt Delay in Multichip Modules,” IEEE Transactions on Components, Packaging, and Manufacturing Technology. Pan‘ B: Advanced Packaging, Vo1.17, No.1, 1994, pp. 30-37. W.C. Elmore, “The Transient Response of Damped Linear Networks with Particular Regard to Wide-Band Amplifiers,” Journal ofApplied Physics, Vo1.19, No.1, 1948, pp. 55-63. Q. Zhu, J.G. Xi, W. W.M. Dai and R. Shukla, “Low Power Clock Distribution Based on Area Pad Interconnect for Multichip Modules,” Proceedings Iniernational Workshop on Low Power Design, 1994, pp. 87-92. Q. Zhu, W. W.M. Dai, “Planar Clock Routing for Chip and Package CO-design,” to be appeared on IEEE Transactions on Very Large Scale Integration (VLSI) Systems, March 1996. Q. Zhu, “Techniques for Design of High-Speed Low-Power ASICs”, LSI-LOGIC Corporaiion Technical Report, Summer 1994. Q. Zhu, “High Performance Microprocessor Layout Case Study Based on C4 Flip-Chip Packaging”, Intel Corporation Technical Report, Summer 1993. R.C. Frye, K.Tai, M.Lau, T. Gabara, “Trends in Silicon-on- Silicon Multichip Modules,” IEEE Design & Test of Computers, Vol.10, No.4, 1993, pp. 8-17. D. Singh, J.M. Rabaey, M. Pedram, F. Catthoor, S. Rajgopal, N. Sehgal, T.J. Mozdzen, “Power Conscious CAD Tools and Methodologies: A Perspective,” Proceedings of IEEE, Vo1.83, No.4, April 1995, pp. 570-594.

163

[ieee comput. soc. press 1996 ieee multi-chip module conference (cat. no.96ch35893) - santa cruz,...

Documents