temperature sensor distribution, measurement uncertainty...
TRANSCRIPT
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
1
Temperature Sensor Distribution, Measurement Uncertainty, and Data Interpretation for
Microprocessor Hotspots
Josef Miler1, Keivan Etessam-Yazdani
2, Mehdi Asheghi
1, Maxat Touzelbaev
3, and Kenneth E.
Goodson1
1Dept. of Mechanical Eng., Stanford Univ., Stanford, California 94305
2 Broadcom Corporation., Santa Clara, California 95054
3 Glew Engineering Consulting Inc., Mountain View, California 94040
9 March 2012
Abstract:
Microprocessor hotspots are pa major reliability concern with heat fluxes as much as 20 times
greater than those found elsewhere on the chip. Chip hotspots also augment thermo-mechanical
stress at chip-package interfaces which can lead to failure during cycling. Because highly
localized, transient chip cooling is both technically challenging and costly, chip manufacturers
are using dynamic thermal management (DTM) techniques that reduce hotspots by throttling
chip power. While much attention has focused on methods for throttling power, relatively little
research has considered the uncertainty inherent in measuring hotspots. The current work
introduces a method to determine the accuracy and resolution at which the hotspot heat flux
profile can be measured using distributed temperature sensors. The model is based on a novel,
computationally-efficient, inverse heat transfer solution. The uncertainties in the hotspot location
and intensity are computed for randomized chip heat flux profiles for varying sensor spacing,
sensor vertical proximity, sensor error, and chip thermal properties. For certain cases the inverse
solution method decreases mean absolute error in the heat flux profile by more than 30%. These
results and simulation methods can be used to determine the optimal spacing of distributed
temperature sensor arrays for hotspot management in chips.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
2
Nomenclature
Variables
� Width of chip, m
� Length of chip, m
�� Sensor spatial frequency in the x-direction, m-1
�� Sensor spatial frequency in the y-direction, m-1
ℎ Convective heat transfer coefficient, W/m2-
K
� Thermal conductivity, W/m-K
� Number of random heat flux profiles tested
′′ Heat flux, W/m2
�� �� Chip vertical thermal resistance per unit area, W/m2-
K
t0 Thickness of chip, m
� Temperature, K
Subscripts
� Circuit level
� Sensor level
� Low resolution
� Full resolution
� Includes sensor error
i Index in x-direction spatial domain
j Index in y-direction spatial domain
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
3
I. Introduction
As microprocessor manufacturers have adopted multi-core circuit architectures, the detection and
management of temporal hotspots have become increasingly important for chip reliability and
performance. While much attention has been given to increases in the overall chip power,
hotspot heat fluxes are increasing even more rapidly for many applications [1]. Active portions
of a microprocessor can produce as much as 20 times as much heat as inactive regions [2]. These
high heat fluxes can cause elevated junction temperatures leading to electromigration and
subsequent circuit failure. Furthermore, temperature non-uniformities in the chip can cause
severe thermo-mechanical stress on the package leading to system failure. These challenges will
be exacerbated in future processors that are expected to include many more processor cores
integrated in three-dimensional geometries.
To date, chip cooling alone does not seem capable of addressing these challenges. Most cooling
solutions are best suited to address relatively slow thermal phenomena occurring over large
regions of the chip. It is especially difficult to directly address highly localized, dynamic hotspots
with cooling solutions implemented in chip packaging. Thermal engineers are forced to
overdesign the cooling solution to satisfy worse-case scenario conditions for a hotspot region.
This can be both difficult and expensive, particularly because the cost of cooling solutions
increases rapidly as a function of maximum local heat flux [1]. Various methods of dynamic,
localized cooling (e.g. use of Peltier devices) are being investigated to address these difficult
thermal requirements, but none have been adopted to date
An alternative overall approach to managing chip hotspots is to regulate the chip power output to
maintain device temperature within specified limits. Such techniques are referred to as dynamic
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
4
thermal management (DTM) and have been a subject of intense investigation since being
introduced by Brooks and Martonosi [3].
All DTM techniques fundamentally involve two steps: (1) interpreting temperature data from the
chip and (2) responding to that data by reducing power. The majority of research has focused on
the latter problem for DTM, specifically on finding innovative ways to locally regulate chip
power. Proposed DTM techniques involve clock gating [4], Dynamic Frequency Control [5],
DVFS [6], SMT thread reduction [7], and activity migration [2]. Much less attention has been
given to designing temperature sensor arrays and interpreting the resulting thermal signals. Two
important sources of uncertainty need to be considered for DTM applications. First, the thermal
sensors used for DTM feedback are subject to error. Most DTM studies do not consider the
effect of this error and thus provide overly optimistic results. Skadron et al [8] demonstrated that
sensor error can cause significant performance reductions due to incorrect DTM triggering and
reduced DTM threshold levels.
Discussions of uncertainty in DTM studies are typically limited to sensor error, but additional
attention should be paid to the uncertainty caused by sensor placement. Because thermal sensors
are not necessarily located at the chip hotspot, a DTM scheme must account for the temperature
difference between the sensor location and the actual hotspot. Skadron et al [9] used an estimated
spreading factor within a core to try to account for this discrepancy as an additional source of
error. In their study, the spreading factor contributed an additional 2°C error in the temperature
signal. To attempt to account for uncertainty in hotspot location and intensity, DTM methods are
currently designed to be conservative, which causes reduced system performance.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
5
To reduce the uncertainties associated with thermal sensing for DTM, a challenging optimization
problem must be considered. Circuit designs with high circuit density but low sensor density
suffer from increased uncertainty in the thermal profile. Increased uncertainty about hotspot
location and magnitude requires more cautious DTM control algorithms, which diminishes
performance metrics. Increasing sensor resolution improves DTM control algorithms but also
reduces circuit density, ultimately reducing computational power. An optimization approach is
required to find a design that maximizes computational power while maintaining the chip in
reliable operating conditions.
This study endeavors to help address this challenging optimization problem by quantifying the
uncertainty that should be accounted for in a DTM scheme given a particular thermal sensor
array. We consider the generalized case of a grid-array of thermal sensors located some distance
above an arbitrary heat flux profile. In order to better represent real applications, the thermal
sensors are not necessarily located directly above known heat flux peaks. The chip heat flux
profile is considered unknown, and the purpose of the thermal sensor array is to detect the
regions of the chip that require dynamic power control.
We introduce a novel, computationally efficient, inverse heat transfer solution method and
determine the accuracy to which it resolves the underlying heat flux profile. We consider cases
with varying numbers of thermal sensors located with varying proximity to the circuitry level of
the chip. Sensor error is also introduced to determine its effect on the estimated heat flux profile.
For certain cases, the inverse solution method is shown to be susceptible to temperature sensor
error. The results of these tests are compared to the uncertainty that results from treating the
unprocessed thermal signal as a representation of the heat flux profile.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
6
The approach taken here also has implications for the use of discrete thermal data in resolving
the source of a hotspot. DTM schemes need not consider this uncertainty because the standard
response to a hotspot is to throttle all activity in the vicinity. In chip development and
production, however, thermal measurements are used to characterize the power distribution of
the circuit design. For these tests, high resolution thermometry can be used (e.g. infrared
microscopy [10]). The maximum spatial resolution at which these techniques can resolve
neighboring hotspots is dictated by the resolution of the applied thermometry technique, the
extent of thermal spreading in the chip, and the measurement error. The present study simulates
the case of distinguishing two similar hotspot sources using discrete temperature measurements.
For a given measurement error and chip configuration, there is a minimum spatial sampling
frequency required to correctly resolve the source of a hotspot.
Section II of this paper presents the overall methodology used to simulate chip heat flux profiles
and determine the uncertainty associated with a particular thermal sensor array. Section III
presents the inverse heat transfer solution method derived for this study. The uncertainties in the
heat flux profile associated with direct temperature interpretation and inverse solution method
are presented in Section IV. Section V provides concluding remarks.
II. Methodology
A. Overall Simulation Methodology
The present study is based on a simplified conduction model for the chip. Figure 1 shows the
model geometry. The chip is modeled as an isotropic, single-layer structure. The isotropic
condition can be relaxed by transformation of the thermal conductivity and chip thickness [11].
The boundary condition on the top surface is convective heat transfer with a uniform heat
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
7
transfer coefficient. The boundary conditions on the four sidewalls are adiabatic. On the bottom
surface, an arbitrary heat flux profile boundary condition is applied. The chip is 1 cm by 1cm and
its thickness and thermal conductivity is varied in the simulations. The system operates in
steady-state. This simplified model of the chip facilitates the generalized simulation
methodology taken in this study which would be impractical with a highly discretized chip
model.
Figure 2 shows the four main steps involved in the overall simulation methodology. The
simulation begins by defining the geometry and system parameters and generating a randomized
heat flux profile. The forward solution method is used to resolve the sensor-level, full-resolution
temperature profile, ��,� (Figure 2b), based on the circuit-level, full resolution heat flux
profile,�,��� (Figure 2a). A set of low-resolution temperature profiles, ��,� (Figure 2c), is created
by interpolating the full-resolution temperature profile, ��,�, at various spatial frequencies. Each
low-resolution temperature profile represents the temperature profile that would be measured by
a temperature sensor array of a particular spatial frequency. For example, for a temperature
sensor spatial sampling frequency of 1000 m-1
(equivalent to nominal sensor spacing of 1mm),
the low-resolution temperature profile, ��,�, is a 10x10 grid on a 1cm by 1cm chip.
Random error is added to the low-resolution temperature profile, ��,�, to simulate the
measurement error introduced by real temperature sensors. The sensor error, ������, is normally-
distributed about the interpolated temperature value with a standard deviation that is specified
relative to the maximum interpolated temperature. The sensor error at each index is calculated
as:
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
8
������(�, ) = #���$�%&���,�,'$�à (1)
where #���$�%&� is the standard deviation of the relative sensor error, ��,�,'$� is the maximum
measured temperature, and Γ is a random number with a mean value of zero and a standard
deviation of unity.
This study shows the results for standard deviations in the relative sensor error of 0, 0.5, and 1
percent. The case of 0 percent standard deviation in the relative sensor error is equivalent to no
measurement error.
The sensor-level, low-resolution temperature profile with error, ��,�,� = ��,� +������, is used to
calculate the circuit-level, heat flux profile, �,�,��� (Figure 2d), using a spatial sampling frequency
domain, inverse heat transfer solution, described in detail in the next section. Because the
inputted temperature profile is low resolution, the resulting heat flux profile, �,�,��� , is also low
resolution. To calculate the error resulting from the solution method, the low-resolution heat flux
profile is interpolated to full resolution. The mean absolute error (MAE) is calculated by finding
the difference between the correct profile and the calculated heat flux profile:
*+, = 1���� ../�,��� (�, ) − �,�,��� (�, )/12
34516
%45 (2)
where Nx and Ny are the total number of indices in the x and y directions, respectively. The mean
absolute error is normalized by the average heat flux:
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
9
�789���:�;*+, = 51612 ∑ ∑ /�,��� (�, ) − �,�,��� (�, )/1234516%4551612 ∑ ∑ �,��� (�, )1234516%45 (3)
�789���:�;*+, = ∑ ∑ /�,��� (�, ) − �,�,��� (�, )/1234516%45 ∑ ∑ �,��� (�, )1234516%45 (4)
Figure 3 provides a block diagram of the simulation procedure. The procedure is repeated for
numerous randomly-generated heat flux profiles and the results are averaged. The average MAE
is plotted against the thermal sensor spatial sampling frequency.
In practice, an inverse heat transfer technique is not always used to interpret measured
temperature profiles. Instead, the measured temperature profile is assumed to be representative of
the chip heat flux profile. This technique is equivalent to treating the measured temperature
profile as directly proportional to the heat flux profile:
���(�, ) = ��(�, )�� �� (5)
where �� �� is the chip vertical thermal resistance for unit area:
�� �� = =>� (6)
This simplification results in additional uncertainty in the heat flux profile, the magnitude of
which depends on the chip properties and boundary conditions. In this paper, this approach is
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
10
referred to as “direct interpretation” of the temperature profile and is compared to the inverse
solution method in the results section.
B. Randomization of Heat Flux Profiles
The uncertainty in the calculated heat flux is dependent on the characteristics of the heat flux
profile. Simple, well-spaced heat flux profiles are easier to resolve than overlapping, complicated
heat flux profiles. To represent the most general case, the simulation is conducted over a set of
heat flux profiles that contain varying degrees of complexities. The heat flux profiles are
randomly generated to include between 1 and 15 hotspots which can vary in laterals dimension
between 273 um (equivalent to 7 grid cells) and 4.18 mm (equivalent to 107 grid cells). For
reference, the chip is 1 by 1cm. The hotspots are created with soft edges; the edge of the hotspot
spans 156 um (equivalent to 4 grid cells) and has a linear slope from the value of the background
heat flux to value of the hotspot heat flux. The background heat flux is 1 W/cm2 and the
maximum possible hotspot heat flux is 320 W/cm2. Hotspots are permitted to overlap with each
other but not with the edge of the chip. For the first set of simulations, the hotspots have a
random heat flux value between the background and the maximum heat flux. This is referred to
as “variable heat flux”. For the second set of simulations, all hotspots have the maximum heat
flux, referred to as “binary heat flux”. The case of binary heat flux represents a core that is either
active or inactive. The case of variable heat flux represents a core for which the amount of
activity is unknown. Since the variable heat flux case is most challenging from an uncertainty
perspective, only select resulted are presented for binary heat flux cases.
The key result of each simulation is the mean absolute error (MAE) in the calculated heat flux
profile. Because conduction through the chip is linear, the results are generalized by normalizing
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
11
the error in the heat flux profile by the input heat flux profile. Thus only the relative magnitude
of the heat flux as compared to the background heat flux is relevant for consideration.
C. Resolution Study
A second study was conducted to quantify the ability of the inverse solution method to resolve a
single hotspot from a group of neighboring hotspots. Two circuit-level heat flux profiles are
created; the first heat flux profile consists of a single hotspot in the center while the second heat
flux profile consists of 9 closely packed hotspots in the center. The average heat flux is the same
in both cases. Figure 4 shows the two heat flux profiles. The circuit-level temperature profile
resulting from the single-hotspot heat flux profile is calculated using the forward solution. The
temperature profile is sampled at reduced spatial sampling frequency to simulate the signal from
a thermal sensor array, as before.
Each solution method is used to deduce which of two possible heat flux profiles yielded the
measured temperature profile. To do so, the inverse solution method is used to calculate the
circuit-level heat flux profile. The results are compared to the two possible inputted heat flux
profiles by calculating the mean absolute error. The profile resulting in the lower MAE
represents the solution chosen by the inverse solution method. For example, if the MAE between
the calculated heat flux profile and the single-hotspot heat flux profile is lower than the MAE
between the calculated heat flux profile and the multi-hotspot heat flux profile, the inverse
solution method chooses the single-hotspot heat flux profile. If the choice correctly corresponds
to the actual inputted heat flux profile, the inverse solution method is correct. This procedure is
conducted for all sensor spatial frequencies, and is also conducted for the direct interpretation
method.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
12
III. Introduction to Inverse Heat Transfer Solution in Spatial Frequency Domain
A. Inverse Heat Transfer Solution Method
To conduct the forward and inverse solutions needed for the overall simulation methodology, an
analytical, spatial-frequency domain heat transfer analysis has been developed. This approach is
more computationally efficient than finite-difference methods and thus facilitates rapid multi-
parameter design optimization and possible integration into DTM schemes.
The thermal profile in the model geometry is defined by the heat diffusion equation. For each
layer in the stack, the solution to the heat diffusion equation is given by:
�(?, @, :) = +> +A>:
+ .B+'�7�ℎ(C':) + A'��Dℎ(C':)E�7�(C'?)F'45
+.B+G�7�ℎ(HG:) +AG��Dℎ(HG:)E�7�(HG@)FG45
+ . .B+'G�7�ℎ(I'G:)FG45
F'45
+ A'G��Dℎ(I'G:)E�7�(C'?)�7�(HG@)
(7)
where:
I'G =JC'K + HGK (8)
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
13
C' = 9L� (9)
HG = DL� (10)
For the boundary conditions imposed in this model, Etessam-Yazdani [11] demonstrated a
technique of representing this conduction problem as a two-port terminal network. The technique
has been shown to be both accurate and fast for the forward heat transfer solution [12] and is
adapted in this study for the inverse problem.
Figure 5 presents a schematic of the two-port terminal network for this system. The two-
dimensional Fourier transforms of the heat flux profiles at the circuit and sensor levels are ��� and ���, respectively. Similarly, �� and �� are the two-dimensional Fourier transforms of the
temperature profiles at the circuit and sensor levels, respectively. The matrix + is a 2x2 matrix
that relates �� and ��� to �� and ���:
M ��(��, ��)���(��, ��)N = +O��, ��P M ��(��, ��)���(��, ��)N (11)
For radial spatial frequency �� > 0:
+O�� , ��P = S cosh(2L��=�) sinh(2L��=�)2L���2L�����Dℎ(2L��=�) cosh(2L��=�)[ (12)
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
14
And for �� = 0:
+O��, ��P = \1 =�/�0 1 ^ (13)
where the radial spatial frequency �� is defined as:
�� = _��K + ��K (14)
Further details on the derivation of the two-port terminal analysis are provided in [11].
Etessam-Yazdani et al [11] used the two-port terminal analysis to solve for the temperature as a
function of the heat flux on the same level of the geometry, which represents the forward
solution. In this study, the solution was modified to determine the heat flux profile on the circuit
plane, ���, using the temperature profile on the sensor plane, ��, which represents the inverse
solution. From the two-port terminal analysis, the equation for ���is:
��� = +K5�� + +KK��� (15)
Applying the top boundary condition, � = ℎ��, and substituting the appropriate values of +%3,
the result for cases where �� > 0 is:
� = (2L�����Dℎ(2L��=�) + ℎ ∗ �7�ℎ(2L��=�))�� (16)
For which the inverse solution transfer function a%G&(��) can be defined such that:
� = a%G&(��) ∗ �� (17)
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
15
and
a%G&(��) = 2L�����Dℎ(2L��=�) + ℎ ∗ �7�ℎ(2L��=�) (18)
For cases where �� = 0, the transfer function reduces to equal the heat transfer coefficient, ℎ, and
the equation is given as � = ℎ��.
B. High-Frequency Filtering
A filtering technique based on the forward solution transfer function is employed to reduce error
in the inverse solution method. As shown by [13], the forward solution to the conduction
problem yields a transfer function in the frequency domain that acts as a low pass filter.
Physically this represents the attenuation of high spatial frequency components of the thermal
signal via heat spreading in the chip.
The inverse transfer function has the form of a high-pass filter, as shown in Figure 6. The
minimum of the transfer function occurs at �� = 0 and increases rapidly as a function of ��, thus
amplifying the high frequency components of the temperature profile. The components of the
temperature profile that are greater than the -3dB frequency of the forward solution transfer
function, however, represent sensor noise. A filtering method has been developed to prevent this
noise from propagating to the calculated heat flux profile. A low-pass filter is applied to the
inverse transfer function with a filter cut-off frequency at the -3dB frequency of the forward
solution transfer function. The filter has a soft roll-off. Figure 6c shows the filtered transfer
function. This filtering technique dramatically improves the performance of the inverse solution
method by decreasing sensitivity to high-frequency noise.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
16
C. Solution Validation
The solution method was validated by comparison to COMSOL Multiphysics software using
representative simulation parameters. The heat transfer coefficient was 10 W/m2-K and the
thermal conductivity was 148 W/m-K. The simulated chip was 1cm by 1cm in lateral dimensions
and 100 microns in thickness. A representative heat flux was applied in the COMSOL model and
the temperature profile was resolved. The temperature profile was used as an input to the inverse
solution method and the applied heat flux was calculated. The calculated heat flux matched the
COMSOL heat flux at greater than 0.01% accuracy.
Additional testing was conducted to ensure the results for average heat flux error are independent
of the number of random heat flux maps tested,�. Figure 7 shows the results for varying number
of randomly generated heat flux maps for both varying heat flux and binary heat flux. The results
are shown to be �-independent (i.e. independent of the number of random heat flux profiles)
after 50 randomly generated heat maps. For all of the reported results, data was averaged for 50
heat maps (� = 50).
IV. Results
Figure 8 shows a representative distribution of mean absolute error for 50 randomized heat flux
distributions. Results are reported for the case of variable heat flux and binary heat flux. These
results provide a basis for understanding the effects of sensor spatial frequency on the calculated
heat flux profile. Simulation parameters are typical of chip applications: the distance from the
sensor array is 100 um, the conductivity is 148 W/m-K and the heat transfer coefficient is 10,000
W/m2-
K. The sensor error is zero for this case. The mean value (shown in bold black) follows the
expected trend of increased accuracy at higher spatial sampling frequency. A sampling frequency
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
17
of 2000 m-1
(approximately 500um sensor spacing) is required to achieve an average mean
absolute error (MAE) below 25% for the variable heat flux case. At lower resolutions, the
average MAE is dramatically higher. Significant deviations from the mean value are caused by
variations between the randomized heat flux profiles.
The average MAE error is dependent on the heat transfer coefficient, the sensor error, and the
proximity between thermal sensors and the circuit level. These effects are discussed in more
details below. For clarity, only the average MAE is shown. The solid curves and dotted curves
represent the average MAE for the inverse solution method and the direct temperature
interpretation method, respectively.
The average MAE of the inverse solution method is dependent on whether the inputted heat flux
profile is binary or variable. Figure 9 shows an approximately 65% drop in average MAE if the
input heat flux is binary rather than variable. Since the heat flux cannot always be assumed to be
binary, the remaining plots show results for variable heat flux.
Figure 10 shows the performance of the inverse solution method for varying heat transfer
coefficients for variable heat flux with no sensor error. As expected, the average MAE for both
methods is reduced by increasing heat transfer coefficient values. The direct interpretation
method performs poorly at low heat transfer coefficients but makes significant improvements as
the heat transfer coefficient is increased. The inverse method produces significantly lower
average MAE and is less sensitive to changes in the heat transfer coefficient.
Figure 11 illustrates the difficulty of calculating the heat flux profile from temperature profiles
containing sensor error. For the ideal case of zero sensor error, the inverse solution method
outperforms the direct method by up to 50% MAE for variable heat flux. However, measurement
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
18
error causes the inverse method to diverge from the solution. For a case of 0.5% measurement
error, the inverse solution is slightly better than the direct method for spatial frequencies up to
about 3000 m-1
, at which point it diverges rapidly. For the case of 1% standard deviation in the
sensor error, the direct interpretation method is superior for sensor spatial frequencies greater
than 2000 m-1
. Similar trends are observed for the case of binary heat flux profiles as well.
Figure 12 presents the effect of vertical proximity between the sensor level and the circuit level.
Average MAE results are shown for vertical distances between 1um and 1mm for variable heat
flux profiles. As the vertical proximity is reduced, modest improvements in MAE are observed
for both the inverse and direct interpretation techniques with the exception of the 1 um case
where improvements in the direct interpretation method are approximately 0.8 normalized
averaged MAE. For the extreme case of 1um of vertical proximity, the inverse and direct
interpretation methods are comparable, but for all other cases the inverse solution significantly
outperforms the direct interpretation method.
Figures 13 and 14 show the performance of the inverse solution method in resolving neighboring
hotspots. The figures show the minimum sensor spatial frequency required to correctly
differentiate between a single hotspot and a group of equivalent neighboring hotspots. The
results are presented as a function of vertical proximity between the distributed thermal sensor
array and the circuit plane, and a moving-average smoothing function is applied to remove
discretization artifacts. The gray region of the plot shows the domain in which the inverse
solution can correctly identify the underlying heat flux profile. A relatively low sensor spatial
frequency is adequate when positioned in close proximity to the hotspot. Increasing the
separation between the sensor array and the hotspot requires an increase in the sensor spatial
frequency. Figures 13 and 14 show results for convective heat transfer coefficients of 10,000 and
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
19
50,000 W/m2-K, respectively. For a convective heat transfer coefficients of 10,000 W/m
2-K at
distances greater than approximately 240 um, the inverse solution method is unable to resolve
the hotspot. Figure 14 shows that the limit of the inverse solution can be extended by increasing
the heat transfer coefficient. For this case, the inverse solution method produces the correct
results up to 300 um. For all cases shown, the direct interpretation method failed to correctly
identify the single hotspot. The inverse technique is shown to be superior to the direct
interpretation method for resolving neighboring hotspots. These results provide insight into the
optimization of sensor vertical proximity and sensor spatial frequency for resolving neighboring
hotspots.
IV. Summary and Concluding Remarks
This study investigates uncertainty and error propagation in distributed thermal sensor arrays in
microprocessors. A novel, inverse heat transfer solution methodology is developed to provide a
computationally efficient method for determining the heat flux profile at a remote level in a chip.
The inverse solution method is used to determine the expected mean absolute error of the
calculated heat flux profile in a chip. Several key conclusions are drawn.
• For systems with relatively low sensor spatial frequency such as typical microprocessors,
large improvements in the accuracy of the calculated heat flux can be made by making
relatively small improvements in the resolution of the sensor array. As the sensor array
increases resolution, the uncertainty in the calculated heat flux is much reduced.
• For cases of very low sensor error, the proposed inverse solution technique more accurately
calculates the heat flux profile than direct interpretation of the temperature profile.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
20
• Depending on the system configuration and the magnitude of the sensor error, the inverse
solution method can be inaccurate. This inaccuracy is mitigated by the proposed filtering
method, but nonetheless represents a fundamental limitation of this technique.
• Direct interpretation of the temperature signal is shown to result in significant error in the
calculated heat flux profile. Accounting for these errors in DTM techniques causes decreased
computational performance and should therefore be considered during overall system design.
These conclusions regarding the nature of error propagation from distributed thermal sensor
arrays can provide a basis for considering the difficult system-level optimization required for
integrated circuit design. Sensor error, sensor spatial frequency, proximity between a sensor
array and hotspots, and signal processing all affect hotspot uncertainty as well as circuit design.
Each of these parameters can help improve DTM accuracy but can also pose costs for the
performance of the circuit. Careful optimization of these parameters is necessary to maximize
computational performance while ensuring reliable thermal conditions.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
21
Acknowledgements
The authors gratefully acknowledge support from Advanced Micro Devices (AMD) Inc. as part
of the Semiconductor Research Consortium (SRC) and further support from the Stanford
Department of Mechanical Engineering Graduate Teaching and Research Fellowship.
Work Cited
[1] S. H. Gunther, D. P. Group, I. Corp, and F. Binns, “Managing the Impact of Increasing
Microprocessor Power Consumption,” Intel Technology Journal, vol. 1, pp. 1-9, 2001.
[2] S. Heo, K. Barr, and K. Asanovic, “Reducing Power Density through Activity Migration,”
in Low Power Electronics and Design, 2003. ISLPED’03. Proceedings of the 2003
International Symposium on, 2003, no. C, pp. 217–222.
[3] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-Performance
Microprocessors,” in Proceedings of the 7th International Symposium on High-
Performance Computer Architecture, 2001, no. C.
[4] E. Kursun, G. Reinman, S. Sair, A. Shayesteh, and T. Sherwood, “Low-overhead Core
Swapping for Thermal Management,” Power-Aware Computer Systems, pp. 46–60, 2005.
[5] A. Cohen, F. Finkelstein, A. Mendelson, R. Ronen, and D. Rudoy, “On Estimating
Optimal Performance of CPU Dynamic Thermal Management,” Computer architecture
letters, vol. 2, no. 1, pp. 6–6, 2003.
[6] A. K. Coskun, R. Strong, D. M. Tullsen, and T. Simunic Rosing, “Evaluating the Impact
of Job Scheduling and Power Management on Processor Lifetime for Chip
Multiprocessors,” in Proceedings of the eleventh international joint conference on
Measurement and modeling of computer systems, 2009, pp. 169–180.
[7] M. Gomaa, M. D. Powell, and T. Vijaykumar, “Heat-and-Run: Leveraging SMT and CMP
to Manage Power Density through the Operating System,” in ACM SIGARCH Computer
Architecture News, 2004, vol. 32, no. 5, pp. 260–270.
[8] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan,
“Temperature-aware Microarchitecture,” in ACM SIGARCH Computer Architecture
News, 2003, vol. 31, no. 2, pp. 2–13.
[9] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan,
“Temperature-aware Microarchitecture: Modeling and Implementation,” ACM
Transactions on Architecture and Code Optimization (TACO), vol. 1, no. 1, pp. 94–125,
2004.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
22
[10] L. Hom, A. Durieux, J. Miler, M. Asheghi, K. Ramani, and K. E. Goodson, “Calibration
Methodology for Interposing Liquid Coolants Infrared Thermography of
Microprocessors,” in ITHERM, 2012.
[11] K. Etessam-Yazdani, “Continuum and Subcontinuum Thermal Modeling of Electronic
Devices and Systems,” Carnegie Mellon University, 2006.
[12] K. Etessam-Yazdani and H. Hamann, “Fast and Accurate Simulation of Heat Transfer in
Microarchitectures Using Frequency Domain Techniques,” IPACK, pp. 1-5, 2007.
[13] K. Etessam-Yazdani, H. F. Hamann, and M. Asheghi, “Spatial Frequency Domain
Analysis of Heat Transfer in Microelectronic Chips with Applications to Temperature
Aware Computing,” in IPACK, 2007.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
23
Figures
Figure 1: Schematic of model geometry. An arbitrary heat flux profile is applied on the bottom boundary. The boundary
condition on all sidewalls is adiabatic; the boundary condition on the top surface is uniform heat transfer.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
24
a) c�,�(?, @)
b) ��,�(?, @)
c) ��,�,�(?, @) d) c�,�,�(?, @)
Figure 2: Representative images of each of the four main steps in the simulation methodology. The inputted heat flux
profile (a) is used as a reference for determining the error in the calculated heat flux profile (d).
Inputted Heat Flux Profile [W/mm2]
0
5
10
15
20
25
30
Actual Temperature Profile [C]
10
20
30
40
50
60
70
Measured Temperature Profile [C]
10
20
30
40
50
60
70
Soln Power Map [W/mm2]Case #1
0
5
10
15
20
25
30
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
25
Figure 3: Block diagram of numerical approach used for determining hotspot detection accuracy. FFT and IFFT refer to
the Fast Fourier Transform and the Inverse Fast Fourier Transform, respectively.
c�,�(?, @)
�,�(��, ��) a���d$�e(��, ��)
��,�(��, ��)
��,�(?, @)
��,�(?, @)
��,�,�(?, @)
��,�,�(��, ��) a%G&(��, ��)
�,�,�(��, ��)
×
FFT
IFFT
Reduce resolution
Introduce error
×
c�,�,�(?, @)
FFT
IFFT
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
26
(a) (b)
Figure 4: Heat flux profiles used for resolution study. Both heat flux profiles have equivalent average heat flux and
produce similar temperature response profiles. The solution methods are tested for their ability to correctly resolve these
heat flux profiles.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
27
Figure 5: Schematic of two-port terminal network [11].
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
28
(a)
(b)
(c)
Figure 6: Representative plots of inverse solution transfer function. Plots show two-dimensional shape of transfer function
(a) without filtering and (b) with filtering. (c) Values of transfer function for varying x-direction spatial frequency and
for y-direction frequency of zero (shown as “on-axis”) as well as for maximum y-direction frequency (shown as “off-
axis”). Effect of applied filter can be seen at approximately 4000 [m-1].
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
29
(a)
(b)
Figure 7: Average mean absolute error (MAE) for varying numbers of randomized heat flux profiles for (a) variable heat
flux and (b) binary heat flux. Results for both cases are independent of the number of heat flux profiles for more than 50
heat flux profiles.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
30
(a)
(b)
Figure 8: Demonstration of the averaging technique for (a) variable heat flux and (b) binary heat flux. Results for 50 heat
flux profiles are shown. The bold black line indicates the average value.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
31
Figure 9: Effects on uncertainty of variable versus binary inputted heat flux profile for varying vertical proximity
between sensor and circuit level. The binary heat flux profile results in substantially lower MAE.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
32
Figure 10: Uncertainty in calculated heat flux profile for varying convective heat transfer coefficient. The inverse solution
method is much less sensitive to heat transfer coefficient than the direct interpretation method.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
33
(a)
(b)
Figure 11: Uncertainty in calculated heat flux profile for varying sensor error at a vertical proximity of (a) 2.575 um and
(b) 7.53 um. The inverse solution method is susceptible to sensor error at high spatial frequency. The MAE for the direct
interpretation method is not affected by varying sensor error.
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
34
Figure 12: Uncertainty in calculated heat flux profile for varying vertical proximity between the sensor and circuit levels
for zero sensor error. For most cases, large changes in vertical proximity yield modest improvements in heat flux
uncertainty.
Vert. Proximity [m]
Submitted to IEEE Trans. on Components, Packaging, and Manufacturing Technology
35
Figure 13: Plot of minimum accurate sampling frequency as a function of vertical proximity between chip and sensor
level for heat transfer coefficient of 104 W/m2-K. The inverse solution method is accurate in the shaded region. The direct
interpretation technique is inaccurate across the entire domain.
Figure 14: Plot of minimum accurate sampling frequency as a function of vertical proximity between chip and sensor
level for heat transfer coefficient of 105 W/m2-K. The inverse solution method is accurate in the shaded region. The direct
interpretation technique is inaccurate across the entire domain.