disclaimer - seoul national university...using multi-bit flip-flop (mbff) library during rtl...
TRANSCRIPT
저 시-비 리- 경 지 2.0 한민
는 아래 조건 르는 경 에 한하여 게
l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.
다 과 같 조건 라야 합니다:
l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.
l 저 터 허가를 면 러한 조건들 적 되지 않습니다.
저 에 른 리는 내 에 하여 향 지 않습니다.
것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.
Disclaimer
저 시. 하는 원저 를 시하여야 합니다.
비 리. 하는 저 물 리 목적 할 수 없습니다.
경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.
M.S. THESIS
Power Optimization TechniquesApplicable in Pre/Post Placement Stages
for Modern System-on-Chips
최신 System-on-Chip에서의 Placement전/후파워최적화기법
BY
Yi Dongyoun
FEBURARY 2017
DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE
COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY
M.S. THESIS
Power Optimization TechniquesApplicable in Pre/Post Placement Stages
for Modern System-on-Chips
최신 System-on-Chip에서의 Placement전/후파워최적화기법
BY
Yi Dongyoun
FEBURARY 2017
DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE
COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY
Power Optimization TechniquesApplicable in Pre/Post Placement Stages
for Modern System-on-Chips
최신 System-on-Chip에서의 Placement전/후파워최적화기법
지도교수김태환
이논문을공학석사학위논문으로제출함
2017년 2월
서울대학교대학원
전기컴퓨터공학부
이동윤
이동윤의공학석사학위논문을인준함
2017년 2월
위 원 장:부위원장:위 원:
Abstract
In this paper, we introduce two power optimization techniques applicable before
and after placement stage.
First, a new approach to the problem of allocating multi-bit flip-flops for data stor-
age is presented. Previous approaches divide the allocation problem into two separate
steps: (i) placing single-bit flip-flops under circuit timing constraints and (ii) minimiz-
ing the flip-flop and clock tree power by grouping single-bit flip-flops to form multi-bit
flip-flops. Yet, there is no easy way to predict the result of step (ii) during step (i). In
our approach, we place primary importance on the cost of power consumption. Con-
sequently, we try to minimize power consumption by synthesizing multi-bit flip-flops
first and then to place them later. For a number of benchmark circuits, it is shown that
our approach of early consideration of synthesizing multi-bit flip-flops is very effec-
tive, reducing the clock power by 16.8% over that of the conventional method while
satisfying all the timing constraints.
The second work addresses a practical problem of optimizing the switch cells in
power-gated modern System-on-Chips (SoCs) to save the unnecessary standby leakage
under noise (i.e., IR-drop) constraint. Since power gating switch cells are physically
directly connected to power rails, their overall allocation structure is synthesized in a
stage before logic cell placement. Consequently, the allocation of switch cells in the
pre-placement could lead to unnecessarily high standby leakage for modern designs.
This work proposes a practical remedy for this problem at the post-placement stage.
Specifically, for an initial design with a grid-based switch cell allocation, which is
commonly used design methodology in industry, we propose a comprehensive solu-
tion to determining, for each switch cell, (i) if the cell can be removed or (ii) the type
of switch cell for replacement so that the resulting total standby leakage of switch cells
should be minimized under the noise constraint. We formulate the problem into a vari-
i
ant of weighted set cover problem and solve it efficiently by employing an approximate
set cover algorithm. Through experiments with benchmark circuits in ISCAS89, open-
MSP430, and fpu, it is shown that our method is able to reduce the standby leakage
by 35.0% and 13.9% over the initial designs and the designs produced by the previous
switch cell optimization method in [9], respectively.
keywords: Low Power, Multi-bit Flip-flop, Logic Synthesis, Power-gating, Switch
cell
student number: 2015-20962
ii
Contents
Abstract i
Contents iii
List of Tables v
List of Figures vi
1 Allocation of Multi-bit Flip-flops in Logic Synthesis for Power Optimiza-
tion 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Algorithm for Multi-bit Flip-flop Allocation . . . . . . . . . . . . . . 3
1.3 Placement Aware Multi-bit Flip-flop Allocation . . . . . . . . . . . . 6
1.3.1 Extraction of Mergeable Flip-flop Sets . . . . . . . . . . . . . 6
1.3.2 Construction of Merging Conflict Graph . . . . . . . . . . . . 9
1.3.3 Selection of Mergeable Flip-flop Sets . . . . . . . . . . . . . 9
1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Comparing with Academic Algorithm . . . . . . . . . . . . . 14
1.4.3 Comparing with Industry Algorithm . . . . . . . . . . . . . . 16
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
iii
2 Switch Cell Optimization of Power-gated Modern System-on-Chips 18
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Preliminaries and Motivations . . . . . . . . . . . . . . . . . . . . . 21
2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1 Extraction of Maximally Feasible Subregions . . . . . . . . . 25
2.4.2 Switch Cell Covering for Minimal Standby Leakage . . . . . 28
2.4.3 Consideration of Practical Issues . . . . . . . . . . . . . . . . 32
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Abstract (In Korean) 42
iv
List of Tables
1.1 Comparison of normalized power and area of 1-bit, 2-bit and 4-bit flip-flops [1]. . . 3
1.2 Computation of intersection densities of some subsets of flip-flops in the circuit in
Fig. 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 The number of mergeable flip-flop sets . . . . . . . . . . . . . . . . . . . . 8
1.4 Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops
power PFF , and the total power for the results produced by the MBFF allocation [1]
in post-placement, our algorithm, and our algorithm followed by [1]. . . . . . . . 14
1.5 Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops
power PFF , combinational logics power PCOMBI , the total power, the longest path
delay tlpd and the worst local clock skew tskew for the results produced by Design
Compiler’s MBFF algorithm, and our algorithm followed by [1]. . . . . . . . . . 15
2.1 An example showing a minimum weighted set cover solution . . . . . 31
2.2 Comparison of the number of switch cells and estimated standby leak-
age Istandby by evenly distributed, the previous current based approach
in [9] and our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3 Power/ground density specification . . . . . . . . . . . . . . . . . . . 36
v
List of Figures
1.1 The structures of two 1-bit flip-flops and a 2-bit flip-flop merging the two 1-bit flip-
flops [14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Flow of the proposed design methodology: Sub-step 2 takes into account the place-
ment cost while performing multi-bit flip-flop allocation in step (i). The cost of multi-
bit flop-flop allocation in Eq.2.2 and the placement cost are balanced by controlling
the value of placement relaxation parameter ρ. . . . . . . . . . . . . . . . . . 5
1.3 Conceptual view of partitioning a circuit based on the transitive fan-outs of flip-flops. 7
1.4 An example of a conflict graph for 7 mergeable flip-flop sets. (a) Given the conflict
graph, we assign their weight to the amount of power saving. For simplicity, we ignore
vertices of 1bit flip-flop whose power saving weight is 0. (b) We evaluate their degree
considering their neighbors and select the minimum degree vertex L7. (c) We remove
the selected vertex and its neighbors from G. (d)(e)(f) We repeat these steps until G
becomes empty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Normalized power consumption per bit of MBFF up to 16-bit. . . . . . . . . . . 13
1.6 Layout views for S5378. The yellow lines indicate clock tree networks. The colored
small rectangles represent 1-bit flip-flops and MBFFs; light-orange is 1-bit flip-flop,
light-red is 2∼4-bit MBFF, orange is 5∼9-bit MBFF and red is 10∼16-bit MBFF.
(a) Result produced by our algorithm. (b) Result produced by random FF grouping to
be the same number of MBFFs as that in (a); red dash indicates timing violation. (c)
Result produced by [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
vi
2.1 A logical structure of power-gated power delivery network (PDN) . . 19
2.2 The changes of standby leakage of openMSP430 circuit as the number
and type (size) of switch cells vary. . . . . . . . . . . . . . . . . . . 21
2.3 Typical PDN structure used in current mobile SoCs. . . . . . . . . . . 22
2.4 An example of (a) switch cell placement, (b) its IR-drop map under an
assumption of fully placed power/ground IO bumps, and (c) IR-drop
and resistance composition for the worst IR-drop path of aes256 circuit
categorized by elements of power net. . . . . . . . . . . . . . . . . . 23
2.5 The flow of our two-step algorithm of switch cell optimization . . . . 26
2.6 An illustration of bin partitioning (Step 1.1), horizontally along the
power/ground rails and vertically along the center of the locations of
two serially placed cells. . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 An illustration of extracting maximally feasible regions (Step 1.2) for
switch cell s6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Comparison of IR-drop map on all instances and its distribution on
switch cells for the fpu circuit, but using different package. (a) flipchip
package, (b) wirebond package. Orange and white dots indicate power
and ground sources, respectively. . . . . . . . . . . . . . . . . . . . 33
2.9 An example illustrating to handle wakeup delay sensitive subcircuits.
Replacement of the unused HEADBUF with an always-on buffer to
maintain the wakeup signal delay. . . . . . . . . . . . . . . . . . . . 34
2.10 Comparison of IR-drop distribution for the fpu circuit. (a) evenly dis-
tributed switch with 60um pitch, (b) the previous switch optimization
method in [9] and (c) our algorithm. . . . . . . . . . . . . . . . . . . 37
vii
Chapter 1
Allocation of Multi-bit Flip-flops in Logic Synthesis for
Power Optimization
1.1 Introduction
Under the limited power and thermal budget constraints for modern system-on-chips
(SoCs) which integrate an increasing number of transistors and interconnects, the min-
imization of the power consumption has become one of the most important design
goals for diverse applications. One effective methodology to save power consumption,
in particular, on the flip-flops and driving clock network is allocating multi-bit flip-flops
or register banks.
Fig. 1.1 shows a comparison of the structures of two 1-bit flip-flops (left one) and
their functionally equivalent 2-bit flip-flop (right one). The power saving in the 2-bit
flip-flop is attributed by the sharing of the two inverters among the two master and
two slave latches. The advance of process technology below 65nm enables even a
minimum sized inverter to still drive multiple master and slave latches in flip-flops [1].
Table 1.1 shows a comparison of the power consumption and area of 1-bit flip-flop, 2-
bit flip-flop, and 4-bit flip-flop. In summary, using a 2-bit flip-flop and a 4-bit flip-flop
can save power by 14% and 22% while reducing the area by 4% and 29%, respectively
1
[14]. Moreover, the use of multi-bit flip-flop reduces the number of cells driven by
clock tree, the clock resource such as clock wire and buffers can be reduced as well,
which in turn decreases the clock power consumption. Thus, as the process technology
advances, the use of flip-flops of 8-bit or more bit size will be feasible with no timing
degradation problem and renders considerable reduction of power consumption.
Master
latch
Slave
latchD Q
1-bit FF
Master
latch
Slave
latchD Q
1-bit FFCLK
2-bit FF
Master
latch
Slave
latchD Q
Master
latch
Slave
latchD Q
CLK
Figure 1.1: The structures of two 1-bit flip-flops and a 2-bit flip-flop merging the two 1-bit flip-flops
[14].
Using multi-bit flip-flop (MBFF) library during RTL synthesis was first introduced
in [10]. However, the selection of flip-flops to form a multi-bit flip-flop is manually
done by only collecting the flip-flops on the same bus. Later, the works in [1, 2, 18]
proposed to merge multiple 1-bit flip-flops to form an MBFF at the post-placement
stage. Precisely, during the merging process based on the cell placement information,
they extracted so called a feasible location region for legal placement of MBFFs, place-
ment density for acquiring available space for MBFFs, and routability congestion for
legal routing of nets from/to MBFFs. On the other hand, the works in [5, 15, 16] con-
sidered the effect of flip-flop merging on the synthesis of clock tree at the in-placement
2
stage.
Note that almost all of the previous MBFF allocation algorithms ([1, 2, 5, 15,
16, 18]) performed the flip-flop merging process by solely relying on the placement
information obtained beforehand. Namely, the previous approaches divide the MBFF
allocation problem into two steps: (i) placing 1-bit flip-flops while meeting the timing
constraints and (ii) grouping 1-bit flip-flops placed in step (i) to form multi-bit flip-
flops to save power consumption while meeting the timing constraints. Since the flip-
flop grouping, thus the power reduction, depends on the placement of 1-bit flip-flops,
the result of step (i) has a strong effect on the result of step (ii). Yet, there is no easy
way to predict the result of step (ii) during step (i). In our approach, we place primary
importance on the cost of power saving rather than on the cost of placement. In other
words, our approach attempts to minimize the cost of power consumption first and then
the cost of placement later. Consequently, our algorithm tries to minimize the cost of
power consumption without committing to a particular way of placing flip-flops, while
taking into account a balance between the cost of power consumption and the cost of
placement.
Table 1.1: Comparison of normalized power and area of 1-bit, 2-bit and 4-bit flip-flops [1].
Bit Total Total Per bit Per bit
size power area power area
1 100 100 1.00 1.00
2 172 192 0.86 0.96
4 312 285 0.78 0.71
1.2 Algorithm for Multi-bit Flip-flop Allocation
We assume that the traditional logic synthesis has already been done. In other words,
the logic gates and (1-bit) flip-flops are optimally synthesized under certain timing
3
constraints. However, how 1-bit flip-flops are to be grouped to form multi-bit flip-flops
and placement of the logic gates and multi-bit flip-flops under the timing constraints
have not yet been determined. We also assume that we have a library of multi-bit flip-
flops of any bit-size. The problem to be solved in step (i) can be expressed as:
Problem 1. (Early multi-bit flip-flop allocation problem) The problem is to trans-
form an input logic circuit C with 1-bit flip-flops only into a logic circuit C′ with a
mixture of 1-bit and multi-bit flip-flops that minimizes the cost function
P (C′) = PFF (C′) + PCLK TREE(C′) (1.1)
while satisfying the timing constraints in placement, in which PFF (C′) indicates the
total power consumed by the flip-flops in C′ and PCLK TREE(C′) indicates the power
consumed by the synthesized clock tree driving the flip-flops in C′. (In some litera-
ture, the sum of the flip-flop power PFF (C′) and clock tree power PCLK TREE(C′) is
referred to as the clock power of C′.)
Let N be the number of 1-bit flip-flops in input circuit C, n1, n2, · · · , nK be re-
spectively the numbers of 1-bit, 2-bit, · · · , K-bit flip-flops in C′, and 1 · n1 + 2 · n2 +
· · ·+K ·nK = N . Then, the value of PFF (C′) can be obtained by summing the power
consumptions of all flip-flops in C′, but the value of PCLK(C′) cannot be computed
since the clock tree is not available yet in this pre-placement stage. However, since it
is clear that the value of PCLK TREE(C′) is closely related to the total wire-length of
clock tree, which is in turn significantly influenced by the number of sinks (i.e., flip-
flops), the clock power cost P (C) in Eq.1.1, which should be minimized in solving
Problem 1, can by estimated by
P̂ (C′) =
K∑m=1
nm · PmFF + α ·
K∑m=1
nm (1.2)
where PmFF represents the power consumption of m-bit flip-flop and α is a weighting
factor.
Fig. 1.2 shows the overall flow of our design methodology. Starting from an RTL
netlist, we apply a logic synthesizer to the netlist and produce a logic circuit C. Step
4
Logic synthesis
Gate-level netlist
Placement relaxation
( + )
RTL netlist
Multi-bit flip-flop
allocation
Flip-flop grouping
1. Extract mergeable FF sets
2. Construct merging
conflict graph with
relaxation parameter
3. Select mergeable FF sets
in , minimizing
Placement/
Clock tree synthesis
Meet timings?
Yes
No
with multi-bit
flip-flops
Return
Start
Set
Step(i)
Step(ii)
Figure 1.2: Flow of the proposed design methodology: Sub-step 2 takes into account the placement
cost while performing multi-bit flip-flop allocation in step (i). The cost of multi-bit flop-flop allocation in
Eq.2.2 and the placement cost are balanced by controlling the value of placement relaxation parameter
ρ. 5
(i) groups 1-bit flip-flops in C to allocate multi-bit flop-flops, producing a circuit C′
that minimizes the quantity of P̂ (C′) in Eq.1.2. (The details on the procedure of step
(i), which consists of three sub-steps will be described in Section 1.3.) Step (ii) then
performs placement and clock tree synthesis for C′ obtained from step (i) followed by
checking the satisfiability of timing constraints. If the timings are not met, we increase
the placement relaxation parameter ρ, and reiterate steps (i) and (ii) until there is no
timing violation on the placement. ρ is a control parameter which is used to take into
account an estimate of the cost of placement during the multi-bit flip-flop allocation in
step (i).
1.3 Placement Aware Multi-bit Flip-flop Allocation
Our multi-bit allocation algorithm in step (i) consists of three sub-steps: (Sub-step 1)
extracting all mergeable 1-bit flip-flip sets for input circuit C, (Sub-step 2) constructing
a merging conflict graph G from the mergeable sets according to the value of place-
ment relaxation parameter ρ, and (Sub-step 3) finding an independent set in G that
minimizes the value of P̂ (C′) in Eq.1.2.
1.3.1 Extraction of Mergeable Flip-flop Sets
If we assume a simple circuit C contains three flip-flops f1, f2 and f3 only, all possi-
bly mergeable sets of flip-flops will be {f1, f2}, {f1, f3}, {f2, f3}, and {f1, f2, f3}.
Thus, theoretically for C with N flip-flops, the number of mergeable sets of flip-flops
would exponentially increase as N increases. However, in practice, locality in C is
able to drastically prune the mergeable sets. We translate locality into a sort of place-
ment constraint and impose it on the meaning of ‘mergeable’, based on the following
definitions.
Definition 1. Let Sfi be the sets of transitive fan-out gates (i.e., successor gates) of (1-
bit) flip-flops fi in C. For a set of (1-bit) flip-flops {f1, f2, · · · , fK},D({f1, f2, · · · , fM}) (M ≥
6
2) is defined as.
D({f1, f2, · · · , fM}) =|Sf1 ∩ Sf2 ∩ . . . ∩ SfM ||Sf1 ∪ Sf2 ∪ . . . ∪ SfM |
. (1.3)
We call D(L), 0 ≤ D(L) ≤ 1 the intersection density driven by the flip-flops in set
L.
Fig. 1.3 shows the partition of a small circuit according to the derivation of tran-
sitive fan-outs of flip-flops. The intersection density of some subsets of flip-flops in
Fig. 1.3 is computed, as shown in Table 1.2, in which for example, |Sf1 ∩ Sf2 | =
|{g8, g9, g10, g15}| = 4 gates and |Sf1 ∪ Sf2 | = 13 gates, thus D({f1, f2}) = 4/13 =
0.31.
f1
QD
CK
f2
QD
CK
f3
QD
CK
f4
QD
CK
Figure 1.3: Conceptual view of partitioning a circuit based on the transitive fan-outs of flip-flops.
7
Table 1.2: Computation of intersection densities of some subsets of flip-flops in the circuit in Fig. 3.1.
Set# Flip-flopsIntersection Density D(L)
Formulation Value
L1 f1, f2|Sf1∩Sf2 |
|Sf1∪Sf2 |
4/13 = 0.31
L2 f2, f3|Sf2∩Sf3 |
|Sf2∪Sf3 |
3/12 = 0.25
L3 f3, f4|Sf3∩Sf4 |
|Sf3∪Sf4 |
1/10 = 0.1
L4 f1, f2, f3|Sf1∩Sf2
∩Sf3 ||Sf1∪Sf2
∪Sf3 |1/17 = 0.06
Definition 2. L = {f1, f2, · · · , fM} (M ≥ 2) is called mergeable flip-flop set if
D({fi, fj}) ≥ ρ where ρ, 0 ≤ ρ ≤ 1 is a parameter controlling the placement cost.
For example, when ρ is set to 0.20, L1 = {f1, f2} and L2 = {f2, f3} are mergeable
sets, but L3 = {f3, f4} and L4 = {f1, f2, f3} are not since D(L3) = 0.1 < 0.2 and
D(L4) = 0.06 < 0.2.
Table 1.3: The number of mergeable flip-flop sets
Design
# of Cells # of Mergeable Sets
FFs GatesTheoretical
ρ = 0.01 ρ = 0.2Bound
s5378 161 686 5.06× 1021 411 240
s13207 647 1724 3.84× 1031 1235 789
s15850 560 2049 3.71× 1030 1023 809
s35932 1728 5837 2.84× 1038 3466 3079
s38417 1564 5410 5.73× 1037 2901 2191
s38584 1300 5996 2.94× 1036 2931 1849
Note that extracting mergeable flip-flop sets is accomplished by traversing gates in
topologically order to identify the flip-flops driving the gates. Due to the locality prop-
erty of flip-flops, a circuit is most likely to be divided into a set of sub-circuits based on
8
their functions such that each sub-circuit is driven by a distinct set of flip-flops. This
flip-flop locality in fact enables the mergeable sets to be in a manageable number, as
validated from the data shown in Table 1.3, in which the number of mergeable flip-flop
sets is far below of the theoretical bound.
1.3.2 Construction of Merging Conflict Graph
We update the set of mergeable flip-flop sets, L = {L1, L2, · · · }, extracted from
Sub-step 1 by setting L′ to L ∪ {f1} ∪ {f2} ∪ · · · ∪ {fN} where f1, f2, · · · , fN
are the 1-bit flip-flops in the input circuit C and {fi}, i = 1, · · · , N is also re-
garded as a self-mergeable set. Then, fromL′ we create a weighted graphG(V,E,W ),
called mergeable conflict graph: (1) a unique node vi ∈ V is created for each Li,
i = 1, 2, · · · , |L′|, (2) a weight wi ∈ W of vi is assigned with the amount of power
saving |Li|·P 1FF−P
|Li|FF , and (3) an edge (vi, vj) ∈ E exists if and only if Li∩Lj 6= φ.
Note that each edge constrains that both of the flip-flop sets corresponding to its termi-
nals cannot be grouped to form a multi-bit flip-flop. (We assume that a 1-bit flip-flop
will be grouped to one and only one multi-bit flip-flops. In other words, we do not
maintain multiple copies of data items.)
1.3.3 Selection of Mergeable Flip-flop Sets
In the last step, we transform the problem of minimizing the quantity of P̂ (C′) in
Eq.1.2 into the problem of finding a set cover on the merging conflict graphG(V,E,W )
which leads to a minimal value of P̂ (C′). Equivalently, we solve the problem of find-
ing a set R = {Lm1 , Lm2 , · · · , LmK} ⊆ V such that (1) Lm1 ∪ Lm2 ∪ · · · ∪ LmK =
{f1, f2, · · · , fN}, which is the set of flip-flops in C and (2) the value of∑Ki=1 P
|Lmi |FF + α · |R| is minimal.
Since the problem belongs to a variant of set packing problem, it is very unlikely
to find an optimal solution in a polynomial-time. (The set packing problem asks if
some k subsets in a list of subsets of a finite set S are pairwise disjoint, in other words,
9
=
= { , , }
= 0.18
=
= { , , }
= 0.18
= { , , }
= 0.18
= { , }
= 0.12
= { , }
= 0.12
= { , }
= 0.12
= { , }
= 0.12
= { }
= { , , }
= 0.18
= 2.95
=
= { , , }
= 0.18
= 3.95
= { , , }
= .
= .
= { , }
= 0.12
= 4.07
= { , }
= 0.12
= 4.61
= { , }
= 0.12
= 3.54
= { , }
= 0.12
= 2.54
= { }
= { , , }
= 0.18
= 2.95
=
= { , , }
= 0.18
= 3.95
= { , , }
= 0.18
= 1.65
= { , }
= 0.12
= 4.07
= { , }
= 0.12
= 4.61
= { , }
= 0.12
= 3.54
= { , }
= 0.12
= 2.54
= { , }
= { , , }
= .
= .30
= { , }
= 0.12
= 2.54
= { , }
= 0.12
= 2.00
= { , }
= 0.12
= 2.54
= { , }
= { , , }
= 0.18
= { , }
= 0.12
= { , }
= 0.12
= { , }
= 0.12
= { , , }
= { , }
= .
= .
Selected
Selected
Selected
Removed
Removed
(a) (b) (c)
(d)(e)(f)
Figure 1.4: An example of a conflict graph for 7 mergeable flip-flop sets. (a) Given the conflict graph,
we assign their weight to the amount of power saving. For simplicity, we ignore vertices of 1bit flip-flop
whose power saving weight is 0. (b) We evaluate their degree considering their neighbors and select the
minimum degree vertex L7. (c) We remove the selected vertex and its neighbors from G. (d)(e)(f) We
repeat these steps until G becomes empty.
no two of them share an element. When α = 0, since minimizing∑K
i=1 P|Lmi |FF is
equivalent to maximizing∑K
i=1(|Lmi | ·P 1FF −P
|Lmi |FF ), which is the amount of power
saving by |Lmi |-bit flip-flop, the problem is reduced to the maximum weighted set
packing problem, which is one of Karp’s 21 NP-complete problems [4].) To efficiently
solve our multi-bit flip-flop allocation problem, we use a greedy algorithm shown in
Algorithm 1 where we look for the node (i.e., mergeable set) inG(V,E,W ) which has
the smallest node degree di = (∑
Lj∈adjacent(Li)wj)/wi among all nodesLi ∈ V , add
the node to our solution, and remove the nodes it connects from G. We continually do
this until no nodes are left.
Fig. 1.4 shows a step-by-step procedure of our algorithm for design with seven
mergeable flip-flops setsL1, L2, · · · , L7 extracted among eight flip-flops f1, f2, · · · , f8.
We create a conflict graphG from Sub-step 2 and assign their weights to the amount of
power saving as shown in Fig. 1.4(a). We ignore 1bit flip-flop sets {f1}, {f2}, · · · , {f8}
10
for simplicity. In Fig. 1.4(b), we calculate the degree for all nodes and select the
node L7 which has the smallest degree among them and put it into R. In Fig. 1.4(c),
we remove the selected node L7 and their neighbors L6, L4 from the graph G. In
Figs. 1.4(d), (e) and (f), we repeat the same process for the remaining graph G until it
comes to be an empty set. Finally, we get set R that has two 3-bit flip-flops L1, L7 and
one 2-bit flip-flop L5.
Although no algorithm can always produce results close to the minimum of P̂ cost,
on many practical inputs our greedy heuristics do so. (If we assume no node exceeds
its |Li| ≥ k ≥ 3, the answer can be approximated within a factor of k/2 + ε for
any ε > 0; in particular, the problem with every node of |Li| = 3 can be approximated
within about 50%. In another more tractable variant, if no 1-bit flip-flop occurs in more
than k of the nodes, the answer can be approximated within a factor of k [4].)
1.4 Experimental Results
1.4.1 Experimental Setup
We test our algorithm for two sets of circuit netlists; one set consists of ISCAS89
benchmark circuits and the other is a set of two circuits of large size, created by com-
bining benchmark circuits, having 100,000 flip-flops. We use 45nm Nangate Open
Cell Library and its PDK for our process and library data. To guarantee the worst case
performance, we use a slow PVT corner. In addition, we generate MBFF library from
2-bit, 3-bit, · · · , to 16-bit sizes. Fig. 1.5 shows the normalized power consumption of
the flip-flops of different bit-width in logarithm scale. We also adjust the clock pin
capacitance of each MBFF to be proportional to its bit size. We used Design Compiler
(DC) for a synthesis tool and IC Compiler (ICC) for a P&R tool from Synopsys Inc,
and applied default tool options to all design steps: logic synthesis, cell placement and
clock tree synthesis.
11
Algorithm 1 A greedy algorithm for Sub-step 3 (inputs:
G(V,E,W ); output: R)1: R = φ;
2: while G 6= φ do
3: // Find Lmax which has the smallest degree dmin
4: Lmax = nil;
5: dmin =∞;
6: for each vertex Li in G do
7: // Calculate degree d of vertex Li
8: d = 0;
9: for each vertex Lj ∈ adjacent(Li) do
10: d += wj ; // Add all neighbors weights
11: end for
12: d = d/wi; // Divided by its own weight
13: if (Lmax is nil || di < dmin) then
14: // Store Lmax whose degree is the smallest one
15: Lmax = Li;
16: dmin = d;
17: end if
18: end for
19: // Remove adjacent vertices of Lmax from G
20: for each vertex Lj ∈ adjacent(Lmax) do
21: G = G− {Lj};
22: end for
23: G = G− {Lmax}; // Remove Lmax from G
24: R = R ∪ Lmax; // Add Lmax to R
25: end while
26: return R
12
0.5
0.6
0.7
0.8
0.9
1
1.1
0 2 4 6 8 10 12 14 16
No
rma
lize
d P
ow
er
per
Bit
Bits
Figure 1.5: Normalized power consumption per bit of MBFF up to 16-bit.
(a) (b) (c)
Timing Failed Path
Figure 1.6: Layout views for S5378. The yellow lines indicate clock tree networks. The colored small
rectangles represent 1-bit flip-flops and MBFFs; light-orange is 1-bit flip-flop, light-red is 2∼4-bit MBFF,
orange is 5∼9-bit MBFF and red is 10∼16-bit MBFF. (a) Result produced by our algorithm. (b) Result
produced by random FF grouping to be the same number of MBFFs as that in (a); red dash indicates
timing violation. (c) Result produced by [1].
13
Table 1.4: Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops power
PFF , and the total power for the results produced by the MBFF allocation [1] in post-placement, our
algorithm, and our algorithm followed by [1].[1] (post-placement) Ours (pre-placement) Ours + [1]
Circuit
# of # of Power(mW) # of Power(mW) # of Power(mW)
1-bit n-bitPCLK T PFF
Total n-bitPCLK T PFF
Total n-bitPCLK T PFF
Total
FFs FFs (Red.) FFs (Red.) FFs (Red.)
s5378 161
1: 20
0.061 0.136
1: 44
0.054 0.124
1: 24
0.054 0.1212∼4: 57 0.243 2∼4: 10 0.216 2∼4: 19 0.212
5∼9: 0 (6.9%) 5∼9: 9 (17.2%) 5∼9: 9 (18.8%)
10∼16: 0 10∼16: 2 10∼16: 2
s13207 647
1: 85
0.183 0.537
1: 274
0.187 0.541
1: 60
0.172 0.5012∼4: 195 0.767 2∼4: 38 0.782 2∼4: 128 0.727
5∼9: 0 (13.2%) 5∼9: 4 (11.5%) 5∼9: 4 (17.8%)
10∼16: 0 10∼16: 13 10∼16: 13
s15850 560
1: 65
0.149 0.441
1: 274
0.142 0.432
1: 60
0.139 0.4012∼4: 183 0.640 2∼4: 34 0.626 2∼4: 119 0.592
5∼9: 0 (12.3%) 5∼9: 2 (14.2%) 5∼9: 2 (18.9%)
10-16: 0 10-16: 12 10∼16: 12
s35932 1728
1: 164
0.627 1.776
1: 286
0.575 1.477
1: 144
0.569 1.4562∼4: 638 3.075 2∼4: 1 2.732 2∼4: 63 2.705
5∼9: 0 (9.4%) 5∼9: 32 (19.5%) 5∼9: 32 (20.3%)
10∼16: 0 10∼16: 128 10∼16: 128
s38584 1300
1: 185
0.381 1.091
1: 465
0.365 1.011
1: 202
0.363 0.9802∼4: 459 1.943 2∼4: 97 1.854 2∼4: 211 1.821
5∼9: 0 (10.5%) 5∼9: 33 (14.6%) 5∼9: 33 (16.2%)
10∼16: 0 10∼16: 26 10∼16: 26
large1 84000
1: 10066
17.883 57.753
1: 51464
18.235 59.742
1: 14379
17.524 55.1812∼4: 28074 86.714 2∼4: 4624 89.554 2∼4: 19363 84.611
5∼9: 21 (11.2%) 5∼9: 1045 (8.3%) 5∼9: 1055 (13.4%)
10∼16: 0 10∼16: 1018 10∼16: 1018
large2 119200
1: 13571
19.621 56.243
1: 58358
19.542 57.099
1: 20021
18.948 54.0032∼4: 40898 99.555 2∼4: 11743 100.728 2∼4: 26802 97.097
5∼9: 34 (9.7%) 5∼9: 3095 (8.7%) 5∼9: 3108 (12.0%)
10∼16: 0 10∼16: 953 10∼16: 953
Average10.5% 13.5% 16.8%
PTOTAL Reduction
1.4.2 Comparing with Academic Algorithm
Table 2.2 summarizes the results produced by the recent work of MBFF allocation in
[1], which is applicable at the post-placement stage, our algorithm applied at the logic
synthesis (pre-placement) stage, and our algorithm followed by [1]. We compare the
numbers of 1-bit and multi-bit flip-flops, power consumption, PCLK T , of clock tree,
power consumption, PFF , of flip-flops, and the total power consumption which is the
14
Table 1.5: Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops power
PFF , combinational logics power PCOMBI , the total power, the longest path delay tlpd and the worst
local clock skew tskew for the results produced by Design Compiler’s MBFF algorithm, and our algorithm
followed by [1].
Circuit
Design Compiler Ours + [1]
# Power(mW)tlpd(ns) tskew(ns)
# Power(mW)tlpd(ns) tskew(ns)
of PCLK T PFF PCOMBI PTOTAL of PCLK T PFF PCOMBI PTOTAL
MBFFs (Red.) (Red.) (Red.) (Red.) (Red.) (Red.) MBFFs (Red.) (Red.) (Red.) (Red.) (Red.) (Red.)
s35932
1: 0 1: 42
2-4: 0 0.558 1.454 0.518 2.530 0.593 0.069 2-4: 139 0.570 1.482 0.445 2.497 0.534 0.076
5-9: 69 (18.4%) (29.9%) (-24.2%) (20.3%) (-2.4%) (-19.3%) 5-9: 26 (16.7%) (28.6%) (-6.7%) (21.4%) (9.9%) (-8.9%)
10-16: 102 10-16: 124
s38584
1: 0 1: 173
2-4: 1 0.283 0.714 0.356 1.353 0.568 0.101 2-4: 218 0.314 0.841 0.302 1.458 0.500 0.087
5-9: 44 (20.5%) (32.5%) (-7.2%) (22.5%) (-12.1%) (-32.6%) 5-9: 31 (11.8%) (20.5%) (9.0%) (16.5%) (12.0%) (14.0%)
10-16: 70 10-16: 17
large2
1: 0 1: 16337
2-4: 9 13.475 32.599 21.381 67.455 2.069 0.326 2-4: 23993 14.860 41.545 14.922 71.327 1.498 0.265
5-9: 2791 (17.5%) (33.2%) (-45.0%) (15.6%) (-4.5%) (-60.6%) 5-9: 1901 (9.1%) (14.9%) (-1.2%) (10.7%) (24.3%) (-30.9%)
10-16: 6271 10-16: 513
sum of PCLK T , PFF and power consumption of combinational logic cells. The power
reduction numbers in parentheses is the rate of reduction from initial testcases with no
MBFF allocation. All reported power values were evaluated after clock tree synthesis.
The comparison shows that our algorithm reduces the total power by 13.5% on average
(up to 19.5%), which is more than the reduction by [1] at the post-placement, which
is 10.5% on average (up to 13.2%). Note that the power reduction by our algorithm
varies considerably from 8.3% for LARGE1 to 19.5% for S35932. Regarding the num-
ber of MBFFs, our algorithm was able to group up to 16 flip-flops while [1] used in
post-placement could not group more than 10 flip-flops due to timing violation. As
shown in the last four columns of Table 2.2, our algorithm can be combined to any
MBFF allocation algorithm used in the placemen stage. The power improvement by
the combined application of ours and [1] is consistent for all testcases, saving power
about 6% over that by [1] alone. In addition, the total power reduction in all cases in-
dicates that the power increase of combinational logic cells due to MBFF allocation is
not significant because PCLK T and PFF are much more dominant in the total power
consumption.
15
Fig. 1.6 shows the chip layout views, for S5378, produced by our algorithm, ran-
dom FF grouping until the number of MBFF equals that in (a), and [1]. We can see
that the view in (a) has a simpler clock tree structure (due to less number of flip-flop
cells) than that in (c).
1.4.3 Comparing with Industry Algorithm
We compare ours with the results produced by the Synopsys’s Design Compiler which
supports the in-placement MBFF allocation. Design Compiler (DC) recognizes the
cell placement in a topological mode and groups flip-flops with the commands of
identify register bank and compile ultra. DC finds compatible groups of 1-bit flip-
flops that are physically close each other to create MBFFs. However, it doesn’t take
into account timing information, thus sometimes incurring timing violation. For a fair
comparison, we performed the same flow as DC for s35932, s38584, and large2. The
only difference is that our algorithm used compile ultra command. Table 1.5 summa-
rizes the results in terms of the numbers of resulting flip-flops, the power consump-
tions PCLK T , PFF , and PCOMBI of clock tree, flip-flops, and combinational cells,
the longest path delay tlpd, and the worst local clock skew tskew. Since DC allocates
MBFFs very aggressively, no 1-bit flip-flops are left, saving the total power consump-
tion by 19.5% on average, which is more than the reduction by our algorithm, which
is 16.2% on average. However, since DC doesn’t consider the effect of timing, a care-
ful use of DC for MBFF allocation is required. In addition, through the analysis of
the Table 1.5 it is observed a number of weaknesses in DC of MBFF allocation: (1)
DC increases the power consumption of combinational cells significantly, by 25.5%
on average (up to 45.0% in large2) while our algorithm decreases the power by 0.4%
on average. Since ours uses the same total number of combinational cells as that of
DC, the power increase in combinational cells by DC is more likely to increase chip
power density and cause IR-drop hot-spot problem; (2) DC increases the longest path
delay by 6.3% on average while our algorithm decreases it by 15.4% on average. The
16
longer delay by DC will lead chip to be weaker to process variation; (3) DC increases
the worst local clock skew significantly, by 37.5% on average (up to 60.6%) while our
algorithm increases it only by 8.6% on average. The high increase of clock skew by
DC may increase the chance of causing timing or even functional failure.
1.5 Conclusion
We presented in this paper a new approach to the multi-bit flip-flop allocation prob-
lem. Previous approaches divided the allocation problem into two steps: (i) placing
single-bit flip-flops under circuit timing constraints and (ii) minimizing the clock tree
and flip-flop power by grouping single-bit flip-flops to form multi-bit flip-flops. In
our approach, we placed primary importance on the cost of power consumption rather
than on the cost of placement. Consequently, we attempted to minimize the power
consumption by synthesizing multi-bit flip-flops first and then to place them later. Ex-
perimental results have demonstrated that our approach of early consideration of syn-
thesizing multi-bit flip-flops offered great benefits on reducing the power consumption
of designs while minimizing the impact of timing during the synthesis.
17
Chapter 2
Switch Cell Optimization of Power-gated Modern System-
on-Chips
2.1 Introduction
As smart phones have become popular, various sorts of mobile devices have been
manufactured and used these days and the needs of higher performance and longer
battery life still valid in the market place nowadays. Integrating many cores into a
chip and enhancing the core architecture have been a mainstream strategy for per-
formance improvement while power-gating, clock-gating, and dynamic voltage and
frequency scaling (DVFS) have been essential ingredients in the low-power on-chip
design methodologies.
In particular, power-gating is one of the most commonly used techniques to save
the standby leakage by shutting off the current to blocks of the circuit that are not in
use. This is accomplished by adding switch cell either to VDD or VSS supply. Fig. 2.1
illustrates a logical structure of power-gated power delivery network (PDN), in which
the switch cell is able to cut the current flowing from the power sources (e.g., IO power
pad or bump) to the power sinks (e.g., standard cells) during the standby state. Switch
cells can be classified into two types depending on their role: header type for cutting
18
power and footer type for cutting ground. In real products, the header type of switch
cells is more popular. For the implementation of header type cells, the part of PDN
from power sources to switch cells is called the real power network or RVDD net and
the part from switch cells to power sinks is called the virtual network or VVDD net. In
other words, RVDD net, switch cells, and VVDD net form a serially connected power
network and the switch cells cut or link the power network.
Since the switch cells act as resistors, expressed as Ron, during active (i.e., ON)
state, multiple switch cells of small size should be placed on the area of logic standard
cells in order not to violate IR-drop constraint [20]. This dispersion of switch cells also
helps prevent excessive in-rush current during wake-up state (e.g., [8, 11]).
VDD
Real Power Net(RVDD)
NSLEEP
Circuit Area
VSS
Ground Net(VSS)
...Standard Cells
...Switch Cells
Virtual Power Net(VVDD)
IN OUT
CLK
Figure 2.1: A logical structure of power-gated power delivery network (PDN)
As the portion of static power dissipation is getting larger due to technology scaling
down [7] and the use of internet-of-things (IoTs) staying more time in standby mode
spreads out, the issue of standby leakage reduction brings up to ensure much longer
battery life. This work addresses the problem of reducing the standby leakage dissi-
19
pated through the switch cells. Super Cut-Off CMOS (SCCMOS) scheme [6] is one of
the noticeable solutions on this topic, which overdrives the control (i.e, wakeup) sig-
nals higher than VDD, resulting in reducing standby leakage at their switch cells and
this scheme is adopted in [12] for ultra-low power management. This scheme however
requires additional circuitry to generate overdriving voltage level, which also dissi-
pates standby leakage. In this work, we aim to reduce standby leakage at switch cells
without such additional burden.
A common knob that prior works ([9, 20, 13]) have used to control the number and
type of switch cells is the quantity of total current around the cells. Kozhaya and Bakir
[9] determined the minimal number of switch cells by computing IV I / IPGC where
IV I is the given total current demand of the circuit on a voltage island and IPGC is the
maximum current that can be supplied by a switch cell, and then iteratively allocate
switch cells whose total number is as close as to IV I /IPGC while meeting the IR-
drop constraint. They employed a greedy approach. Unlike [9], Yong and Ung [20]
exploits the effect of cell placement on the determination of cells to be allocated, in
which they propose a method, called power perimeter scanning, to find the widest
sub-regions on which the contained switch cells can endure the supply current. For
every cell on two or more sub-regions, it is considered to be removed if the removal
causes no IR-drop violation. One limitation of this work is that the number of sub-
regions considered is exponential. On the other hand, Lin and Kin [13] partitioned a
whole circuit into sub-regions according to the current constraint and computed the
effective resistance of the sub-regions to estimate the number and type of switch cells.
They performed two steps: (1) they formulated the problem of determining, for each
sub-region, the location at which switch cells should be placed into an ILP (integer-
linear programming) problem; (2) they iteratively control the effective resistance (i.e.,
resizing) of switch cells allocated in step 1 until there is no IR-drop violation. The
works in [20, 13] considered the effect of cell placement. However, (limitation 1) since
their controlling knob is based on current rather than IR-drop, their optimizations are
20
indirection. In addition, (limitation 2) the current based knob could not handle IR-drop
hot-spots that take place on the power/ground rails because of the large resistance, as
supported by the data shown in Fig. 2.4. This work overcomes the two limitations of
the prior works.
2.2 Preliminaries and Motivations
0
20
40
60
80
100
120
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71
Sta
nd
by
Lea
kag
e (u
A)
The Number of Switch Cells
HEADX32_RVT HEADX16_RVT HEADX8_RVT
HEADX4_RVT HEADX2_RVT
y=0.154x+0.474, R2=0.99
Figure 2.2: The changes of standby leakage of openMSP430 circuit as the number and
type (size) of switch cells vary.
Fig. 2.2 shows the changes of (total) standby leakage of openMSP430 circuit as
the number of switch cells and the type (i.e., size) of switch cells used in the circuit
change. The curves show that as the number of switch cells used increase, the standby
leakage insistently increases as well. In addition, as the size of switch cells used is big-
ger, for example when using cells HEADX32 RVT and HEADX16 RVT, the standby
leakage increases roughly 3 and 5 times more than that when using HEADX2 RVT,
respectively. Consequently, the analysis from Fig. 2.2 indicates that it is necessary to
try to use as few switch cells as possible while the cell size to be as small as possible.
21
However, since smaller cells act as higher resistance (Ron) in PDN, a use of extremely
less number and small size of cells would lead to a sharp voltage drop in PDN. Thus,
we need to find a minimal number (i.e., switching locations) and type of switch cells to
maximally prevent unnecessary standby leakage dissipation while meeting the IR-drop
constraint. Note that since the number of switch cells required for meeting the IR-drop
constraint is typically much larger than the number required for wake-up delay con-
straint due to the strong physical constraint such as a large resistance of switch cell
and power/ground rail, we do not take the wake-up delay constraint into consideration
in our design specification. (If the wake-up delay still matters, one simple strategy is
to choose and revive additional switch cells later.)
Upper Thick
Metal Stack
Low Resistance(Metal7~Metal9)
Middle Thin
Metal Stack
Middle Resistance(Metal3~Metal6)
Lower Thin
Power Rail
High Resistance(Metal1~Metal2)
C4 bumps
Power-gating Switch Cell, High Ron
Figure 2.3: Typical PDN structure used in current mobile SoCs.
Fig. 2.3 shows a bird view of typical PDN structure of current mobile SoCs where
seven or more metal stacks is commonly used: the two uppermost layers with wide
and thick metals having very low resistance are exclusively assigned to power/ground
network to deliver as large current as possible; the lower one or two layers are used as
rails of PDN that are either in the cell library or added before/after the cell placement
to deliver current to standard cells. The rest of layers is implemented as power/ground
22
mesh type to bridge between the uppermost stacks and the bottom power/ground rails.
Switch cell pitch
0% 20% 40% 60% 80% 100%
Resistance
IR-dropVVDD Rail
VVDD Mesh/Via
Switch Cell
RVDD
46.5%
59.2%
50.6%
33.7%
(a) (b)
(c)
Switch cell
violation
Figure 2.4: An example of (a) switch cell placement, (b) its IR-drop map under an
assumption of fully placed power/ground IO bumps, and (c) IR-drop and resistance
composition for the worst IR-drop path of aes256 circuit categorized by elements of
power net.
The large resistance of the thin and narrow power/ground rails makes a large con-
tribution to the IR-drop violation. Likewise, since the effective resistance of switch
cells is high, the portion of IR-drop by the cells is large as well. (A supporting data
is included in Fig. 2.4(c), and will be explained later.) Switch cells are evenly dis-
tributed in the standard cell area with pre-defined pitch p, as shown in Fig. 2.4(a)
where the yellow small boxes indicate pre-placed switch cells. Since the switch cell
placement has already been done in the pre-placement stage, like Fig. 2.4(a), it may
23
cause an IR-drop violation at the post-placement stage, as indicated in Fig. 2.4(b) in
which the red spots indicate the violation of IR-drop constraint. We traced the power
delivery path from the source to standard cells around which the worst IR-drop oc-
curs in Fig. 2.4(b) and measured the voltage drops and resistances at RVDD →
Switch cell → V V DD mesh/via → V V DD rail of the worst IR-drop path.
Fig. 2.4(c) shows the relative portions of the resistances and voltage drops among
RVDD Switch cell V V DD mesh/via, and V V DD rail of the worst IR-drop path.
Clearly, it shows that switch cells can be the major source, contributing about 50%, of
the violation of IR-drop constraint. Because of large number of switch cells required
to distribute and burden of secondary routing of always-on and wakeup signals on
switch cells, the switch cell pitch p is determined at the power-plan stage before cell
placement. Consequently, setting a small p in pre-placement stage may avoid any IR-
drop problem across entire circuit, but the standby leakage by so many switch cells
will be enormous. This work can attenuate this leakage by selecting switch cells to be
removed or resized while strictly satisfying IR-drop constraint.
2.3 Problem Formulation
Objective: Let L denote the switch cell library that contains K types of switch cells
sw1, sw2, · · · , swK , arranged in an increasing order of driving strength. Then, the
standby leakage dissipated by the switch cells in a power-gating circuit C can be ex-
pressed as:
Istandby(C) = α ·∑
swi∈LIswi · nswi + β (2.1)
where Iswi and nswi represent the amount of standby leakage and the number of switch
cells of type swi used in C, respectively. α is a weighting factor and β is a constant
that indicates the total leakage on the always-on parts.
Constraint: Let S = {s1, s2, · · · } be the set of switch cells placed on C. For sim-
plicity, we adopt the constant global voltage margin rather than the variation margin
24
for advanced OCV (AOCV) [19], and consider static IR-drop. Then, the voltage drop,
Vsi , at cell si for every i = 1, · · · , |S| must be below the constraint value Vlimit, i.e.,
expressed as:
maxcelli∈S
Vcelli ≤ Vlimit (2.2)
Optimization: We want to determine the switch cells in S to be removed or resized
(i.e., replacing cells with another types) to minimize the quantity of Istandby(C) in
Eq. 2.1 while satisfying the constraint in Eq. 2.2.
2.4 The Proposed Algorithm
The input to our algorithm of switch cell optimization is a circuit C, in which the place-
ment and routing of logic cells as well as a uniform distribution of switch cells were
done. Then, our algorithm performs two steps : (Step 1) Extracting a set of maximally
feasible regions of every switch cell location in C and (Step 2) Selecting a subset of
the maximally feasible regions obtained in Step 1 that cover C such that its value of
Istandby(C) is minimized under Vlimit constraint. The overall flow of our algorithm
is shown in Fig. 2.5. The output of the algorithm is the switch cells to be turned on
(i.e., the cells in the regions selected in Step 2) and their switch cell types in L for
replacement.
2.4.1 Extraction of Maximally Feasible Subregions
This step identifies all subregions expressed as r(si, swj , bx1,y1 , · · · , bxk,yk) to denote
that the deployment of switch cell si of type swj can exclusively resolves any IR-drop
violation that may occur on the locations in bx1,y1 , · · · , bxk,yk . We call such a subregion
r(·) feasible and call the subregion r(·) maximally feasible if it is feasible and no
feasible subregion of si with type swj cannot properly contain r(·). We collect all such
maximal subregions into set R. The extraction of all maximally feasible subregions is
performed in two sub-steps:
25
Step 1. Extract Feasible Region for Each (Sec. 4.1)
Chip with Completed P&R
Step 2. Switch Cell Covering for Minimal Standby
Leakage Adopting Approximation Algorithm (Sec. 4.2)
Remove not in the Solution Set from
Step 1.2 Generating All Maximally Feasible Regions
Step 1.1 Bin Partitioning
Figure 2.5: The flow of our two-step algorithm of switch cell optimization
26
Step 1.1 (Bin partitioning): We partition circuit C into a (regular) bins in a way that
every switch cell is placed in the center position of a distinct bin. Fig. 2.6 shows an
example of partition. Any possible style of bin partitioning is acceptable in our algo-
rithm, but the bottom line is that the size of each partitioned bin should be at least as
small as the one in which the centered switch cell of the smallest type sw1 is able to
completely resolve the IR-drop violation on the bin. In this work we choose to apply
vertically and horizontally equal-interval partitioning. We use notation bx,y to denote
the bin at the location of the xth column and yth row bin.
X columns
Y ro
ws
Figure 2.6: An illustration of bin partitioning (Step 1.1), horizontally along the
power/ground rails and vertically along the center of the locations of two serially
placed cells.
Step 1.2 (Generating all maximally feasible subregions): We then use an incremental
approach to find maximally feasible subregions. The following process will be iter-
atively performed for every si ∈ S. For example, let R = {} and let us focus s6 in
Fig. 2.7(a). By the definition of the bin generation,R = {r(s6, sw1, b2,4)}, as indicated
in Fig. 2.7(b). Now, we look for the upper neighbor bin to see if it can be included.
The IR-drop analysis shows that the exclusive use of s6 of type sw1 enables the sub-
27
region corresponding to b2,4 and b2,5 feasible. Thus, R = {r(s6, sw1, b2,4, b2,5)}, as
shown in Fig. 2.7(c). Likewise, we examine the lower neighbor bin to see if it can be
included. The IR-drop analysis shows that the exclusive use of s6 of type sw1 makes
the subregion of b2,3 and b2,4 infeasible as shown in Fig. 2.7(d), but the use of s6
of sw2 enables the subregion feasible, as shown in Fig. 2.7(e). Consequently, max-
imally feasible set can be R = {r(s6, sw1, b2,4, b2,5), r(s6, sw2, b2,3, b2,4)}. Further
expansion of feasible subregion is possible, as shown in Fig. 2.7(f), by increasing the
cell size from sw2 to sw3, updating R = {r(s6, sw1, b2,4, b2,5), r(s6, sw2, b2,3, b2,4),
r(s6, sw3, b2,2, b2,3, b2,4)}. We employ a pruning technique while expanding the subre-
gions, as illustrated in Fig. 2.7(f), to avoid further tests for a subregion if it has already
been found to be feasible on a cell of type swj .
2.4.2 Switch Cell Covering for Minimal Standby Leakage
From the set R that contains all maximally feasible covering subregions obtained in
Step 1, this step extracts a subset F ⊆ R with the least standby leakage while meet-
ing the IR-drop constraint in C. We formulate this problem into a weighted set cover
problem, which is stated as: given a set U of elements and a set H of subsets A1, A2,
· · · of U with weight w(A1), w(A2), · · · such that A1 ∪A2 ∪ · · · = U , find a minimal
total weight of a subset F of H that satisfies for every ei ∈ U , ei ∈ Aj for some
Aj ∈ F . Thus, our transformation into the weighted set cover problem is to construct
U , H and weight w(Ai): (1) U = {b1,1, b1,2, · · · } is the set of physical locations in
C on which the IR-drop constraint should be met; (2) H = R, i.e., A(·) = r(·); (3)
w(A(·)) = w(r(·)) is the standby leakage at the switch cell s(·) of type sw(·) in the
r(·).
To speed up the computation of set covering, we adopt the approximation algo-
rithm in [3]. A pseudocode of the approximate algorithm trimmed to our context is
given in Algorithm 2. The algorithm is a mixture of greedy and exact algorithms. As
the whole initial problem is too big, a greedy algorithm is applied to an extent, followed
28
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,OFF
ON
,
,
,
,
,
,
,
,
,
,
,
,OFF
ON
(a) (b) (c)
,
,
,
,
,
,
,
,
,
,
,
,OFF
ON
(d)
,
,
,
,
,
,
,
,
,
,
,
,OFF
ON
(e)
=
IR Meet
IR Meet
IR Violated IR Meet
(f)
infeasble
,
,
,
,
,
,
,
,
,
,
,
,
,
,
= { ( , , , )} = { ( , , , , , )}
= { ( , , , , , )} = { ( , , , , , ),
, , , , , }
Violated with all ( )infeasble
switch to
switch
to
= { ( , , , , , ),
, , , , , ,
, , , , , , , }
Figure 2.7: An illustration of extracting maximally feasible regions (Step 1.2) for
switch cell s6.
29
by applying an exact algorithm. Because input size to an exact algorithm becomes rel-
atively small, it takes much less time to get an optimal solution. By integrating both
algorithms into one framework in Algorithm 2, in which a parameter ρ (≥ 1) is used
to control the problem sizes for applying greedy and exact algorithms, the run time can
be significantly reduced while ensuring the sub-optimality. The approximation ratio of
Algorithm 2 is 1+lnρ [3].
Algorithm 2 (Step 2) Approximation Algorithm for optimizing switch cells(input:
U , R, ρ (control parameter); output: F ⊆ R)1: Rg = ∅ // a set of subregions in R for greedy part
2: F = R // F will be iteratively refined.
3: n = |U | // total bins in C
4: //⋃B is a bin union operation.
5: while (⋃B R) ∪ (
⋃B Rg) = U do
6: Select r(·) ∈ R that minimizes w(r(·))binCount(r(·)\
⋃B Rg)
7: // ρ(≥ 1) controls problem sizes for greedy and exact.
8: // Increasing ρ places more weight on greedy.
9: if n− binCount(⋃B Rg ∪ {r(·)}) >
nρ then
10: Rg = Rg ∪ {r(·)} // greedy covering
11: Remove conflict subregions against the r(·) from R
12: else
13: Rt = {r′(·) ∈ R| r′(·) /∈⋃Rg ∪ {r(·)}}
14: Ropt = an exact covering for Rt
15: if w(Rg) + w({r(·)}) + w(Ropt) < w(F ) then
16: F = Rg ∪ {r(·)} ∪Ropt17: end if
18: R = R \ r(·)
19: end if
20: end while
21: Return F
An example illustrating how Algorithm 2 for U = {b1,1, b1,2, · · · , b1,7}, R =
30
{r(s1, sw1, b1,1), r(s1, sw2, b1,1, b1,2), · · · , r(s4, sw3, b1,5, b1,6, b1,7)}, and ρ = 1 (i.e.,
for exact solution) is applied to find the switch cells to be turned on together with
the resizing types that use the least total standby leakage is shown in Table 2.1. The
minimum cover solution in Table 2.1 indicates that only three switch cells s1, s2 and
s4 suffice to meet the IR-drop constraint on all bins if they are sized to sw2, sw3 and
sw2, respectively. The resulting standby leakage is 2.583, which is minimal.
Table 2.1: An example showing a minimum weighted set cover solution
R U Leakage
(Feasible Subregions) b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 (w(r(·)))
r(s1, sw1, b1,1)√
0.404
r(s1, sw2, b1,1, b1,2)√ √
0.672
r(s1, sw3, b1,1, b1,2, b1,3)√ √ √
1.239
r(s2, sw1, b3,3)√
0.404
r(s2, sw2, b1,2, b1,3)√ √
0.672
r(s2, sw3, b1,3, b1,4, b1,5)√ √ √
1.239
r(s2, sw3, b1,1, b1,2, b1,3)√ √ √
1.239
r(s2, sw3, b1,2, b1,3, b1,4)√ √ √
1.239
r(s3, sw2, b1,5)√
0.672
r(s3, sw3, b1,4, b1,5)√ √
1.239
r(s4, sw2, b1,6, b1,7)√ √
0.672
r(s4, sw3, b1,5, b1,6, b1,7)√ √ √
1.239
min. cover F√ √ √ √ √ √ √
2.583
31
2.4.3 Consideration of Practical Issues
Spatial Variation in Wirebond Package
So far, we have used the constant IR-drop constraint Vlimit across all switch cells.
This will be acceptable only when it is assumed that every switch cell has a small
IR-drop variation. This is true for the flipchip package since the portion of RVDD
drop in the total IR-drop is very small, as previously shown in Fig. 2.4(c). However,
for the wirebond package, the distances from IO pads to switch cells significantly vary
depending on the cells’ placement location and for some cells, the distance is very
longer than that of the flipchip package, resulting in a considerable spatial IR-drop
variation on switch cells, as demonstrated in Fig. 2.8.
Our work can handle this high IR-drop variation by simply setting, for every switch
cell si preplaced in C, different IR-drop constraint Vlimit(si) = Vlimit + δV (si) where
∆V (si) is the cumulative IR-drop value at the location of si from IO pads to si. The
value of ∆V (·) can be separately obtained from IR-drop analysis at the sign-off stage.
Then, our extraction of maximal feasible subregions can be performed according to
the updated IR-drop constraints.
Wakeup Signal Sensitive Subcircuits
Usually, the NSLEEP (wakeup control) signals on the switch cells are delivered
through daisy chain to ensure sufficient wakeup signal delay to avoid a large surge
current. However, if our proposed cell optimization solution reports that a particular
switch cell can be unused, the resulting wire rerouting among cells may shorten the
wakeup signal delay and distort the timing constraints on the nearby timing-sensitive
subcircuits. We can solve this problem by replacing the HEADBUF in the removed
cell with an always-on buffer, as shown in Fig. 2.9.
32
0
200
400
600
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Wirebond
Flipchip
Mean = 9.79mV
Mean = 6.26mV
(a) (b)
(c)
Figure 2.8: Comparison of IR-drop map on all instances and its distribution on switch
cells for the fpu circuit, but using different package. (a) flipchip package, (b) wirebond
package. Orange and white dots indicate power and ground sources, respectively.
33
HEADBUF
HEADBUF
Replaced to
Always-on BUF
HEADBUFAlways-on
BUF
NSLEEP
RVDD
VVDD
RVDD
VVDD
RVDD
Figure 2.9: An example illustrating to handle wakeup delay sensitive subcircuits. Re-
placement of the unused HEADBUF with an always-on buffer to maintain the wakeup
signal delay.
2.5 Experimental Results
2.5.1 Experimental Setup
The proposed algorithm have been implemented by Python3 language. The circuits
s35932, s38417 and s38584 from ISCAS’89, openMSP430 (16bit microcontroller core)
and fpu (floating point unit) from OpenCores [17] are synthesized by Design Compiler
(DC) J-2014.09-SP5-2, conducted P&R by IC Compiler (ICC) I-2013.12-SP5-1 and
measured the standby leakage by HSPICE from Synopsys Inc. based on Synopsys
32/28nm Generic Library and its PDK. IR-drop was performed at the timing worst
corner, ss/0.95v/125c for PVT and Cmax for BEOL corner, using RedHawk V15.2.1
from Ansys Inc. The on-state power consumption was calculated in vectorless mode
and toggle rate was set to 0.3 and 2.0 for data and clock signal, respectively.
Power/ground metal density shown in Table 2.3 was identically applied to all cir-
cuits, and switch cells were placed in evenly at the every cross point of M1 and M5 in
staggered way. The given switch cell library is L = {HEADX2 RVT, HEADX4 RVT,
HEADX8 RVT, HEADX16 RVT, HEADX32 RVT}. We assumed flipchip package
and set IR-drop constraint Vlimit to 25mV.
34
Table 2.2: Comparison of the number of switch cells and estimated standby leakage
Istandby by evenly distributed, the previous current based approach in [9] and our al-
gorithmCircuit Circuit size (W×H) Gate count Target frequency
Evenly distributed Current-based method ([9]) Ours
# of Switch Cell Istandby (uA) # of Switch Cell Istandby (uA) # of Switch Cell Istandby (uA)
s35932 198um x 197um 4965 667MHz HEADX32: 80 149.1 HEADX32: 5993.2 HEADX32: 51 86.7
(37.5%) HEADX16: 2 (41.9%)
s38417 189um x 187um 5183 667MHz HEADX32: 75 132.6 HEADX32: 5894.1 HEADX32: 45 88.6
(29.0%) HEADX16: 4 (33.2%)
s38584 178um x 178um 5002 1000MHz HEADX32: 71 121.7 HEADX32: 61107.9 HEADX32: 52 97.8
(11.3%) HEADX16: 3 (19.6%)
openMSP430 167um x 166um 6611 1250MHz HEADX32: 80 123.6 HEADX32: 5998.4 HEADX32: 49 91.0
(20.4%) HEADX16: 10 (26.4%)
fpu 450um x 450um 49397 909MHz HEADX32: 420 579.6 HEADX32: 277441.0
(23.9%)
HEADX32: 59266.1
(35.0%)HEADX16: 106
HEADX8: 53
Average reduction in Istandby 24.4% 35.0%
Since an IR-drop value expected at powerplan stage tends to pessimistic due to
unrealistic hot-spot margin, IR-drop value obtained at sign-off stage is usually less
than our IR-drop constraint. To get tighter IR-drop condition, we adjusted the IR-
drop value closer to our constraint by increasing the frequency to the target frequency
presented in Table 2.2, which in turn increases total power consumption.
We used N/10 as the ρ value for the switch cell covering algorithm (Step 2) in
order to reduce run time and as a result we were able to finish this step no longer than
5 minutes even in the worst case of the fpu circuit.
2.5.2 Experimental Result
Table 2.2 summarized the results produced by initial evenly distributed switch cell
placement, the previous optimization algorithm in [9], and our algorithm in terms of
the number of switch cells of each type and the standby leakage Istandby. The number
in parenthesis is the rate of reduction from the initial case. The comparison shows that
our algorithm reduces the standby leakage by 35.0% on average, which is 13.9% more
reduction over [9].
Note that there is a long tail shape in IR-drop distribution of initial case, as shown
35
Fig. 2.10(a). It comes from a small number of IR-drop hot-spots whose values are
near Vlimit. A switch cell pitch p was defined to prevent these hot-spots from exceed-
ing Vlimit. Therefore, the IR-drop values of most instances other than hot-spots are
distributed around the mean value, 6.8mV, which is very small compared to Vlimit.
Fig. 2.10(b) shows an example of misleading in current based approach. According to
the IR-drop map of Fig. 2.10(b), the left side of circuit seems to be weaker in IR-drop
than the right side due to long distance from switch cells to instances on them, that
is, dominant factor is resistance and not current in this case. Our algorithm reduced
switch cells very efficiently while shifting the right its IR-drop value near to Vlimit.
We are also able to resize a switch cell if possible and it can reduce more power than
using only one type of switch cell as demonstrated in the result of openMSP430 in
Table 2.2. The standard deviation σ of our IR-drop distribution, shown in Fig. 2.10(c),
also decreases like the initial case in Fig. 2.10(a).
Table 2.3: Power/ground density specification
LayerRVDD VVDD VSS
Width/Pitch (density) Width/Pitch (density) Width/Pitch (density)
MRDL8um/20um
N/A8um/20um
(40%) (40%)
M98um/20um
N/A8um/20um
(80%) (40%)
M51um/60um 1um/60um 1um/60um
(1.7%) (1.7%) (1.7%)
M1N/A
0.06um/3.344um 0.06um/3.344um
(PG rail) (1.8%) (1.8%)
36
0
5000
10000
15000
20000
0 5 10 15 20 25
0
5000
10000
15000
20000
0 5 10 15 20 250
5000
10000
15000
20000
0 5 10 15 20 25
long tail of hot-spot
Mean = 6.69mV Mean = 12.95mV
23.9% Istandby reduced
Mean = 12.00mV
51.1% Istandby reduced
(a) (b) (c)
Figure 2.10: Comparison of IR-drop distribution for the fpu circuit. (a) evenly dis-
tributed switch with 60um pitch, (b) the previous switch optimization method in [9]
and (c) our algorithm.
37
2.6 Conclusions
In this work, we introduced the structure of power deliver network of modern power-
gated SoCs and demonstrated that the standby leakage was considerable at switch
cells. To reduce the unnecessary leakage, we proposed a comprehensive solution to
determining, for each switch cell, if the cell can be removed or the type of switch
cell for replacement so that the resulting total standby leakage of switch cells should
be minimized under the noise constraint. We formulate the problem into a variant of
weighted set cover problem and solve it efficiently by employing an approximate set
cover algorithm. Experiments showed that our method was able to reduce 35.0% and
13.9% more standby leakage over the initial designs and the designs produced by the
previous current-based method in [9], respectively.
38
Bibliography
[1] Y.-T. Chang, C.-C. Hsu, M. P.-H. Lin, Y.-W. Tsai, and S.-F. Chen. Post-placement
power optimization with multi-bit flip-flops. In Proceedings of IEEE/ACM Inter-
national Conference on Computer-Aided Design, 2010.
[2] Z.-W. Chen and J.-T. Yan. Routability-constrained multi-bit flip-flop construction
for clock power reduction. Integration, the VLSI Journal, 46(3), 2013.
[3] M. Cygan, Ł. Kowalik, and M. Wykurz. Exponential-time approximation of
weighted set cover. Information Processing Letters, 109(16):957–961, 2009.
[4] M. R. Garey and D. S. Johnson. Computers and intractability: a guide to the
theory of np-completeness. San Francisco, LA:Freeman, 1979.
[5] C.-C. Hsu, Y.-C. Chen, and M. P.-H. Lin. In-placement clock-tree aware multi-
bit flip-flop generation for power optimization. In Proceedings of IEEE/ACM
International Conference on Computer-Aided Design, 2013.
[6] H. Kawaguchi, K. Nose, and T. Sakurai. A super cut-off cmos (sccmos) scheme
for 0.5-v supply voltage with picoampere stand-by current. IEEE Jounal of
SOLID-STATE CIRCUITS, 35(10):1498–1501, 2000.
[7] N. S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,
M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static
power. Computer, 36(12):68–75, 2003.
39
[8] S. Kim, S. V. Kosonocky, and D. R. Knebel. Understanding and minimizing
ground bounce during mode transition of power gating structures. In Proceedings
of IEEE/ACM International Symposium on Low Power Electronics and Design.
ACM, 2003.
[9] J. N. Kozhaya and L. A. Bakir. An electrically robust method for placing power
gating switches in voltage islands. In Custom Integrated Circuits Conference,
2004. Proceedings of the IEEE 2004, pages 321–324, 2004.
[10] Y. Kretchmer and L. Logic. Using multi-bit register inference to save area and
power: the good, the bad, and the ugly. EE Times Asia, 2001.
[11] Y. Lee, D.-K. Jeong, and T. Kim. Simultaneous control of power/ground current,
wakeup time and transistor overhead in power gated circuits. In Proceedings of
IEEE/ACM International Conference on Computer-Aided Design, 2008.
[12] Y. Lee, M. Seok, S. Hanson, D. Blaauw, and D. Sylvester. Standby power re-
duction techniques for ultra-low power processors. In Proceedings of European
Solid-State Circuits Conference. IEEE, 2008.
[13] J.-M. Lin and C.-C. Lin. Placement density aware power switch planning
methodology for power gating designs. 34(5):766–777, 2015.
[14] M. P.-H. Lin, C.-C. Hsu, and Y.-T. Chang. Recent research in clock power saving
with multi-bit flip-flops. In IEEE MWSCAS, 2011.
[15] M. P.-H. Lin, C.-C. Hsu, and Y.-C. Chen. Clock-tree aware multibit flip-flop
generation during placement for power optimization. 34(2), 2015.
[16] H. Moon and T. Kim. Design and allocation of loosely coupled multi-bit flip-
flops for power reduction in post-placement optimizationi. In Proceedings of
IEEE Asia-South Pacific Design Automation Conference, 2016.
[17] OpenCores. http://www.opencores.org.
40
[18] Y.-T. Shyu, J.-M. Lin, C.-P. Huang, C.-W. Lin, Y.-Z. Lin, and S.-J. Chang. Ef-
fective and efficient approach for power reduction by using multi-bit flip-flops.
21(4), 2013.
[19] S. Walia. Primetime R© advanced ocv technology. Synopsys, Inc, 2009.
[20] L. K. Yong and C. K. Ung. Power density aware power gate placement optimiza-
tion scheme. In Proceedings of IEEE Asia Symposium on Quality Electronic
Design, 2010.
41
초록
본논문에서는셀배치이전/이후에적용가능한파워최적화기법에대해소개
한다.
첫째로, 멀티비트 플립플롭을 구현하기 위해 기존의 소개된 방법과 다른 방법
을 제안한다. 기존 멀티비트 플립플롭에는 크게 두 가지 단계로 구현이 되는데, (i)
타이밍 제약조건에 위배되지 않게 싱글비트 플립플롭을 배치한 후에 (ii) 플립플롭
과 클락트리가 소모하는 파워가 최소가 되도록 플립플롭을 합치는 과정을 거친다.
하지만 (i)의단계에서 (ii)를예측하여진행하기에는한계가있기때문에파워최소
화에는한계가있다.따라서본연구에서는파워최소화에중점을두어파워를최소
화하도록 멀티비트 플립플롭을 합성한 후에 배치하는 방식을 사용하였다. 이러한
설계 초기 단계에서 합성함으로써 타이밍 제약 조건을 위배하지 않으면서 클락 파
워소모를 16.8%감소킬수있었으며이는기존대비클락파워감소에효과적임을
실험을통하여확인하였다.
두 번째로 최신 파워 게이팅이 적용된 설계에서 불필요한 스위치 셀을 노이즈
제약조건(예를들어, IR-drop)을만족하면서최적화하는실제적인문제를다룬다.
파워게이팅스위치셀은파워레일에직접연결되어있기때문에셀배치이전에그
위치가결정이되게된다.결과적으로실제필요보다많은수의스위치셀이배치되
게되고이는곧불필요하게많은대기전력소모를발생시킨다.따라서,본연구에서
는셀배치이후에이를최적화하는기법을소개한다.구체적으로기존의업계에서
주로 사용되는 그리드 방식으로 균등하게 스위치 셀이 배치된 초기 디자인에서 각
스위치 셀 별로 (i) 제거가 가능한지 혹은 (ii) 다른 타입으로 교체가 가능한지를 전
42
반적으로 결정하여 주어진 노이즈 제약 조건하에서 스위치 셀의 총 대기 전력을
최소화하는방법을제안한다.이를위해기존의 weighted set cover problem의변형
으로 문제를 표현하고 approximate set cover algorithm을 통해 효과적으로 문제를
해결하였다. ISCAS89의 benchmark들과, openMSP430, fpu circuit을이용한실험결
과에 따르면, 이 기법을 통하여 초기 그리드 방식 대비 35.0%, 기존 [9]에서 제안된
스위치셀최적화기법대비 13.9%대기전력을감소함을확인하였다.
주요어:저전력설계,멀티비트플립플롭,논리합성,파워게이팅,스위치셀
학번: 2015-20962
43