disclaimer - seoul national university...using multi-bit ﬂip-ﬂop (mbff) library during rtl...

저 시-비 리- 경 지 2.0 한민

는 아래 조건 르는 경 에 한하여 게

l 저 물 복제, 포, 전송, 전시, 공연 송할 수 습니다.

다 과 같 조건 라야 합니다:

l 하는, 저 물 나 포 경 , 저 물에 적 된 허락조건 명확하게 나타내어야 합니다.

l 저 터 허가를 면 러한 조건들 적 되지 않습니다.

저 에 른 리는 내 에 하여 향 지 않습니다.

것 허락규약(Legal Code) 해하 쉽게 약한 것 니다.

Disclaimer

저 시. 하는 원저 를 시하여야 합니다.

비 리. 하는 저 물 리 목적 할 수 없습니다.

경 지. 하는 저 물 개 , 형 또는 가공할 수 없습니다.

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/legalcode

http://creativecommons.org/licenses/by-nc-nd/2.0/kr/

M.S. THESIS

Power Optimization TechniquesApplicable in Pre/Post Placement Stages

for Modern System-on-Chips

최신 System-on-Chip에서의 Placement전/후파워최적화기법

BY

Yi Dongyoun

FEBURARY 2017

DEPARTMENT OF ELECTRICAL ENGINEERING ANDCOMPUTER SCIENCE

COLLEGE OF ENGINEERINGSEOUL NATIONAL UNIVERSITY

Power Optimization TechniquesApplicable in Pre/Post Placement Stages

for Modern System-on-Chips

최신 System-on-Chip에서의 Placement전/후파워최적화기법

지도교수김태환

이논문을공학석사학위논문으로제출함

2017년 2월

서울대학교대학원

전기컴퓨터공학부

이동윤

이동윤의공학석사학위논문을인준함

2017년 2월

위 원 장:부위원장:위 원:

Abstract

In this paper, we introduce two power optimization techniques applicable before

and after placement stage.

First, a new approach to the problem of allocating multi-bit flip-flops for data stor-

age is presented. Previous approaches divide the allocation problem into two separate

steps: (i) placing single-bit flip-flops under circuit timing constraints and (ii) minimiz-

ing the flip-flop and clock tree power by grouping single-bit flip-flops to form multi-bit

flip-flops. Yet, there is no easy way to predict the result of step (ii) during step (i). In

our approach, we place primary importance on the cost of power consumption. Con-

sequently, we try to minimize power consumption by synthesizing multi-bit flip-flops

first and then to place them later. For a number of benchmark circuits, it is shown that

our approach of early consideration of synthesizing multi-bit flip-flops is very effec-

tive, reducing the clock power by 16.8% over that of the conventional method while

satisfying all the timing constraints.

The second work addresses a practical problem of optimizing the switch cells in

power-gated modern System-on-Chips (SoCs) to save the unnecessary standby leakage

under noise (i.e., IR-drop) constraint. Since power gating switch cells are physically

directly connected to power rails, their overall allocation structure is synthesized in a

stage before logic cell placement. Consequently, the allocation of switch cells in the

pre-placement could lead to unnecessarily high standby leakage for modern designs.

This work proposes a practical remedy for this problem at the post-placement stage.

Specifically, for an initial design with a grid-based switch cell allocation, which is

commonly used design methodology in industry, we propose a comprehensive solu-

tion to determining, for each switch cell, (i) if the cell can be removed or (ii) the type

of switch cell for replacement so that the resulting total standby leakage of switch cells

should be minimized under the noise constraint. We formulate the problem into a vari-

i

ant of weighted set cover problem and solve it efficiently by employing an approximate

set cover algorithm. Through experiments with benchmark circuits in ISCAS89, open-

MSP430, and fpu, it is shown that our method is able to reduce the standby leakage

by 35.0% and 13.9% over the initial designs and the designs produced by the previous

switch cell optimization method in [9], respectively.

keywords: Low Power, Multi-bit Flip-flop, Logic Synthesis, Power-gating, Switch

cell

student number: 2015-20962

ii

Contents

Abstract i

Contents iii

List of Tables v

List of Figures vi

1 Allocation of Multi-bit Flip-flops in Logic Synthesis for Power Optimiza-

tion 1

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Algorithm for Multi-bit Flip-flop Allocation . . . . . . . . . . . . . . 3

1.3 Placement Aware Multi-bit Flip-flop Allocation . . . . . . . . . . . . 6

1.3.1 Extraction of Mergeable Flip-flop Sets . . . . . . . . . . . . . 6

1.3.2 Construction of Merging Conflict Graph . . . . . . . . . . . . 9

1.3.3 Selection of Mergeable Flip-flop Sets . . . . . . . . . . . . . 9

1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 Comparing with Academic Algorithm . . . . . . . . . . . . . 14

1.4.3 Comparing with Industry Algorithm . . . . . . . . . . . . . . 16

1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

iii

2 Switch Cell Optimization of Power-gated Modern System-on-Chips 18

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Preliminaries and Motivations . . . . . . . . . . . . . . . . . . . . . 21

2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4 The Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.1 Extraction of Maximally Feasible Subregions . . . . . . . . . 25

2.4.2 Switch Cell Covering for Minimal Standby Leakage . . . . . 28

2.4.3 Consideration of Practical Issues . . . . . . . . . . . . . . . . 32

2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.2 Experimental Result . . . . . . . . . . . . . . . . . . . . . . 35

2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Abstract (In Korean) 42

iv

List of Tables

1.1 Comparison of normalized power and area of 1-bit, 2-bit and 4-bit flip-flops [1]. . . 3

1.2 Computation of intersection densities of some subsets of flip-flops in the circuit in

Fig. 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 The number of mergeable flip-flop sets . . . . . . . . . . . . . . . . . . . . 8

1.4 Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops

power PFF , and the total power for the results produced by the MBFF allocation [1]

in post-placement, our algorithm, and our algorithm followed by [1]. . . . . . . . 14

1.5 Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops

power PFF , combinational logics power PCOMBI , the total power, the longest path

delay tlpd and the worst local clock skew tskew for the results produced by Design

Compiler’s MBFF algorithm, and our algorithm followed by [1]. . . . . . . . . . 15

2.1 An example showing a minimum weighted set cover solution . . . . . 31

2.2 Comparison of the number of switch cells and estimated standby leak-

age Istandby by evenly distributed, the previous current based approach

in [9] and our algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.3 Power/ground density specification . . . . . . . . . . . . . . . . . . . 36

v

List of Figures

1.1 The structures of two 1-bit flip-flops and a 2-bit flip-flop merging the two 1-bit flip-

flops [14]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Flow of the proposed design methodology: Sub-step 2 takes into account the place-

ment cost while performing multi-bit flip-flop allocation in step (i). The cost of multi-

bit flop-flop allocation in Eq.2.2 and the placement cost are balanced by controlling

the value of placement relaxation parameter ρ. . . . . . . . . . . . . . . . . . 5

1.3 Conceptual view of partitioning a circuit based on the transitive fan-outs of flip-flops. 7

1.4 An example of a conflict graph for 7 mergeable flip-flop sets. (a) Given the conflict

graph, we assign their weight to the amount of power saving. For simplicity, we ignore

vertices of 1bit flip-flop whose power saving weight is 0. (b) We evaluate their degree

considering their neighbors and select the minimum degree vertex L7. (c) We remove

the selected vertex and its neighbors from G. (d)(e)(f) We repeat these steps until G

becomes empty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Normalized power consumption per bit of MBFF up to 16-bit. . . . . . . . . . . 13

1.6 Layout views for S5378. The yellow lines indicate clock tree networks. The colored

small rectangles represent 1-bit flip-flops and MBFFs; light-orange is 1-bit flip-flop,

light-red is 2∼4-bit MBFF, orange is 5∼9-bit MBFF and red is 10∼16-bit MBFF.

(a) Result produced by our algorithm. (b) Result produced by random FF grouping to

be the same number of MBFFs as that in (a); red dash indicates timing violation. (c)

Result produced by [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

vi

2.1 A logical structure of power-gated power delivery network (PDN) . . 19

2.2 The changes of standby leakage of openMSP430 circuit as the number

and type (size) of switch cells vary. . . . . . . . . . . . . . . . . . . 21

2.3 Typical PDN structure used in current mobile SoCs. . . . . . . . . . . 22

2.4 An example of (a) switch cell placement, (b) its IR-drop map under an

assumption of fully placed power/ground IO bumps, and (c) IR-drop

and resistance composition for the worst IR-drop path of aes256 circuit

categorized by elements of power net. . . . . . . . . . . . . . . . . . 23

2.5 The flow of our two-step algorithm of switch cell optimization . . . . 26

2.6 An illustration of bin partitioning (Step 1.1), horizontally along the

power/ground rails and vertically along the center of the locations of

two serially placed cells. . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7 An illustration of extracting maximally feasible regions (Step 1.2) for

switch cell s6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8 Comparison of IR-drop map on all instances and its distribution on

switch cells for the fpu circuit, but using different package. (a) flipchip

package, (b) wirebond package. Orange and white dots indicate power

and ground sources, respectively. . . . . . . . . . . . . . . . . . . . 33

2.9 An example illustrating to handle wakeup delay sensitive subcircuits.

Replacement of the unused HEADBUF with an always-on buffer to

maintain the wakeup signal delay. . . . . . . . . . . . . . . . . . . . 34

2.10 Comparison of IR-drop distribution for the fpu circuit. (a) evenly dis-

tributed switch with 60um pitch, (b) the previous switch optimization

method in [9] and (c) our algorithm. . . . . . . . . . . . . . . . . . . 37

vii

Chapter 1

Allocation of Multi-bit Flip-flops in Logic Synthesis for

Power Optimization

1.1 Introduction

Under the limited power and thermal budget constraints for modern system-on-chips

(SoCs) which integrate an increasing number of transistors and interconnects, the min-

imization of the power consumption has become one of the most important design

goals for diverse applications. One effective methodology to save power consumption,

in particular, on the flip-flops and driving clock network is allocating multi-bit flip-flops

or register banks.

Fig. 1.1 shows a comparison of the structures of two 1-bit flip-flops (left one) and

their functionally equivalent 2-bit flip-flop (right one). The power saving in the 2-bit

flip-flop is attributed by the sharing of the two inverters among the two master and

two slave latches. The advance of process technology below 65nm enables even a

minimum sized inverter to still drive multiple master and slave latches in flip-flops [1].

Table 1.1 shows a comparison of the power consumption and area of 1-bit flip-flop, 2-

bit flip-flop, and 4-bit flip-flop. In summary, using a 2-bit flip-flop and a 4-bit flip-flop

can save power by 14% and 22% while reducing the area by 4% and 29%, respectively

1

[14]. Moreover, the use of multi-bit flip-flop reduces the number of cells driven by

clock tree, the clock resource such as clock wire and buffers can be reduced as well,

which in turn decreases the clock power consumption. Thus, as the process technology

advances, the use of flip-flops of 8-bit or more bit size will be feasible with no timing

degradation problem and renders considerable reduction of power consumption.

Master

latch

Slave

latchD Q

1-bit FF

Master

latch

Slave

latchD Q

1-bit FFCLK

2-bit FF

Master

latch

Slave

latchD Q

Master

latch

Slave

latchD Q

CLK

Figure 1.1: The structures of two 1-bit flip-flops and a 2-bit flip-flop merging the two 1-bit flip-flops

[14].

Using multi-bit flip-flop (MBFF) library during RTL synthesis was first introduced

in [10]. However, the selection of flip-flops to form a multi-bit flip-flop is manually

done by only collecting the flip-flops on the same bus. Later, the works in [1, 2, 18]

proposed to merge multiple 1-bit flip-flops to form an MBFF at the post-placement

stage. Precisely, during the merging process based on the cell placement information,

they extracted so called a feasible location region for legal placement of MBFFs, place-

ment density for acquiring available space for MBFFs, and routability congestion for

legal routing of nets from/to MBFFs. On the other hand, the works in [5, 15, 16] con-

sidered the effect of flip-flop merging on the synthesis of clock tree at the in-placement

2

stage.

Note that almost all of the previous MBFF allocation algorithms ([1, 2, 5, 15,

16, 18]) performed the flip-flop merging process by solely relying on the placement

information obtained beforehand. Namely, the previous approaches divide the MBFF

allocation problem into two steps: (i) placing 1-bit flip-flops while meeting the timing

constraints and (ii) grouping 1-bit flip-flops placed in step (i) to form multi-bit flip-

flops to save power consumption while meeting the timing constraints. Since the flip-

flop grouping, thus the power reduction, depends on the placement of 1-bit flip-flops,

the result of step (i) has a strong effect on the result of step (ii). Yet, there is no easy

way to predict the result of step (ii) during step (i). In our approach, we place primary

importance on the cost of power saving rather than on the cost of placement. In other

words, our approach attempts to minimize the cost of power consumption first and then

the cost of placement later. Consequently, our algorithm tries to minimize the cost of

power consumption without committing to a particular way of placing flip-flops, while

taking into account a balance between the cost of power consumption and the cost of

placement.

Table 1.1: Comparison of normalized power and area of 1-bit, 2-bit and 4-bit flip-flops [1].

Bit Total Total Per bit Per bit

size power area power area

1 100 100 1.00 1.00

2 172 192 0.86 0.96

4 312 285 0.78 0.71

1.2 Algorithm for Multi-bit Flip-flop Allocation

We assume that the traditional logic synthesis has already been done. In other words,

the logic gates and (1-bit) flip-flops are optimally synthesized under certain timing

3

constraints. However, how 1-bit flip-flops are to be grouped to form multi-bit flip-flops

and placement of the logic gates and multi-bit flip-flops under the timing constraints

have not yet been determined. We also assume that we have a library of multi-bit flip-

flops of any bit-size. The problem to be solved in step (i) can be expressed as:

Problem 1. (Early multi-bit flip-flop allocation problem) The problem is to trans-

form an input logic circuit C with 1-bit flip-flops only into a logic circuit C′ with a

mixture of 1-bit and multi-bit flip-flops that minimizes the cost function

P (C′) = PFF (C′) + PCLK TREE(C′) (1.1)

while satisfying the timing constraints in placement, in which PFF (C′) indicates the

total power consumed by the flip-flops in C′ and PCLK TREE(C′) indicates the power

consumed by the synthesized clock tree driving the flip-flops in C′. (In some litera-

ture, the sum of the flip-flop power PFF (C′) and clock tree power PCLK TREE(C′) is

referred to as the clock power of C′.)

Let N be the number of 1-bit flip-flops in input circuit C, n1, n2, · · · , nK be re-

spectively the numbers of 1-bit, 2-bit, · · · , K-bit flip-flops in C′, and 1 · n1 + 2 · n2 +

· · ·+K ·nK = N . Then, the value of PFF (C′) can be obtained by summing the power

consumptions of all flip-flops in C′, but the value of PCLK(C′) cannot be computed

since the clock tree is not available yet in this pre-placement stage. However, since it

is clear that the value of PCLK TREE(C′) is closely related to the total wire-length of

clock tree, which is in turn significantly influenced by the number of sinks (i.e., flip-

flops), the clock power cost P (C) in Eq.1.1, which should be minimized in solving

Problem 1, can by estimated by

P̂ (C′) =

K∑m=1

nm · PmFF + α ·

K∑m=1

nm (1.2)

where PmFF represents the power consumption of m-bit flip-flop and α is a weighting

factor.

Fig. 1.2 shows the overall flow of our design methodology. Starting from an RTL

netlist, we apply a logic synthesizer to the netlist and produce a logic circuit C. Step

4

Logic synthesis

Gate-level netlist

Placement relaxation

( + )

RTL netlist

Multi-bit flip-flop

allocation

Flip-flop grouping

1. Extract mergeable FF sets

2. Construct merging

conflict graph with

relaxation parameter

3. Select mergeable FF sets

in , minimizing

Placement/

Clock tree synthesis

Meet timings?

Yes

No

with multi-bit

flip-flops

Return

Start

Set

Step(i)

Step(ii)

Figure 1.2: Flow of the proposed design methodology: Sub-step 2 takes into account the placement

cost while performing multi-bit flip-flop allocation in step (i). The cost of multi-bit flop-flop allocation in

Eq.2.2 and the placement cost are balanced by controlling the value of placement relaxation parameter

ρ. 5

(i) groups 1-bit flip-flops in C to allocate multi-bit flop-flops, producing a circuit C′

that minimizes the quantity of P̂ (C′) in Eq.1.2. (The details on the procedure of step

(i), which consists of three sub-steps will be described in Section 1.3.) Step (ii) then

performs placement and clock tree synthesis for C′ obtained from step (i) followed by

checking the satisfiability of timing constraints. If the timings are not met, we increase

the placement relaxation parameter ρ, and reiterate steps (i) and (ii) until there is no

timing violation on the placement. ρ is a control parameter which is used to take into

account an estimate of the cost of placement during the multi-bit flip-flop allocation in

step (i).

1.3 Placement Aware Multi-bit Flip-flop Allocation

Our multi-bit allocation algorithm in step (i) consists of three sub-steps: (Sub-step 1)

extracting all mergeable 1-bit flip-flip sets for input circuit C, (Sub-step 2) constructing

a merging conflict graph G from the mergeable sets according to the value of place-

ment relaxation parameter ρ, and (Sub-step 3) finding an independent set in G that

minimizes the value of P̂ (C′) in Eq.1.2.

1.3.1 Extraction of Mergeable Flip-flop Sets

If we assume a simple circuit C contains three flip-flops f1, f2 and f3 only, all possi-

bly mergeable sets of flip-flops will be {f1, f2}, {f1, f3}, {f2, f3}, and {f1, f2, f3}.

Thus, theoretically for C with N flip-flops, the number of mergeable sets of flip-flops

would exponentially increase as N increases. However, in practice, locality in C is

able to drastically prune the mergeable sets. We translate locality into a sort of place-

ment constraint and impose it on the meaning of ‘mergeable’, based on the following

definitions.

Definition 1. Let Sfi be the sets of transitive fan-out gates (i.e., successor gates) of (1-

bit) flip-flops fi in C. For a set of (1-bit) flip-flops {f1, f2, · · · , fK},D({f1, f2, · · · , fM}) (M ≥

6

2) is defined as.

D({f1, f2, · · · , fM}) =|Sf1 ∩ Sf2 ∩ . . . ∩ SfM ||Sf1 ∪ Sf2 ∪ . . . ∪ SfM |

. (1.3)

We call D(L), 0 ≤ D(L) ≤ 1 the intersection density driven by the flip-flops in set

L.

Fig. 1.3 shows the partition of a small circuit according to the derivation of tran-

sitive fan-outs of flip-flops. The intersection density of some subsets of flip-flops in

Fig. 1.3 is computed, as shown in Table 1.2, in which for example, |Sf1 ∩ Sf2 | =

|{g8, g9, g10, g15}| = 4 gates and |Sf1 ∪ Sf2 | = 13 gates, thus D({f1, f2}) = 4/13 =

0.31.

f1

QD

CK

f2

QD

CK

f3

QD

CK

f4

QD

CK

Figure 1.3: Conceptual view of partitioning a circuit based on the transitive fan-outs of flip-flops.

7

Table 1.2: Computation of intersection densities of some subsets of flip-flops in the circuit in Fig. 3.1.

Set# Flip-flopsIntersection Density D(L)

Formulation Value

L1 f1, f2|Sf1∩Sf2 |

|Sf1∪Sf2 |

4/13 = 0.31

L2 f2, f3|Sf2∩Sf3 |

|Sf2∪Sf3 |

3/12 = 0.25

L3 f3, f4|Sf3∩Sf4 |

|Sf3∪Sf4 |

1/10 = 0.1

L4 f1, f2, f3|Sf1∩Sf2

∩Sf3 ||Sf1∪Sf2

∪Sf3 |1/17 = 0.06

Definition 2. L = {f1, f2, · · · , fM} (M ≥ 2) is called mergeable flip-flop set if

D({fi, fj}) ≥ ρ where ρ, 0 ≤ ρ ≤ 1 is a parameter controlling the placement cost.

For example, when ρ is set to 0.20, L1 = {f1, f2} and L2 = {f2, f3} are mergeable

sets, but L3 = {f3, f4} and L4 = {f1, f2, f3} are not since D(L3) = 0.1 < 0.2 and

D(L4) = 0.06 < 0.2.

Table 1.3: The number of mergeable flip-flop sets

Design

# of Cells # of Mergeable Sets

FFs GatesTheoretical

ρ = 0.01 ρ = 0.2Bound

s5378 161 686 5.06× 1021 411 240

s13207 647 1724 3.84× 1031 1235 789

s15850 560 2049 3.71× 1030 1023 809

s35932 1728 5837 2.84× 1038 3466 3079

s38417 1564 5410 5.73× 1037 2901 2191

s38584 1300 5996 2.94× 1036 2931 1849

Note that extracting mergeable flip-flop sets is accomplished by traversing gates in

topologically order to identify the flip-flops driving the gates. Due to the locality prop-

erty of flip-flops, a circuit is most likely to be divided into a set of sub-circuits based on

8

their functions such that each sub-circuit is driven by a distinct set of flip-flops. This

flip-flop locality in fact enables the mergeable sets to be in a manageable number, as

validated from the data shown in Table 1.3, in which the number of mergeable flip-flop

sets is far below of the theoretical bound.

1.3.2 Construction of Merging Conflict Graph

We update the set of mergeable flip-flop sets, L = {L1, L2, · · · }, extracted from

Sub-step 1 by setting L′ to L ∪ {f1} ∪ {f2} ∪ · · · ∪ {fN} where f1, f2, · · · , fN

are the 1-bit flip-flops in the input circuit C and {fi}, i = 1, · · · , N is also re-

garded as a self-mergeable set. Then, fromL′ we create a weighted graphG(V,E,W ),

called mergeable conflict graph: (1) a unique node vi ∈ V is created for each Li,

i = 1, 2, · · · , |L′|, (2) a weight wi ∈ W of vi is assigned with the amount of power

saving |Li|·P 1FF−P

|Li|FF , and (3) an edge (vi, vj) ∈ E exists if and only if Li∩Lj 6= φ.

Note that each edge constrains that both of the flip-flop sets corresponding to its termi-

nals cannot be grouped to form a multi-bit flip-flop. (We assume that a 1-bit flip-flop

will be grouped to one and only one multi-bit flip-flops. In other words, we do not

maintain multiple copies of data items.)

1.3.3 Selection of Mergeable Flip-flop Sets

In the last step, we transform the problem of minimizing the quantity of P̂ (C′) in

Eq.1.2 into the problem of finding a set cover on the merging conflict graphG(V,E,W )

which leads to a minimal value of P̂ (C′). Equivalently, we solve the problem of find-

ing a set R = {Lm1 , Lm2 , · · · , LmK} ⊆ V such that (1) Lm1 ∪ Lm2 ∪ · · · ∪ LmK =

{f1, f2, · · · , fN}, which is the set of flip-flops in C and (2) the value of∑Ki=1 P

|Lmi |FF + α · |R| is minimal.

Since the problem belongs to a variant of set packing problem, it is very unlikely

to find an optimal solution in a polynomial-time. (The set packing problem asks if

some k subsets in a list of subsets of a finite set S are pairwise disjoint, in other words,

9

=

= { , , }

= 0.18

=

= { , , }

= 0.18

= { , , }

= 0.18

= { , }

= 0.12

= { , }

= 0.12

= { , }

= 0.12

= { , }

= 0.12

= { }

= { , , }

= 0.18

= 2.95

=

= { , , }

= 0.18

= 3.95

= { , , }

= .

= .

= { , }

= 0.12

= 4.07

= { , }

= 0.12

= 4.61

= { , }

= 0.12

= 3.54

= { , }

= 0.12

= 2.54

= { }

= { , , }

= 0.18

= 2.95

=

= { , , }

= 0.18

= 3.95

= { , , }

= 0.18

= 1.65

= { , }

= 0.12

= 4.07

= { , }

= 0.12

= 4.61

= { , }

= 0.12

= 3.54

= { , }

= 0.12

= 2.54

= { , }

= { , , }

= .

= .30

= { , }

= 0.12

= 2.54

= { , }

= 0.12

= 2.00

= { , }

= 0.12

= 2.54

= { , }

= { , , }

= 0.18

= { , }

= 0.12

= { , }

= 0.12

= { , }

= 0.12

= { , , }

= { , }

= .

= .

Selected

Selected

Selected

Removed

Removed

(a) (b) (c)

(d)(e)(f)

Figure 1.4: An example of a conflict graph for 7 mergeable flip-flop sets. (a) Given the conflict graph,

we assign their weight to the amount of power saving. For simplicity, we ignore vertices of 1bit flip-flop

whose power saving weight is 0. (b) We evaluate their degree considering their neighbors and select the

minimum degree vertex L7. (c) We remove the selected vertex and its neighbors from G. (d)(e)(f) We

repeat these steps until G becomes empty.

no two of them share an element. When α = 0, since minimizing∑K

i=1 P|Lmi |FF is

equivalent to maximizing∑K

i=1(|Lmi | ·P 1FF −P

|Lmi |FF ), which is the amount of power

saving by |Lmi |-bit flip-flop, the problem is reduced to the maximum weighted set

packing problem, which is one of Karp’s 21 NP-complete problems [4].) To efficiently

solve our multi-bit flip-flop allocation problem, we use a greedy algorithm shown in

Algorithm 1 where we look for the node (i.e., mergeable set) inG(V,E,W ) which has

the smallest node degree di = (∑

Lj∈adjacent(Li)wj)/wi among all nodesLi ∈ V , add

the node to our solution, and remove the nodes it connects from G. We continually do

this until no nodes are left.

Fig. 1.4 shows a step-by-step procedure of our algorithm for design with seven

mergeable flip-flops setsL1, L2, · · · , L7 extracted among eight flip-flops f1, f2, · · · , f8.

We create a conflict graphG from Sub-step 2 and assign their weights to the amount of

power saving as shown in Fig. 1.4(a). We ignore 1bit flip-flop sets {f1}, {f2}, · · · , {f8}

10

for simplicity. In Fig. 1.4(b), we calculate the degree for all nodes and select the

node L7 which has the smallest degree among them and put it into R. In Fig. 1.4(c),

we remove the selected node L7 and their neighbors L6, L4 from the graph G. In

Figs. 1.4(d), (e) and (f), we repeat the same process for the remaining graph G until it

comes to be an empty set. Finally, we get set R that has two 3-bit flip-flops L1, L7 and

one 2-bit flip-flop L5.

Although no algorithm can always produce results close to the minimum of P̂ cost,

on many practical inputs our greedy heuristics do so. (If we assume no node exceeds

its |Li| ≥ k ≥ 3, the answer can be approximated within a factor of k/2 + ε for

any ε > 0; in particular, the problem with every node of |Li| = 3 can be approximated

within about 50%. In another more tractable variant, if no 1-bit flip-flop occurs in more

than k of the nodes, the answer can be approximated within a factor of k [4].)

1.4 Experimental Results

1.4.1 Experimental Setup

We test our algorithm for two sets of circuit netlists; one set consists of ISCAS89

benchmark circuits and the other is a set of two circuits of large size, created by com-

bining benchmark circuits, having 100,000 flip-flops. We use 45nm Nangate Open

Cell Library and its PDK for our process and library data. To guarantee the worst case

performance, we use a slow PVT corner. In addition, we generate MBFF library from

2-bit, 3-bit, · · · , to 16-bit sizes. Fig. 1.5 shows the normalized power consumption of

the flip-flops of different bit-width in logarithm scale. We also adjust the clock pin

capacitance of each MBFF to be proportional to its bit size. We used Design Compiler

(DC) for a synthesis tool and IC Compiler (ICC) for a P&R tool from Synopsys Inc,

and applied default tool options to all design steps: logic synthesis, cell placement and

clock tree synthesis.

11

Algorithm 1 A greedy algorithm for Sub-step 3 (inputs:

G(V,E,W ); output: R)1: R = φ;

2: while G 6= φ do

3: // Find Lmax which has the smallest degree dmin

4: Lmax = nil;

5: dmin =∞;

6: for each vertex Li in G do

7: // Calculate degree d of vertex Li

8: d = 0;

9: for each vertex Lj ∈ adjacent(Li) do

10: d += wj ; // Add all neighbors weights

11: end for

12: d = d/wi; // Divided by its own weight

13: if (Lmax is nil || di < dmin) then

14: // Store Lmax whose degree is the smallest one

15: Lmax = Li;

16: dmin = d;

17: end if

18: end for

19: // Remove adjacent vertices of Lmax from G

20: for each vertex Lj ∈ adjacent(Lmax) do

21: G = G− {Lj};

22: end for

23: G = G− {Lmax}; // Remove Lmax from G

24: R = R ∪ Lmax; // Add Lmax to R

25: end while

26: return R

12

0.5

0.6

0.7

0.8

0.9

1

1.1

0 2 4 6 8 10 12 14 16

No

rma

lize

d P

ow

er

per

Bit

Bits

Figure 1.5: Normalized power consumption per bit of MBFF up to 16-bit.

(a) (b) (c)

Timing Failed Path

Figure 1.6: Layout views for S5378. The yellow lines indicate clock tree networks. The colored small

rectangles represent 1-bit flip-flops and MBFFs; light-orange is 1-bit flip-flop, light-red is 2∼4-bit MBFF,

orange is 5∼9-bit MBFF and red is 10∼16-bit MBFF. (a) Result produced by our algorithm. (b) Result

produced by random FF grouping to be the same number of MBFFs as that in (a); red dash indicates

timing violation. (c) Result produced by [1].

13

Table 1.4: Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops power

PFF , and the total power for the results produced by the MBFF allocation [1] in post-placement, our

algorithm, and our algorithm followed by [1].[1] (post-placement) Ours (pre-placement) Ours + [1]

Circuit

# of # of Power(mW) # of Power(mW) # of Power(mW)

1-bit n-bitPCLK T PFF

Total n-bitPCLK T PFF

Total n-bitPCLK T PFF

Total

FFs FFs (Red.) FFs (Red.) FFs (Red.)

s5378 161

1: 20

0.061 0.136

1: 44

0.054 0.124

1: 24

0.054 0.1212∼4: 57 0.243 2∼4: 10 0.216 2∼4: 19 0.212

5∼9: 0 (6.9%) 5∼9: 9 (17.2%) 5∼9: 9 (18.8%)

10∼16: 0 10∼16: 2 10∼16: 2

s13207 647

1: 85

0.183 0.537

1: 274

0.187 0.541

1: 60

0.172 0.5012∼4: 195 0.767 2∼4: 38 0.782 2∼4: 128 0.727

5∼9: 0 (13.2%) 5∼9: 4 (11.5%) 5∼9: 4 (17.8%)

10∼16: 0 10∼16: 13 10∼16: 13

s15850 560

1: 65

0.149 0.441

1: 274

0.142 0.432

1: 60

0.139 0.4012∼4: 183 0.640 2∼4: 34 0.626 2∼4: 119 0.592

5∼9: 0 (12.3%) 5∼9: 2 (14.2%) 5∼9: 2 (18.9%)

10-16: 0 10-16: 12 10∼16: 12

s35932 1728

1: 164

0.627 1.776

1: 286

0.575 1.477

1: 144

0.569 1.4562∼4: 638 3.075 2∼4: 1 2.732 2∼4: 63 2.705

5∼9: 0 (9.4%) 5∼9: 32 (19.5%) 5∼9: 32 (20.3%)

10∼16: 0 10∼16: 128 10∼16: 128

s38584 1300

1: 185

0.381 1.091

1: 465

0.365 1.011

1: 202

0.363 0.9802∼4: 459 1.943 2∼4: 97 1.854 2∼4: 211 1.821

5∼9: 0 (10.5%) 5∼9: 33 (14.6%) 5∼9: 33 (16.2%)

10∼16: 0 10∼16: 26 10∼16: 26

large1 84000

1: 10066

17.883 57.753

1: 51464

18.235 59.742

1: 14379

17.524 55.1812∼4: 28074 86.714 2∼4: 4624 89.554 2∼4: 19363 84.611

5∼9: 21 (11.2%) 5∼9: 1045 (8.3%) 5∼9: 1055 (13.4%)

10∼16: 0 10∼16: 1018 10∼16: 1018

large2 119200

1: 13571

19.621 56.243

1: 58358

19.542 57.099

1: 20021

18.948 54.0032∼4: 40898 99.555 2∼4: 11743 100.728 2∼4: 26802 97.097

5∼9: 34 (9.7%) 5∼9: 3095 (8.7%) 5∼9: 3108 (12.0%)

10∼16: 0 10∼16: 953 10∼16: 953

Average10.5% 13.5% 16.8%

PTOTAL Reduction

1.4.2 Comparing with Academic Algorithm

Table 2.2 summarizes the results produced by the recent work of MBFF allocation in

[1], which is applicable at the post-placement stage, our algorithm applied at the logic

synthesis (pre-placement) stage, and our algorithm followed by [1]. We compare the

numbers of 1-bit and multi-bit flip-flops, power consumption, PCLK T , of clock tree,

power consumption, PFF , of flip-flops, and the total power consumption which is the

14

Table 1.5: Comparison of the number of flip-flop cells, clock tree power PCLK T , flip-flops power

PFF , combinational logics power PCOMBI , the total power, the longest path delay tlpd and the worst

local clock skew tskew for the results produced by Design Compiler’s MBFF algorithm, and our algorithm

followed by [1].

Circuit

Design Compiler Ours + [1]

# Power(mW)tlpd(ns) tskew(ns)

# Power(mW)tlpd(ns) tskew(ns)

of PCLK T PFF PCOMBI PTOTAL of PCLK T PFF PCOMBI PTOTAL

MBFFs (Red.) (Red.) (Red.) (Red.) (Red.) (Red.) MBFFs (Red.) (Red.) (Red.) (Red.) (Red.) (Red.)

s35932

1: 0 1: 42

2-4: 0 0.558 1.454 0.518 2.530 0.593 0.069 2-4: 139 0.570 1.482 0.445 2.497 0.534 0.076

5-9: 69 (18.4%) (29.9%) (-24.2%) (20.3%) (-2.4%) (-19.3%) 5-9: 26 (16.7%) (28.6%) (-6.7%) (21.4%) (9.9%) (-8.9%)

10-16: 102 10-16: 124

s38584

1: 0 1: 173

2-4: 1 0.283 0.714 0.356 1.353 0.568 0.101 2-4: 218 0.314 0.841 0.302 1.458 0.500 0.087

5-9: 44 (20.5%) (32.5%) (-7.2%) (22.5%) (-12.1%) (-32.6%) 5-9: 31 (11.8%) (20.5%) (9.0%) (16.5%) (12.0%) (14.0%)

10-16: 70 10-16: 17

large2

1: 0 1: 16337

2-4: 9 13.475 32.599 21.381 67.455 2.069 0.326 2-4: 23993 14.860 41.545 14.922 71.327 1.498 0.265

5-9: 2791 (17.5%) (33.2%) (-45.0%) (15.6%) (-4.5%) (-60.6%) 5-9: 1901 (9.1%) (14.9%) (-1.2%) (10.7%) (24.3%) (-30.9%)

10-16: 6271 10-16: 513

sum of PCLK T , PFF and power consumption of combinational logic cells. The power

reduction numbers in parentheses is the rate of reduction from initial testcases with no

MBFF allocation. All reported power values were evaluated after clock tree synthesis.

The comparison shows that our algorithm reduces the total power by 13.5% on average

(up to 19.5%), which is more than the reduction by [1] at the post-placement, which

is 10.5% on average (up to 13.2%). Note that the power reduction by our algorithm

varies considerably from 8.3% for LARGE1 to 19.5% for S35932. Regarding the num-

ber of MBFFs, our algorithm was able to group up to 16 flip-flops while [1] used in

post-placement could not group more than 10 flip-flops due to timing violation. As

shown in the last four columns of Table 2.2, our algorithm can be combined to any

MBFF allocation algorithm used in the placemen stage. The power improvement by

the combined application of ours and [1] is consistent for all testcases, saving power

about 6% over that by [1] alone. In addition, the total power reduction in all cases in-

dicates that the power increase of combinational logic cells due to MBFF allocation is

not significant because PCLK T and PFF are much more dominant in the total power

consumption.

15

Fig. 1.6 shows the chip layout views, for S5378, produced by our algorithm, ran-

dom FF grouping until the number of MBFF equals that in (a), and [1]. We can see

that the view in (a) has a simpler clock tree structure (due to less number of flip-flop

cells) than that in (c).

1.4.3 Comparing with Industry Algorithm

We compare ours with the results produced by the Synopsys’s Design Compiler which

supports the in-placement MBFF allocation. Design Compiler (DC) recognizes the

cell placement in a topological mode and groups flip-flops with the commands of

identify register bank and compile ultra. DC finds compatible groups of 1-bit flip-

flops that are physically close each other to create MBFFs. However, it doesn’t take

into account timing information, thus sometimes incurring timing violation. For a fair

comparison, we performed the same flow as DC for s35932, s38584, and large2. The

only difference is that our algorithm used compile ultra command. Table 1.5 summa-

rizes the results in terms of the numbers of resulting flip-flops, the power consump-

tions PCLK T , PFF , and PCOMBI of clock tree, flip-flops, and combinational cells,

the longest path delay tlpd, and the worst local clock skew tskew. Since DC allocates

MBFFs very aggressively, no 1-bit flip-flops are left, saving the total power consump-

tion by 19.5% on average, which is more than the reduction by our algorithm, which

is 16.2% on average. However, since DC doesn’t consider the effect of timing, a care-

ful use of DC for MBFF allocation is required. In addition, through the analysis of

the Table 1.5 it is observed a number of weaknesses in DC of MBFF allocation: (1)

DC increases the power consumption of combinational cells significantly, by 25.5%

on average (up to 45.0% in large2) while our algorithm decreases the power by 0.4%

on average. Since ours uses the same total number of combinational cells as that of

DC, the power increase in combinational cells by DC is more likely to increase chip

power density and cause IR-drop hot-spot problem; (2) DC increases the longest path

delay by 6.3% on average while our algorithm decreases it by 15.4% on average. The

16

longer delay by DC will lead chip to be weaker to process variation; (3) DC increases

the worst local clock skew significantly, by 37.5% on average (up to 60.6%) while our

algorithm increases it only by 8.6% on average. The high increase of clock skew by

DC may increase the chance of causing timing or even functional failure.

1.5 Conclusion

We presented in this paper a new approach to the multi-bit flip-flop allocation prob-

lem. Previous approaches divided the allocation problem into two steps: (i) placing

single-bit flip-flops under circuit timing constraints and (ii) minimizing the clock tree

and flip-flop power by grouping single-bit flip-flops to form multi-bit flip-flops. In

our approach, we placed primary importance on the cost of power consumption rather

than on the cost of placement. Consequently, we attempted to minimize the power

consumption by synthesizing multi-bit flip-flops first and then to place them later. Ex-

perimental results have demonstrated that our approach of early consideration of syn-

thesizing multi-bit flip-flops offered great benefits on reducing the power consumption

of designs while minimizing the impact of timing during the synthesis.

17

Chapter 2

Switch Cell Optimization of Power-gated Modern System-

on-Chips

2.1 Introduction

As smart phones have become popular, various sorts of mobile devices have been

manufactured and used these days and the needs of higher performance and longer

battery life still valid in the market place nowadays. Integrating many cores into a

chip and enhancing the core architecture have been a mainstream strategy for per-

formance improvement while power-gating, clock-gating, and dynamic voltage and

frequency scaling (DVFS) have been essential ingredients in the low-power on-chip

design methodologies.

In particular, power-gating is one of the most commonly used techniques to save

the standby leakage by shutting off the current to blocks of the circuit that are not in

use. This is accomplished by adding switch cell either to VDD or VSS supply. Fig. 2.1

illustrates a logical structure of power-gated power delivery network (PDN), in which

the switch cell is able to cut the current flowing from the power sources (e.g., IO power

pad or bump) to the power sinks (e.g., standard cells) during the standby state. Switch

cells can be classified into two types depending on their role: header type for cutting

18

power and footer type for cutting ground. In real products, the header type of switch

cells is more popular. For the implementation of header type cells, the part of PDN

from power sources to switch cells is called the real power network or RVDD net and

the part from switch cells to power sinks is called the virtual network or VVDD net. In

other words, RVDD net, switch cells, and VVDD net form a serially connected power

network and the switch cells cut or link the power network.

Since the switch cells act as resistors, expressed as Ron, during active (i.e., ON)

state, multiple switch cells of small size should be placed on the area of logic standard

cells in order not to violate IR-drop constraint [20]. This dispersion of switch cells also

helps prevent excessive in-rush current during wake-up state (e.g., [8, 11]).

VDD

Real Power Net(RVDD)

NSLEEP

Circuit Area

VSS

Ground Net(VSS)

...Standard Cells

...Switch Cells

Virtual Power Net(VVDD)

IN OUT

CLK

Figure 2.1: A logical structure of power-gated power delivery network (PDN)

As the portion of static power dissipation is getting larger due to technology scaling

down [7] and the use of internet-of-things (IoTs) staying more time in standby mode

spreads out, the issue of standby leakage reduction brings up to ensure much longer

battery life. This work addresses the problem of reducing the standby leakage dissi-

19

pated through the switch cells. Super Cut-Off CMOS (SCCMOS) scheme [6] is one of

the noticeable solutions on this topic, which overdrives the control (i.e, wakeup) sig-

nals higher than VDD, resulting in reducing standby leakage at their switch cells and

this scheme is adopted in [12] for ultra-low power management. This scheme however

requires additional circuitry to generate overdriving voltage level, which also dissi-

pates standby leakage. In this work, we aim to reduce standby leakage at switch cells

without such additional burden.

A common knob that prior works ([9, 20, 13]) have used to control the number and

type of switch cells is the quantity of total current around the cells. Kozhaya and Bakir

[9] determined the minimal number of switch cells by computing IV I / IPGC where

IV I is the given total current demand of the circuit on a voltage island and IPGC is the

maximum current that can be supplied by a switch cell, and then iteratively allocate

switch cells whose total number is as close as to IV I /IPGC while meeting the IR-

drop constraint. They employed a greedy approach. Unlike [9], Yong and Ung [20]

exploits the effect of cell placement on the determination of cells to be allocated, in

which they propose a method, called power perimeter scanning, to find the widest

sub-regions on which the contained switch cells can endure the supply current. For

every cell on two or more sub-regions, it is considered to be removed if the removal

causes no IR-drop violation. One limitation of this work is that the number of sub-

regions considered is exponential. On the other hand, Lin and Kin [13] partitioned a

whole circuit into sub-regions according to the current constraint and computed the

effective resistance of the sub-regions to estimate the number and type of switch cells.

They performed two steps: (1) they formulated the problem of determining, for each

sub-region, the location at which switch cells should be placed into an ILP (integer-

linear programming) problem; (2) they iteratively control the effective resistance (i.e.,

resizing) of switch cells allocated in step 1 until there is no IR-drop violation. The

works in [20, 13] considered the effect of cell placement. However, (limitation 1) since

their controlling knob is based on current rather than IR-drop, their optimizations are

20

indirection. In addition, (limitation 2) the current based knob could not handle IR-drop

hot-spots that take place on the power/ground rails because of the large resistance, as

supported by the data shown in Fig. 2.4. This work overcomes the two limitations of

the prior works.

2.2 Preliminaries and Motivations

0

20

40

60

80

100

120

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71

Sta

nd

by

Lea

kag

e (u

A)

The Number of Switch Cells

HEADX32_RVT HEADX16_RVT HEADX8_RVT

HEADX4_RVT HEADX2_RVT

y=0.154x+0.474, R2=0.99

Figure 2.2: The changes of standby leakage of openMSP430 circuit as the number and

type (size) of switch cells vary.

Fig. 2.2 shows the changes of (total) standby leakage of openMSP430 circuit as

the number of switch cells and the type (i.e., size) of switch cells used in the circuit

change. The curves show that as the number of switch cells used increase, the standby

leakage insistently increases as well. In addition, as the size of switch cells used is big-

ger, for example when using cells HEADX32 RVT and HEADX16 RVT, the standby

leakage increases roughly 3 and 5 times more than that when using HEADX2 RVT,

respectively. Consequently, the analysis from Fig. 2.2 indicates that it is necessary to

try to use as few switch cells as possible while the cell size to be as small as possible.

21

However, since smaller cells act as higher resistance (Ron) in PDN, a use of extremely

less number and small size of cells would lead to a sharp voltage drop in PDN. Thus,

we need to find a minimal number (i.e., switching locations) and type of switch cells to

maximally prevent unnecessary standby leakage dissipation while meeting the IR-drop

constraint. Note that since the number of switch cells required for meeting the IR-drop

constraint is typically much larger than the number required for wake-up delay con-

straint due to the strong physical constraint such as a large resistance of switch cell

and power/ground rail, we do not take the wake-up delay constraint into consideration

in our design specification. (If the wake-up delay still matters, one simple strategy is

to choose and revive additional switch cells later.)

Upper Thick

Metal Stack

Low Resistance(Metal7~Metal9)

Middle Thin

Metal Stack

Middle Resistance(Metal3~Metal6)

Lower Thin

Power Rail

High Resistance(Metal1~Metal2)

C4 bumps

Power-gating Switch Cell, High Ron

Figure 2.3: Typical PDN structure used in current mobile SoCs.

Fig. 2.3 shows a bird view of typical PDN structure of current mobile SoCs where

seven or more metal stacks is commonly used: the two uppermost layers with wide

and thick metals having very low resistance are exclusively assigned to power/ground

network to deliver as large current as possible; the lower one or two layers are used as

rails of PDN that are either in the cell library or added before/after the cell placement

to deliver current to standard cells. The rest of layers is implemented as power/ground

22

mesh type to bridge between the uppermost stacks and the bottom power/ground rails.

Switch cell pitch

0% 20% 40% 60% 80% 100%

Resistance

IR-dropVVDD Rail

VVDD Mesh/Via

Switch Cell

RVDD

46.5%

59.2%

50.6%

33.7%

(a) (b)

(c)

Switch cell

violation

Figure 2.4: An example of (a) switch cell placement, (b) its IR-drop map under an

assumption of fully placed power/ground IO bumps, and (c) IR-drop and resistance

composition for the worst IR-drop path of aes256 circuit categorized by elements of

power net.

The large resistance of the thin and narrow power/ground rails makes a large con-

tribution to the IR-drop violation. Likewise, since the effective resistance of switch

cells is high, the portion of IR-drop by the cells is large as well. (A supporting data

is included in Fig. 2.4(c), and will be explained later.) Switch cells are evenly dis-

tributed in the standard cell area with pre-defined pitch p, as shown in Fig. 2.4(a)

where the yellow small boxes indicate pre-placed switch cells. Since the switch cell

placement has already been done in the pre-placement stage, like Fig. 2.4(a), it may

23

cause an IR-drop violation at the post-placement stage, as indicated in Fig. 2.4(b) in

which the red spots indicate the violation of IR-drop constraint. We traced the power

delivery path from the source to standard cells around which the worst IR-drop oc-

curs in Fig. 2.4(b) and measured the voltage drops and resistances at RVDD →

Switch cell → V V DD mesh/via → V V DD rail of the worst IR-drop path.

Fig. 2.4(c) shows the relative portions of the resistances and voltage drops among

RVDD Switch cell V V DD mesh/via, and V V DD rail of the worst IR-drop path.

Clearly, it shows that switch cells can be the major source, contributing about 50%, of

the violation of IR-drop constraint. Because of large number of switch cells required

to distribute and burden of secondary routing of always-on and wakeup signals on

switch cells, the switch cell pitch p is determined at the power-plan stage before cell

placement. Consequently, setting a small p in pre-placement stage may avoid any IR-

drop problem across entire circuit, but the standby leakage by so many switch cells

will be enormous. This work can attenuate this leakage by selecting switch cells to be

removed or resized while strictly satisfying IR-drop constraint.

2.3 Problem Formulation

Objective: Let L denote the switch cell library that contains K types of switch cells

sw1, sw2, · · · , swK , arranged in an increasing order of driving strength. Then, the

standby leakage dissipated by the switch cells in a power-gating circuit C can be ex-

pressed as:

Istandby(C) = α ·∑

swi∈LIswi · nswi + β (2.1)

where Iswi and nswi represent the amount of standby leakage and the number of switch

cells of type swi used in C, respectively. α is a weighting factor and β is a constant

that indicates the total leakage on the always-on parts.

Constraint: Let S = {s1, s2, · · · } be the set of switch cells placed on C. For sim-

plicity, we adopt the constant global voltage margin rather than the variation margin

24

for advanced OCV (AOCV) [19], and consider static IR-drop. Then, the voltage drop,

Vsi , at cell si for every i = 1, · · · , |S| must be below the constraint value Vlimit, i.e.,

expressed as:

maxcelli∈S

Vcelli ≤ Vlimit (2.2)

Optimization: We want to determine the switch cells in S to be removed or resized

(i.e., replacing cells with another types) to minimize the quantity of Istandby(C) in

Eq. 2.1 while satisfying the constraint in Eq. 2.2.

2.4 The Proposed Algorithm

The input to our algorithm of switch cell optimization is a circuit C, in which the place-

ment and routing of logic cells as well as a uniform distribution of switch cells were

done. Then, our algorithm performs two steps : (Step 1) Extracting a set of maximally

feasible regions of every switch cell location in C and (Step 2) Selecting a subset of

the maximally feasible regions obtained in Step 1 that cover C such that its value of

Istandby(C) is minimized under Vlimit constraint. The overall flow of our algorithm

is shown in Fig. 2.5. The output of the algorithm is the switch cells to be turned on

(i.e., the cells in the regions selected in Step 2) and their switch cell types in L for

replacement.

2.4.1 Extraction of Maximally Feasible Subregions

This step identifies all subregions expressed as r(si, swj , bx1,y1 , · · · , bxk,yk) to denote

that the deployment of switch cell si of type swj can exclusively resolves any IR-drop

violation that may occur on the locations in bx1,y1 , · · · , bxk,yk . We call such a subregion

r(·) feasible and call the subregion r(·) maximally feasible if it is feasible and no

feasible subregion of si with type swj cannot properly contain r(·). We collect all such

maximal subregions into set R. The extraction of all maximally feasible subregions is

performed in two sub-steps:

25

Step 1. Extract Feasible Region for Each (Sec. 4.1)

Chip with Completed P&R

Step 2. Switch Cell Covering for Minimal Standby

Leakage Adopting Approximation Algorithm (Sec. 4.2)

Remove not in the Solution Set from

Step 1.2 Generating All Maximally Feasible Regions

Step 1.1 Bin Partitioning

Figure 2.5: The flow of our two-step algorithm of switch cell optimization

26

Step 1.1 (Bin partitioning): We partition circuit C into a (regular) bins in a way that

every switch cell is placed in the center position of a distinct bin. Fig. 2.6 shows an

example of partition. Any possible style of bin partitioning is acceptable in our algo-

rithm, but the bottom line is that the size of each partitioned bin should be at least as

small as the one in which the centered switch cell of the smallest type sw1 is able to

completely resolve the IR-drop violation on the bin. In this work we choose to apply

vertically and horizontally equal-interval partitioning. We use notation bx,y to denote

the bin at the location of the xth column and yth row bin.

X columns

Y ro

ws

Figure 2.6: An illustration of bin partitioning (Step 1.1), horizontally along the

power/ground rails and vertically along the center of the locations of two serially

placed cells.

Step 1.2 (Generating all maximally feasible subregions): We then use an incremental

approach to find maximally feasible subregions. The following process will be iter-

atively performed for every si ∈ S. For example, let R = {} and let us focus s6 in

Fig. 2.7(a). By the definition of the bin generation,R = {r(s6, sw1, b2,4)}, as indicated

in Fig. 2.7(b). Now, we look for the upper neighbor bin to see if it can be included.

The IR-drop analysis shows that the exclusive use of s6 of type sw1 enables the sub-

27

region corresponding to b2,4 and b2,5 feasible. Thus, R = {r(s6, sw1, b2,4, b2,5)}, as

shown in Fig. 2.7(c). Likewise, we examine the lower neighbor bin to see if it can be

included. The IR-drop analysis shows that the exclusive use of s6 of type sw1 makes

the subregion of b2,3 and b2,4 infeasible as shown in Fig. 2.7(d), but the use of s6

of sw2 enables the subregion feasible, as shown in Fig. 2.7(e). Consequently, max-

imally feasible set can be R = {r(s6, sw1, b2,4, b2,5), r(s6, sw2, b2,3, b2,4)}. Further

expansion of feasible subregion is possible, as shown in Fig. 2.7(f), by increasing the

cell size from sw2 to sw3, updating R = {r(s6, sw1, b2,4, b2,5), r(s6, sw2, b2,3, b2,4),

r(s6, sw3, b2,2, b2,3, b2,4)}. We employ a pruning technique while expanding the subre-

gions, as illustrated in Fig. 2.7(f), to avoid further tests for a subregion if it has already

been found to be feasible on a cell of type swj .

2.4.2 Switch Cell Covering for Minimal Standby Leakage

From the set R that contains all maximally feasible covering subregions obtained in

Step 1, this step extracts a subset F ⊆ R with the least standby leakage while meet-

ing the IR-drop constraint in C. We formulate this problem into a weighted set cover

problem, which is stated as: given a set U of elements and a set H of subsets A1, A2,

· · · of U with weight w(A1), w(A2), · · · such that A1 ∪A2 ∪ · · · = U , find a minimal

total weight of a subset F of H that satisfies for every ei ∈ U , ei ∈ Aj for some

Aj ∈ F . Thus, our transformation into the weighted set cover problem is to construct

U , H and weight w(Ai): (1) U = {b1,1, b1,2, · · · } is the set of physical locations in

C on which the IR-drop constraint should be met; (2) H = R, i.e., A(·) = r(·); (3)

w(A(·)) = w(r(·)) is the standby leakage at the switch cell s(·) of type sw(·) in the

r(·).

To speed up the computation of set covering, we adopt the approximation algo-

rithm in [3]. A pseudocode of the approximate algorithm trimmed to our context is

given in Algorithm 2. The algorithm is a mixture of greedy and exact algorithms. As

the whole initial problem is too big, a greedy algorithm is applied to an extent, followed

28

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,OFF

ON

,

,

,

,

,

,

,

,

,

,

,

,OFF

ON

(a) (b) (c)

,

,

,

,

,

,

,

,

,

,

,

,OFF

ON

(d)

,

,

,

,

,

,

,

,

,

,

,

,OFF

ON

(e)

=

IR Meet

IR Meet

IR Violated IR Meet

(f)

infeasble

,

,

,

,

,

,

,

,

,

,

,

,

,

,

= { ( , , , )} = { ( , , , , , )}

= { ( , , , , , )} = { ( , , , , , ),

, , , , , }

Violated with all ( )infeasble

switch to

switch

to

= { ( , , , , , ),

, , , , , ,

, , , , , , , }

Figure 2.7: An illustration of extracting maximally feasible regions (Step 1.2) for

switch cell s6.

29

by applying an exact algorithm. Because input size to an exact algorithm becomes rel-

atively small, it takes much less time to get an optimal solution. By integrating both

algorithms into one framework in Algorithm 2, in which a parameter ρ (≥ 1) is used

to control the problem sizes for applying greedy and exact algorithms, the run time can

be significantly reduced while ensuring the sub-optimality. The approximation ratio of

Algorithm 2 is 1+lnρ [3].

Algorithm 2 (Step 2) Approximation Algorithm for optimizing switch cells(input:

U , R, ρ (control parameter); output: F ⊆ R)1: Rg = ∅ // a set of subregions in R for greedy part

2: F = R // F will be iteratively refined.

3: n = |U | // total bins in C

4: //⋃B is a bin union operation.

5: while (⋃B R) ∪ (

⋃B Rg) = U do

6: Select r(·) ∈ R that minimizes w(r(·))binCount(r(·)\

⋃B Rg)

7: // ρ(≥ 1) controls problem sizes for greedy and exact.

8: // Increasing ρ places more weight on greedy.

9: if n− binCount(⋃B Rg ∪ {r(·)}) >

nρ then

10: Rg = Rg ∪ {r(·)} // greedy covering

11: Remove conflict subregions against the r(·) from R

12: else

13: Rt = {r′(·) ∈ R| r′(·) /∈⋃Rg ∪ {r(·)}}

14: Ropt = an exact covering for Rt

15: if w(Rg) + w({r(·)}) + w(Ropt) < w(F ) then

16: F = Rg ∪ {r(·)} ∪Ropt17: end if

18: R = R \ r(·)

19: end if

20: end while

21: Return F

An example illustrating how Algorithm 2 for U = {b1,1, b1,2, · · · , b1,7}, R =

30

{r(s1, sw1, b1,1), r(s1, sw2, b1,1, b1,2), · · · , r(s4, sw3, b1,5, b1,6, b1,7)}, and ρ = 1 (i.e.,

for exact solution) is applied to find the switch cells to be turned on together with

the resizing types that use the least total standby leakage is shown in Table 2.1. The

minimum cover solution in Table 2.1 indicates that only three switch cells s1, s2 and

s4 suffice to meet the IR-drop constraint on all bins if they are sized to sw2, sw3 and

sw2, respectively. The resulting standby leakage is 2.583, which is minimal.

Table 2.1: An example showing a minimum weighted set cover solution

R U Leakage

(Feasible Subregions) b1,1 b1,2 b1,3 b1,4 b1,5 b1,6 b1,7 (w(r(·)))

r(s1, sw1, b1,1)√

0.404

r(s1, sw2, b1,1, b1,2)√ √

0.672

r(s1, sw3, b1,1, b1,2, b1,3)√ √ √

1.239

r(s2, sw1, b3,3)√

0.404

r(s2, sw2, b1,2, b1,3)√ √

0.672

r(s2, sw3, b1,3, b1,4, b1,5)√ √ √

1.239

r(s2, sw3, b1,1, b1,2, b1,3)√ √ √

1.239

r(s2, sw3, b1,2, b1,3, b1,4)√ √ √

1.239

r(s3, sw2, b1,5)√

0.672

r(s3, sw3, b1,4, b1,5)√ √

1.239

r(s4, sw2, b1,6, b1,7)√ √

0.672

r(s4, sw3, b1,5, b1,6, b1,7)√ √ √

1.239

min. cover F√ √ √ √ √ √ √

2.583

31

2.4.3 Consideration of Practical Issues

Spatial Variation in Wirebond Package

So far, we have used the constant IR-drop constraint Vlimit across all switch cells.

This will be acceptable only when it is assumed that every switch cell has a small

IR-drop variation. This is true for the flipchip package since the portion of RVDD

drop in the total IR-drop is very small, as previously shown in Fig. 2.4(c). However,

for the wirebond package, the distances from IO pads to switch cells significantly vary

depending on the cells’ placement location and for some cells, the distance is very

longer than that of the flipchip package, resulting in a considerable spatial IR-drop

variation on switch cells, as demonstrated in Fig. 2.8.

Our work can handle this high IR-drop variation by simply setting, for every switch

cell si preplaced in C, different IR-drop constraint Vlimit(si) = Vlimit + δV (si) where

∆V (si) is the cumulative IR-drop value at the location of si from IO pads to si. The

value of ∆V (·) can be separately obtained from IR-drop analysis at the sign-off stage.

Then, our extraction of maximal feasible subregions can be performed according to

the updated IR-drop constraints.

Wakeup Signal Sensitive Subcircuits

Usually, the NSLEEP (wakeup control) signals on the switch cells are delivered

through daisy chain to ensure sufficient wakeup signal delay to avoid a large surge

current. However, if our proposed cell optimization solution reports that a particular

switch cell can be unused, the resulting wire rerouting among cells may shorten the

wakeup signal delay and distort the timing constraints on the nearby timing-sensitive

subcircuits. We can solve this problem by replacing the HEADBUF in the removed

cell with an always-on buffer, as shown in Fig. 2.9.

32

0

200

400

600

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Wirebond

Flipchip

Mean = 9.79mV

Mean = 6.26mV

(a) (b)

(c)

Figure 2.8: Comparison of IR-drop map on all instances and its distribution on switch

cells for the fpu circuit, but using different package. (a) flipchip package, (b) wirebond

package. Orange and white dots indicate power and ground sources, respectively.

33

HEADBUF

HEADBUF

Replaced to

Always-on BUF

HEADBUFAlways-on

BUF

NSLEEP

RVDD

VVDD

RVDD

VVDD

RVDD

Figure 2.9: An example illustrating to handle wakeup delay sensitive subcircuits. Re-

placement of the unused HEADBUF with an always-on buffer to maintain the wakeup

signal delay.

2.5 Experimental Results

2.5.1 Experimental Setup

The proposed algorithm have been implemented by Python3 language. The circuits

s35932, s38417 and s38584 from ISCAS’89, openMSP430 (16bit microcontroller core)

and fpu (floating point unit) from OpenCores [17] are synthesized by Design Compiler

(DC) J-2014.09-SP5-2, conducted P&R by IC Compiler (ICC) I-2013.12-SP5-1 and

measured the standby leakage by HSPICE from Synopsys Inc. based on Synopsys

32/28nm Generic Library and its PDK. IR-drop was performed at the timing worst

corner, ss/0.95v/125c for PVT and Cmax for BEOL corner, using RedHawk V15.2.1

from Ansys Inc. The on-state power consumption was calculated in vectorless mode

and toggle rate was set to 0.3 and 2.0 for data and clock signal, respectively.

Power/ground metal density shown in Table 2.3 was identically applied to all cir-

cuits, and switch cells were placed in evenly at the every cross point of M1 and M5 in

staggered way. The given switch cell library is L = {HEADX2 RVT, HEADX4 RVT,

HEADX8 RVT, HEADX16 RVT, HEADX32 RVT}. We assumed flipchip package

and set IR-drop constraint Vlimit to 25mV.

34

Table 2.2: Comparison of the number of switch cells and estimated standby leakage

Istandby by evenly distributed, the previous current based approach in [9] and our al-

gorithmCircuit Circuit size (W×H) Gate count Target frequency

Evenly distributed Current-based method ([9]) Ours

# of Switch Cell Istandby (uA) # of Switch Cell Istandby (uA) # of Switch Cell Istandby (uA)

s35932 198um x 197um 4965 667MHz HEADX32: 80 149.1 HEADX32: 5993.2 HEADX32: 51 86.7

(37.5%) HEADX16: 2 (41.9%)


(29.0%) HEADX16: 4 (33.2%)


(11.3%) HEADX16: 3 (19.6%)

openMSP430 167um x 166um 6611 1250MHz HEADX32: 80 123.6 HEADX32: 5998.4 HEADX32: 49 91.0

(20.4%) HEADX16: 10 (26.4%)

fpu 450um x 450um 49397 909MHz HEADX32: 420 579.6 HEADX32: 277441.0

(23.9%)

HEADX32: 59266.1

(35.0%)HEADX16: 106

HEADX8: 53

Average reduction in Istandby 24.4% 35.0%

Since an IR-drop value expected at powerplan stage tends to pessimistic due to

unrealistic hot-spot margin, IR-drop value obtained at sign-off stage is usually less

than our IR-drop constraint. To get tighter IR-drop condition, we adjusted the IR-

drop value closer to our constraint by increasing the frequency to the target frequency

presented in Table 2.2, which in turn increases total power consumption.

We used N/10 as the ρ value for the switch cell covering algorithm (Step 2) in

order to reduce run time and as a result we were able to finish this step no longer than

5 minutes even in the worst case of the fpu circuit.

2.5.2 Experimental Result

Table 2.2 summarized the results produced by initial evenly distributed switch cell

placement, the previous optimization algorithm in [9], and our algorithm in terms of

the number of switch cells of each type and the standby leakage Istandby. The number

in parenthesis is the rate of reduction from the initial case. The comparison shows that

our algorithm reduces the standby leakage by 35.0% on average, which is 13.9% more

reduction over [9].

Note that there is a long tail shape in IR-drop distribution of initial case, as shown

35

Fig. 2.10(a). It comes from a small number of IR-drop hot-spots whose values are

near Vlimit. A switch cell pitch p was defined to prevent these hot-spots from exceed-

ing Vlimit. Therefore, the IR-drop values of most instances other than hot-spots are

distributed around the mean value, 6.8mV, which is very small compared to Vlimit.

Fig. 2.10(b) shows an example of misleading in current based approach. According to

the IR-drop map of Fig. 2.10(b), the left side of circuit seems to be weaker in IR-drop

than the right side due to long distance from switch cells to instances on them, that

is, dominant factor is resistance and not current in this case. Our algorithm reduced

switch cells very efficiently while shifting the right its IR-drop value near to Vlimit.

We are also able to resize a switch cell if possible and it can reduce more power than

using only one type of switch cell as demonstrated in the result of openMSP430 in

Table 2.2. The standard deviation σ of our IR-drop distribution, shown in Fig. 2.10(c),

also decreases like the initial case in Fig. 2.10(a).

Table 2.3: Power/ground density specification

LayerRVDD VVDD VSS

Width/Pitch (density) Width/Pitch (density) Width/Pitch (density)

MRDL8um/20um

N/A8um/20um

(40%) (40%)

M98um/20um

N/A8um/20um

(80%) (40%)

M51um/60um 1um/60um 1um/60um

(1.7%) (1.7%) (1.7%)

M1N/A

0.06um/3.344um 0.06um/3.344um

(PG rail) (1.8%) (1.8%)

36

0

5000

10000

15000

20000

0 5 10 15 20 25

0

5000

10000

15000

20000

0 5 10 15 20 250

5000

10000

15000

20000

0 5 10 15 20 25

long tail of hot-spot

Mean = 6.69mV Mean = 12.95mV

23.9% Istandby reduced

Mean = 12.00mV

51.1% Istandby reduced

(a) (b) (c)

Figure 2.10: Comparison of IR-drop distribution for the fpu circuit. (a) evenly dis-

tributed switch with 60um pitch, (b) the previous switch optimization method in [9]

and (c) our algorithm.

37

2.6 Conclusions

In this work, we introduced the structure of power deliver network of modern power-

gated SoCs and demonstrated that the standby leakage was considerable at switch

cells. To reduce the unnecessary leakage, we proposed a comprehensive solution to

determining, for each switch cell, if the cell can be removed or the type of switch

cell for replacement so that the resulting total standby leakage of switch cells should

be minimized under the noise constraint. We formulate the problem into a variant of

weighted set cover problem and solve it efficiently by employing an approximate set

cover algorithm. Experiments showed that our method was able to reduce 35.0% and

13.9% more standby leakage over the initial designs and the designs produced by the

previous current-based method in [9], respectively.

38

Bibliography

[1] Y.-T. Chang, C.-C. Hsu, M. P.-H. Lin, Y.-W. Tsai, and S.-F. Chen. Post-placement

power optimization with multi-bit flip-flops. In Proceedings of IEEE/ACM Inter-

national Conference on Computer-Aided Design, 2010.

[2] Z.-W. Chen and J.-T. Yan. Routability-constrained multi-bit flip-flop construction

for clock power reduction. Integration, the VLSI Journal, 46(3), 2013.

[3] M. Cygan, Ł. Kowalik, and M. Wykurz. Exponential-time approximation of

weighted set cover. Information Processing Letters, 109(16):957–961, 2009.

[4] M. R. Garey and D. S. Johnson. Computers and intractability: a guide to the

theory of np-completeness. San Francisco, LA:Freeman, 1979.

[5] C.-C. Hsu, Y.-C. Chen, and M. P.-H. Lin. In-placement clock-tree aware multi-

bit flip-flop generation for power optimization. In Proceedings of IEEE/ACM

International Conference on Computer-Aided Design, 2013.

[6] H. Kawaguchi, K. Nose, and T. Sakurai. A super cut-off cmos (sccmos) scheme

for 0.5-v supply voltage with picoampere stand-by current. IEEE Jounal of

SOLID-STATE CIRCUITS, 35(10):1498–1501, 2000.

[7] N. S. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,

M. Kandemir, and V. Narayanan. Leakage current: Moore’s law meets static

power. Computer, 36(12):68–75, 2003.

39

[8] S. Kim, S. V. Kosonocky, and D. R. Knebel. Understanding and minimizing

ground bounce during mode transition of power gating structures. In Proceedings

of IEEE/ACM International Symposium on Low Power Electronics and Design.

ACM, 2003.

[9] J. N. Kozhaya and L. A. Bakir. An electrically robust method for placing power

gating switches in voltage islands. In Custom Integrated Circuits Conference,

2004. Proceedings of the IEEE 2004, pages 321–324, 2004.

[10] Y. Kretchmer and L. Logic. Using multi-bit register inference to save area and

power: the good, the bad, and the ugly. EE Times Asia, 2001.

[11] Y. Lee, D.-K. Jeong, and T. Kim. Simultaneous control of power/ground current,

wakeup time and transistor overhead in power gated circuits. In Proceedings of

IEEE/ACM International Conference on Computer-Aided Design, 2008.

[12] Y. Lee, M. Seok, S. Hanson, D. Blaauw, and D. Sylvester. Standby power re-

duction techniques for ultra-low power processors. In Proceedings of European

Solid-State Circuits Conference. IEEE, 2008.

[13] J.-M. Lin and C.-C. Lin. Placement density aware power switch planning

methodology for power gating designs. 34(5):766–777, 2015.

[14] M. P.-H. Lin, C.-C. Hsu, and Y.-T. Chang. Recent research in clock power saving

with multi-bit flip-flops. In IEEE MWSCAS, 2011.

[15] M. P.-H. Lin, C.-C. Hsu, and Y.-C. Chen. Clock-tree aware multibit flip-flop

generation during placement for power optimization. 34(2), 2015.

[16] H. Moon and T. Kim. Design and allocation of loosely coupled multi-bit flip-

flops for power reduction in post-placement optimizationi. In Proceedings of

IEEE Asia-South Pacific Design Automation Conference, 2016.

[17] OpenCores. http://www.opencores.org.

40

[18] Y.-T. Shyu, J.-M. Lin, C.-P. Huang, C.-W. Lin, Y.-Z. Lin, and S.-J. Chang. Ef-

fective and efficient approach for power reduction by using multi-bit flip-flops.

21(4), 2013.

[19] S. Walia. Primetime R© advanced ocv technology. Synopsys, Inc, 2009.

[20] L. K. Yong and C. K. Ung. Power density aware power gate placement optimiza-

tion scheme. In Proceedings of IEEE Asia Symposium on Quality Electronic

Design, 2010.

41

초록

본논문에서는셀배치이전/이후에적용가능한파워최적화기법에대해소개

한다.

첫째로, 멀티비트 플립플롭을 구현하기 위해 기존의 소개된 방법과 다른 방법

을 제안한다. 기존 멀티비트 플립플롭에는 크게 두 가지 단계로 구현이 되는데, (i)

타이밍 제약조건에 위배되지 않게 싱글비트 플립플롭을 배치한 후에 (ii) 플립플롭

과 클락트리가 소모하는 파워가 최소가 되도록 플립플롭을 합치는 과정을 거친다.

하지만 (i)의단계에서 (ii)를예측하여진행하기에는한계가있기때문에파워최소

화에는한계가있다.따라서본연구에서는파워최소화에중점을두어파워를최소

화하도록 멀티비트 플립플롭을 합성한 후에 배치하는 방식을 사용하였다. 이러한

설계 초기 단계에서 합성함으로써 타이밍 제약 조건을 위배하지 않으면서 클락 파

워소모를 16.8%감소킬수있었으며이는기존대비클락파워감소에효과적임을

실험을통하여확인하였다.

두 번째로 최신 파워 게이팅이 적용된 설계에서 불필요한 스위치 셀을 노이즈

제약조건(예를들어, IR-drop)을만족하면서최적화하는실제적인문제를다룬다.

파워게이팅스위치셀은파워레일에직접연결되어있기때문에셀배치이전에그

위치가결정이되게된다.결과적으로실제필요보다많은수의스위치셀이배치되

게되고이는곧불필요하게많은대기전력소모를발생시킨다.따라서,본연구에서

는셀배치이후에이를최적화하는기법을소개한다.구체적으로기존의업계에서

주로 사용되는 그리드 방식으로 균등하게 스위치 셀이 배치된 초기 디자인에서 각

스위치 셀 별로 (i) 제거가 가능한지 혹은 (ii) 다른 타입으로 교체가 가능한지를 전

42

반적으로 결정하여 주어진 노이즈 제약 조건하에서 스위치 셀의 총 대기 전력을

최소화하는방법을제안한다.이를위해기존의 weighted set cover problem의변형

으로 문제를 표현하고 approximate set cover algorithm을 통해 효과적으로 문제를

해결하였다. ISCAS89의 benchmark들과, openMSP430, fpu circuit을이용한실험결

과에 따르면, 이 기법을 통하여 초기 그리드 방식 대비 35.0%, 기존 [9]에서 제안된

스위치셀최적화기법대비 13.9%대기전력을감소함을확인하였다.

주요어:저전력설계,멀티비트플립플롭,논리합성,파워게이팅,스위치셀

학번: 2015-20962

43

disclaimer - seoul national university...using multi-bit ﬂip-ﬂop (mbff) library during rtl...

Documents