[ieee 2011 international symposium on vlsi design, automation and test (vlsi-dat) - hsinchu, taiwan...

4
Energy-Aware Partitioning for On-Chip Bus Architecture using a Multi-Objective Genetic Algorithm Lih-Yih Chiou, Yi-Siou Chen and Ya-Lun Jian Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, 70101, R.O.C. Abstract –Incorporating power management during partitioning at the system level contributes considerably to energy efficient architecture. Designers commonly implement systems as a mix of partitioning blocks of various sizes, connected using bus interconnection architecture. Therefore, the use of a partitioning approach that partitions a system with the greatest possible idle time on a dedicated interconnection architecture has become unpractical for power management development. This work presents a novel energy-aware hardware clustering algorithm with a performance estimator for on-chip bus based architectures during high level synthesis, to enhance the quality of solutions for implementing power management systems. Experimental results obtained in four cases reveal that the proposed strategy generates a wide range of cost-effective solutions and is highly effective for today’s hardware systems. I. INTRODUCTION System-level partitioning divides an executable specification into numerous manageable parts in the early design phase, enabling the critical design issues to be addressed. Power consumption is a major concern, given growing consumer demands for energy conserving portable products. Partitioning a system while considering how power affects subsequent implementation is a highly promising approach for designing low-power architecture [1]. The goal of this research is to address the partitioning issues raised within the high level synthesis (HLS) by developing an energy-aware algorithm and using an effective performance/power estimation approach. Power partitioning has been an active research topic in high level synthesis domain. Partitioning for power refers to reducing the power consumption of the functional blocks and the amount of communication among the blocks. Conventional partitioning approaches divide a system consisting of resources into many blocks of similar size and minimize the number of cut nets among these blocks. The clustered blocks provide a good starting point for subsequent place and route algorithm (P&R). The communication power is also reduced as the amount of transfers on wires is reduced. Reducing power consumption also involves considering the temporal relationship among components. The power management unit is commonly adopted in modern electronic systems to turn off (or on) the components or blocks when they are idle (or re-active) [2]. While partitioning of the system into several blocks of similar size, and the maximization of the overlapping of resources on a block in active time is maximized seems feasible for implementing power management systems, invalid partitioning is performed when the size of partitioning block is not similar. Communication architecture design significantly affects the performance and power of a system due to escalating data This work is sponsored by Taiwan National Science Council under project No. NSC 99-2220-E-006-017 transfers among the partitioning blocks. The temporal partitioning approach by a set of active patterns based on simulation without considering communication latency makes end-systems infeasible, because communication latency may greatly change the active patterns of the partitioning blocks. This work presents a bus-based energy-aware partitioning (BEAP) scheme for resolving issues, to our best knowledge, that have not been addressed previously. Invalid solutions may be generated because maximizing the system idle time does not automatically involve minimizing the power consumed by the system. The use of an effective active pattern obtained by a simulation with communication latency can benefit energy-aware partitioning. BEAP adopts a multi-objective genetic algorithm with a performance estimator, to generate a set of partitioning solutions for further development of power management systems. II. RELATED WORK Circuit partitioning approaches isolate the control or/and data path into sub-circuits at the gate level, such that resources inside the sub-circuit resemble each other in terms of their active time. Designers can then reduce power consumption by gating clock of the sub-circuit when it is idle [4], [5]. Low power clustering has also attracted much attention at an early design stage when the complexity of system increases. Various resources that may be in the form of processing elements are normally clustered in a complex block to reduce the amount of cut net between the blocks in a system [6-8]. The lowering the number of cut nets reduces the area of routing wires and the power consumed by connections among the blocks, improving the subsequent application of the P&R algorithm for a small cell area. The use of the clustering approach in the early design phase can promote the development of power management for ever-growing complex systems [9]. The approach clusters the resources that have the most overlapping in active time to maximize the amount of idle time in the system using a segment tree. A novel binding approach is presented to reduce the switching activity and leakage [10]. This proposed approach binds the operations on a resource with fewer power switches and adopts a clique partitioning algorithm to cluster these resources in a manner that maximize active time overlap. A low-power partitioning approach utilizes weighting in a genetic algorithm to minimize the amount of idle time and cut nets [11]. The power switching penalty is also considered. However, these approaches neglect the impact of communication latency on the active pattern, and implicitly assume that the power consumed by all partitioned blocks is equal, making an unpractical solution for implementing power management systems. III. ENERGY-AWARE PARTITIONING 978-1-4244-8499-7/11/$26.00 ©2011 IEEE

Upload: ya-lun

Post on 07-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2011 International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu, Taiwan (2011.04.25-2011.04.28)] Proceedings of 2011 International Symposium on VLSI Design,

Energy-Aware Partitioning for On-Chip Bus Architecture using a Multi-Objective Genetic Algorithm

Lih-Yih Chiou, Yi-Siou Chen and Ya-Lun Jian

Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, 70101, R.O.C.

Abstract –Incorporating power management during partitioning at the system level contributes considerably to energy efficient architecture. Designers commonly implement systems as a mix of partitioning blocks of various sizes, connected using bus interconnection architecture. Therefore, the use of a partitioning approach that partitions a system with the greatest possible idle time on a dedicated interconnection architecture has become unpractical for power management development. This work presents a novel energy-aware hardware clustering algorithm with a performance estimator for on-chip bus based architectures during high level synthesis, to enhance the quality of solutions for implementing power management systems. Experimental results obtained in four cases reveal that the proposed strategy generates a wide range of cost-effective solutions and is highly effective for today’s hardware systems.

I. INTRODUCTION

System-level partitioning divides an executable specification into numerous manageable parts in the early design phase, enabling the critical design issues to be addressed. Power consumption is a major concern, given growing consumer demands for energy conserving portable products. Partitioning a system while considering how power affects subsequent implementation is a highly promising approach for designing low-power architecture [1]. The goal of this research is to address the partitioning issues raised within the high level synthesis (HLS) by developing an energy-aware algorithm and using an effective performance/power estimation approach.

Power partitioning has been an active research topic in high level synthesis domain. Partitioning for power refers to reducing the power consumption of the functional blocks and the amount of communication among the blocks. Conventional partitioning approaches divide a system consisting of resources into many blocks of similar size and minimize the number of cut nets among these blocks. The clustered blocks provide a good starting point for subsequent place and route algorithm (P&R). The communication power is also reduced as the amount of transfers on wires is reduced.

Reducing power consumption also involves considering the temporal relationship among components. The power management unit is commonly adopted in modern electronic systems to turn off (or on) the components or blocks when they are idle (or re-active) [2]. While partitioning of the system into several blocks of similar size, and the maximization of the overlapping of resources on a block in active time is maximized seems feasible for implementing power management systems, invalid partitioning is performed when the size of partitioning block is not similar.

Communication architecture design significantly affects the performance and power of a system due to escalating data This work is sponsored by Taiwan National Science Council under project No. NSC 99-2220-E-006-017

transfers among the partitioning blocks. The temporal partitioning approach by a set of active patterns based on simulation without considering communication latency makes end-systems infeasible, because communication latency may greatly change the active patterns of the partitioning blocks.

This work presents a bus-based energy-aware partitioning (BEAP) scheme for resolving issues, to our best knowledge, that have not been addressed previously. Invalid solutions may be generated because maximizing the system idle time does not automatically involve minimizing the power consumed by the system. The use of an effective active pattern obtained by a simulation with communication latency can benefit energy-aware partitioning. BEAP adopts a multi-objective genetic algorithm with a performance estimator, to generate a set of partitioning solutions for further development of power management systems.

II. RELATED WORK

Circuit partitioning approaches isolate the control or/and data path into sub-circuits at the gate level, such that resources inside the sub-circuit resemble each other in terms of their active time. Designers can then reduce power consumption by gating clock of the sub-circuit when it is idle [4], [5]. Low power clustering has also attracted much attention at an early design stage when the complexity of system increases. Various resources that may be in the form of processing elements are normally clustered in a complex block to reduce the amount of cut net between the blocks in a system [6-8]. The lowering the number of cut nets reduces the area of routing wires and the power consumed by connections among the blocks, improving the subsequent application of the P&R algorithm for a small cell area.

The use of the clustering approach in the early design phase can promote the development of power management for ever-growing complex systems [9]. The approach clusters the resources that have the most overlapping in active time to maximize the amount of idle time in the system using a segment tree. A novel binding approach is presented to reduce the switching activity and leakage [10]. This proposed approach binds the operations on a resource with fewer power switches and adopts a clique partitioning algorithm to cluster these resources in a manner that maximize active time overlap. A low-power partitioning approach utilizes weighting in a genetic algorithm to minimize the amount of idle time and cut nets [11]. The power switching penalty is also considered.

However, these approaches neglect the impact of communication latency on the active pattern, and implicitly assume that the power consumed by all partitioned blocks is equal, making an unpractical solution for implementing power management systems.

III. ENERGY-AWARE PARTITIONING

978-1-4244-8499-7/11/$26.00 ©2011 IEEE

Page 2: [IEEE 2011 International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu, Taiwan (2011.04.25-2011.04.28)] Proceedings of 2011 International Symposium on VLSI Design,

A. Observations

A partitioning approach that partitions a system with the maximum amount of idle time appears to be the most practical for power management, in which the efficiency of power saving inevitably decreases since the partitioning block size becomes unbalance. To explain this phenomenon, the following formula for energy consumption is used:

.�=

=blockh

hhTPEnergy ��������������(1)�

where, Ph is the power consumption of partitioning block hand Th represents the active time of block h. The power consumed by a block depends strongly on its resources. The traditional temporal partitioning approach assumes that the sizes of the partitioned blocks are equal, and the blocks consume equal power in Eq. (1). The power management partitioning problem can be described as a problem of system partitioning with maximization of idle time, but it becomes unpractical for power management development, since the ever-growing complexity design requires various size of partitioned blocks, and each block consumes different power.

Dedicated wire and shared bus architecture are commonly synthesized in HLS. A shared bus uses a suitable structure for minimizing the interconnection costs, but requires more communication latency in comparison to directly connected wires. The communication latency invalidates the original partition solution. Fig. 1 shows one example. The task graph (TG) is a directed graph, where the node represents a task and the edge represents a data and control dependence among tasks. Assume the system consists of 13 tasks T={T1, T2,…,T13}(Fig. 1(a)). These tasks are assigned to be executed on four resources RS1, RS2, RS3, and RS4, like as follows, RS1={T1, T6, T4, T8}, RS2={T5, T7, T10, T11}, RS3={T2, T3, T9} and RS4={T12, T13}. The execution cycle of RS1 and RS2 is forty, RS3 is sixty and RS4 is eighty.

Fig. 1(c) shows the activity of the resources when the tasks are scheduled without considering communication latency. For purposes of demonstration, the switching penalty of the power management system is ignored, and every clustering block is assumed to consume the same power in each operation. Consider two clustering cases, Case1={H1{RS1,RS3},H2{RS2,RS4}} and Case2={H1{RS1,RS2},H2{RS3,RS4}}, where the resources are clustered in two blocks, H1 and H2. The traditional energy gains, EG is given by

).()( 21 HIdleHIdleEG += (2)

where, Idle(Hi) represents the amount of time when the resources in Hi are idle. Then, the system stops at 260 time unit and Case1 is a better solution for implementing power management system than Case2 because the energy gains in Case1 is 100 and that in Case2 is 60.

Various interconnection topologies are typically used to transfer data between partitioned blocks in ever-growing complex systems. Assume that a shared bus connects the two partitioned blocks H1 and H2, as shown in Fig. 1(b), and that data transfer on bus involves 40 cycles; Figs. 1(d) and 1(e) present the activities of the resources in the target interconnection topology in Case1 and Case2, respectively. The end time for the two cases is 340. Unfortunately, Case2 is a better solution for implementing power management systems because the energy gain in Case1 is 120 and that in

Case2 is 180. The results indicate that neglecting communication latency causes ineffective partitioning.

B. Methodology

Fig. 2 shows the BEAP algorithm. The input of BEAP is the executable specification described by a task graph along with performance constraints. By considering both the energy and cut-net during clustering exploration, BEAP attempts to cluster system resources R={r1,r2,…,rr} as h-partition circuit blocks using the Non-dominated Sorting Genetic Algorithm-II (NSGA-II) [3] so that the partitioning architecture has less energy and/or smaller number of cut nets. The genetic algorithm (GA) has been shown to be effective in solving combinational problems. NSGA-II is designed to generate Pareto-optimal solutions to the problem with multiple objectives.

Assume that three power modes namely idle, active and switch, are available in a power management system. The energy objective function in Eq. (1) is modified as follows:

.))((

))()()()((

1

1

=

=

⋅+

⋅+⋅=

h

iii

h

iiiii

SWswitchEnergy

idleUidlePactiveUactivePEnergy (3)

where Pi(active) and Pi(idle) represents the active power consumption and idle power consumption of hi individually. Ui(active) and Ui(idle) is the amount of active time and idle time of hi. SWi denotes the amount of mode switching in hi and Energyi(switch) is the energy consumption per switching in hi.

Every iteration is consisted of three main steps: block partitioning, architectural transformation, and cut-net and energy estimation. The first step is to search the design space and generate candidate solutions by applying the multi-objective genetic algorithm. The design space is explored by first encoding the architecture as genes and performing operations such as crossover and mutation. A chromosome comprises a set of genes and represents a candidate partitioning solution (Fig. 3). The chromosome length equals to the number of resources. The value of a gene denotes the block id and ranges from 0 to h-1, where h denotes the number of partitioning blocks. For instance, resource i is assigned in a partitioning block j when the value of the (i-1)-th gene is j.

The architectural transformation step builds the architecture and connects partitioned blocks based on communication

FIGURE 1. TASK GRAPH SCHEDULING ON TARGET ARCHITECTURE

Page 3: [IEEE 2011 International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu, Taiwan (2011.04.25-2011.04.28)] Proceedings of 2011 International Symposium on VLSI Design,

architecture obtained in the first step. Then, energy consumption and the number of cut nets are estimated to guide the evolution. BEAP adopts a dynamic trace-based method, namely APDT [13] to estimate the activity of the resources on a given communication architecture, and calculates the performance, amount of cut-net and energy of a system. The APDT approach uses the concurrent execution traces to ensure the proper order of task executions, and allows designers to fast explore architecture with various resource partitions, data allocations and communication designs without remodeling the system for trace reconstruction. These three tasks are repeated until the given computing time budget in NSGA-II is exhausted. Notably, the communication architecture can be modified for other efficient interconnection topology or differ in each candidate solution.

IV. EXPERIMENTAL RESULTS

BEAP was implemented using C++ programming and applied in three experiments to demonstrate its effectiveness on low power partitioning. All simulations were run on a Xeon 2.5G workstation with 14G memory running Linux.

A. Effect of the objective function on energy saving

BEAP was applied to a wavelet transform application (WTA) to demonstrate how idle time maximization and energy minimization influence the effectiveness of power management. The WTA implemented an algorithm for analyzing audio using the wavelet transform. The solutions were explored under three factors, i.e. idle time, cut nets and energy. Assume that the resources are clustered as 4-partition circuit blocks. The performance and power consumption of all basic resources in the applications was measured using Synopsys Design Vision and 0.18μm cell library. For unbalanced partitioning, the number of resources assigned in a block was not limited. The number of populations and generations was set as 200 and 2000, respectively.

Fig. 4 shows the generated solutions with energy values, cut nets and idle time at the final generation of the BEAP algorithm, where each point represents a solution. The front-end points in the plane formed by the energy and cut nets axis indicate that the energy consumption of the solution

increases with a decreasing number of cut nets. However, the idle time is not closely related to the energy

consumption. To explain this phenomenon, Fig. 5 illustrates the relation of between energy and idle time of the solutions in three distinct generations, i.e. 0th, 500th and 2000th. The solution in the 2000th generation is the optimization solution. According to Fig. 5, the solutions with the maximum amount of idle time have a power consumption distributed over a wide range. This distribution is because the power consumption of the system largely depends on not only the amount of the idle time of blocks, but also the power of the blocks. Therefore, conventional partitioning approaches [9-11] adopting the maximum idle time as the optimization objective may lead to an unjustified solution in implementation. In contrast, BEAP considering the power of blocks can generate feasible solutions based on the right objective functions.

B. BEAP versus two-phase approach

An extension from a traditional temporal partitioning algorithm [11] namely two-phase approach was adopted to illustrate how the communication latency impacted the effectiveness of the partitioning algorithms. The two-phase approach partitions a system with the minimum amount of energy and cut nets based on the use of a set of active patterns from a simulation without considering communication latency. The power consumption associated with a set of solutions in which the partitioned blocks are connected using a given communication architecture was recalculated following partitioning in the two-phase approach.

The input cases were generated by TGFF [12], the number of nodes was 500 and the number of resources was 30. The resources were to be clustered in five blocks. The partitioned blocks were assumed to be connected through shared-bus architecture. The amount of explored solutions in the two-phase approach and the BEAP were the same. Fig. 6 shows the solutions in the last generation. The triangle and diamond represents the solutions generated by the two-phase approach and the BEAP approach, respectively.

A power/cut-nets trade-off analyzer is very useful for designers to determine which system architecture suitable for incorporating with power management. The Pareto-frontier obtained by the generated solutions helps designers to define system architecture and produce a satisfied solution. The cut net of generated solutions decrease with a decreasing number of blocks. On the contrary, the power consumption increases with a decreasing number of clusters. As shown in Fig 6, the solutions obtained using the BEAP are better than those obtained using the two-phase approach.�

Another important observation of the solutions generated by the two-phase approach is that as the cut net becomes larger, the energy of the solutions increases when the number of cut net is larger than 200. The results indicate that the effectiveness of the two-phase approach is closely related to the impact of communication latency on the active pattern. Increased numbers of interactions among clusters leads to significant performance drops and power consumption increases in the communication architecture.

C. BEAP versus exhaustive approach

BEAP and the exhaustive approach were compared in terms of the increase in solution searching speed. The exhaustive approach searches for the global optimization solution after visiting all design spaces. BEAP and the exhaustive approach were applied to an image processing application (IPA). The power consumption of all basic resources in the application

FIGURE 2. BUS-BASED ENERGY-AWARE PARTITIONING DESIGN FLOW

FIGURE 3. GENETIC ALGORITHM ENCODING

Page 4: [IEEE 2011 International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu, Taiwan (2011.04.25-2011.04.28)] Proceedings of 2011 International Symposium on VLSI Design,

was evaluated using the same approach as shown in Section IV-A.

The results generating from the exhaustive approach were compared with the results using the BEAP approach. The experimental result demonstrates that the minimum energy solutions and the minimum cut nets solutions from the two approaches in three cases were the same. Table I displays their simulation times. The time complexity of BEAP approach largely depends on the size of the population and the generation. The exhaustive approach is closely related to the size of the design space that is increased with an increasing number of blocks. Therefore, BEAP approach is more efficient than the exhaustive approach with a generated solution of similar quality.

V. Conclusions

This work presents a novel energy-aware partitioning approach by estimating energy and analyzing activity to develop architecture suitable for subsequent power management design. Experimental results demonstrate that not only the activity patterns but also the formulation of the objective function greatly influences the effectiveness of the generated architecture. Furthermore, the proposed algorithm can generate the closed optimal solution more rapidly than that by the exhaustive approach.

REFERENCES

[1] A. Staammermann et al., ”System-level optimization and design space exploration for low power,” in Proc. Int. Symp. System Synthesis, 2001, pp. 142-146. [2] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of

design techniques for system-level dynamic power management,” IEEE Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 8, issue 3, pp. 299-316, June 2000. [3] K. Deb, A. Pratab, S. Agrawal, and T. Meyarivan, ”A fast and elitist multi-objective genetic algorithm: NSGA-II,” IEEE Trans. on Evolutionary Computation, vol. 6, issue 2, pp. 182-197, April 2002. [4] N. Agarwal and N. Dimopoulos, "FSMD partitioning for low power using simulated annealing," in Proc. Int. Symp. Circuits and Syst., 2008, pp. 1244-1247. [5] E. Hwang, F. Vahid, and H. Yu-Chin, "FSMD functional partitioning for low power," in Proc. Int. Conf. Des. Autom. Test Eur., 1999, pp. 22-28. [6] C. M. Fiduccia and R. M. Mattheyses, "A Linear-Time Heuristic for Improving Network Partitions," in Proc. Int. Conf. Des. Autom., 1982, pp. 175-181. [7] B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs," Bell System Technical Journal, vol. 49, pp. 291-307, 1970. [8] L. Jianhua, L. Behjat, and A. Kennings, "Net Cluster: A Net-Reduction-Based Clustering Preprocessing Algorithm for Partitioning and Placement," IEEE Trans. on Comput.-Aided Design Integr. Circuits Syst., vol. 26, pp. 669-679, 2007. [9] A. H. Farrahi and M. Sarrafzadeh, "System partitioning to maximize sleep time," in Proc. Int. Conf. Comput.-Aided Design, 1995, pp. 452-455. [10] D. Dal and N. Mansouri, "Power Optimization with Power Islands synthesis," IEEE Trans. on Comput.-Aided Design Integr. Circuits Syst., vol. 28, pp. 1025-1037, 2009. [11] P. Ghafari, E. Mirhadi, M. Anis, A. Areibi, and M. Elmasry, "A low-power partitioning methodology by maximizing sleep time and minimizing cut nets," in Proc. Int. Workshop Syst.-on-Chip for Real-Time Application, 2005, pp. 368-371. [12] R. Dick, D. Rhodes, and W. Wolf, “TGFF: task graphs for free,” in Proc. Int. workshop Hardware/Software Codesign, 1998, pp. 97-101. [13] Y. S. Chen, L. Y. Chiou, and H. H. Chang, "A Fast and Effective Dynamic Trace-based Method for Analyzing Architectural Performance," in Proc. Int. Conf. Asia and South Pacific Des. Autom., 2011, pp. 591-596.

FIGURE 6. THE OBTAINED SOLUTIONS USING THE TWO-PHASE APPROACH AND THE BEAP APPROACH AT THE FINAL GENERATION

TABLE I THE SIMULATION TIME USING THE EXHAUSTIVE APPROACH AND THE

BEAP APPROACH

Simulation Time (sec) Exhaustive BEAP

2 blocks 1.17 32.48 3 blocks 497.17 32.39 4 blocks 34256.3 31.98

FIGURE 4. GENERATED SOLUTION AT THE FINAL GENERATION

FIGURE 5. THE OBTAINED SOLUTIONS IN 0TH,500TH,2000TH GENERATION