rispp: rotating instruction set processing platform

6
RISPP: Rotating Instruction Set Processing Platform Lars Bauer, Muhammad Shafique, Simon Kramer and Jörg Henkel University of Karlsruhe, Chair for Embedded Systems, Karlsruhe, Germany {lars.bauer, shafique, henkel} @ informatik.uni-karlsruhe.de ABSTRACT Adaptation in embedded processing is key in order to address effi- ciency. The concept of extensible embedded processors works well if a few a-priori known hot spots exist. However, they are far less efficient if many and possible at-design-time-unknown hot spots need to be dealt with. Our RISPP approach advances the extensible processor concept by providing flexibility through run- time adaptation by what we call “instruction rotation”. It allows sharing resources in a highly flexible scheme of compatible com- ponents (called Atoms and Molecules). As a result, we achieve high speed-ups at moderate additional hardware. Furthermore, we can dynamically tradeoff between area and speed-up through run- time adaptation. We present the main components of our platform and discuss by means of an H.264 video codec. Categories and Subject Descriptors C.3 [Special-Purpose and Application-Based Systems]: Real- time and embedded systems General Terms: Performance, Design Keywords: ASIP, extensible embedded processors, run-time adaptation, reconfigurable computing 1. INTRODUCTION AND RELATED WORK Embedded processors are key for rapidly growing application fields ranging from automotive to personal mobile communi- cation/computation/entertainment etc. In the early 1990s, the term ASIP has emerged denoting processors with an applica- tion specific instruction set. They present a far better effi- ciency in terms of performance/area, MIPS/mW compared to main stream processors and eventually make today’s embedded (and often mobile) devices possible. The term ASIP comprises nowadays a far larger variety of embedded processors allowing for customization in various ways including a) instruction set extensions, b) parameterization and c) inclusion/exclusion of predefined blocks tailored to specific applications (like, for example, an MPEG-4 decoder) [14]. Tool suites and architec- tural IPs for embedded customizable processors with different flavors are available from major vendors like Tensilica [8], CoWare/LisaTek [2], ARC [1], 3DSP [5] and Target [4] as well as academic approaches like ASIP Meister (PEAS-III) [11], LISA [3], [12] and Expression [13]. A generic design flow of an embedded processor can be described as follows: i) an ap- plication is analyzed/profiled, ii) an extensible instruction set is defined, iii) extensible instruction set is synthesized together with the core instruction set, iv) retargetable tools for compila- tion, instruction set simulation etc. are (often automatically) created and application characteristics are analyzed, v) the process might be iterated several times until design constraints comply. These approaches assume that customizations are under- taken during design-time with little or no adaptation possible during run-time. Our project combines the paradigms of exten- sible processor design with the paradigm of dynamic re-confi- guration in order to address the following concerns in embed- ded processing: a) an application might have many hot spots (instead of only a few) and would require a large additional chip area in order to comprise all customizations necessary, b) the characteristics of an application may widely vary during run-time due to switching to different operation modes, change in design constraints (systems runs out of energy, for exam- ple), or highly in-correlated input stimuli patterns. These and similar concerns are addressed by our paradigm of “Rotating Instruction Set Processing” that fixes some customizations during run-time and is able to adapt dynamically during run- time. Before we further present our approach, we discuss related work that is twofold: extensible embedded processor design and approaches for dynamic re-configuration which build the foundation for our approach. An overview for the benefits and challenges of ASIPs is given in [14], [15]. The architecture [16] uses a library of re- usable instructions which are manually designed, but therefore of high quality. In [17], [18] the authors describe methods to generate special instructions from operation patterns matching. [19] introduces an estimation model for area, overhead, la- tency, and power consumption under a wide range of customi- zation parameters. In [20] an approach for a instruction set de- scription on architecture level is proposed, which avoids in- consistencies between compiler and instruction set simulator. The authors in [21] investigate local memories in the func- tional units which are then exploited by special instructions. On the other side, we utilize the techniques of reconfigur- able computing for our approach. Overviews can be found in [22], [23]. In the scope of this paper we focus on extensions of CPUs with reconfigurable fabrics which are then made avail- able to the programmer. An overview for this specific area of reconfigurable computing is given in [24]. Still, most of the CPU extensions are coarse-grained. The Molen Processor cou- ples a reconfigurable processor to a core processor via a dual port register file and an arbiter for shared memory [25]. The run-time reconfiguration is explicit controlled by additional instructions in the application. The OneChip project [26] and its successor OneChip98 [27] use a Reconfigurable Functional Unit (RFU) to utilize reconfigurable computing in a CPU. This RFU is coupled to the host processor and obtains its speed-up mainly from memory streaming applications [27]. A first commercial product is presented in [6] although it concentrates on startup- instead of run-time reconfigurability. The rest of the paper is organized as follows: In section 2, we present the motivation for our RISPP project. The Atom Model and Special Instruction Composition are presented in section 3. Section 4 describes how to add the Special Instruc- tions Forecasts at compile-time. The run-time architecture is explained in section 5. We show a case study and results in section 6 and conclude our paper in section 7. 2. MOTIVATIONAL CASE STUDY Extensible/Configurable Processors have played a significant role to target the multifaceted applications for low power. But one of the strict shortcomings of this domain is the fixed hardware for hot spots along the whole application span. The special instructions’ (SIs) hardware is fixed at design-time ir- respective of the consideration of its utilization during the ap- plication’s run-time. To address this problem, another approach is coarse-grained dynamic reconfiguration which offers the adaptation at runtime. But in case of intricate applications like Multimedia TV where 30 frames are encoded and decoded at high video quality with Audio, Multiplexing, Pre- and Post- Processing the time schedule is very tight not allowing for time consuming reconfigurations. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2007, June 4–8, 2007, San Diego, California, USA Copyright 2007 ACM 978-1-59593-627-1/07/0006…$5.00 791 44.1

Upload: independent

Post on 31-Jan-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

RISPP: Rotating Instruction Set Processing Platform Lars Bauer, Muhammad Shafique, Simon Kramer and Jörg Henkel

University of Karlsruhe, Chair for Embedded Systems, Karlsruhe, Germany {lars.bauer, shafique, henkel} @ informatik.uni-karlsruhe.de

ABSTRACT Adaptation in embedded processing is key in order to address effi-ciency. The concept of extensible embedded processors works well if a few a-priori known hot spots exist. However, they are far less efficient if many and possible at-design-time-unknown hot spots need to be dealt with. Our RISPP approach advances the extensible processor concept by providing flexibility through run-time adaptation by what we call “instruction rotation”. It allows sharing resources in a highly flexible scheme of compatible com-ponents (called Atoms and Molecules). As a result, we achieve high speed-ups at moderate additional hardware. Furthermore, we can dynamically tradeoff between area and speed-up through run-time adaptation. We present the main components of our platform and discuss by means of an H.264 video codec. Categories and Subject Descriptors C.3 [Special-Purpose and Application-Based Systems]: Real-time and embedded systems General Terms: Performance, Design Keywords: ASIP, extensible embedded processors, run-time adaptation, reconfigurable computing

1. INTRODUCTION AND RELATED WORK Embedded processors are key for rapidly growing application fields ranging from automotive to personal mobile communi-cation/computation/entertainment etc. In the early 1990s, the term ASIP has emerged denoting processors with an applica-tion specific instruction set. They present a far better effi-ciency in terms of performance/area, MIPS/mW compared to main stream processors and eventually make today’s embedded (and often mobile) devices possible. The term ASIP comprises nowadays a far larger variety of embedded processors allowing for customization in various ways including a) instruction set extensions, b) parameterization and c) inclusion/exclusion of predefined blocks tailored to specific applications (like, for example, an MPEG-4 decoder) [14]. Tool suites and architec-tural IPs for embedded customizable processors with different flavors are available from major vendors like Tensilica [8], CoWare/LisaTek [2], ARC [1], 3DSP [5] and Target [4] as well as academic approaches like ASIP Meister (PEAS-III) [11], LISA [3], [12] and Expression [13]. A generic design flow of an embedded processor can be described as follows: i) an ap-plication is analyzed/profiled, ii) an extensible instruction set is defined, iii) extensible instruction set is synthesized together with the core instruction set, iv) retargetable tools for compila-tion, instruction set simulation etc. are (often automatically) created and application characteristics are analyzed, v) the process might be iterated several times until design constraints comply.

These approaches assume that customizations are under-taken during design-time with little or no adaptation possible during run-time. Our project combines the paradigms of exten-sible processor design with the paradigm of dynamic re-confi-guration in order to address the following concerns in embed-ded processing: a) an application might have many hot spots

(instead of only a few) and would require a large additional chip area in order to comprise all customizations necessary, b) the characteristics of an application may widely vary during run-time due to switching to different operation modes, change in design constraints (systems runs out of energy, for exam-ple), or highly in-correlated input stimuli patterns. These and similar concerns are addressed by our paradigm of “Rotating Instruction Set Processing” that fixes some customizations during run-time and is able to adapt dynamically during run-time.

Before we further present our approach, we discuss related work that is twofold: extensible embedded processor design and approaches for dynamic re-configuration which build the foundation for our approach.

An overview for the benefits and challenges of ASIPs is given in [14], [15]. The architecture [16] uses a library of re-usable instructions which are manually designed, but therefore of high quality. In [17], [18] the authors describe methods to generate special instructions from operation patterns matching. [19] introduces an estimation model for area, overhead, la-tency, and power consumption under a wide range of customi-zation parameters. In [20] an approach for a instruction set de-scription on architecture level is proposed, which avoids in-consistencies between compiler and instruction set simulator. The authors in [21] investigate local memories in the func-tional units which are then exploited by special instructions.

On the other side, we utilize the techniques of reconfigur-able computing for our approach. Overviews can be found in [22], [23]. In the scope of this paper we focus on extensions of CPUs with reconfigurable fabrics which are then made avail-able to the programmer. An overview for this specific area of reconfigurable computing is given in [24]. Still, most of the CPU extensions are coarse-grained. The Molen Processor cou-ples a reconfigurable processor to a core processor via a dual port register file and an arbiter for shared memory [25]. The run-time reconfiguration is explicit controlled by additional instructions in the application. The OneChip project [26] and its successor OneChip98 [27] use a Reconfigurable Functional Unit (RFU) to utilize reconfigurable computing in a CPU. This RFU is coupled to the host processor and obtains its speed-up mainly from memory streaming applications [27]. A first commercial product is presented in [6] although it concentrates on startup- instead of run-time reconfigurability.

The rest of the paper is organized as follows: In section 2, we present the motivation for our RISPP project. The Atom Model and Special Instruction Composition are presented in section 3. Section 4 describes how to add the Special Instruc-tions Forecasts at compile-time. The run-time architecture is explained in section 5. We show a case study and results in section 6 and conclude our paper in section 7.

2. MOTIVATIONAL CASE STUDY Extensible/Configurable Processors have played a significant role to target the multifaceted applications for low power. But one of the strict shortcomings of this domain is the fixed hardware for hot spots along the whole application span. The special instructions’ (SIs) hardware is fixed at design-time ir-respective of the consideration of its utilization during the ap-plication’s run-time. To address this problem, another approach is coarse-grained dynamic reconfiguration which offers the adaptation at runtime. But in case of intricate applications like Multimedia TV where 30 frames are encoded and decoded at high video quality with Audio, Multiplexing, Pre- and Post- Processing the time schedule is very tight not allowing for time consuming reconfigurations.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2007, June 4–8, 2007, San Diego, California, USA Copyright 2007 ACM 978-1-59593-627-1/07/0006…$5.00

791

44.1

Within this paper we are presenting a novel platform which gives origin to an innovative thinking process. It is based on what we call the Rotating Instruction Set Processing Platform. It provides rotations of hardware for SIs at run-time. It elimi-nates the need of dedicated hardware for hot spots and presents a framework for a time-multiplexed utilization of rotating hardware resources. We will demonstrate our concept with the help of an ITU-T H.264 Video Encoder [9].

Figure 1 shows the comparison of an Extensible Processor and the RISPP hardware requirements (Gate Equivalents (GE)) while showing the performance maintenance using RISPP’s rotating concept. The distribution of processing complexity is shown in percentage for Motion Estimation (ME), Motion Compensation (MC), Transform and Quantization (TQ), and Loop Filter (LF), which are the major functional blocks, con-taining various hot spots in the H.264 Video Encoder. An Ex-tensible Processor provides dedicated hardware resources in form of SIs which are constructed at design-time. It is obvious from Figure 1, that these resources are not used in the com-plete run-time span of the application. The hardware for LF, TQ, and MC is not used while processing ME, resulting in power/energy loss and overhead of silicon area.

Figure 1: Comparison of Extensible Processors and RISPP

RISPP provides a new concept that requires only the silicon areas for the largest hot spot plus some addition. This addi-tional silicon area is used to accommodate hot spots while con-sidering the rotation overhead. A benefit of the RISPP model is that it upholds the performance of Extensible Processors in context of speed. Here, we can introduce α as a scaling factor to find the trade-off points for rotation overheads and perform-ance preservation. When the need of other hot spot arises, the RISPP manager performs the rotation of the instruction set to accommodate the needs of a new hot spots.

In the example in Figure 1 MC consumes only 17% of total processing time, but it requires the biggest area (GEmax). Using the RISPP concept we can select α x GEmax to construct the hardware which fulfills the needs of every single hot spot. It shows a [(GEtotal - α x GEmax) x 100 / GEtotal] percent GE sav-ing. For a given area constraint GEconstraint we define RISPP requirements as RISPP HW = α x GE ≤ GE . required constraint

It is apparent from Figure 1 that the cycles to perform ME consume the major portion of the total processing time while requiring the least amount of hardware. Therefore, we can per-form ME at the same performance of the Extensible Processor, as it completely fits into α x GEmax. While ME is executed the unused hardware (ME is small compared to MC) will be pre-pared for the next hot spot which demands a robust forecasting mechanism for the consecutively needed SIs. Alternatively, we can assign some more hardware to ME to expedite it resulting in a performance improvement.

Recapitulating the concept, the notion of rotating instruc-tions at run-time eliminates the need of dedicated hardware for each hot spot while retaining the application performance. The

subsequent sections will explain this concept and the funda-mental building blocks of our RISPP approach in more detail.

We do not loose any performance as compared to ASIPs in terms of execution time for ME. One of the benefits of our ap-proach is that we can support SIs using slower hardware exe-cutions. And it is apparent from Figure 1 that during the time for ME execution the partial hardware for MC is nearly com-pletely loaded. When ME execution is finished, then the MC hardware is not completely loaded. Nevertheless, the available hardware resources can be used to immediately start executing MC in hardware, although not with the same performance of the full hardware. This is due to our fine-grained SI composi-tion. When the faster hardware version will be loaded, the MC execution will gradually upgrade to a faster implementation. We call this as “Rotation in Advance” which is the analogon of pre-fetching or forecasting in the context of succeeding SIs. Our Novel Contributions: We present a novel platform for Dynamically Extensible Proc-essors. Our RISPP approach features: a) Special Instruction forecasting algorithm/scheme. b) Composition and formal model of Special Instructions in

form of Atoms and Molecules. c) Run-time adaptation to different Molecules, depending upon

the available Atoms and application requirements. d) Optimized software Molecule for each Special Instruction. Our RISPP approach offers a high degree of freedom for switching between software components, tasks, functions, and operations (from top to bottom level) while reducing the num-ber of rotations and also forecasting the expected rotations well in advance. Furthermore, our forecast updating scheme maximizes the expectation / probability of the prediction.

3. SPECIAL INSTRUCTION COMPOSITION We present a new vision of Special Instructions (SIs) in our Rotating Instruction Set Processing Platform. The SIs are real-ized as complex combinations of distinct atomic operations which are key for low power, high speed data paths with high reusability. We call a basic data path an Atom and the combina-tions of Atoms that implement an SI a Molecule. Comprehen-sive analysis of a set of SIs, sharing similar functional capa-bilities, leads to the conclusion that underlying hardware has a certain degree of similarity, which leads to the formulation of Molecules in terms of connecting Atoms. With enhanced reus-ability an Atom can be used for different Molecules, even of different SIs. Figure 2 shows a visualization using an ITU-T H.264 Video Encoder [9]. It is noteworthy that three different SIs can be implemented while sharing the same set of Atoms.

Figure 2: Molecule implementations of HT_4x4, DCT_4x4, and SATD_4x4 using different number of available Atoms

Let us take the example of the Hadamard Transformation for a 4x4 Macroblock (HT_4x4 in Figure 2), as it is used in the ITU-T H.264 Video Codec. For our designed Atoms each HT_4x4 requires 4 Transform- and 4 Pack_LSB_MSB-executions to complete the computation. To implement this SI we have the choice to perform these atomic operations in parallel, sequen-tially or a combination of both1, giving the best trade-off for

1 for simplicity Figure 2 shows only 3 possible combinations

792

performance vs. area or power. Additionally, our RISPP ap-proach moves this degree of freedom from design-time to run-time, by offering different SI implementations (i.e. Molecules). Hence, in this particular case, Hadamard Transformation is performed using a single call of the HT_4x4 SI in the applica-tion binary and the run-time system has multiple Molecule op-tions to execute considering the available hardware, relative speed-up gain, required rotation effort and power savings.

3.1. Model for Molecule Assembly, Definitions We will now explain our fine-grained SI composition on a for-mal basis to present the Molecule/Atom interdependencies and to simplify expressions in later used pseudo code. We define a data structure , where is the set of all Molecules and n is the number of different available Atoms. For expedi-ency we consider as Molecules with 1 n where mi describes the desired number of instances of Atom i to implement the Molecule (similarly for o and

n( , , , )∪ ∩ ≤ n

nm, o, p∈ m = (m , ..., m )

p). The opera-tor n with ; i i i describes a Meta-Molecule which contains the Atoms required to imple-ment both m and 2

n n:∪ × → }p m o=:m o p∪ = : max{ ,

o . We name the resulting Molecule p as a Meta-Molecule to distinguish it from the elementary Mole-cules, which are dedicated to implement a specific SI. Since the operator ∪ is commutative and associative with the neutral element (0,…,0), therefore is an Abelian semigroup. The e for with being defined as i . The relation m is defined to be true iff

n( , )∪ same is n

m o p p m o tru n( , )∩ n n:∩ × →

: , : min{ , }∩ = =i i o≤[ ]1, : i i∀ ∈ . As ≤ is reflexive, anti-symmetric,

and transitive, is a partially ordered set. For a set of Molecules , the supremum is defined as su m M

i n m o≤n( , )≤

nM⊂ p :M m∈=∪ . The supremum from M is a Meta-Molecule with the meaning of declaring all Atoms that are needed to implement any of the Molecules in M such that ∀ ∈ . The infimum is correspondingly defined as m M

: supm M m M≤inf :M m= ∈∩ with the meaning

of containing those Atoms that are collectively needed for all Molecules of M. As any subset has a well-defined supremum and infimum, is a complete lattice.

nM∅≠ ⊂n( , )≤

3.2. Molecule and SI Compatibility Given these definitions, we can now combine multiple Mole-cules which are chosen to implement different SIs. To calcu-late the cost for constituting Molecules in hardware, we define the following two functions: The determinant of a Molecule is defined as [ ]i 1, n i∈ , i.e. the total number of Atoms that are required to implement . To further consider already configured Atoms we define the function

m := m∑m

i i i in n n o -m , if o -m 0: ; : ; p :

0, ≥⎧

× → = = ⎨⎩

im o pelse

The created Meta-Molecule p contains the minimum set of Atoms that additionally have to be offered to implement o, as-suming that the Atoms in are already available. m

To find a metric for the compatibility of SIs we have to con-sider that an SI in general consists of multiple Molecules with potentially different compatibilities. To handle this, we can compute statistical indicators like the average compatibility of multiple SIs. As we need to compute the compatibility of SIs at run-time (it depends on the currently available Atoms), we decided to represent each SI by a Meta-Molecule for the aver-age Atom usage of its Molecules. By doing so we reduce the incompatibilities of the SIs to the incompatibilities of the rep-resenting Meta-Molecules. We define this representing Meta-Molecule for an SI S as ( )Rep( ) : , : ∀ ∈= = ⎡ ∑ ⎤S m m o S⎢ ⎥o Si i , where , ⊂ is the number of hardware MoleculesnS S

3 for the SI S. These definitions will be used in section 4.2 to select forecast instruction out of a set of candidates and it is also

idates

2 although they are not necessarily usable concurrently 3 omitting the Molecule for software-execution

used at run-time to select SIs that should be supported in hardware out of a set of requested SIs.

4. ADDING FORECASTS AT COMPILE-TIME As the rotation time is in the range of milliseconds4 it is of paramount importance to forecast which SIs will be needed next. This allows starting the rotation earlier and thus the chance is increased that the hardware implementation can ac-tually be used when the SI is required. Adding the Forecast Points (FCs) is done at compile-time on the Base-Block (BB) level of the application. Figure 3 shows the BB-graph from the AES application as it is automatically generated from our tool-chain. The coloring visualizes profiling information for the execution time. To add FCs to this BB-graph we use the fol-lowing scheme: 1) For each SI-Type determine a set of BBs as FC Candidates 2) For each BB remove those FC Candidates that are incom-

patible to the other FCs in the same BB 3) Choose FCs out of the FC Candidates and combine them to

FC Blocks, which will ease the run-time computation effort

4.1. Determining FC CandTo determine whether a BB ’B’ is a good candidate to forecast a specific SI ’S‘ we have to con-sider the following points: • The probability to reach an

execution of S. • The minimal, typical, and

maximal temporal distance between B and any usage of S. The expected number of exe-cutions when S is reached.

If the probability at location B for executing S is high but the actual execution of S occurs only a few cycles after B, then B is an inappropriate candidate for starting to initiate a rotation, since there is too little time left to finish rotation. In the case that B is temporally too far from the execution of S then a rota-tion would block the according Atom-Containers (ACs) for a too long period resulting in inefficiency.

Figure 3 shows an example BB graph. Outlined is the usage of

Special Instructions

Candidates for Forecast Points

an SI-type (circles). From profiling information we obtain the before-mentioned measurements distance, probability, and number of executions. These values are used to evaluate whether a BB represents a good FC Candidate or not. The probability is computed by a recursive algorithm that segments the BB graph into a tree of strongly connected components5 (SCC) [28], recursively calls itself to compute the probability values of the SCCs and finally executes the algorithm pro-posed by Li/Hauck [29] to compute the probability in the re-sulting tree. Our recursive addition to Li/Hauck is needed for our more fine-grained approach.

Next, the Forecast Decision Function FDF is executed. In-puts are the probability and the temporal distance. The output of the FDF is the minimal number of expected executions of the SIs that are needed for the current BB to turn it into an FC Candidate. This output is then compared with the computed number of expected executions. The FDF for the probability p and the temporal distance t is computed as6:

RotT t −⎧⎪ SW

Rot

T *pFDF(p, t) : offset max T * t p

0= + ⎨

⎪⎩

4 as it is shown in the results in Section 6 5 e.g. loops or subroutine calls 6 for clarity some additional adjustment parameters are omitted

F

anCan

igure 3: BB-graph for AES with profil-ing info, SI usages

d computed FC didates

793

where TRot is the average rotation time for S, TSW is the time that S needs for software execution and offset is the minimum number of executions that are needed to make the rotation process energy efficient. The offset is computed as the energy cost for the rotation divided by the difference of the execution of S in software and in hardware. The offset is then multiplied by a parameter with which a trade-off between energy effi-ciency and speed-up can be achieved:

Rot

SW HW

Eoffset * T T⎛ ⎞= α ⎜ ⎟−⎝ ⎠

An instance of the FDF is shown in Figure 4, where the tempo-ral distance is shown relative to TRot (logarithmic scale). In the range where the BB is closer than 1 TRot to the SI, a high num-ber of expected executions are needed to become FC Candi-date. In a range where the BB is farther than 10 TRot, the num-ber of needed executions is increasing to avoid an unnecessary long blocking of the Atom-Containers.

4.2. Trimming the FC Candidates and Choosing FCs At run-time, we need to regard every single FC to trigger rota-tion decisions. But as every FC invokes the run-time system to re-evaluate, we need to reduce the number of FC Candidates in the first place. Good candidates for elimination are those which would contribute little to the overall speed-up or effi-ciency. The main observation is that one BB can contain FCs which are definitely not going to fit collectively into the ac-cording ACs. Then, those FCs whose SIs are providing the worst relation of speed-up and additional needed hardware re-sources F

7F are truncated. The pseudo-code is shown in Figure 5.

// S1, …, Sk are the SIs of the FC Candidates in this BB 1. M ;←∅ 2. i i S : M M {Rep(S )};← ∪∀ 3. ( )( sup(M) #AvailableAtomContainers) M {> ∧ ≠ ∅while 4. relation 0; candidate null;← ← 5. m M {∈∀ 6. ( )temp sup(M) sup(M\{m} ExpectedSpeedup(m);← − 7. (temp > relation) {if 8. relation temp; candidate m;← ← 9. } 10. } 11. if (candidate ≠ null) M ← M \ {candidate} 12. else break; 13. }

Figure 5: Removing FC Points with worst Expected- Speed-up per hardware resources

The ExpectedSpeedup(m) execution in the algorithm reflects the difference in execution speed between the Molecules and the software execution. It may happen that no single SI exists that would reduce the number of needed Atom-Containers, e.g. the three Molecules (1,0), (0,1), (1,1) F

8F. In such a case the con-

dition in line 11/12 aborts the algorithm because we do not want to remove a complete cluster of SIs out of the FCs as this would be a major reduction in the search space for the run-

7 i.e. allocated Atom-Containers 8 in general: ( )m M : m sup M \ {m}∀ ∈ ≤

time decision system. The algorithm is executed for each SI-type individually and is running on the transposed BB-graph F

9F.

For each not-yet-visited FC Candidate a Depth First Search is performed. Each node’s suitability as an FC Candidate is re-evaluated. If there is no suitability plus another FC Candidate is far F

10F, then the preceding FC Candidate is turned into an ac-

tual FC. The FCs are finally annotated with the profiled prob-ability, temporal distance, and the expected number of execu-tions as initial values for the online phase.

5. RUN-TIME ARCHITECTURE Besides the off-line phase where FCs (forecast points) are in-serted (see section X4 X), the or run-time phase actually rotates instructions depending on compile-time obtained information plus run-time updated information. Whether a rotation actually takes place depends on various parameters including the state of the dynamic hardware, on updated profiling information etc. The run-time architecture is described more detailed in X[30] X. The main tasks during the run-time phase are as follows: a) Monitoring FCs and SIs in order to fine-tune the profiling

information to reflect varying run-time situations. b) Selecting and composing Molecules from Atoms to imple-

ment a subset of the forecasted SIs. c) Scheduling the rotations and replacing Atoms to accommo-

date new rotations. Figure 6 shows an exemplary scenario to illustrate the run-time architecture by means of an ITU-T H.264 Video Codec X[9] X. An excerpt consisting of two Tasks A and B is seen sharing six Atom Containers (ACs) to implement various Molecules of their respective SIs. Among others it shows the video codec Task A employs the ‘4x4 Sum of Absolute Transformed Differ-ences’ (SATD_4x4)-SIs that are implemented with varying numbers of instances out of the four Atoms namely QuadSub, Pack, Transform, and SATD (more details in section X6 X). Task B instead is using two different SIs called SI0 and SI1 for brevity. We now discuss the run-time scheme at selected points of time Ti in Figure 6. T0: Both tasks are supposed to be running in steady state. The

ACs 0 to 3 comprise the Atoms that are needed to imple-ment the smallest Molecule implementing SATD_4x4. The ACs 4 and 5 belong to B and are used to implement SI0.

T1: SI1 in Task B is forecasted. That triggers a reallocation of AC 3 and a subsequent rotation to implement this SI. Task A then executes SATD_4x4 in software.

T2: SI1 has already been executed several times deploying AC 3 and reusing ACs 1 and 2. Then, the forecast states that SI1 is no longer needed. This triggers a reallocation of ACs 3 to 5 of Task A which initiates rotation to achieve a hardware implementation of SATD_4x4.

T3: SI0 of Task B is executed in HW utilizing the ACs 4 and 5 although they now ‘belong’ to Task A. This is possible, as they still contain the Atoms needed to implement that SI and they share the available HW resources.

T4: AC 3 is rotated and the execution of SATD_4x4 immedi-ately switches from SW to HW implementation.

T5: AC 4 is rotated and SATD_4x4 is now executed in an even faster Molecule implementation.

This scenario shows that our run-time architecture does not follow a fixed rotation schedule. Instead, adaptation takes place during run-time according to the state of the HW and SW as well as to run-time updated forecasts. This scenario also shows that our approach is suitable for Multi-Mode systems with their changing demands on quasi-parallel executed tasks. By means of a video codec we show the advantages of RISPP.

6. CASE STUDY AND RESULTS ITU-T H.264 X[9] X is one of the latest Video Coding Standards and also the most complex one (10x computation power in-crease relative to MPEG-4 simple profile encoding, 2x for de-coding X[10] X).

9 i.e. the graph with all edges reversed 10 in #cycles, computed by previous executed temporal distance algorithm

0.1

0.2

0.3

0.4

0.6

1.0

1.6

2.5

4.0

6.3

10.0

15.8

25.1

39.8

63.1

100.

0

100

70

40

100

50100150200250300350400450500

Output: Num-ber of minimalSI usages toissue a Fore-cast Candidate[#SI usages]

Temporal distance untilusage of SI (relative to rotationtime of SI with logarithmic scale) [t / TRot]

Proba-bility of SI

usage [%]

450-500400-450350-400300-350250-300200-250150-200100-15050-1000-50

Figure 4: Forecast Decision Function (FDF)

794

Figure 10: HW implementation with Xilinx PlanAhead, visualized with FPGA Editor

Figure 7: Flow of Test Application In our case study in Figure 7, the SATD operation is calculated first for 16 candidate sub-blocks. The candidate with minimum SATD value is chosen as the best candidate and forwarded to DCT. This whole process is performed for 16 sub-blocks within one MacroBlock (MB) (16x16 pixel block; basic proc-essing unit o SATD value

nd intra MB) th

f encoder). In the worst case after calculation the Quality Manager tool can take a decision to switch to the Intra MB injection. Then, after 16 DCTs there will be one 4x4 Hadamard Transform operation on the 16 DC coefficients. Therefore, one SI implements the functionality. For this case study we have designed the SIs manually. Auto-matic detection and generation of SIs might be done similar to [17] or [18] but is not in the focus of this paper.

As the coefficients are not exceeding the 16-bit range we have considered the 16-bit storage pattern in the design of our SIs hence enhancing the Atom Container (AC) reusability. Two 16-bit data values are packed into one 32-bit register. In case of Chroma there is no SATD operation, as ME only performs on the Luma component. In both cases (inter a

ere are two Chroma components Cr and Cb. Each block is 8x8 (4 calls of DCT), in total 2 x 4 = 8 calls of this SI to com-plete the transform step for both Chroma components. But for each Chroma component there is an additional one 2x2 Ha-damard Transform operation. An SI is designed for this opera-tion which constitutes only one Atom. Further, QuadSub and SATD can also be combined to form an SI that can execute the SAD operation used in Integer-Pixel Motion Estimation.

Figure 8: Block Diagram of SATD_4x4 Special Instruction

It can be noticed from the block diagram of SATD_4x4 in Fig-ure 8 that this SI can be executed using different Molecules depending upon the number of available Atoms at run-time. These Molecules realize a trade-off between spatial and tempo-

dware implementation of individual Atoms

ral execution. The minimum requirement for this SI is 4 At-oms, i.e. 1 Atom of each kind.

Figure 9 shows the data path of the Transform Atom. It is designed with consideration of high reusability and high speed. This design makes it utilizable by SATD_4x4, DCT_4x4,

HT_4x4 and HT_2x2 SIs. There are three different transforms used in ITU-T H.264 Video Codec: 2x2 Hadamard Transform, 4x4 Integer Transform, and 4x4 Hadamard Transform. The ad-dition and subtraction flow is identical in all three transforms. Therefore, by just adding the shift elements multiplexed with two control signals DCT and HT we can make this Atom reus-able so that all three Transform SIs then can use this Atom. For future work we consider automatic generation of reusable At-oms by e.g. methods for finding the longest common subse-quence of multiple sequences, as proposed in [31].

Table 1: Results for harAtoms

Characteristics Transform SATD Pack QuadSub # Slices 517 407 406 352# LUTs 1034 808 812 700

Utilization 50.5% 39.5% 39.7% 34.2% Bitstream Size [Byte] 59,353 58,141 65,713 58,745

Rotation Time [µs] 857.63 840. 9.53 848.84 11 94

We h ted th ca y an r ACs on XC2v3000-6 FPGA FP imple ta-tion is prototy showing the ilit ur con al pr AS t y similarly eva otot he an d f-ter plem n is re e

ave implemen e H.264 se stud by me s of foua Xilinx . This GA men

m at, not as a

eant as pe for feasibypicall

y of ocep finluated in FPGA pr

oduct. IPs areypes. T floorpl of the esign a

synthesis and im entatio shown in Figu 10. Thleft (static) module contains the core CPU with the connection to some peripherals. We currently use a DLX core, but concep-tually we are not limited to any specific core. The four par-tially reconfigurable ACs contain 4 exemplary Atoms. Each AC has a width of four Configurable Logic Blocks (CLBs). For technical (Xilinx Virtex-II related) reasons our ACs have to use the full height of the FPGA. Each AC comprises 1024 slices and thus 2048 4-input LUTs. We have designed an intercon-nection bus to attach the ACs via BusMacros [7] as extension of the execution data-paths. The resource usage and the rotat-ing time for the SelectMap [7] interface for each Atom is shown in Table 1. As the AC to which Pack is loaded covers an

QuadSub Pack Transform SATD SATD

Transform QuadSub

Container 0:Container 1:Container 2:Container 3:Container 4:Container 5:

Task A:

Task B:

Task A is executed

SI is exe-cuted in

hardware

SI is exe-cuted in software

The m ore im portant SI1 for Task B is forecasted .

Container is rotated

The m ore im portant

SI1 is used

The less im portant

SI0 is used

It is forecasted , that SI1 is no longer needed and SI 0 w ill be seldom needed

Ti’0' ’2' ’4' ’5 '’1 ' ’3 '

Hardware

Sotware

Sotware

Figure 6: Scenario of the H.264 showing the run-time architecture capabilities

X00

X11

X01

X10

<< 1

DC

T

<< 1

T00

T01

T11

T10

DC

T>> 1

>> 1

HT

>> 1H

T

>> 1_+

+__+ _ +

Figure 9: An example of an Atom showing the data path of Transform

795

embedded BlockRAM row, the corresponding bitstream (and thus rotation time) is significant bigger, although its logic utilization is moderate. The rotation time generally corre-sponds to the memory transfer rate (e.g. 66 MB/s for Virtex-II) and the bitstream size and our concept would directly profit from faster rotation time, due to e.g. faster memory bandwidth.

SI Execution Time for different Resources

182024

544 488

2419 15

298

22 22 17

10

100

1000

Exec

utio

n Ti

me

[Cyc

les] Allover performance for

H.264 Encoding Engine

60,244 59,135 58,287

201,065

50,000

100,000

150,000

200,000

Exec

utio

n Ti

me

[Cyc

les]

1Opt. SW 4 Atoms 5 Atoms 6 Atoms

RISPP Resources

SATD_4x4DCT_4x4HT_4x4

0Opt. SW 4 Atoms 5 Atoms 6 Atoms

RISPP Resources Figure 11: SI execution

time for a different amount of RISPP resources

Figure 12: Allover perform-ance for a different amount

of RISPP resources The execution time of three SIs for different Molecule options and an optimized software implementation are presented in Figure 11 (logarithmic scale). It is noteworthy that the SIs w h mi mor

tat

rout alternative. um-

itn. Atom requirements are

optimized software implemenduring certain points at run-time

ine is executed as an

e than 22 times faster than theion. When hardware resources are unavailable, this software

Once the minimum nber of Atoms is loaded the special instruction starts using the hardware resources. Figure 12 shows the relative speed up for different Molecule options for the encoder application. The implementation with minimum Atom requirements is more than 300% faster than the one with optimized software imple-mentation. Amdahl’s law prevents significant further speed-up when offering more Atoms. To overcome this we will consider additional SIs focusing on different hot spots in future work.

RISPP SI Trade-off: Performance vs. Resources

15

17

19

21

23

25

ime

[#C

ycle

s]

5

7

9

11

13

0 2 4 6 8 10 12 14 16 18RISPP Resources [#Atoms]

Exec

utio

n T

SATD_4X4 HT_4x4

DCT_4x4 HT_2x2

dynamic trade-off

Figur SI performance for different Molecules

litating dynamic trade-off Table 2: M le composition of different SIs

e 13: faci

olecuHT2x2

Load 1 1 1 2 2 4 4 1 1 2 2 4 4 4 4 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 QuadSUB 1 1 2 2 4 4 4 4 1 1 1 2 2 2 4 4 4 4 4 4 4 4 4 Pack 1 1 2 2 4 4 1 1 1 1 2 2 4 4 1 1 1 1 1 1 2 2 2 4 4 4 4 4

sform 1 1 2 1 2 2 4 1 2 1 2 1 2 2 4 1 2 2 1 2 2 1 2 2 2 2 4 4 4 41 1 2 1 1 2 1 1

4 Tran SATD 4 4 Add 1 2 Store 1 1 Cycles 5 22 17 17 12 11 8 24 23 19 15 18 12 12 9 24 22 22 20 18 18 17 15 14 15 14 14 13 13 12

HT4X4 DCT4X4 SATD4X4

2 1 2 1 21 1 1 1 1 1 1 1 1 1 1 1 1

1 1 2 1 2 2 4 1 2 1 2 1 2 2 4 1 1 1 1 1 1 1 1 1 1 1 1 1

Figure 13 explains the role of the number of Atoms to speed up the SI execution. Each entry in this figure corresponds to a Molecule for the specific SI where the detailed Molecule for-mulation is given in Table 2. Our RISPP architecture can dy-namically adapt the trade between performance and resource requirements, by rotating the implementations of the SIs. This corresponds to a movement on the highlighted lines of Pareto-

predefined special instructions) can-hieve a very high efficiency as

inc.com) : Two Flows for Partial Reconfiguration:

05 (E) (MPEG-4 AVC) ual services”, 2005

orm-

-

[16]

3

s”,

[23]

-

optimal Molecules, whereas an ASIP has to choose fixed SI implementations at design-time. These results excel any results that can be achieved with today’s extensible processors. It is due to a carefully selected boundary of design-time and run-time decisions.

7. CONCLUSION It can be seen that our approach of instruction rotation pro-vides a high flexibility during run-time to adapt to varying ap-plication constraints. It is able to exploit the area-performance design space as it can choose between various Pareto-optimal solutions and switches between them. This is a feature that ex-tensible processors (with not deliver. As a result, we accan be seen in Figure 12. It comes at the cost of a thorough and therefore computational extensive design-time analysis that carefully trades-off the boundary between design-time deci-sions and run-time flexibility. 8. REFERENCES [1] Arctangent processor. ARC Int’l (www.arc.com/configurablecores/) [2] CoWare Inc, LISATek (www.coware.com/) [3] LISA (http://servus.ert.rwth-aachen.de/lisa/) [4] Target Compiler (http://www.retarget.com/) [5] SP-5Flex DSP Core, 3DSP (www.3dsp.com/sp5_flex.shtml) [6] Stretch processor (www.stretch[7] Xilinx Corp. “XAPP 290

Module Based or Difference Based”, (www.xilinx.com) lica.com) [8] Xtensa processor, Tensilica Inc. (www.tensi

[9] ITU-T Rec. H.264 and ISO/IEC 14496-10:20“Advanced video coding for generic audiovis

[10] J Ostermann, et al “Video coding with H.264/AVC: tools, perf Systems, 2004 ance, and complexity”, IEEE Circuits and

[11] S Kobayashi, H Mita, Y Takeuchi, M Imai: “Design space explora-tion for dsp applications using the ASIP development system PEASIII”, ICASSP 2002

[12] A Hoffmann et al. “A novel methodology for the design of applica-tion-specific instruction-set processors (ASIPs) using a machine de-scription language”, IEEE Trans. on CAD of Int. Circ. and Syst. 01

[13] A Halambi et al. “Expression: A language for architecture explora-tion through compiler/simulator retargetability”, DATE, 1999

[14] J Henkel “Closing the SoC Design Gap”, IEEE Computer Volume 36, Issue 9, September 2003

[15] K Keutzer, S Malik, AR Newton “From ASIC to ASIP: the next de-sign discontinuity”, IEEE Int’l Conf. on Computer Design, 2002 N Cheung, J Henkel, S Parameswaran “Rapid configuration & in-struction selection for an ASIP: a case study”, DATE, 2003

[17] K Atasu, L Pozzi, P Ienne “Automatic application-specific instruc-tion-set extensions under microarchitectural constraints”, DAC, 200

[18] F Sun, S Ravi, A Raghunathan, NK Jha “A scalable application-specific processor synthesis methodology”, ICCAD, 2003

[19] N Cheung, S Parameswaran, J Henkel “A quantitative study and es-timation models for extensible instructions in embedded processorICCAD, 2004

[20] G Braun, et al. “A novel approach for flexible and consistent ADL-driven ASIP design”[P1], DAC, 2004

c-[21] P Biswas, et al. “Introduction of local memory elements in instrution set extensions”, DAC, 2004

[22] R Hartenstein “A decade of reconfigurable computing: a visionary retrospective”, DATE, 2001 K Compton, S Hauck “Reconfigurable computing: a survey of sys-tems and software”, ACM Computing Surveys 2002

[24] F Barat, R Lauwereins “Reconfigurable instruction set processors: asurvey”, RSP 2000

[25] S Vassiliadis, et al. “The MOLEN polymorphic processor”, IEEE Transaction on Computers, 2004

[26] RD Wittig, P Chow “OneChip: an FPGA processor with reconfigur-able logic”, IEEE Symp. FCCM, 1996

[27] JE Carrillo, P Chow “The effect of reconfigurable units in superscalar processors”, Int’l Symp. on FPGAs, 2001

[28] T Cormen, C Leiserson, R Rivest “Introduction to Algorithms”, MIT Press, 23rd printing, 1999, ISBN 0-262-03141-8

[29] Z Li, S Hauck: “Configuration prefetching techniques for partial reconfigurable coprocessors with relocation and defragmentation”, Int'l Symp. on FPGAs, 2002

[30] L Bauer, M Shafique, D Teufel, J Henkel: “A Self-Adaptive Extensi-ble Embedded Processor”, SASO 2007 (accepted for publication)

[31] P Brisk, A Kaplan, M Sarrafzadeh “Area-efficient instruction set synthesis for reconfigurable system-on-chip designs”, DAC 2004

796