000819304 mateus.pdf

119
UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMÁTICA PROGRAMA DE PÓS-GRADUAÇÃO EM COMPUTAÇÃO MATEUS BECK RUTZIG A Transparent and Energy Aware Reconfigurable Multiprocessor Platform for Efficient ILP and TLP Exploitation Prof. Dr. Luigi Carro Advisor Porto Alegre January/2012 Thesis presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science

Upload: erico

Post on 16-Aug-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL INSTITUTO DE INFORMTICA PROGRAMA DE PS-GRADUAO EM COMPUTAO MATEUS BECK RUTZIG A Transparent and Energy Aware ReconfigurableMultiprocessor Platform for Efficient ILP and TLP Exploitation Prof. Dr. Luigi Carro Advisor Porto Alegre January/2012Thesis presented in partial fulfillment of the requirements for the degree of Doctor of Computer Science CIP CATALOGAO NA PUBLICAO UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL Reitor: Prof. Carlos Alexandre Netto Vice-Reitor: Prof. Rui Vicente Oppermann Pr-Reitora de Ps-Graduao: Prof. Aldo Bolten Lucion Diretor do Instituto de Informtica: Prof. Lus da Cunha Lamb Coordenador do PPGC: Prof. Alvaro Freitas Moreira Bibliotecria-Chefe do Instituto de Informtica: Beatriz Regina Bastos Haro Beck Rutzig, Mateus ATransparentandEnergyAwareReconfigurable MultiprocessorPlatformforEfficientILPandTLP Exploitation/MateusBeckRutzigPortoAlegre:Programade Ps-Graduao em Computao, 2012. 119 p.:il. Tese (doutorado) Universidade Federal do Rio Grande do Sul. Programa de Ps-Graduao em Computao. Porto Alegre, BR RS, 2012. Orientador: Luigi Carro. 1.SistemasMultiprocessados2.ArquiteturasReconfigurveis 3.Sistemas Embarcados I. Carro, Luigi II. Ttulo 3 TABLE OF CONTENTS 1 INTRODUCTION ........................................................................................... 131.1 Contributions ....................................................................................................... 192 RELATED WORK ......................................................................................... 212.1 Single-Threaded Reconfigurable Systems ......................................................... 212.2 Multiprocessing Systems ..................................................................................... 252.3 Multi-Threaded Reconfigurable Systems .......................................................... 302.4 The Proposed Approach ...................................................................................... 413 ANALYTICAL MODEL ................................................................................. 433.1 Performance Comparison ................................................................................... 443.1.1 Low End Single Processor .................................................................................. 443.1.2 High End Single Processor ................................................................................. 443.1.3 High-End Single Processor versus Homogeneous Multiprocessor Chip ............ 453.1.4 Applying the Performance Modeling in Real Processors ................................... 473.1.5 Communication Modeling in Multiprocessing Systems ..................................... 483.1.6 Applying the Performance Modeling in Real Processors considering the Communication Overhead .............................................................................................. 503.2 Energy Comparison ............................................................................................. 533.2.1 Applying the Energy Modeling in Real Processors ............................................ 533.2.2 Communication Modeling in Energy of Multiprocessing Systems .................... 543.2.3 Applying the Energy Modeling in Real Processors considering the Communication Overhead for Multiprocessing Systems ............................................... 553.3 Example of a Application Parallelization Process in a Multiprocessing System574 CREAMS ....................................................................................................... 614.1 Dynamic Adaptive Processor (DAP) .................................................................. 614.1.1 Processor Pipeline (Block 2) ............................................................................... 614.1.2 Reconfigurable Data Path Structure (Block 1) ................................................... 614.1.3 Dynamic Detection Hardware (Block 4) ............................................................ 634.1.4 Storage Components (Block 3) ........................................................................... 675 RESULTS ...................................................................................................... 695.1 Methodology ......................................................................................................... 695.1.1 Benchmarks ........................................................................................................ 695.1.2 Simulation Environment ..................................................................................... 705.1.3 VHDL descriptions ............................................................................................. 71 4 5.1.4 How does the thread synchronization work? ...................................................... 725.1.5 Organization of this Chapter ............................................................................... 735.2 The Potential of CReAMS ................................................................................... 745.2.1 Considering the Same Chip Area ........................................................................ 755.2.2 Considering the Power Budget ........................................................................... 785.2.3 Energy-Delay Product ......................................................................................... 795.3 The impact of Inter-thread Communication ..................................................... 805.3.1 Considering the Same Chip Area ........................................................................ 815.3.2 Considering the Power Budget ........................................................................... 865.3.3 Energy-Delay Product ......................................................................................... 875.4 Heterogeneous Organization CReAMS ............................................................. 895.4.1 Methodology ....................................................................................................... 895.5 CReAMS versus Out-Of-Order Superscalar SparcV8 .................................... 976 CONCLUSIONS AND FUTURE WORKS .................................................. 1016.1 Future Works ..................................................................................................... 1016.1.1 Scheduling Algorithm ....................................................................................... 1016.1.2 Studies over TLP and ILP considering the Operating System ......................... 1026.1.3 Behavioral of CReAMS on a Multitask Environment ...................................... 1026.1.4 Automatic CReAMS generation ....................................................................... 1026.1.5 Area reductions by applying the Data Path Virtualization Strategy ................. 1026.1.6 Boosting TLP performance with Heterogeneous Multithread CReAMS ......... 1037 PUBLICATIONS ......................................................................................... 1057.1 Book Chapters .................................................................................................... 1057.2 Journals ............................................................................................................... 1057.3 Conferences ........................................................................................................ 105APPENDIX A.................................................................................................... 111Introduo ................................................................................................................... 111Objetivos ...................................................................................................................... 112CReAMS ...................................................................................................................... 113DAP .............................................................................................................................. 113Metodologia ................................................................................................................. 117Resultados ................................................................................................................... 118Concluses ................................................................................................................... 119 5 LIST OF ABBREVIATIONS AND ACRONYMS ALUArithmetic and Logic Unit ARMAdvanced RISC Machine ASICApplication Specific Integrated Circuits BTBinary Translation CADComputer Aided Design CCAConfigurable Compute Array DIMDynamic Instruction Merging DSPDigital Signal Processor FIFOFirst In, First Out FPGAField Programmable Gate Array ILPInstruction Level Parallelism TLPThread Level Parallelism IPCInstructions per Cycle RAWRead After Write RFUReconfigurable Functional Unit RPUReconfigurable Processor Unit SIMDSingle Instruction Multiple Data 6 7 LIST OF FIGURES Figure 1. Different Architectures and Organizations ..................................................... 16Figure 2. Speedup of homogeneous multiprocessing systems on embedded applications ........................................................................................................................................ 17Figure 3. Coupling setups (HAUCK e COMPTON, 2002) ............................................ 22Figure 4. Virtualization process of Piperench (GOLDSTEIN, SCHMIT, et al., 2000) . 23Figure 5. How the DIM system works (BECK, RUTZIG, et al., 2008) ........................ 24Figure 6. KAHRISMA architecture overview (KOENIG, BAUER, et al., 2010) ......... 32Figure 7. Overview of Thread Warping execution process (STITT e VAHID, 2007)... 33Figure 8. Blocks of the Reconfigurable Architecture (YAN, WU, et al., 2010) ............ 34Figure 9. Block Diagram of Annabelle SoC (SMIT, 2008) ........................................... 35Figure 10. The Montium Core Architecture ................................................................... 36Figure 11. Fabric utilization considering many architecture organizations (WATKINS, CIANCHETTI e ALBONESI, 2008) ............................................................................. 37Figure 12. (a) SPL cell architecture (b) Interconnection strategy (WATKINS, CIANCHETTI e ALBONESI, 2008) ............................................................................. 38Figure 13. (a) Spatial sharing (b) Temporal sharing (WATKINS, CIANCHETTI e ALBONESI, 2008) ......................................................................................................... 38Figure 14. Thread Intercommunication steps (ALBONESI e WATKINS, 2010) ......... 39Figure 15. Modeling of the (a) Multiprocessor System and the (b) High-End Single-Processor ......................................................................................................................... 44Figure 16. Multiprocessor system and Superscalar performance regarding a power budget using different ILP and TLP; = is assumed. ................................................. 48Figure 17.Execution time of different designs considering0 = u.16 ........................... 51Figure 18. Execution time of different designs considering0 = u.SS .......................... 51Figure 19. Execution time of different designs considering0 = u.66 .......................... 52Figure 20. Execution time of different designs considering0 = u.99 .......................... 52Figure 21. Multiprocessing Systems and High-end single processor energy consumption; = is assumed. ...................................................................................... 54Figure 22. Energy consumption of different designs considering0 = u.16 ................. 55Figure 23. Energy consumption of different designs considering0 =0.33 .................. 56Figure 24. Energy consumption of different designs considering0 =0.66 .................. 56Figure 25. Energy consumption of different designs considering0 =0.99 .................. 56Figure 26. Speedup provided in 18-tap FIR filter execution for Superscalar, MPSoC and a mix of both approaches ................................................................................................ 58Figure 27.C-like FIR Filter ........................................................................................... 59Figure 28.(a) CReAMS architecture (b) DAP blocks .................................................. 62Figure 29. Interconnection mechanism .......................................................................... 63Figure 30. Example of an allocation of a code region inside of the data path ............... 65Figure 31. DAP acceleration process ............................................................................. 66Figure 32. Activity Diagram of DIM process ................................................................ 67Figure 33. (a) Simulation Flow (b) How the synchronization process is done .............. 72 8 Figure 34. How the simulation handles synchronization from the software point of view ........................................................................................................................................ 73Figure 35. Example of Same Area #1 (left) and Same Area #2 (right) comparison schemes ........................................................................................................................... 90Figure 36. Relative Performance of HeteroLarge over HomoSmall CReAMS considering the SameArea #1 scheme ............................................................................ 91Figure 37. Relative Energy Consumption of HeteroLarge over HomoSmall CReAMS considering the Same Area #1 scheme ........................................................................... 93Figure 38. Relative Performance of HeteroLarge over HomoSmall CReAMS considering the Same Area #2 scheme ........................................................................... 94Figure 39. Relative Energy Consumption of HeteroLarge over HomoSmall CReAMS considering the Same Area #2 scheme ........................................................................... 95Figure 40. Relative Energy-Delay Product of HeteroLarge over HomoSmall CReAMS considering the Same Area #1 and #2 schemes .............................................................. 95Figure 41. Relative Speedup of Dynamic over the Static Thread Scheduling considering the Same Area #1 scheme ............................................................................................... 96Figure 42. Relative Speedup of Dynamic over the Static Thread Scheduling considering the Same Area #2 scheme ............................................................................................... 97 9 LIST OF TABLES Table 1. Summarized Commercial Multiprocessing Systems ........................................ 30Table 2. Load balancing and mean basic block size of the selected applications .......... 70Table 3. The configuration of both basic processors ...................................................... 74Table 4.(a) Area (um2) of DAP and SparcV8 components .......................................... 74Table 5. Area, in um2, of : (a) Same Area Chip scheme #1 (b) Same Area Chip scheme #2 .................................................................................................................................... 75Table 6. Speedup provided by MPSparcV8 and CReAMS over a standalone single SparcV8 processor .......................................................................................................... 76Table 7. Execution time of MPSparcV8 and CReAMS ................................................. 77Table 8. Average Power consumption of MPSparcV8 and CReAMS ........................... 78Table 9. Energy consumption of MPSparcV8 and CReAMS ........................................ 79Table 10. Energy-Delay product of MPSparcV8 and CReAMS .................................... 80Table 11. Average number of hops for different multiprocessing systems .................... 81Table 12. (a) Area of DAP and SParcV8 components (b) Same Area #1 Scheme (c) Same Area #2 scheme .................................................................................................... 82Table 13. Execution time (in ms) considering the Same Area #1 scheme for CReAMS and MPSparcV8 .............................................................................................................. 82Table 14. Execution time (in ms) considering the Same Area #2 scheme for CReAMS and MPSparcV8 .............................................................................................................. 83Table 15. Energy (in mJoules) considering the Same Area #1 for CReAMS and MPSparcV8 .................................................................................................................... 84Table 16. Energy (in mJoules) considering the Same Area #2 for CReAMS and MPSparcV8 .................................................................................................................... 85Table 17. Execution time (in ms) of both CReAMS and MPSparcV8 considering a power budget .................................................................................................................. 86Table 18. Energy (in mJoules) of both CReAMS and MPSparcV8 considering a power budget ............................................................................................................................. 87Table 19. Energy-Delay product of MPSparcV8 and CReAMS considering the Same Area #2 scheme .............................................................................................................. 88Table 20. Energy-Delay product of MPSparcV8 and CReAMS considering the power budget ............................................................................................................................. 88Table 21. (a) Different DAPs sizes (b) Percentage of DAPs that composes each Heterogeneous CReAMS ............................................................................................... 89Table 22. (a) Area of the components of the different DAP sizes (b) Area of the Homogeneous and Heterogeneous CReAMS setups ..................................................... 90Table 23. (a) Thermal Design Power (TDP) of DAP configurations (b) TDP of heterogeneous and homogeneous CReAMS .................................................................. 92 10 Table 24. TDP of 4-issue Out-Of-Order SparcV8 multiprocessing system ................... 98Table 25. Execution time of 4-issue OOO MPSparcV8 and CReAMS ......................... 99ABSTRACT Asthenumberofembeddedapplicationsisincreasing,thecurrentstrategyof severalcompaniesistolaunchanewplatformwithinshortperiods,toexecutethe application set more efficiently, with low energy consumption. However, for each new platformdeployment,newtoolchainsmustcomealong,withadditionallibraries, debuggers and compilers. This strategy implies in high hardware redesign costs, breaks binarycompatibilityandresultsinahighoverheadinthesoftwaredevelopment process.Therefore,focusingonareasavings,lowenergyconsumption,binary compatibilitymaintenanceandmainly software productivity improvement, we propose theexploitationofCustomReconfigurableArraysforMultiprocessorSystem (CReAMS).CReAMSiscomposedofmultipleadaptivereconfigurablesystemsto efficiently explore Instruction and Thread Level Parallelism (ILP and TLP) at hardware level,inatotallytransparentfashion.Conceivedashomogeneousorganization, CReAMSshowsareductionof37%inenergy-delayproduct(EDP)comparedtoan ordinarymultiprocessingplatformwhenassumingthesamechiparea.Whenavariety ofprocessorwithdifferentcapabilitiesonexploitingILParecoupledinasingledie, conceivingCReAMSasaheterogeneousorganization,performanceimprovementsof upto57%andenergysavingsofupto36%areshowedincomparisonwiththe homogenousplatform.Inaddition,theefficiencyoftheadaptabilityprovidedby CReAMS is demonstrated in a comparison to a multiprocessing system composed of 4-issue Out-of-Order SparcV8 processors, 28% of performance improvements are shown considering a power budget scenario. Keywords: Multiprocessors, Reconfigurable Architectures, Instruction and thread level parallelism. 12 1INTRODUCTION Industry competition in the current wide and expanding embedded market makes thedesignofadeviceincreasinglycomplex.Nowadays,embeddedsystemsareina transitionprocessfromcloseddevicestoaworldinwhichtheproductshavetorun applications, previously unforeseenat designtime, during theirwholelife cycle.Thus, companiesarealwaysenhancingtherepositoryofapplicationstosustaintheirprofit evenaftertheproducthasbeensold.Currentcellphones,aclearexampleofdevices thatexploretodaysconvergence,arecapableofdownloadingapplicationsduringthe productlifecycle.Android,Googlessoftwareframework,inlessthanthreeyearsof existenceoffers380,297applicationsfordownloading,whileApplesplatform,iOS, hasthreetimesasmanyapplicationsasAndroid.Applebecomesthemostvaluable companyofUnitedStatesofAmericaafterfouryearsofiPhone.Thecustomersare attracted to have in their devices more and more applications such as games, text editors and VoIP communication interfaces. However,mostembeddedproductsaremobileandhencebattery-powered. Hardwaredesignersshouldcopewithwell-knowndesignconstraintssuchasenergy consumption, chip area, process costs and processing capability. The strategy to embed differentapplicationsduringtheproductlifecycleproducesnewdesignchallenges, whichmakesembeddedplatformsdevelopmentevenmoredifficult.Thus,thecurrent embeddedsystemdesignisnotonlyconstrainedbytheexistingapplications requirements.Toreachawidermarket,oneshouldcarefullyconceivethedesignto cope with the requirements of the wide software repository that will be developed even after the product deployment. The fast deployment of embedded applications dynamically enlarges the range of different types of code that the platform should execute. Consequently, the life cycle of modernembeddedproductsisgettingincreasinglysmallsincethehardwareplatform wasnotoriginallybuilttohandlesuchsoftwareheterogeneity.Afewyearsago,cell phonesmanufacturerslaunchedamajorproductlineperyear,whatwassuitableto supplytheperformancerequiredbythenewapplicationslaunchedduringthisperiod. Thelifecycleofacellphonehasshortenedtoachievetherequirementsofthenew applications (HENKEL, 2003), which implies in less revenue per new design due to the reduced product lifetime. However, companies try hard to stretch their product lines to amortizethecostsandtoincreasetheprofitsperdesign.Typically,companiesusethe natural life cycle of the applications in the market as a strategy to stretch the product life cycle and to avoid costs with frequent hardware redesigns. The application life cycle is dividedintothreephases(BRANDAOeWYNN,2008):introduction,growthand 14 maturity. During the introduction phase, which reflects the time when the application is launched in the market (e.g. a new video decoding standard such as H.264). Due to the doubtsabouttheconsumeracceptance,thelogicalbehaviorofanewapplicationis describedusingwell-knownhigh-levelsoftwarelanguages(e.g.C++,Javaand.NET), supported by the platform tool chain, which could possibly cause overload in some parts oftheunderlyingplatform.Thegeneral-purposeprocessorwouldberesponsiblefor executingthenewapplication.Inthislifecyclestep,companiesstillavoidhardware costs, since the target platform is the very same of the previous product, or very close to it.Aftermarketconsolidation,thegrowthandmaturityphasestart,thankstothe widespreaduseoftheapplicationindifferentproducts.Atthistime,aredesignofthe hardwareplatformismandatorytoshrinkthegapbetweentheapplicationandthe hardware achieving better energy/performance execution.Generally, two approaches are used to supply the efficient execution of the latest embeddedapplication.Inthefirstapproach,newinstructionsareaddedtotheoriginal instructionsetarchitecture(ISA)oftheplatform.Thisapproachaimsatsolving performance bottlenecks created by massive execution of certain application parts with well-defined behavior. For instance, this used to be the scenario of the last generation of embeddedsystems.Aftertheprofilingandevaluationphase,partsofapplicationsthat contain a similar behavior are implemented as specialized hardwired instructions. These instructionswillextendtheprocessorISAtoassistadelimitedrangeofapplications (GONZALEZ,2000).Sincemultimediaapplicationsanddigitalfiltersaremassively used in the embedded systems field, current ARM processors have implemented DSP to efficiently execute, in terms of energy and performance, these kinds of applications.Asecondtechniqueusesamoreradicalapproachtoclosethegapbetweenthe hardware and the embedded application. Application Specific Instruction Set Processors (ASIPs) is a technique used to implement the entire logic behavior of the application in hardware.ASIPdevelopmentcanbeconsideredbetterdesignsolutionthanISA extensions,sinceitprovideshigherenergysavingsandperformance.Nowadays,such anapproachiswidelyexploredbytheleadingcompaniesofthemarket.Open MultimediaApplicationPlatform(OMAP),designedbyTexasInstruments,comprises oneortwoARMprocessorsthataresurroundedbyseveralASIPs(communication, graphicsandaudiostandards),eachoneofthemwithitsparticulararchitectural characteristicstoefficientlyexecutearestrictedtypeofsoftware.Companiesusually employASIPstoobeythedemandfornewapplicationsintheshorteneddesigntime scenario and to reach the performance requirements imposed by the market. However,theuseofASIPscausesfrequentplatformredesignthatbesides increasingcostsinhardwaredeploymentalsoaffectssoftwaredevelopmentprocess. While it is difficult to develop applications for the current platform where one can find up to 21 ASIPs. Such difficulty will increase in the coming decade since it is expected that600differentASIPswillbeneededtocoverthegrowingconvergenceof applications for the embedded devices (SEMICONDUCTORS, 2009). To soften such a complexity, hardware companies (e.g. OMAP and Nvidia) provide particular tool chains tosupportthesoftwaredevelopmentprocess.Thistoolschainmakesthe implementation details of the platform transparent to the software designers, even in the presence of a great number of ASIPs. However, each release of a platform relies on tool chainmodifications,sinceitmustbeawareoftheexistenceoftheunderlyingASIPs. Thus, changes on both software and hardware are mandatory when ASIPs are employed 15 tosupplytheenergyandperformanceefficiencydemandedforthecurrentembedded platforms.DespitethegreatadvantagesshownintheemploymentofsomeISAextensions andASIPs,suchapproachesrelyonfrequenthardwareandsoftwareredesigns,which go against the current market trend on stretching the life cycle of a product line. These strategiesattackonlyaveryspecificapplicationclass,failingtodelivertherequired performancewhileexecutingapplicationsforthosebehaviorsthathavenotbeen consideredat design time. In addition, both ISA extensionsand ASIP employed in the currentplatformsonlyexploreinstructionlevelparallelism(ILP).AggressiveILP exploitation techniques no longer provide an advantageous tradeoff between the amount of transistors added and the extra speedup obtained (MAK, 1991) . Duetotheaforementionedreasons,theforeseenscenariodictatestheneedfor changes in the paradigm of the hardware platform development for embedded systems. Manyadvantagescanbeobtainedbycombiningdifferentprocessingelementsintoa singledie.Theexecutiontimecanclearlybenefitsinceseveraldifferentpartsofthe programcouldbeexecutedconcurrentlyinprocessingelements.Inaddition,the flexibilitytocombinedifferentprocessingelements,intermsofperformance,appears asasolutiontotheheterogeneoussoftwareexecutionproblem.Thehardware developers can select the set of processing elements that best fit with the heterogeneity running in their designs.Multiprocessingsystemsprovideseveraladvantages,andthreeofthemhighlight amongall:performance,energyconsumptionandvalidationtime.Thelifecycleof thesedeviceshashalvedincomparisonwithproductsoflastdecade.Validationtime appearsasanimportantconsumerelectronicsconstraintthatshouldbecarefully handled.Researchesexplainthat70%ofthedesigntimeisspentintheplatform validation(ANANTARAMAN,SETH,etal.,2003),thusbeinganattractivepointfor time-to-marketoptimization.Consideringthissubject,theuseofmultiprocessing system softens the hard task to shrink time-to-market. Commonly, such an approach is builtbythecombinationofvalidatedprocessingelementsthatareaggregatedintoa singledieasapuzzlegame.Sinceeachpuzzleblockreflectsavalidatedprocessing element,theremainingdesignchallengeistoassemblethedifferentblocks.Actually, thedesignersshouldselectacommunicationmechanismtoconnecttheentiresystem, which eases the design process by the use of standard communication mechanisms such as buses or network on chips (NoC).Multiprocessingsystemsintroduceanewparallelexecutionparadigmaimingto overcome the performance barrier created by the limits of instruction level parallelism. Nowadays, the software team should manually detect the parts of the program that could beexecutedinparallel.Thehardwareteamisonlyresponsiblefortheencapsulation processofacertainprocessingelementsandforthecommunicationinfrastructure. When considering ILP exploitation, the complexity on extracting the parallelism moves tosoftwareteamformultiprocessingchips,sincethereisnostrongmethodologythat can support automatic software parallelization. The software team is responsible for the non-trivialtaskofspawninganddistributingthecodeamongtheprocessingelements. Duetothisreason,softwareproductivityarisesasthehardestchallengeina multiprocessingsystemdesign,sincetheapplicationsshouldbelaunchedasfastas possibletosupplythedemandofthemarket.Thebinarycodeoftheseapplications should be as generic as possible to provide compatibility among different products and 16 platforms.Inaddition,thecommunicationinfrastructureshouldbeefficientenoughto smooth the latency of the inter-thread communication. The four quadrants plotted in the Figure 1 show the strengths and the weaknesses of the existing hardware strategies used to design a multiprocessor platform. This figure considerstheorganizationandthearchitectureofthemultiprocessingplatforms.The main strategy used for leader companies in themarket is building embedded platforms asillustratedinthelowerleftquadrantoftheFigure1.Suchstrategycouldbearea inefficient,sinceitreliesontheemploymentofaparticularASIPtoefficientlycover theexecutionofsoftwarewitharestrictedbehaviorintermsofILPandTLP.Each releaseofaplatformwillnotbetransparenttothesoftwaredevelopers,sincetogether withanewplatform,anewversionofitstoolchainwithparticularlibrariesand compilersmustbeprovided.Besidestheobviousdeleteriouseffectsonsoftware productivityandcompatibilityforanynewhardwareupgrade,therewillalsobe intrinsic costs of new hardware and software developments for every new product. Figure 1. Different Architectures and Organizations Ontheotherhand,theupperrightquadrantoftheFigure1illustratesthe multiprocessing systems that are composed of multiple copies of the same processors, in terms of architecture and organization. Typically, such strategy is employed in general-purposeplatformswhereperformanceismandatory.However,energyconsumptionis alsogettingrelevantinthisdomain(e.g.itisnecessarytoreduceenergycostsin datacenters).Inordertocopewiththisdrawback,thehomogeneousarchitectureand heterogeneous organization, shown in the upper left quadrant of the Figure 1, has been emergingtoprovidebetterenergyandareaefficiencythantheothertwo aforementionedplatforms.Thisapproachbringsthecostofhigherdesignvalidation time,sincemanydifferentorganizationsofprocessorsareused.However,ithasthe advantageofimplementingauniqueISA,sothesoftwaredevelopmentprocessisnot GPP GPPGPPGPPGPPGPP GPPASIP1DSPGraphicAcceleratorASIP2GPPEnergyConsumptionStrengthsWeaknessesStrengthsWeaknessesStrengthsWeaknessesSWProductivityEnergyConsumptionGPP GPPSWProductivityDesignTimeEnergyConsumptionSWProductivityDesignTimeDesignTimeChipAreaSWBehaviorCoverage SWBehaviorCoverageSWBehaviorCoverageChipAreaChipAreaEmbeddedMPSoC GeneralPurposeMPSoCHomogeneousOrganizationHeterogeneousOrganizationHomogeneousArchitectureHeterogeneousArchitecture 17 penalized.Itispossibletogenerateassemblycodeusingtheverysametoolchainfor anyplatformversionmaintainingfullbinarycompatibilityforthealreadydeveloped applications. However, the scheduling of the threads appears as an additional challenge when heterogeneous organization approach is used. Threads that have different levels of instructionlevelparallelismshouldbeassignedtoprocessorswithdifferent performance capabilities. Softwarepartitioningisakeyfeatureinmultiprocessingenvironments.A computationalpowerfulmultiprocessingplatformbecomesuselessifthreadsofa certainapplicationshowsignificantloadunbalanceratio.Usually,itisgivenbythe poor quality of the software partitioning, or by the nature of the application that does not provideaminimumthreadlevelparallelismtobeexplored.Amdahlslawshowsthat thespeedupofacertainapplicationislimitedbyitssequentialpart.Therefore,ifan applicationneeds1hourtoexecute,being5minutessequential(almost9%ofentire applicationcode),themaximumspeedupprovidedforamultiprocessingsystemis12 times, no matter how many processing elements are available. Figure 2 shows the performance of some well-known embedded applications that were split in threads using a traditional shared-memory parallel programming language. Ascanbeseen,theperformanceoftheseapplicationsdoesnotscaleasthenumberof processors increases, when executed on amultiprocessor systemcomposed of multiple copiesoffive-stagepipelineRISCprocessors.Inthebestcaseoftheexamples,even overlookinginter-threadcommunicationcosts,aspeedupofninetimesisachieved when64 processors are used. Clearly, theseembedded applications are goodexamples ofAmdahlslaw,demonstratingthatmultiprocessingsystemscanfailtoaccelerate applicationsthathaveameaningfulsequentialpart.SincethereisalimitofTLPfor mostapplications(BLAKE,DRESLINSKI,etal.,2010),standaloneTLPexploitation doesnotprovidetheenergyandperformanceoptimizationdemandedforcurrent embedded designs. Figure 2. Speedup of homogeneous multiprocessing systems on embedded applications Theidealplatformwouldhavethehardwarebenefitsfromthethirdquadrantof Figure1,withtheeaseforsoftwaredevelopmentofthesecondquadrant,withoutany costassociatedtonewhardwaredevelopment.Itmeansthat,thephysicalstructureof the hardware could be homogenous, since chip area is no longer a drawback for billion transistor technologies.Nevertheless, it ismandatory thatthecosts,such as powerand energy consumption, virtuallymustbehave as aheterogeneous organization. However, this can only be achieved if the available hardware has the ability to be tuned for each differentapplicationorevenprogramphaseonthefly.Dynamicreconfigurable 0 1 2 3 4 5 6 7 8 9 10fftsusanedgessusancornersSingleDynamicAccelerator 4Processors 16Processors 64ProcessorsSpeedupOnlyILPExploitation OnlyTLPExploitation 18 architectureshavealreadyshowntobeveryattractiveforembeddedplatforms,since theycanadaptthefinegrainparallelismexploitation(i.e.atinstructionlevel)tothe applicationrequirementsatrun-time(CLARK,KUDLUR,etal.,2004)(LYSECKY, STITTeVAHID,2004).However,besideshavingrestrictedthreadlevelparallelism, embedded applications also exhibit limits of instruction level parallelism. Thus, gains in performancewhensuchexploitationisemployedtendtostagnate,evenifahuge amount of resources is available in the reconfigurable accelerator. The Single Dynamic Accelerator bars of the Figure 2 illustrate the above claim. This assumption considers theperformanceofadynamicreconfigurablearchitecturewithanareaequivalentto sixteen five-stage pipeline RISC processor. Although it is faster than a RISC processor executingasinglethread,outperformingfourprocessorsonrunningtheFastFourier Transform(FFT).ThestandaloneILPexploitationofthesingledynamicaccelerator does not provide an advantageous trade-off between area and performance if compared tothemultithreadedversionoftheremainingbenchmarks.Summarizing,thisfigure indicatesthatILPaswellasTLPalonedonotprovidemeaningfularea-performance tradeoff, considering a heterogeneous software environment. The state-of-art of the multiprocessing systems with both ILP and TLP exploitation is very divergent, if one considers the complexity of the processing element. At one side ofthespectrum,therearemultiprocessingsystemscomposedofmultiplecopiesof simplecorestobetterexplorecoarsegrainparallelismofhighlythread-based applications(HAMMOND,HUBBERT,etal.,2000)(ANDRE,BARROSO,etal., 2000).Attheotherside,therearemultiprocessorchipsassembledwithfewcomplex superscalar/SMT processing elements, to explore applications where ILP exploration is mandatory.Thereisnoconsensusonthehardwarelogicdistributionina multiprocessingenvironmenttoexplorethebestofILPandTLPtogetherregardinga widerangeofapplicationclasses.Consideringthewiderangeofinstructionlevel parallelism that current applications exhibit, there is a large design space to explore by creating platforms composed of processors with different capabilities on exploiting ILP. Despitethetechnologyallowstheencapsulationofbilliontransistorsinasinglechip, areacouldbesavedandtheperformanceofthehomogeneousplatformcouldbe maintainedbyexploitingthediversityofcomputationalcapabilitiesofthe heterogeneousorganization.Suchstrategyreliesonschedulingalgorithmthatwould correlatetheintrinsiccharacteristicsofthethreads,suchasloadunbalanceandILP, withthecomputationalcapabilityoftheavailableprocessors.(KUMAR,FARKAS,et al., 2003) (KUMAR, JOUPPI e TULLSEN, 2006) Summarizing,anidealmultiprocessingsystemforembeddeddevicesshouldbe composedofreplicationofgenericprocessingelementsthatcouldadapttothe particularitiesoftheapplications,throughoutattheproductlifecycle.Thisplatform should emulate the behavior, in terms of performance and energy, of the ASIPs that are successfully employed in the current embedded platforms. At the same time, in contrast to such platforms, the use of the same ISA for all processing elements is mandatory to increase software productivity by avoidingtime spent on tool chainmodifications, and tomaintainthebinarycompatibilitytothealreadydevelopedapplications.Thisideal platformwouldbeabletoattackefficientlythewholespectrumofapplication behaviors:thosethatcontaindominantthreadlevelparallelismandthosesingle threadedapplications.However,theplatformshouldbeconceivedasaheterogeneous organization to provide a best fitting between the heterogeneous characteristics that the applications exhibit and the necessary processing capability to execute them. Moreover, 19 thenumberofprocessingelementsshouldbecarefulinvestigated.Sincetheoverall systemperformancecouldbeaffectedbyinter-threadcommunicationcostswiththe growth of processing elements. This way, the hypothesis is that by using such strategy onecouldreachasatisfactorytradeoffintermsofenergy,performanceandarea, without extra software and hardware costs. 1.1Contributions Considering all motivations discussed before, the first goal of this work is focused on reinforcing, by the use of an analyticalmodel, that the employment of a standalone levelofparallelismexplorationdoesnotprovideameaningfulenergy-performance tradeoffwhenaheterogeneousapplicationenvironmentishandled.Inaddition,this studygivessomecluesabouttheratioofhardwaredeploymentinmultiprocessing chips,intermsoffineandcoarsegrainparallelismexploitation,toachieveabalanced architectureintermsofareaandperformance.ANetwork-on-Chipisalsomodeledto investigate the impact of inter-thread communication latency over the gains obtained by thread level parallelism exploitation.Inthisscenario,weproposeaplatformbasedonCustomReconfigurableArrays for Multiprocessor System (CReAMS), by merging two different architectural concepts: reconfigurable architectures and multiprocessing systems. In the first step of this work, CReAMSisbuiltashomogeneousonbotharchitectureandorganization.However,it virtuallybehavesasahomogeneousarchitecturewithaheterogeneousorganization. Thankstoitsdynamicadaptivehardware,coupledtoeachbasicprocessor,CReAMS takes advantage of the flexibility provided by the reconfigurable architecture.This system is capable of transparently explore (no changes in the binary code are necessaryatall)thefine-grainedparallelismoftheindividualthreads,offeringmuch greater ability to adapt to the ILP demands of the applications, while at the same time it makesthemostoftheavailablethreadparallelism.Thecoarse-grainedparallelism exploitationdoesnotrelyonspecialtoolsemploymentsinceitisexploredbywell-known application programming interfaces (e.g. OpenMP and POSIX threads), making CReAMSexecutionindependentofanyparticularsoftwarepartitioningprocess.Thus, dynamically and in a transparent fashion it is possible to balance the best of both thread andinstructionparallelismlevels.Thisway,anykindofcode,fromthosethatpresent high TLP and low ILP to those that are exactly the opposite are accelerated. CReAMS achievesperformanceimprovement,providinglessenergyconsumption,butwiththe software productivity of a multiprocessor device based on homogeneous architecture. In addition,asingletoolchainisusedforthewholeplatformandforanynewversion launched, with full binary compatibility. AimingatshowingthepotentialofCReAMSplatformonadaptingtoawide range of software behaviors, we selected applications from general purpose (e.g. SPEC OMP2001), parallel (e.g. Splash2) and embedded benchmark suites (e.g. MiBench). The experimental setup was supported by simulation using the SparcV8 ISA model supplied by Simics instruction set accurate simulator (MAGNUSSON, CHRISTENSSON, et al., 2002).CReAMSmeasurementswereobtainedthroughreplicationofcycleaccurate simulators that model the behavior of the basic processing element of CReAMS, named asDynamicAdaptiveProcessor(DAP).Thecycleaccuratesimulatorsprecisely calculatethreadssynchronization,asbarriersandlocks.CReAMSimplementsthread communicationthroughshared-memorymechanismand,asalreadycited,supportsthe well-knownapplicationprogramminginterfaces,whichmakesthethreadspawning 20 processtransparenttothehardware.Performanceimprovementandenergysavings weredemonstratedwhencomparingCReAMStoordinarymultiprocessingsystem composedofmultiplecopiesofpipelinedSparcV8processorswhenconsideringthe same chip area for both designs. Since very interesting performance and energy results were obtained, considering CReAMS as homogeneous organization platform, aiming at reducing the area occupied byCReAMS,weinvestigatedtheadvantagesofusingDAPswithdifferentprocessing capabilities, taking advantage of the heterogeneous organization. One of the motivations forsuchdesignspaceexploitationisthediversityofinstructionlevelparallelism availableinaheterogeneousapplicationworkload.Somethreadsmayhavelarger amountofinstructionlevelparallelismthanothers,whichcanbeexploitedbyaDAP that can issue many instructions per cycle.However,thepowerfulDAPcouldbeassignedtoexecuteacertainthreadthat requirestinyILPexploitation,consumingmorepowerthanasimplercorethatwould better matched to the characteristics of such thread. This wrong thread assignment could causeload-unbalancedexecution,significantlyaffectingtheoverallexecutiontime. Thus, DAPs with different processing capabilities bring a diversity of ILP opportunities toexplore,openingroomtoachievelargerareasavingsandlesspowerconsumption than the homogeneous organization strategy. However, it also brings a need for a thread scheduling strategy that matches to the performance requirements of a certain thread to maintaintheperformanceshownbythehomogeneousDAPs.Thus,wedevelopeda simplethreadschedulingalgorithmonlytoprovetheneedforadynamicthread schedulingstrategywhenheterogeneousorganizationsareemployed.Thescheduling strategy assigns threads to DAPs with different ILP exploitation capabilities considering thenumberofexecutedinstructions.AsallDAPshavethesameinstructionsetinthe heterogeneous environment, the transparency offered by the homogeneous CReAMS in thesoftwaredevelopmentprocessisnotaffected,whichmaintainsthesamesoftware productivity. Chapter2presentstherelatedwork,discussingissuesrelatedtoreconfigurable architectures,multiprocessingsystemsbasedonheterogeneousarchitecture.Wealso describe the contribution and main novelty of this work, when comparing against these other studies. In Chapter 3, we first discuss, using an analytical model, the potential of standaloneexploitationoftheinstructionandthreadlevelparallelism.Wemodela multiprocessingarchitecturecomposedofseveralsimpleandhomogeneouscores,and we compare it to the modeling of a superscalar architecture in terms of performance and energy.Theimpactofthecommunicationinfrastructureisalsoanalyticallymodeled. After that, in the Chapter 4, we present the structure of the CReAMS platform. Chapter 5showsthemethodologyandtoolsemployedtogathertheresults.Theperformance, energyandarearesultsregardingthehomogeneousorganizationofCReAMSare demonstratedinthisChapter.After,resultsconsideringCReAMSconceivedas heterogeneousorganizationareshown.Finally,theperformanceofCReAMSis comparedtoa4-issueOut-Of-OrderSparcV8multiprocessor.Chapter6discussesthe future works and concludes this work. 21 2RELATED WORK Inthischapter,wereviewtraditionalworksthatexplorereconfigurablefabricto acceleratesingle-threadedapplications.After,weshowsomeapproachesthatuse multiprocessingsystemsinthecommercialandacademicfield.Finally,the characteristicsofmanyresearchesthatemployreconfigurablearchitectureina multiprocessingenvironmentareshown.Attheendofthissection,weanalyzeour approach,linkingitssimilarities/dissimilaritieswiththeotherresearchesthatusethe same strategy. 2.1Single-Threaded Reconfigurable Systems Althoughthereisnocommoncriteriaovertheclassificationofthesingle-threaded reconfigurablesystem,carefulstudywithrespecttocoupling,granularityand reconfiguration type is presented in (HAUCK e COMPTON, 2002). Inareconfigurablearchitecturedesign,thechoiceofthecouplingbetweenthe reconfigurabledatapathandthebasicprocessoriscrucialforperformance.Ascanbe seen in Figure 3, tightly coupled is the classification given for the reconfigurable fabric implementedasanadditionalfunctionalunit(FU)oftheprocessor.Asthe communication between both elements occurs only inside the chip, its high throughput isabenefitoverthelooselycoupledreconfigurablefabric.Therearemanysub-classifications of loosely coupled fabrics. When the fabric is classified as co-processor, the data path is implemented outside the chip, as shown in Figure 3. Design constraints guidesthecouplingemployment,whenthereisnotenoughsiliconareatostorethe reconfigurablefabric,looselycoupledarchitecturesareused,whereanexternalbusis responsibleforthecommunicationbetweenprocessorandreconfigurablefabric. Attached is the coupling strategy that connects the reconfigurable fabric between cache memory and the I/O interface. The communication cost is high, however, lower than the standalone strategy that connects the reconfigurable data path to the I/O interface. Thesizeandcomplexityofthebasicreconfigurableelementsisreferredtoasthe blocks granularity. For example, one could build a reconfigurable fabric as replications of one-bit width adders as a basic reconfigurable element. However, 32-bits width adder couldbeencapsulatedasablackboxbuildingacoarserbasicreconfigurableelement. This latter design provides lower reconfiguration flexibility than the former, since usage of adders that need less than 32-bits width would always occupy a basic element. On the other hand, a simple controller is required as the granularity becomes coarser, so fewer bits are used to reconfigure the whole fabric. 22 Figure 3. Coupling setups (HAUCK e COMPTON, 2002) Staticreconfigurationisexploitedbyseveralresearchesasstrategytoextract,ata compile time, the most suitable parts of the application code to efficiently execute in the reconfigurablefabric.Thisstrategyavoidsanykindofexecutiontimetaskbyadding compilationphasetodiscoverthesuitablepartsoftheapplicationcode.However,it breaks the binary compatibility since it relies on some kind of source code modification. Inaddition,thetime-to-marketconstraintcanbeaffectasanewcompilationphaseis inserted.Manysuccessfulreconfigurablefabricsemploystaticreconfiguration.Processors like Chimaera (HAUCK, FRY, et al., 2004) have a tightly coupled reconfigurable array in the processor core, working as an additional functional unit, limited to combinational logic only. This simplifies the control logic and diminishes the communication overhead between the reconfigurable array and the rest of the system. Look-up-tables are used as abasicreconfigurableblock,whichleadtohighreconfigurationcosts,asinmemory footprintaswellasinreconfigurationtime.TheGARPmachine(WAWRZYNEK, 1997) is a MIPS compatible processor with a loosely coupled reconfigurable array. The communicationisdoneusingdedicatedmoveinstructions,asonealsoemployslook-up-tablesasbasicreconfigurableblocksthesamedesigncostsofChimaeraare produced by this approach. Piperench(GOLDSTEIN,SCHMIT,etal.,2000)proposesapipeline-based reconfigurablefabricattachedtotheprocessortoreducethereconfiguration/execution time of FPGAs. This approach use a technique, named as virtualization, to reduce area costs of the reconfigurable fabric. The upper side of this Figure (Figure 4(a)) shows an example of a Piperench execution without the virtualization technique. In this case, the applicationwasdividedin5partsandtakes7cyclestobeconfiguredandexecuted, since no parallelism in the configuration/execution process is provided. The lower side oftheFigure4showsthevirtualizationtechnique,3cyclesareneededtoexecutethe sameapplication.Thereuseofthesamedatapathstageatdifferentperiodsisthekey factor to achieve high performance with low area.Morerecently,newreconfigurablearchitectures,verysimilartothedataflow approaches,wereproposed.Forinstance,theTRIPSisbasedonahybridvon-Neumann/dataflowarchitecturethatcombinesaninstanceofcoarse-grained, polymorphousgridprocessorcoreswithanadaptiveon-chipmemorysystem (SANKARALINGAM, NAGARAJAN, et al., 2004) . To better explore the application parallelismandutilizetheavailableresources,TRIPSusesthreedifferentmodesof execution,focusingoninstruction-,data-orthreadlevelparallelism.Wavescalar (SWANSON,2007),inturn,totallyabandonstheprogramcounterandthelinearvon-Neumannexecutionmodelthatcouldlimittheamountofexploitedparallelism.The major difference between this approach and the conventional systems is that there is no 23 centralprocessingunitatall,whichisreplacedbymanydistributedprocessingnodes. In agreement with the previous examples, one can also refer to Molen (VASSILIADIS, WONG, et al., 2004). All cited approaches still rely on static reconfiguration to achieve code optimization and better resource utilization on applying reconfigurable logic. Figure 4. Virtualization process of Piperench (GOLDSTEIN, SCHMIT, et al., 2000) Concernedabouttheoverheadscreatedbythestaticreconfigurationprocess,Stitt (LYSECKY, STITT e VAHID, 2004) had a pioneering work on proposing the dynamic detectionstrategytoreconfigurablefabrics.Theemploymentofdynamicdetection techniquesdoesnotrelyoncoderecompilation,providingsoftwarecompatibilityand maintaining the device time-to-market. Stitt et al. (LYSECKY, STITT e VAHID, 2004) presentedtheWarpProcessing,whichisbasedonasystemthatdoesdynamic partitioningusingreconfigurablelogic.Performanceimprovementsareshownon applyingsuchatechniquetoasetofpopularembeddedsystembenchmarks.Itis composedofamicroprocessortoexecutetheapplicationsoftware,another microprocessor where a simplified CAD algorithm runs, local memory and a dedicated simplified FPGA.In(CLARK,KUDLUR,etal.,2004)theConfigurableComputeArray(CCA), which is a coarse-grained array tightly coupled to an ARM processor, is proposed. The feeding process of the CCA involves two steps: the discovery of which sub graphs are suitableforrunningontheCCA,andtheirreplacementbymicroopsintheinstruction stream.Twoalternativeapproachesarepresented:static,wherethesubgraphsforthe CCA are found at compile time, and dynamic. Dynamic discovering assumes the use of atracecachetoperformsub-graphdiscoveryontheretiringinstructionstreamatrun-time.EvenapplyingdynamictechniquesWarpProcessingandCCApresentsome drawbacks,though.First,significantmemoryresourcesarerequiredforthekernels transformation.InthecaseoftheWarpProcessing,theuseofanFPGApresentslong latencyandconsumedarea,beingpowerinefficient.InthecaseoftheCCA,some operations, such asmemory accesses and shifts, are not supported at all. Then, usually 24 justtheverycriticalpartsofthesoftwareareoptimized,limitingtheirfieldof application.In(BECK,RUTZIG,etal.,2008),Beckproposesacouplingofareconfigurable systemtogetherwithaspecialbinarytranslation(BT)techniqueimplementedin hardware,namedDynamicInstructionMerging(DIM).DIMisdesignedtodetectand transforminstructiongroupsforreconfigurablehardwareexecution.Therefore,this workproposesacompletedynamicnatureofthereconfigurablearray:besidesbeing dynamicreconfigurable,thesequencesofinstructionstobeexecutedonitarealso detected and transformed to a data paths configuration at run-time. As can be observed in Figure 5, this is done concurrently while themain processor fetchesotherinstructions(Step1).Whenasequenceofinstructionsisfound,abinary translationisappliedtoit(Step2).Thereafter,thisconfigurationissavedinaspecial cache, and indexed by the memory address of the first detected instruction (Step 3). Figure 5. How the DIM system works (BECK, RUTZIG, et al., 2008) The next time the saved sequence is found (Step 4), the dependence analysis and the translationarenolongernecessary:theBTmechanismloadsthepreviouslystored configurationfromthespecialcache,theoperandsfromtheregisterfileandmemory (Step 5), and activates the reconfigurable hardware as functional unit (Step 6). Then, the array executes that configuration in hardware (including write back of the results) (Step 7), instead of ordinary (not translated) processor instructions. Finally, the PC is updated, inordertocontinuetheexecution.Thisway,repetitivedependenceanalysisforthe same sequence of instructions throughout program execution is avoided.The reconfigurable data path is tightly coupled to the processor, working as another ordinaryfunctionalunitinthepipeline.Itiscomposedofcoarse-grainedfunctional units, as arithmetic and logic units and multipliers. A set of multiplexers are responsible fortherouting.Becauseofthesmallcontextsizeandsimplestructure,theuseofa coarse-graineddatapathismoresuitableforthiskindofdynamictechnique.Inthis technique,bothDIMengineandreconfigurabledatapatharedesignedtoworkin paralleltotheprocessoranddonotintroduceanydelayoverheadorpenaltiesforthe critical path of the pipeline structure. 25 Alltheworksexplainedinthissubsectionshowthepotentialoftransformingparts ofthesoftwaretoreconfigurablelogicexecution.Bothdynamicandstaticapproaches arestilllimitedtooptimizesingle-threadedapplications,whichnarrowitsfieldof applicationsince,nowadays,duetothelimitedinstructionlevelparallelism(MAK, 1991)theperformancewillnotincreaseatthesamepaceasthenumberoffunctional units increases as well. 2.2Multiprocessing SystemsInthenineties,sophisticatedarchitecturalfeaturesthatexploitinstructionlevel parallelism, like aggressive out-of-order instruction execution, provided higher increase intheoverallcircuitcomplexitythanperformanceimprovements.Therefore,asthe technology reached an integration of almost a billion of transistors in a single die in this decade,researchersstartedtoexplorethreadlevelparallelismbyintegratingmany processors in a single die.Intheacademicfield,severalresearchesaddresschip-multiprocessingsubject. Hydra (HAMMOND, HUBBERT,et al., 2000) was one ofthe pioneering designs that integratedmanyprocessorswithinasingledie.Theauthorsarguethatthecostin hardwareofextractingparallelismfromasingle-threadedapplicationisbecoming prohibitive, and advocate the use of software support to extract thread level parallelism to allow hardware to be simple and fast. In addition, they discourageacomplex single processor implementation in a billion-transistor design, since the wire delay increases as the technology scaling that makes the handling of long wires complex in pipeline-based designs. For instance, in the Pentium 4 design, the long wires distance adds two pipeline stages to the floating-point pipeline, so the FPU has to wait two whole clock cycles for the operands to arrive from the register file (PATTERSON e HENNESSY, 2010).TheHydraChipMultiprocessoriscomposedofmultiplecopiesofthesame processorsbeinghomogeneousonbotharchitectureandorganizationpointofviews. Hydra implementation contains eight processors being each of them capable of issuing twoinstructionspercycle.Thechoiceforsimpleprocessororganizationprovides advantages over multiprocessing systems composed of complex processor since, besides allows a higher operating frequency of the chip, achieves larger number of processor in thesamearea.AperformancecomparisonamongHydradesign,12-issuesuperscalar processorand8-thread12-issuesimultaneousmultithreadingprocessorshown promising results for applications that could be parallelized into multiple threads, since Hydrausesrelativelysimplehardwarethanthecomparedarchitectures.However, disadvantagesappearwhenapplicationscontaincodethatcannotbemultithreaded, Hydraisthenslowerthanthecomparedarchitectures,becauseonlyoneprocessorcan betargetedtothetask,andthisprocessordoesnothavestrongabilitytoextract instruction level parallelism.Piranha(ANDRE,BARROSO,etal.,2000),asHydra,investsonthecouplingof many simple single-issue in-order processors to massive explore thread level parallelism ofcommercialdatabaseandwebserverapplications.Theprojectmakesavailablea completeplatformcomposedofeightsimpleprocessorcoresalongwithacomplete cache hierarchy, memory controllers, coherence hardware, and network router all onto a singlechiprunningat500MHz.Resultsaroundwebserverapplicationsshowthat Piranha outperforms an aggressive out-of-order processor exploitation running at 1 GHz by over a factor of three times. As Hydra, the authors explicit declare that Piranha is a wrong design choice if the goal is to achieve performance improvements in applications 26 thathavelackofsufficientthread-levelparallelismduetothesimpleorganizationof their processors. Tullsen(KUMAR,TULLSEN,etal.,2004)demonstratesthattherecanbegreat advantageonprovidingadiversityofprocessingcapabilitieswithinamultiprocessing chip,allowingthatarchitecturetoadapttotheapplicationrequirements.A heterogeneousorganizationandhomogeneousISAmultiprocessingchipisassembled withfourdifferentprocessorsorganization,eachonewithitsparticularpower consumptionandinstructionlevelparallelismexploitationcapability.Tomotivatethe useofsuchanapproach,astudyovertheSPEC2000benchmarksuitewasdone.It showsthatapplicationshavedifferentexecutionphasesandtheyrequiredifferent amountofresourcesinthesephases.Onthataccount,severaldynamicswitching algorithms are employed to examine the limits of power and performance improvements possibleinaheterogeneousmultiprocessingorganizationenvironment.Hugeenergy reductionswithlittleperformancepenaltiesarepresentedbyonlymovingapplications to a better-matched processor.Foralmosttenyearsnow,multiprocessingsystemsareincreasinglygettingthe general-purposeprocessormarketplace.IntelandAMDhavebeenusingthisapproach to speed up their high-end processors. In 2006, Intel has shipped its multiprocessor chip basedonhomogeneousarchitecturestrategy.IntelCoreDuoiscomposedoftwo processingelementsthatmakecommunicationamongthemselvesthroughanon-chip cachememory.Inthisproject,Intelhasthoughtbeyondthebenefitsofsuchasystem employmentandcreatedanapproachtoincreasetheprocessyield.Anewprocessor market line, named Intel Core Solo, was created aiming to increase the process yield by selling even Core Duo dies with manufacturing defects. In this way, Intel Core Solo has the very same two-core die as the Core Duo, but only one core is defect free. Recently, embedded processors are following the trend of high-end general-purpose processors couplingmany processingelements, with thesamearchitecture, on a single die. Early, due to the hard constraints of these designs and the few parallel applications thatwouldbenefitfromseveralGPP,homogeneousmultiprocessorswerenotsuitable forthisdomain.However,theembeddedsoftwarescenarioisgettingsimilartoa personalcomputeroneduetotheconvergenceoftheapplicationstoembeddeddevice alreadydiscussedinthebeginningofthiswork.ARMCortex-A9processoristhe pioneertoemployhomogeneousmultiprocessingapproachintoembeddeddomain, couplinguptofourCortex-A9coresintoasingledie.Eachprocessingelementuses powerful techniques for ILP exploration, as superscalar execution and SIMD instruction set extensions, which closes the gap between the embedded processor design and high-end general-purpose processors. TexasInstrumentstrategybetterillustratestheembeddeddomaintrendtouse multiprocessorsystems.Thisheterogeneousarchitecturehandlesinhardwaremost widelyusedapplicationsonembeddeddeviceslikemultimediaanddigitalsignal processing.In2002,TexasInstrumentshaslaunchedinthemarketanInnovator Developmentkit(IDK)targetinghighperformanceandlowpowerconsumptionfor multimediaapplications.IDKprovidesaneasydesigndevelopment,withopen software, based on a customized hardware platform called open multimedia applications processor (OMAP). Since its launch, OMAP is a successful platform being used by the embedded market leaders like Nokia with its N90 cell phones series, Samsung OMNIA HDandSonyEricssonIDOU.Currently,duetothelargediversityfoundonthe embeddedconsumermarket,TexasInstrumentshasdividedtheOMAPfamilyintwo 27 different lines, covering different aspects. The high-end OMAP line supports the currentsophisticatedsmartphonesandpowerfulcellphonemodels,providingpre-integrated connectivity solutions for the latest technologies (3G, 4G, WLAN, Bluetooth and GPS), audio and video applications (WUXGA), includingalso high definition television. The low-endOMAPplatformscoverdown-marketproductsprovidingolderconnectivity technologies (GSM/GPRS/EDGE) and low definition display (QVGA). Recently,TexasInstrumentreleasedoneofitslatesthigh-endproducts.The OMAP4440coverstheconnectivitybesideshigh-qualityvideo,imageandaudio support.Thismobileplatformcametosupplytheneedoftheincreasinglymultimedia applicationsconvergenceinasingleembeddeddevice.Thisplatformincorporatesthe dual-coreARMCortexA9MPCoreprovidinghighermobilegeneral-purpose computingperformance.ThepowermanagementtechniqueavailableintheARM CortexA9MPCorebalancesthepowerconsumptionwiththeperformance requirements,activatingonlythecoresthatareneededforaparticularexecution.In addition,duetothehighperformancerequirementoftodaysmartphones,uptoeight threadscanbeconcurrentlyfiredintheMPCore,sinceeachcoreiscomposedoffour single-coresCortexA9.Thesingle-coreARMCortexA9implementssuperscalar execution,SIMDinstructionsetandDSPextensions,showingalmostthesame processing power as a personal computer into an embedded mobile platform. Excluding theARMCortexMPCore,theremainderprocessingelementsarededicatedto multimedia execution.In2011,NVIDIAintroducedtheprojectnamedKal-el(NVIDIA,2011)mobile processor.Thisprojectisthefirsttoencapsulatefourprocessorsinasinglediefor mobilecomputation.ThemainnoveltyintroducedbythisprojectistheVariable SymmetricMultiprocessing(vSMP)technology.vSMPintroducesafifthprocessor named Companion Core that executes tasks a low frequency for active standby mode, as mobile systems tend to keep in this mode for most time. All five processors are ARM Cortex-A9,butthecompanioncoreisbuiltinaspeciallowpowersiliconprocess.In addition,allcorescanbeenabled/disabledindividuallyandwhentheactivestandby mode is on, only the Companion Core works, so battery life can significant improved. NVIDIA reports that the switching from the Companion Core to the regular cores are supportedonlybyhardwareandtakelessthan2millisecondsbeingnotperceptibleto theendusers.IncomparisonwithTegra2platform,vSMPachievesupto61%of energy savings on running HD video playback. AsOMAP,Samsungdesignsarefocusedonmultimedia-baseddevelopment.Their projects are very similar due to the increasing market demand for powerful multimedia platforms,whichstimulatesthedesignertotakethesamedecisiontoachieveefficient multimedia execution. Commonly, the integration of specific accelerators is used, since thisreducesthedesigntimeavoidingvalidationandtestingtime.In2008,SamsunglaunchedthemostpowerfuloftheMobileMPSoCfamily.Atfirst,S3C6410wasa multimediaMPSoClikeOMAP4440.However,afteritsdeploymentintheApple iPhone3Gemployment,ithasbecomeoneofthemostpopularMPSoCs,shipping3 million units duringthe first life timemonth. After, Apple has developediPhone 3GS, whichassuresbetterperformancewithlowerpowerconsumption.Thesebenefitsare suppliedbythereplacementoftheS3C6410architectureswiththehigh-endS5PC100 version. Followingthemultimedia-basedmultiprocessortrend,Samsungplatformsare composedofseveralapplicationspecificacceleratorsbuildingheterogeneous 28 multiprocessorarchitectures.S3C6410andS5PC100haveacentralgeneral-purpose processingelement,inbothcasesARM-based,surroundedbyseveralmultimedia acceleratorstightlytargetedtoDSPprocessing.Bothplatformsskeletonfollowthe sameexecutionstrategy,changingonlytheprocessingcapabilityoftheirIPcores. SmallplatformchangesaredonefromS3C6410toS5PC100aimingtoincreasethe performance.Morespecifically,a9-stagepipelinedARM1176JZF-ScorewithSIMD extensionsisreplacedtoa13-stagesuperscalar-pipelinedARMCortexA8providing greatercomputationcapabilityforgeneral-purposeapplications.Besidesitsdouble-sized L1 cache compared to ARM1176JZF-S, ARM Cortex A8 also includes a 256KB L2cacheavoidingexternalmemoryaccessesdueL1cachemisses.NEONARM technology is included in ARM Cortex A8 to provide flexible and powerful acceleration for intensive multimedia applications. Its SIMD based-execution accelerates multimedia andsignal-processingalgorithmssuchasvideoencode/decode,2D/3Dgraphics, speech-processing,imageprocessingatleasttwicebetterthanthepreviousSIMD technology.However,thesehardwarechangesprovidemandatorytoolchain modificationstosupporttheuseofthenewdedicatedhardware,whichconsequently breaksthebinarycompatibilitysincethesoftwaredevelopersmustchangeand recompile the application code. Regardingmultimediaaccelerators,bothsystemsareabletoprovidesuitable performanceforanyhigh-endmobiledevices.However,S5PC100includesthelatest codec multimedia support using powerful accelerators. This strategy on changing some platform elements from S3C6410 to S5PC100 illustrates the growth and maturity phase of the functionality lifecycle discussed in the beginning of this work. In this phase, the electronicconsumermarketalreadyhasabsorbedthesefunctionalities,andtheirhard-wired execution is mandatory for energy and performance efficiency. Othermultiprocessingsystemshavealreadybeenreleasedinthemarket,with differentgoalfromthearchitecturesdiscussedbefore.Sony,IBMandToshibahave workedtogethertodesigntheCellBroadbandEngineArchitecture(CHEN, RAGHAVAN,etal.,2007).TheCellarchitecturecombinesapowerfulcentral processorwitheightSIMD-basedprocessingelements.Aimingtoacceleratealarge rangeofapplicationbehaviors,theIBMPowerPCarchitectureisusedasgeneral purposeprocessor.Inaddition,thisprocessorhastheresponsibilitytomanagethe processingelementssurroundingit.Theseprocessingelements,calledsynergistic processingelements(SPE),arebuilttosupportstreamingapplicationswithSIMD execution.EachSPEhasalocalmemorythatonlycanbeaccessedbyexplicitand particularsoftwaredirectives.ThesefactsmakethesoftwaredevelopmentfortheCell processorevenmoredifficult,sincethesoftwareteamshouldbeawareofthislocal memory,andmanageitatthesoftwareleveltobetterexploretheSPEexecution. Despiteitshighprocessingcapability,theCellprocessordoesnotyethavealarge market acceptance because of the intrinsic difficulty to code software in order to use the SPEs. When it was launched, the Playstation console did not achieve a great part of the gaming entertainmentmarketplace, the game developers had not enough knowledge of thetoolchainlibrariestoefficientlyexplorethecomplexCellarchitecture,which implied in a restricted amount of games available in the market. Homogeneousmultiprocessingsystemorganizationisalsoexploredinthemarket, mainlyforpersonalcomputerswithgeneralpurposeprocessors,becauseofthehuge amount of different applications that these processors have to face, and hence due to the difficulttasktodefinespecializedhardwareaccelerators.In2005,SunMicrosystems 29 announceditsfirsthomogeneousmultiprocessordesign,composedofupto8 processing elements executing the SPARC V9 instruction set. UltraSparc T1, also called Niagara(JOHNSONeNAWATHE,2007),isthefirstmultithreadedhomogeneous multiprocessor,andeachprocessingelementisabletoexecutefourthreads concurrently.Inthisway,Niagaracanhandle,atthesametime,upto32threads. Recently,withthedeploymentofUltraSparcT2,thisnumberhasgrownto64 concurrentthreads.Niagarafamilytargetsmassivedatacomputationwithdistributed tasks, like the market for web servers, database servers and network file systems. Intelhasannounceditsfirstmultiprocessingsystembasedonhomogeneous organization prototyped with 80-cores, which is capable of executing 1 trillion floating-point operations per second, while consuming 62 Watts (VANGAL, HOWARD, et al., 2007). Thecompany expectsto launchthischip within the next 5 years in themarket. Hence,thex86instructionsetarchitectureeracouldbebroken,sincetheirprocessing elementsisbasedontheverylonginstructionword(VLIW)approach,lettingtothe compilertheresponsibilityfortheparallelismexploration.Theinterconnection mechanismusedonthe80-coreusesameshnetworktocommunicateamongits processingelements.However,evenemployingthemeshcommunicationturnsoutto be difficult, due to the great amount of processing elements.In this way, this ambitious projectusesa20Mbytesstackedon-chipSRAMmemorytoimprovetheprocessing elements communication bandwidth. Graphicprocessingunit(GPU)isanothermultiprocessingsystemapproachaiming atgraphic-basedsoftwareacceleration.However,thisapproachhasbeenarisingasa promise architecture also to improve general-purpose software. Intel Larrabee (SEILER, CARMEAN,etal.,2008)attacksbothapplicationsdomainthankstoitsCPU-and GPU-likearchitecture.InthisprojectIntelhasemployedtheassumptionofenergy efficiencybysimplecoresreplication.LarrabeeusesseveralP54C-basedcoresto exploregeneral-purposeapplications.In1994,P54CwasshippedinCMOS0.6um technologyreachingupto100MHzanddoesnotincludeout-of-ordersuperscalar execution. However, some modifications have been done in the P54C architecture, like supporting of SIMD execution aiming to provide more powerful graphic-based software execution.TheSIMDLarrabeeexecutionissimilarto,butpowerfulthan,theSSE technology available in themodern x86 processors. Each P54C is coupled toa 512-bit vectorpipelineunit(VPU),capableofexecuting,inoneprocessorcycle,16single precisionfloating-pointoperations.Inaddition,Larrabeeemploysafixed-function graphicshardwarethatperformstexture-samplingtaskslikeanisotropicfilteringand texture decompression. However, in 2009, Intel discontinued Larrabee project.NVIDIATesla(LINDHOLM,NICKOLLS,etal.,2008)isanotherexampleof multiprocessingsystembasedontheconceptofageneral-purposegraphicprocessor unit.Itsmassive-parallelcomputingarchitectureprovidessupporttoComputeUnified DeviceArchitecture(CUDA)technology.CUDA,theNVIDIAscomputingengine, eases the parallel software development process by providing software extensions in its framework. In addition, CUDA provides permission to access the native instruction set andmemoryoftheprocessingelements,turningtheNVIDIATeslatoaCPU-like architecture.Teslaarchitectureincorporatesuptofourmultithreadedcoresthat communicatethroughaGDDR3bus,whichprovidesahugedatacommunication bandwidth. 30 Table 1. Summarized Commercial Multiprocessing Systems As discussed in the beginning of this work, multiprocessing systems employment is a consensus to current/next generation for both general and embedded processors, since aggressiveexplorationofinstructionlevelparallelismofsingle-threadedapplications doesnotprovideanadvantageoustradeoffbetweenextratransistorusageand performanceimprovement.Allmultiprocessingsystemdesignsmentionedinthis sectionsomehowexplorethreadlevelparallelism.Summarizingallcommercial multiprocessingsystemdiscussedbefore,Table1comparestheirmaincharacteristics showingtheirdifferencesdependingonthetargetmarketdomain.Heterogeneous architectures,liketheOMAP,SamsungandCell,incorporateseveralspecialized processingelementstoattackspecificapplicationsforhighlyconstrainedmobileor portabledevices.Thesearchitectureshavemultimedia-basedprocessingelements, followingthetrendofembeddedsystems.However,asmentionedbefore,software productivityisaffectedwhensuchstrategyisused,eachnewplatformlaunching implies on tool chain modifications, like library description, to explore the execution of thecoupledspecializedhardware.Inaddition,thisapproachcanbeoptimizedfor performanceandarea,buttheyarecostlytodesignandnotprogrammable,making upgradabilityadifficulttaskandtheybringnobenefitexcludingthetargeted applications. Unlikeheterogeneousarchitectures,homogeneousonesaimatthegeneral-purpose processingmarket,handlingawiderangeofapplicationsbehaviorbyreplicating general-purposeprocessors.Commercialhomogeneousarchitecturesstilluseonly homogeneousorganizations,couplingseveralprocessingelementswiththesameISA andtheprocessingcapability.Heterogeneousorganizationshavenotbeenusedon homogeneousarchitectures,sincepowermanagementtechniques,likeDVFS,support thevariableprocessingcapability.However,mostofthesetechniquesarerestrictedto reduce only dynamic power, the circuit still consumes leakage power that is increasing withthetechnologyscaling.Supposingaperfectpowermanagementthatsolves dynamicandleakagepower,thehomogeneousarchitectureandorganizationplatform still relies on huge area overhead, what supports the need for homogeneous architecture and heterogeneous organization strategy.2.3Multi-Threaded Reconfigurable Systems Asthescopeofthisworkismotivatedbymultiprocessingsystemsthatusesome kindofadaptabilityonexploitinginstructionlevelparallelism,thissub-sectiononly Architecture Organization CoresMultithreadedCoresInterconnectionOMAP4440 Heterogeneous Heterogeneous2ARMCortexA91PowerVRgraphicsaccelerator1ImageSignalProcessorNo IntegratedBusSamsungS3C6410/S5PC100Heterogeneous Heterogeneous1ARM1176JZFS5MultimediaAcceleratorsNo IntegratedBusCell Heterogeneous Heterogeneous1PowerPC8SPENo IntegratedBusNiagara Homogeneous Homogeneous 8SPARCV9ISAYes(4threads)CrossbarIntel80CoresHomogeneous Homogeneous 80VLIW No MeshIntelLarrabeeHomogeneous HomogeneousnP54Cx86coresSIMDexecutionNo IntegratedBusNVIDIATesla(GeForce8800)Homogeneous Homogeneous 128StreamProcessorsYes(upto768threads)Network 31 contemplatesthestateoftheartresearchesthatemploymultiprocessingsystems together with reconfigurable architectures.In(KOENIG,BAUER,etal.,2010),theauthorsproposeKAHRISMA,a heterogeneousorganizationandarchitectureplatform.Figure6showsKAHRISMAs architectureoverview,itsmultipleinstructionset(RISC,2-and6-issueVLIW,and EPIC)couplingwithfine-andcoarse-grainedreconfigurableencapsulateddatapath elements (EDPE) arethemain novelty ofthis research. Theresourceallocationtask is totallysupportedbyaflexiblesoftwareframeworkthat,atcompiletime,analyzesthe high-levelC/C++sourcecodeandbuildsaninternalcoderepresentation.Thiscode representation goes through an optimization process to eliminate dead code and constant propagation. After, the internal representation is used to identify/select parts of code that willimplementcustominstructions(CIs)tobeexecutedinthereconfigurablearrays (FG- and CG-EDPE).The entire process considers that the amount of free hardware resources can vary at runtime,sincesomepartsofcodecouldpresentgreaternumberofparallelexecuting threadsthanothers,somultipleimplementationsofcustominstructionsareprovided. The runtime system is responsible for the best CIs solution selection, which depends on theloadingstateofthearchitecture.Thus,theexecutionofacertainpartofcodecan varyfromRISCimplementation(lowperformance)tothecustominstruction implementationusingFG-aswellasCG-EDPEs(highperformance).Speedupsare shownintheexecutionofveryintensivecomputekernelfromh.264videoencode-decodestandardonexploringmultipleISAs,whenthemultithreadscenariois considered.However, this approach fails at several crucial constraints of the embedded systems. High memory usage is caused by multiple assembly generation of the same part of code, whichcouldnotalwaysofferspeedupsduetotherestrictedamountofhardware resources at a certain time. KAHRISMA is able to optimize multi-threaded applications, howevertheyalsorely oncompilersupport,staticprofilingandatooltoassociatethe codeorcustominstructionstothedifferenthardwarecomponentsatdesigntime. Despite inserting reconfigurable components in its platform, KAHRISMA maintains the main drawbacks of the current embedded multiprocessing systems (e.g. OMAP), since it maintainsamandatorytimeoverheadoneachplatformchangetoproducecustom instructions affecting the software productivity by breaking the binary compatibility. 32 Figure 6. KAHRISMA architecture overview (KOENIG, BAUER, et al., 2010) Consideringasystemwithhomogeneousarchitectureandheterogeneous organization,onecanfindtheThreadWarping(TW)(STITTeVAHID,2007),which extendstheaforementionedWarpProcessingsystemshownintheSection2.1.Prior work has developed a CAD algorithm that dynamically remaps critical code regions of single-threadedapplicationsfromprocessorinstructionstoFPGAcircuitsusinga runtimesynthesis.ThecontributionofTWconsistsofintegratingexistingCAD algorithminaframeworkcapableofdynamicallysynthesizingmanythread accelerators.Figure7overviewstheTWarchitectureandshowshowtheacceleration process occurs. As can be seen, the TW is composed of four ARM11 microprocessors, a XilinxVirtexIVFPGAandanOn-ChipCADhardwareusedtothesynthesizing process.The thread creation process shown in the Step 1 of the Figure 7 is totally supported byanApplicationProgrammingInterface(API),sonosourcecodemodificationis needed.However,changesintheoperatingsystemaremandatorytosupportthe schedulingprocess.Theoperatingsystemschedulermaintainsaqueuethatstoresthe threadsreadyforexecution.Inaddition,astructure,namedschedulableresourcelist (SRL),holdsthelistoffreeresources.Thus,totriggeranexecutionofathread,the operatingsystemshouldcheckiftheresourcerequirementsofacertainreadythread matchwiththefreeresourcesintheSRL.AnARM11istotallydedicatedtorunthe operatingsystemtasksneededtosynchronizethreadsandtoscheduletheirkernelsin the FPGA (Step 2 of Figure 7).The framework, implemented in hardware, analyzes waiting threads, and utilizes on-chip CAD tools to create custom accelerator circuits for executing in the FPGA (step 3). Aftersometime,onaverage22minutes,theCADtoolfinishesmappingthe accelerators onto the FPGA and stores the custom accelerators circuits in a non-volatile libraryforfutureexecutions,namedAccLibintheFigure7.Assumingthatthe application has not finished during these 22 minutes, the operating system (OS) begins schedulingthreadsontobothFPGAacceleratorsandmicroprocessorcores(step4). SincethearearequirementsoftheexistingacceleratorscouldexceedtheFPGA capacity, a greedy knapsack heuristic is used to generate a solution for the instantiation process of the accelerators in the FPGA. 33 Figure 7. Overview of Thread Warping execution process (STITT e VAHID, 2007) Despiteitsdynamicnature,thatprovidesbinarycompatibility,thereareseveral drawbacks in the Thread Warping proposal. First, the unacceptable latency on creating the CAD tool for applications those run less than 22 minutes. TW shows good speedups (502times)whentheinitialexecutionoftheapplicationsisnotconsidered.Inother words,theseresultsdoesnotconsidertheperiodwhentheCADtoolisworkingto createthecustomacceleratorcircuits.Inthecaseofthecustominstructionscreation overheadistakenintoaccount,allbutoneofthetenalgorithmshaveshown performanceloss.Summarizing,ThreadWarpingpresentsthesamedeficiencyofthe original work shown in the Section 2.1: only critical code regions are optimized, due to thehighoverheadintimeandmemoryimposedbythedynamicdetectionhardware. Thus,TWonlyoptimizesapplicationswithfewandverydefinedkernels,which narrowsitsfieldofapplication.Theoptimizationoffewkernelswillverylikelynot satisfy the performance requirements of future embedded systems, where it is foreseen a high concentration of different software behaviors (SEMICONDUCTORS, 2009). In(YAN,WU,etal.,2010),Yanproposesthecouplingofmanyreconfigurable processingunitsbasedonFPGAtoSparcV9general-purposeprocessors.ISA extensions are done to support the reconfigurable processing units execution. However, thesystemcanalsoworkwithoutusingtheacceleratorsinabackward-compatible manner. The reconfigurable architecture overview is show in the Figure 8. As it can be seen,acrossbarisemployedtoconnectthereconfigurableprocessingunitstothe homogeneousSparcV9-basedprocessors,whichprovidesalow-latencyparallel communication. 34 Figure 8. Blocks of the Reconfigurable Architecture (YAN, WU, et al., 2010) TheReconfigurableProcessingUnit(RPU)isadatadrivencomputingsystem, basedonfine-grainedreconfigurablelogicstructuresimilartoXilinxVirtex-5.The RPU is composed of configurable logic block arrays to synthesize the logic; local buffer responsibleforthecommunicationbetweentheRPUandtheSparcV9processors; configurationcontextthatstoresthealreadyimplementedcustominstructions;and configuration selection multiplexer that selects the fetched custom instructions from the configurationcontext. As Thread Warping, thisapproach also employs anextracircuit to provide consistency and synchronization on data memory accesses. Asoftware-hardwareco-operativeimplementationisusedtosupportthetriggering ofthereconfigurableexecutions.Theexecutionisdividedinfourphases:configuring, pre-load,processingandpost-storephase.Theconfiguringphasestartswhenaspecial instruction, that request an RPU execution, arrives in the execution stage of the SparcV9 processor.Ifthecustominstructionisavailableattheconfigurationcontext,thepre-loadphasestartsandaninterruptionisgeneratedtonotifytheoperatingsystem scheduling to configure the RPU with the configuration context. In this phase, the data required for thecomputation also are loaded tothelocal bufferof therespective RPU. Intheprocessingphasethedatadrivencomputingisdone.Finally,somespecial instructions are fired to fetch the results from the local buffer and to return the execution process to the SparcV9 processor.Thisapproachimprovestheperformanceoverthesoftwareonlyexecutionby,on average,2.4timesinanapplicationenvironmentcomposedofanencryptionstandard and an encode image algorithm. However, some implementation aspects make such an approachnotviabletoembeddeddomain,thebinarycompatibilityisbrokensincea compilationphaseis used to extend the originalSparcV9 instruction setto support the RPU execution. The fine-grained reconfigurable structure relies on high reconfiguration overhead, which narrows the scope of such an approach to applications where very few kernels cover almost its whole execution time. Differentfromotherapproaches,in(SMIT,2008)ispresentedamultiprocessing reconfigurablearchitecturefocusedonacceleratingstreamingDSPapplications.The authorsarguethatiseasiertocontrolthereconfigurablearchitecturewhenhandling such kind of applications since most of them can be specified as a data flow graph with streamsofdataitems(theedges)flowingbetweencomputationkernels(thenodes). AnnabelleSoCispresentedintheFigure9,itsheterogeneousarchitectureand organization aggregates a traditional ARM926 thatis surrounded by ASIC blocks (e.g. 35 Viterbi Decoder and DDC) and four-domain specific coarse-grained reconfigurable data path,namedMontiumcores.Anetwork-on-chipinfrastructuresupportsinter-Montium communicationwithhigherbandwidthandmultipleconcurrenttransmissions.The communication among the rest of the system elements is done through a 5-layer AMBA bus. As each processor operates independently, they need to be controlled separately, so theARM926processorcontrolstheothercoresbysendingconfigurationmessagesto their network interface. Since the cores might not be running at the same clock speed as the NoC, the network interface synchronizes the data transfers. Figure 9. Block Diagram of Annabelle SoC (SMIT, 2008) Figure10depictsthearchitectureofasingleMontiumCorethathasfive16-bit widtharithmeticandlogicunitsinterconnectedby10localmemoriesduetothehigh bandwidth required for DSP applications. An interesting point considered in this work is thelocalityofreference.Inotherwords,theaccessesonsmallandlocalmemoryis muchmoreenergyefficientthanaccessingabigandfardistantmemorybecauseof increasing wire capacitance on recent nano-technologies. There is a communication and a configuration unit that provides the functionality to configure the Montium, to manage thememoriesbymeansofdirectmemoryaccess(DMA)andtostart/wait/resetthe computation of the algorithm configured. Since the Montium core is based on a coarse-grainedreconfigurablearchitecture,theconfigurationmemoryisrelativelysmall,on average,itoccupiesonly2.6Kbytes.Becausetheconfigurationmemorycanbe accessed as a RAM memory, the system allows dynamic partial reconfiguration. Results showthatenergysavingscanbeachievedbyonlyexploitinglocalityofreference.In addition,thisworksupportstheuseofcoarse-grainedreconfigurablearchitecturesby demonstratinglowerreconfigurationtimeoverhead.DespitethefactthatAnnabelle exploresareconfigurablefabrictoacceleratestreamingapplications,thissystemstill reliesonheterogeneousISAimplementationbycouplingASICstoprovideefficient energy-performanceexecution.LikeOMAP,suchanapproachaffectssoftware productivity since each new platform requires tool chain modifications. 36 Figure 10. The Montium Core Architecture Studiesonsharingreconfigurablefabricamonggeneral-purposeprocessorsare shownin(WATKINS,CIANCHETTIeALBONESI,2008)(GARCIAeCOMPTON, 2008).Thesestrategiesaresupportedbythehugeareaoverheadandnon-concurrent utilizationofthereconfigurableunitsbymultipleprocessors.In(GARCIAe COMPTON,2008),areconfigurablefabricsharingapproachfocusedonaccelerating multithreaded applications is presented. This work exploits a type of parallelism named single program multiple data (SPMD), where each thread instantiation runs the same set of operations ondifferent data. Multiple instantiations ofthe Xvidencoderare usedto emulatesuchtypeofparallelism,actingasadigitalvideorecordertoencodemultiple videostreamsfromdifferentchannelssimultaneously.Toavoidlowutilizationof reconfigurablehardwarekernels,differentthreadssharethealreadyconfigured reconfigurable hardware kernels. For example, if two instances of Xvid are executing, a singlephysicalcopyofeachreconfigurablehardwarekernelcouldbeshared,soboth instancesofXvidcanbenefitfromthem.Althoughitdoesnotspecifyanyparticular reconfigurablehardwaredesign,theXvidencoderinstantiationsaresynthesizedina Xilinx Virtex-4 FPGA.Firstexperimentsshowthatsharingsinglephysicalcopyofeachreconfigurable hardwarekernelamongallXvidinstancesperformsverypoorlyduetothefrequent contention on accessing the kernels. Thus, the authors conclude that not all kernels can beeffectivelyshared,sotheycreatedamodifiedstrategytoprovidebetterkernels allocation. Such an approach uses the concept of virtual kernels to control the physical kernelallocation.Thealgorithmusesthefollowingstrategy.Whenanapplication attemptstoaccessavirtualkernel,thecontrollerfirstchecksifanyinstanceofthe correspondingvirtualkernelisalreadymappedtoaphysicalkernelandifanyother physicalkernelisfree.Ifmultiplephysicalkernelsareavailable,oneofthemwillbe reserved to execute the virtual kernel even if other physical kernel already is executing thesamevirtualkernel.Thisstrategyeliminatesthewaitingforbusysharedphysical kernel increasing the combined throughput of Xvid encoder in a multiprocessor system by 95-130% over the software execution alone. 37 Watkins (WATKINS, CIANCHETTI e ALBONESI, 2008) proposes, as a first work, asharedspecializedprogrammablelogic(SPL)todecreasethelargepowerandarea costs of FPGA when a multiprocessing environment is considered. The main motivation to apply such an approach in multiprocessing systems is supported by intermittent used ofthereconfigurablefabric.Thereareinevitablyperiodswhereonefabricishighly utilized while another lies largely or even completely idle. Themotivation is produced throughinterestingexperimentsthatshowthepoorutilizationoftheSPLsrowson running applications of different domains in multiprocessing system composed of eight cores.ThesedataaredepictedinFigure11.Theleftmostbarsfortheindividual benchmarks show the utilization of the fabric composed of 26-row configuration, which reflects twice the area of each core that the SPL is coupled. The utilization of seven SPL fabrics is less than 10%, and the average SPL utilization is only 7%.As can be seen in Figure 11, reducing each SPL to 12 rows (roughly the same area of the coupled core) increases SPL utilization for some benchmarks and greatly reduces the area occupied. However, this comes at a high cost: an 18% overall performance loss, since all benchmarks use more than 12 rows. The two rightmost bars of Figure 11 show a spatially shared SPL organization with a naive control policy that equally divides the rowsoftheSPLamongallcoresatalltimes.Thus,aSPLfabricconfiguration composed of24 rows sharedamong four cores (fourth bar ofAvgUtilization in Figure 11)produces,onaverage,anutili