hardware-software partitioning

52
Hardware-Software Partitioning

Upload: abena

Post on 25-Feb-2016

59 views

Category:

Documents


3 download

DESCRIPTION

Hardware-Software Partitioning. Definition : Given an application, hw / sw partitioning maps each region of the application onto either a hardware (custom circuits) or a software (microprocessors ), but not both A partition is a mapping of each region to either HW or SW - PowerPoint PPT Presentation

TRANSCRIPT

Hardware-Software Partitioning and Co-Design Principles

Hardware-Software Partitioning Hardware Software DefinitionDefinition: Given an application, hw/sw partitioning maps each region of the application onto either a hardware (custom circuits) or a software (microprocessors), but not bothA partition is a mapping of each region to either HW or SWMapping is done to meet certain Design Goals with Constraints2 EEL6935 / 52 Design Constraints & Goals 3

Space AreaPower

PerformanceYieldScheduleCost EEL6935 / 52 You cannot get away with Everything !4

EEL6935 / 52 4

Challenges5 EEL6935 / 52 6 10s15s25s10s5s12s8s5sSw Time: 50sSw Time: 30sSw Time: 20sAcknowledgement: Modified from G. Stitts slides in EEL5721FIR()ACCUM()SEARCH()5s25s10s10s15sPossible Solutions:Use fastest implementationsUse smallest implementationsConsider all middle implementations5+30+20=55s25+15+10=50s10+15+20=45sPerformance:Best Partition15sProcessHardware Implementation Options : Area and Execution TimeArea BudgetApplication with the Multiple Hardware Software Options EEL6935 / 52 6Mathematical Modeling to arrive at the Optimum H/W-S/W Partition 7

EEL6935 / 52 7Granularity8 EEL6935 / 52 Dynamic Hardware-Software Partitioning: A First ApproachGreg Stitt, Roman Lysecky, Frank Vahid, University of California, RiversideDAC 2003, June 2-6,2003, Anaheim, California, USADynamic Hardware-Software PartitioningDynamically identify and re-implementcritical software kernels, loops etc. to configurable fabric in order to achieve better performance, lower energy or meet other design goals10 EEL6935 / 52 Multiple Applications an IllustrationEEL593511

EEL6935 / 52 12 Application Usage Profile: An IllustrationEEL5935

Mr. JazzMr. LuigiMr. MTBMusicGamesGPSUserData AccessSMSCallsDifferent users have different usage profilesWhile designing a product usage profile needs to be assumed to give best user experience. HoweverUsage Profile (Application usage) may be User/code dependentE.g. MP3, Camera, Video Playback, Call etc.Usage profile may change over-timeGeneric product assuming a certain profile is optimum for the assumed profile but sub-optimal in terms of area or performance for other usage profiles

Profiling in real time is keyusage profile may identify critical kernelsCritical components may be pushed to configurable areaTo boost the performance and reduce energy

EEL6935 / 52 12Dynamic HW/SW Partitioner RequirementsDetect critical code regionsDecompile and synthesize them to hardwarePlace and Route the Hardware onto on-chip configurable logicUpdate binary to communicate with the logic

13Wait ! Did you say on-chip PnR ? You got to be kidding ! Right ?All of the above with on-chip implementable, very lean algorithms

EEL6935 / 52 13Binary Level Partitioning and AdvantagePartitioning at the binary leveloffline or onlineStepsidentify critical code sections, high loop sectionsConsider assembly code and object code as HW candidatesPush these to configurable hardwareAdvantageWorks with any software compilerHigh level language

14

The Paper uses Binary Level partitioning approach.Critical Loops identified and implemented in the Configurable logic EEL6935 / 52 14Why Binary Level Partitioning instead of higher level optimizations ?Dynamic PartitioningNeeds to run on a small on-chip partitioning system Needs to be lean to be able to perform Place and Route etc. on-chipHigher Level Partitioning Methodologies may be good for offline analysis, but very difficult to implement due to the compute constrain.15 EEL6935 / 52 15HW/SW Partitioning of Software BinaryEEL593516

Acknowledgement: Figure taken from G. Stitt, F. Vahid HW/SW Partitioning of Software Binaries ICCAD Nov 2002 EEL6935 / 52

System Architecture (Top)EEL593517

Microprocessor and Memory for normal Software applicationOn chip configurable module1. Detects Most Frequently Executed Software region2. Re-implements (1) in the configurable logic Architecture Based on Triscend A7 (60MHz) EEL6935 / 52 System Architecture (Sub Blocks)EEL593518

Direct Memory Access Controller to access memoryInputOutputDecompiles and synthesized selected binary regions for HW implementationDetects Most Frequently executed application-software loops32-bit i/p o/p registerPartitioning Co-Processor Overhead:Not much : Very Lean compared to Main Processor Platform with multiple Main Processors may share single Partitioning co-processor, reducing the overhead further

EEL6935 / 52 Simplified Configurable Logic FabricSimplified Fabric to just support inner loop implementation designedMapping, placing and routing a design to a general configurable logic fabric is time consuming19 EEL6935 / 52 Architecture LimitationsNo sequential Logic support in the Configurable logic (in the platform chosen)Constraint:Loops to be implemented must have single cycle implementable bodyNumber of loop iterations must be determined before the loop executes, in order to specify the DMA block size request.Number of iterations may be determined :Statically in case of constant boundsDynamically requires extra instructions to configure the size of the DMA block request before HW execution starts

United States Patent 5,440,245 : Galbraith , et al. August 8, 1995 Logic module with configurable combinational and sequential blocks 20 EEL6935 / 52 CLF ArchitectureEEL593521

Either side connect-ability(only at bottom)4 channel:Given Channel to Given Channel EEL6935 / 52 Tool Flow : Loop ProfilerEEL593522

Detects critical SW regions that should be implemented in HWIs Non intrusiveMonitors instruction addresses on the memory busIncrements branch frequency in the cache for a given backward branchSmall cache with a dozen entriesNeed to save area and power

Reference: Ann Gordon-Ross et. al Frequent Loop Detection Using Efficient Nonintrusive On-Chip Hardware IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 10, OCTOBER 2005 EEL6935 / 52 22DecompilationConverts Software loops into higher level abstraction more suitable for synthesisStep 1 : Converts each assembly instruction to register transferStep 2: Using Register Transfers Builds:CFG (Control Flow Graph) for software regionDFG(Data Flow Graph) by parsing the Register transfersStep 3: Applies compiler optimizations to remove overhead due to assembly code and instruction set

23

EEL6935 / 52 DMA Configuration ToolFunction: Maps the memory access of the decompiled loop onto the DMA ArchitectureInvolves detection ofReads/ writesIncrement and decrement address updatesSingle and block request modesRemove following from Decompiled loopLoop counters and exit conditionsAddress calculations: As only sequential locations accessedDMA functioning:DMA transfers data needed before the loop startsAfter HW initialization, HW starts a block request that fetches 1 memory location per cycle in case of a read or write24

EEL6935 / 52 Register Transfer SynthesisConverts each o/p bit into Boolean expressionBy traversing the dataflow graphs of the software regionLimitation:Single cycle executable loop-bodies onlyMulti cycle would need behavioral synthesis to schedule loop operations25

EEL6935 / 52 Logic Synthesis Tech Mapping P&RConverts Boolean equations into a netlistBoolean equations transformed into DAG (directed acyclic Graph) of the Boolean Logic networkInternal Nodes of DAG correspond to simple logic gates (AND/OR/INV, XOR)Logic minimizationLight weight suited for on-chip executionApplied at each node starting with the input nodes, while traversing through the networkUses single expand phase to achieve good optimizationTech MappingTraverses DAG starting from output nodesCombines nodes that may create 3 i/p 1 o/p LUTFurther combine nodes (where possible ) to form 3 i/p 2 o/p LUTs26

EEL6935 / 52 LUT Placement StepsStep 1: Determine relative placement of LUTs to one anotherby determining the critical path, and placing it on a horizontal rowStep 2 : For remaining non-placed nodes place as per dependency (i/p or o/p) w.r.t. placedPlace above for inputs to Placed nodesPlace below for outputs from Placed nodesStep 3: Place in the Configurable Logic27

EEL6935 / 52 RoutingSimple Greedy algorithmRoutes wires in most direct fashionRoute the wires between input nodes and LUTsRoute wires from LUTs to outputsRoute wires connecting LUTs togetherRouting decisions at Switch Matrices for within conifugrable logic fabric28

EEL6935 / 52 Bitfile CreationCombines the Placed and routed hardware description with the DMA configuration information into a single bit fileBitfile can be used to initialize the configurable logic29

EEL6935 / 52 Bitfile modificationUpdate software binaries to utilize HW for loopsReplace original software instruction for loop to a jump to HW initializing codeInitializing code sends HW enable signal through Memory mapped registerCode followed up with microprocessor power down triggerUpon finishing HW asserts completion signal causing a software interruptSoftware interrupt wakes the microprocessorJump instruction at the end of the hardware initialization code to the end of the original software loop30

EEL6935 / 52 Draw cartoon for Second30Tool : Performance and Area overhead EEL593531

Typical tools for De-compilation, synthesis, and Place and Route need huge LSF machinesDesigned tool very light weight and geared towards partitioning co-processor

Data Size: Memory required for the tool executionTime : Execution time of each tool considering 60MHz clock and 1.5 cycle/Instruction EEL6935 / 52 Results32

Definitions:Loop Time Perc: Percentage of total software time, spent in the implemented loops

Loop Size Perc: Percentage of the total instructions that the loop required

Ideal Speedup: Speedup assuming HW implemented loops are executing in Zero time.

Sw Loop Time: Time required by the loop if completely in software

HW Loop Time: Time when loop implemented in HW EEL6935 / 52 Where is the partitioning time ?

Partitioning is a one time even and would not be done often. Thus the partitioning time may not be considered32ConclusionDynamic HW/SW Partitioning offers advantages over traditional approach:Transparent i.e. Benefits of partitioning even with regular software flowsCan adapt as per actual usage profileUpto 2.6 average speedup

33 EEL6935 / 52 Areas of Improvement of Future WorkPower required by the partitioning module and the HW running specified as 10-20% of total powerPower data for individual modules not presentedRealistic loops have sequential logic and may not be always single cycleExtend implementation on sequential logic compatible CLFExtend to include mutli cycle loopsApplications seem too biased especially url, with 80% loop time with just 0.1% loop area overheadPlace and Route, synthesis would have been difficult to do on single partitioning chip:Today as on 2013 it should be possible to interface the modules with the cloud computing. I would rather have a complex algorithm run to get best suited partition profile, on a cloud network than to try small tricks with the lean co-processorsThis would be application dependent

34 EEL6935 / 52 A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software PartitioningRoman Lysecky, Frank Vahid, University of California, RiversideDesign, Automation and Test in Europe Conference and Exhibition (DATE05)Motivation (1/2)Hard-ProcessorPros: PerformanceCons: FlexibilitySoft-processorsPros: FlexibilityCons: Degraded Performance and Energy Consumption

36 Can we leverage benefits of both using Warp Processing ? EEL6935 / 52 36Motivation (2/2)Warp Processing : Technique for optimizing a software application by dynamically and transparently re-implementing critical software kernels as custom circuits in on-chip configurable logic

Study MicroBlaze based Warp processing System toEliminate the performance and energy overhead of a soft-processor compared to a hard-processor

37 EEL6935 / 52 FPGA single-chip Systems: Hard-core Vs Soft-coreHard-coreExcellent Packaging and communication with the FPGALower Power and Higher Performance than Soft-coreE.g. : Triscend, Atmel, Alteras Excalibur, Virtex* with PowerPCsSoft-coreLower Part costExtreme Flexibility during design processAdding custom instructions or including/ excluding particular data-path coprocessorsQuickly integrate the processor within a FPGAVarying number of processors as per needE.g. NIOS, NIOS II , Picoblaze, Microblaze

38Use Hardware / Software Partitioning Techniques to alleviate Power and Performance overhead of Soft Processors EEL6935 / 52 MicroBlaze Soft Processor Core

39MicroBlaze 32bit softcore by XilinxLMB Local Memory BusBRAM Block RAM : User Defined SizeOPB On-Chip Peripheral BusXilinx Platform Studio ToolsSynthesizes designBitstreamSoftware LibrariesApplicationCompileFinalSystemBitstreamSpecify system Architecture and configure MicroBlaze EEL6935 / 52 Key features of MicroBlazeUser Configurable optionsTailor processors functionality as per the design needConfigurable Instructions and data cachesIncorporate additional hardware:Hardware multiplier ( mul instructions)Hardware Divider ( div instructions )Barrel Shifter (bs and bsi instructions)Hardware bit manipulations and absolute plus

40 EEL6935 / 52 40Peripheral Hardware available @ PresentXilinx LogiCORE IP Floating-Point Operator v5.0 (Mar 11)

Available for Kintex-7, Virtex-7, Virtex-6,Virtex-5, Virtex-4, Spartan-6, Spartan-3/XA,Spartan-3E/XA, Spartan-3A/AN/3A DSP/XA FPGAsSupported operators:multiplyadd/subtractdividesquare-rootcomparisonconversion from floating-point to fixed-pointconversion from fixed-point to floating-pointconversion between floating-point types Parameterized fraction and exponent word lengths41 EEL6935 / 52 Applications analyzedbrev (Powerstone benchmark suite)Critical kernel performs bit reversal heavily relying on shift operationsSoftware only Implementation (without mul or barrel shift)N-bit shift by using n-successive add operations Configurable Hardware implementation2.1X speed upmatmulCritical Region : Matrix multiplicationHardware Multiplier provides 1.3X speedup

42 EEL6935 / 52 MicroBlaze-based Warp Processor43

Identify Critical Kernels in execution timeImplement critical Kernels in WCLA as cutom HWWCLA Warp Configurable Logic Architecture EEL6935 / 52 Warp Configerable Logic Architecture for Dynamic HW/ SW Partitioning44

DADG: Data Address Generator Used for any memory accesses to/for Configurable logicLCH: Loop Control HardwareHandles loops and controls executionsReg 0, Reg 1 Reg 2:i/p to CLF /or MAC (as per mapping)Outputs from the configurable logic stored in Registers EEL6935 / 52 44MicroBlaze Multi-processor warp processing system

Mutliple Soft-cores may be incorporated within a single FPGALimited only by the FPGA SizeMulti-processor Warp Processing system may share a common DPM and WLCA and HW/SW partitioning may be done in round robin mannerNo Overhead due to additional DPMsPartitioning tools may be implemented as software tasks running in one of the cores

45 EEL6935 / 52 Experimental SetupExecution Time and Power studiedEmbedded systems applications chosen from Powerstone and EEMBC benchmark suites studiesMicroBlaze processor core implemented on Spartan3 FPGABarrel Shifter and Multiplier configured in HardwareNote: MicroBlaze max frequency 85MHz; However FPGA circuits may run upto 250MHz46 EEL6935 / 52 Profiling Simulation47 Xilinx Microprocessor Debug EngineMicroBlazeSoft App 1Soft App 2Soft App nInst Trace 1Inst Trace 2Inst Trace n

Simulate On-chip Profiler BehaviorCritical Region EEL6935 / 52 Energy Equations48

EEL6935 / 52 48Performance / Power Simulation49Critical RegionsVHDLSynopsys Design CompilerUMC 0.18um LibrarySynthesisExecution Traces of critical regionsExecute HW Circuits (VHDL model for WCLA) for each partitioned Critical Region Determine final application performanceXilinx XPowerMicroBlaze and system Component(excluding WCLA)Dynamic PowerStatic PowerConfigurable HW PowerMicroBlaze Power EEL6935 / 52 49Results50

ARM execution determined using Simple Scalar EEL6935 / 52 50ConclusionWarp processors (with soft-core), by pushing critical software kernels to the CFG can provideFlexibility of the Soft-coreDue to soft-core implementationCompetitiveness of a Hard-core processors (as ARM)Performance of the order of the Hard-coreBy leveraging special Configured HW5.8X (average) improvement (with MicroBlaze)Eliminates Energy OverheadBy faster execution due to dedicated hardware and trimming down the soft-processor to perfectly fit design needsAverage Energy reduction ~ 57%Opened Avenues for Soft-core processors which would not have been feasible previously due to energy/performance

51 EEL6935 / 52 51Areas of Improvement & Future WorkReal processing systems do not just do a execute just a single application at a timeFor realistic data, multiple applications should be run simultaneouslyExplore Parallel Processing architecture furtherPower Estimation DataEstimation is goodIt would be good to see real data as wellOnline Profiler has a dozen entriesNumber of entries should be configurable to avoid local maximaInstead of simplified configurable logic fabric, how about using underlying FPGA physical fabricAlgorithm to come up with re-partitioning time interval should be worked up

52 EEL6935 / 52 QuestionsDesign GoalsEEL593554

EEL6935 / 52