exploiting vector parallelism in software pipelined loops

Download Exploiting Vector Parallelism in Software Pipelined Loops

If you can't read please download the document

Upload: yves

Post on 06-Jan-2016

26 views

Category:

Documents


2 download

DESCRIPTION

Exploiting Vector Parallelism in Software Pipelined Loops. Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology. Multimedia Extensions. Short vector extensions in ILP processors AltiVec, 3DNow!, SSE, etc. - PowerPoint PPT Presentation

TRANSCRIPT

  • Multimedia ExtensionsShort vector extensions in ILP processorsAltiVec, 3DNow!, SSE, etc.Accelerate loops in multimedia & DSP codesNew designs have floating point support

    Page

  • Multimedia ExtensionsVector resources do not overwhelm the scalar resourcesScalar: 2 FP ops / cycleVector: 4 FP ops / cycleFull vectorization may underutilize scalar resources ILP techniques do not target vector resourcesNeed bothCourtesy of International Business Machines Corporation. Unauthorized use not permitted.

    Page

  • Modulo Schedulingfor (i=0; i
  • Traditional Vectorizationfor (i=0; i
  • Vectorization without Distributionfor (i=0; i
  • Selective Vectorizationfor (i=0; i
  • ComplicationsComplex scheduling requirementsParticularly in statically scheduled machinesMemory alignmentExample assumes no communication costIn reality, explicit operations requiredOften through memoryReserve critical resourcesPotential long latencyPerformance improvement still possible

    Page

  • Tomcatv main loop (50%)

    Page

  • Tomcatv (SpecFP 95)1.7x Speedup overModulo Scheduling

    Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*

    TechniqueALUMEMFPUVECModulo Scheduling622460

    Full Vectorization713046

    Selective Vectorization7271927

    Page

  • Tomcatv (SpecFP 95)

    Page

  • Selective VectorizationBalance computation among resourcesMinimize II when loop is modulo scheduledCarefully manage communicationIncorporate alignment informationSoftware pipelining hides latencyAdapt a 2-cluster partitioning heuristic[Fidduccia & Matheyses 82][Kernighan & Lin 70]

    Page

  • Selective Vectorizationscalarvectorcost

    Page

  • Cost FunctionProjected II due to resources (ResMII)Bin-packing approach [Rau MICRO 94]With some modifications

    Can ignore operation latencySoftware pipelining hides latencyVectorizable ops not on dependence cycles

    for (i=0; i

  • EvaluationSUIF front-endDependence analysisDataflow optimization

    Trimaran back-endModulo schedulerRegister allocatorVLIW SimulatorAdded vector opsSimulation BinaryC or Fortran

    Page

  • EvaluationOperands communicated through memorySoftware responsible for realignment

    Issue Width6Memory Units2ALUs4FPUs2Vector Units1Vector Length2*

    Page

  • EvaluationSpecFP 92, 95, 2000Easier to extract dependence informationDetectable data parallelism64-bit data means vector length of 2Considered amenable to vectorization & SWPApply selective vectorization to DO loopsNo control flow, no function calls Fully simulate with training sets

    Page

  • Traditional Vectorization

    Page

  • Vectorization without Distribution

    Page

  • Vectorization + Free Communication

    Page

  • Vectorization without Distribution

    Page

  • Selective Vectorization

    Page

  • Selective Vectorizationtomcatvsu2corswimmgrid

    Page

  • Communication SupportTransfer through memoryRegister to register copyUses fewer issue slotsFrees memory resourcesShared register fileVector elements addressable in scalar opsRequires no extra issue slots

    Page

  • Through Memorytomcatvsu2corswimmgrid

    Page

  • Reg to Reg Transfer Supporttomcatvsu2corswimmgrid

    Page

  • Shared Register Filetomcatvsu2corswimmgrid

    Page

  • Related WorkTraditional vectorizationAllen & Kennedy, WolfeSoftware PipeliningRaus iterative modulo schedulingClustered VLIW[Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34]Partitioning among clusters similarOurs is also an instruction selection problemNo dedicated communication resources

    Page

  • ConclusionTargeting all FUs improves performanceSelective vectorizationVectorization better in the backendCost analysis more accurateSoftware pipeline vectorized loopsGood idea anywayFacilitates selective vectorizationHides communication and alignment latency

    Page

    ILP techniques are instruction scheduling techniques vectorization is a type of instruction selection thats why we need both

    Mention what the notation in the code meansmention loop distributioncommunication between vector and scalar loopsThis is our contributionExample was very simple, but in reality there are complicationsPlanning to make publicly availablePACTTraditional never beats modulo scheduling for this architectureFree communication is unrealisticMention theoretical maximum for this architectureSay what percentages meanConsider two other design pointsMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages meanMention theoretical maximum for this architectureSay what percentages mean