progggramming the flexram parallel intelligent …...aimos aevaluation arelated work aconclusions b....
TRANSCRIPT
Programming the FlexRAM Parallel g gIntelligent Memory System
B. B. Fraguela*, J. Renau†, P. Feautrier‡
D. Padua† and J. Torrellas†
*Univ. da Coruña†Univ. of Illinois
‡ENS de Lyon
Intelligent Memory ArchitecturesIntelligent Memory Architectures
Main memory enhanced with many simple processors
•HeterogeneousHeterogeneous
•Highly parallel
P bl Littl h h t th hit t
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 2
Problem: Little research on how to program these architectures
ContributionsContributions
Language support for intelligent memories
OpenMP-like directives (CFlex)p ( )
Library of Intelligent Memory Operations (IMOs)
Runtime system and OS extensionsRuntime system and OS extensions
Speedups of over one order of magnitude
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 3
OutlineOutline
FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 4
FlexRAM ArchitectureFlexRAM Architecture64 PArrays/chip PArrays arePHost Off-the-shelf system6 ays/c p
64 MB/chipy
muchsimpler thanthe PHost(s)
2D torus insideeach chip
Controller: comm& synchr tasks
PHost
S d d S d d S d d
the PHost(s)
StandardMemory
StandardMemory
StandardMemory
ProgrammedindependentlyThe FlexRAM bus
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 5
depe de t y(MIMD, SPMD)interconnects
all the chips
FlexRAM Architectural IssuesFlexRAM Architectural Issues
PArrays cannot interrupt/invoke the PHost(s)PArray requests are sent to chip controller
PHost polls the controllers and services the requests
Communication PHost - PArray : pass input andCommunication PHost PArray : pass input and output arguments through memory
No HW cache coherenceNo HW cache coherenceCompiler inserted cache flushes and invalidations
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 6
OutlineOutline
FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 7
Operating System ExtensionsOperating System Extensions
Common address space for PHost(s) and PArraysPArrays kernel
Manages the TLBManages the TLB Manages spawn and termination of local tasks
PHost OSUpdates the shared page tableFirst-touch placement of pagesCooperates with PArray kernels to keep TLBs coherentCooperates with PArray kernels to keep TLBs coherent
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 8
Programming FlexRAMProgramming FlexRAM
OpenMP-like directives (CFlex)
Library of Intelligent Memory Operations (IMOs)Library of Intelligent Memory Operations (IMOs)
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 9
CFlexCFlex
CFlex: family of directives inspired by OpenMPExecution modifiers: build/sync tasksData modifiers: properties of data structuresExecutable directives: barriers, prefetches,...
#pragma FlexRAM directive-type [clauses]
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 10
Execution Modifier: SpawnExecution Modifier: Spawn
Directive-type
[phost|parray] : specifies kind of processor to useClClauses
on_home(x): run the task on the PArray on whose bank x is located.sync/async: parent task must stop until child finishes (sync) or not (async)pfor: parallelize for loop
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 11
Execution Modifier: Spawn (II)Execution Modifier: Spawn (II)
Clauses (cont.):if(cond)/ else: conditional execution of the compiler directivecompiler directiveshared, private, firstprivate, lastprivate, reduction: scope clauses with p , pthe same meaning as in OpenMP flush: specify which pieces of data to flush from PHost cachePHost cache
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 12
Example: Parallelizing a LoopExample: Parallelizing a Loop
for(p = head; p != NULL; p = p->next)process(p->data);
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 13
Example: Parallelizing a LoopExample: Parallelizing a Loop
for(p = head; p != NULL; p = p->next)for(p head; p ! NULL; p p >next)
process(p->data);
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 14
Example: Parallelizing a LoopExample: Parallelizing a Loop
#pragma FlexRAM phost syncfor(p = head; p != NULL; p = p->next)
#pragma FlexRAM parray async on_home(*(p->data)) \firstprivate(p)
process(p->data);
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 15
Example: Parallelizing a LoopExample: Parallelizing a Loop
#pragma FlexRAM parray pfor on_home(*(p->data)) \firstprivate(p)
for(p = head; p != NULL; p = p->next)process(p->data);
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 16
Parallelizing Complex CodesParallelizing Complex Codes
int TreeAdd (register tree_t *t) {if (t == NULL) return 0;if (t NULL) return 0;else {
int leftval, rightval;
leftval = TreeAdd(t->left);leftval TreeAdd(t >left);
rightval = TreeAdd(t->right);
return leftval + rightval + t->val;return leftval + rightval + t >val;}
}
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 17
Parallelizing Complex Codes (II)Parallelizing Complex Codes (II)PHost async task
PHost async task PHost async taskPHost-allocated node
PArray subtreePArray async task
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 18
Parallelizing Complex Codes (III)Parallelizing Complex Codes (III)
int TreeAdd (register tree_t *t) {if (t == NULL) return 0;else {
int leftval, rightval;
#pragma FlexRAM phost sync{
#pragma FlexRAM parray async on_home(*(t->tleft)) if (lcl(t->left))##pragma FlexRAM phost async else
leftval = TreeAdd(t->left);
#pragma FlexRAM parray async on_home(*(t->tright)) if (lcl(t->right))( )rightval = TreeAdd(t->right);
}
return leftval + rightval + t->val;}
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 19
}}
OutlineOutline
FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 20
Intelligent Memory Operations (IMOs)Intelligent Memory Operations (IMOs)
Libraries that hide FlexRAM while providing near-optimal performanceImplement common operations on data structures often used in programsHi hl ti i d b th th ti l d thHighly optimized, both the sequential and the parallel versions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 21
Example IMOs: Vector ContainerExample IMOs: Vector Container
IMO description Syntax
Apply func f with arg a Vector apply(v f a)Apply func f with arg a Vector_apply(v,f,a)
Search element that fulfills
cond f with arg aVector_search(v,f,a)
Generate vector with the result
of appl func f with arg av2=Vector_map(v,f,a)
Reduce vector applying func f,Vector reduce(v,f,a)
whose neutrum is aVector_reduce(v,f,a)
Process two vectors and an arg
a, generating a new vectorv3=Vector_map2(v,v2,f,a)
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 22
OutlineOutline
FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 23
Architecture ParametersArchitecture Parameters
PHost1.6 GHz, 5 issueL1 cache: 32 KB
PArray1.2 GHz, 2 issue in orderL1 cache: 8 KB
L2 cache: 1 MB
Mem latency 180 cycles
No FP support
Mem latency 14 cyclesMem latency 180 cycles Mem latency 14 cycles
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 24
Applications (I)Applications (I)
Application Suite Access Data Task Size
TSP Olden Ptr FP Large
TreeAdd Olden Ptr Int Large
Swim SPEC OMP 2001 Reg FP Med
Mgrid SPEC OMP 2001 Reg FP Var
Dmxdm Kernel Reg FP Large
S K l I d FP M dSpmxv Kernel Ind FP Med
Distance CAM Ptr Int Large
Path CAM Ptr Int Small
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 25
Path CAM Ptr Int Small
Applications (II)Original code not changed
Applications (II)
Application Code size (lines) Directives Additional lines
TSP 485 12 5
TreeAdd 71 8 4
Swim 272 8 0
Mgrid 470 13 0
Dmxdm 81 1 2
Spmxv 47 1 0
Distance 108 17 7
h 16 1 9
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 26
Path 165 17 9
SpeedupsSpeedups
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 27
Speedups: Over one order of magnitude!
Further OptimizationsFurther Optimizations
Single task for consecutive pages affected by an on_home residing in the same bankConsecutive rather than cyclic spawn among theConsecutive rather than cyclic spawn among the FlexRAM chipsLimiting the usage of on_home to large loops
Increase the average speedup by about 30%
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 28
OutlineOutline
FlexRAM ArchitectureSoftware SupportCFlexIMOsEvaluationRelated WorkConclusions
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 29
Other Intelligent Mem ArchOther Intelligent Mem Arch.
[ k l] [ ll l]Active Pages [Oskin et al], DIVA [Hall et al]Program the machine by handC l b d ith CFl /IMOCan also be programmed with CFlex/IMOsSeem more sensitive to data placement than FlexRAMFurther extensions of CFlex to specify alignment andFurther extensions of CFlex to specify alignment and placement of data structuresMessage passing is more natural for DIVA
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 30
Related WorkRelated Work
FlexRAM [Solihin et al]: automatic partition and mapping by a compiler
Only feasible for simple codesOnly feasible for simple codesPerformance is limited
Widespread compiler directivesOpenMP: UMA model, no locality clausesHPF: extensive alignment + replicationBoth: Unadequate for irregular applicationsq g pp
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 31
ConclusionsConclusions
Effective programming support for Intelligent MemCFlex: family of pragmas inspired by OpenMPIMOs: library of intelligent memory operations
CFlex parallelizes more problems than OpenMPfComplexity can be further hidden using IMOs
Speedups over one order of magnitude
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 32
Programming the FlexRAM Parallel g gIntelligent Memory System
B. B. Fraguela*, J. Renau†, P. Feautrier‡
D. Padua† and J. Torrellas†
*Univ. da Coruña†Univ. of Illinois
‡ENS de Lyon
Runtime SystemRuntime System
Task managementCreation of buffer with input args + messageCreation of buffer with input args + messagePHost spins on a termination flag set by the PArray
Interface to chip controller locks and constructions built upon them, like barriersHeap memory management
ll f h l hPolling of the FlexRAM chips
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 34
PArray StructurePArray Structure
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 35
Further OptimizationsFurther Optimizations
Initial optimizations:Alignments using the OS first-touch policyon home clause to exploit localityon_home clause to exploit locality
New optimizations:H: single task for consecutive pages affected by an on_home
idi i th b kresiding in the same bankC: consecutive rather than cyclic spawn among the FlexRAM chipsL: limiting the usage of on_home to large loops
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 36
Software Optimization ResultsSoftware Optimization Results
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 37
Final ResultsFinal Results
B. B. Fraguela et al Programming FlexRAM, PPoPP 2003 38