intro - apple-core.infoapple-core.info/hipeac-talk-foundations-jesshope.pdf · title: intro author:...
TRANSCRIPT
The Apple-CORE Approach
• Hardware threading is not new... Delencor HEP introduced this 30 years ago
• Apple-CORE’s contribution is in the system interface with a consistent view on concurrency management whatever resources are available
• from a single thread to many cores• The goal is to make concurrency the norm and
the processing resource fungible• like money it must be divisible and interchangeable
The Core Issue
• Continuity in generations of microprocessors requires a common (sequential) binary interface to the processor and a common software interface for system functionality
• The $1000 questions is where is concurrency defined in this interface and how can we best adapt to dynamic resource heterogeneity in order to support future many core systems
Current Approach
• A combination of two techniques is currently used in mainstream microprocessors:
• implicit concurrency - sequential instructions executed out of order with wide issue
• this is expensive in both area and power
• explicit (programmer) concurrency - managed in software
• large overhead limits this to coarse granularity
The Apple-CORE Approach
• With massive concurrency predicted the Apple-CORE approach is to capture maximal concurrency in the binary interface and automatically sequentialise this based on resource availability at run time
!"#$"%&'()*'+,-'((.)/'0'(("()1'02"3)
/'0'(("(,4") 4"#$"%&'(,4")
Past approach
Apple-CORE approach
Spectrum of parallelism
Key Abstractions
• Dataflow synchronisation on instruction execution - latency tolerance
• Deterministic programming model based on multi-way fork/sync - create families of threads
• Memory consistent at fork and sync events• Handles on resources - places • Asynchronously execution of families at places• Capture sequence where necessary by thread-
to-thread dependencies - shared variables
University of Amsterdam
The DRISC Core
Dataflow synchronisation
• DRISC - Dynamically-scheduled RISC• provides data-flow scheduling in RISC ISA using i-structures as
registers - allows asynchronous instruction execution and with multiple threads tolerates latency in those operations
• A Bolychevsky, C R Jesshope and V B Muchnick, (1996) Dynamic scheduling in RISC architectures, IEE Trans. E, Computers and Digital Techniques, 143, pp 309-317
• Related work past & present• Rishiyur S.Nikhil Arvind (1989) Can dataflow subsume von
Neumann computing? Proc. ISCA '89, doi: 10.1145/74925.74955
• 2 of 4 DARPA funded UHPC projects have adapted a fine-grain dynamic asynchronous program execution model
Dataflow Scheduling
• Dataflow is a more effective way to tolerate latency in instruction execution - e.g. memory accesses take 100s or even 1000s of cycles
• Prior approaches e.g. HEP/Niagara context switch on the hazard-inducing instruction
• DRISC synchronises on the result of the hazard, i.e. captures the dependency - dataflow scheduling
• This requires a synchronising register file (i-structures) and an efficient mechanism to store and manage thread continuations
!"#$%&'%
!"#$%&'()$*$()$("&
'(+"%,-.!(+&#$$%&(%&)%&'%
*+,+-$+-./%012%."-3+43%5673.8+5%8+9+%
*:;<=%."-3+43%5673.8+5%8+9+%
>?!@,!7+9%+A+.3%"-%,#9#!!+!%5!#.B-+55%
DRISC Continuation• Continuation comprise a PC and a context comprising from 2 to 32
synchronising registers (optionally some local memory)
Thread table Register file
PC R. ptr
Memory
thread context
thread
program
thread-local data
Asynchrony in DRISC Cores
• Intra-thread - i.e. an instruction that is issued and synchronised within the same thread:
• Memory references• FPU operations (shared FPU)• Functions implemented as logic• Create a family and Sync on a family
• Inter-thread - read/write to a remote register file• This asynchronously activates a waiting thread
and is one of two mechanism used for system signalling - the other is asynchronously creating a family of thread(s)
Achievements - DRISC cores
• Single DRISC core - FPGA implementation of single core (SPARC ISA)
• main result is that this transparently supports asynchronous functions in logic
• Many-core Microgrid (Alpha and SPARC ISA)• emulates all pipeline interactions in each core and
both on chip and off chip memories giving cycle accurate timing simulations
• Low-level C-based compiler for both platforms
University of Amsterdam
Microthreaded Programming
Programming Strategy
• Restrict non-expert users to deterministic programming models e.g. sequential, data parallel, functional, etc.
• Then manage concurrency issues automatically in the language run-time or in a coordination layer
• Microthreading - the abstract execution model • capture as much concurrency as possible using families of threads
in a dynamically evolving concurrency tree (tree of frames)
• functions & loops captured concurrently as families of threads
• microthreading is deterministic, resource agnostic and free of deadlock given unbounded resources
Functions as Threads• Events are create/sync and
any remote register writes• register writes are synchronising
• asynchronous parameter passing
• create executed early and parent and child execute concurrently
create
sync
remote register
write
Control flow
Events
Parent thread Child
thread
foo(A){......A......}...create()foo(P)...P:=......sync
Loops as Threads
create
sync
remote register
write
Control flow
Events
Parent thread Child
thread 0Child
thread 1Child
thread 2Child
thread 3
foo(A){......A......}
...create(i=0..3)foo(P)...P:=......sync
Sequence as threads
create
sync
remote register
write
Control flow
Events
Parent thread Child
thread 0Child
thread 1Child
thread 2Child
thread 3
foo(shared A){...A:=...A......}
...create(i=0..3)foo(P)...P:=.........sync
Why Capture Sequence as Threads?
• Still need local concurrency for latency tolerance. e.g. naive inner product
• sum:=sum+A[i]*B[i]• compiles to a six instruction thread in Alpha• A+i, B+i, load, load, fmul are all independent
• only one instruction: fadd $S1 $D1 $L1 is constrained to be executed sequentially
• No speedup but independent instructions can be scheduled concurrently on a single core
• This allows multiple concurrent memory loads and floating point multiplies to be executed concurrently on a single core
Microthread Program Structure
• Any thread can create a subordinate family
• Microthreading can capture all concurrency down to innermost loops (even ILP)
• The program is a dynamically evolving tree of concurrently executing thread contexts
…
…
…
…
Memory Model
• Use dataflow synchronisation between related threads to capture sequence e.g. in loops
• uses synchronising memory (DRISC i-structures)• parent -> child 1 -> Child 2 etc.
• Other memory is asynchronous between concurrent threads - is made consistent on create and sync events
• Any child thread will see consistent memory on its parent’s writes that were issued before the create
• The parent thread will see consistent memory on any writes by its children following the sync
Achievements - deterministic Programming
• Two user-level compilers: plain old C and SaC compile to this deterministic programming model via an intermediate language (SL)
• We have shown that a combination of run-time system, low-level compiler and DRISC core is able to adapt the concurrency exposed by the programmer in these languages to the resources available at run time
• caveat the run-time systems for these compilers also use the SVP system-level abstractions for resource management
University of Amsterdam
System Virtualization
PlatformSVP - see http://svp-home.org/
SVP Adds Resources to Microthreading
• Introduces the abstraction of place to capture a guaranteed set of resources - i.e. processing capability and a number of thread contexts
• Introducing places allows unrelated threads to signal each other (remote register writes) and enables operating system development
• With places unrelated threads can also be executed with mutual exclusion also required for operating system development
OS model
• In the past computing cycles were scarce but memory ubiquitous - time sharing OS
• time sharing is not energy efficient - it requires moving stuff in and out of the CPU as contexts are switched
• Now both processors and processor cycles are ubiquitous - space-sharing OS
• divide the processor space between units of work (dynamically)
• power considerations suggest it does not matter if a processor is idle but if it is powered up and clocked - it better use its cycles efficiently
Places in Microgrid
• Families are bound to a place on create• Create takes a place id parameter to specify
where the family executes• A place can be a single family/thread on a core• A place can be a single core (32 families & up to
256 threads)• A place may be a cluster of adjacent cores
• Families of independent threads are distributed automatically to all cores/contexts in a place
Delegation granularity
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!#*"
!#+"
!#,"
$"
!"
("
$!"
$("
%!"
%("
&!"
&("
'!"
'("
!" $!" %!" &!" '!" (!" )!" *!"
!"#$%&#'(
)*+,-./)
0-.(1%2(3)45(#67#8(
9:;<%2(7=(#72%>(
?!@9!+(A((((((!BCDE0,9(,*(.EDE!(
*@D)F!9E(
-./0123+45"
-6123+45"
-./0123'45"
-6123'45"
789:;<9=3+45"
789:;<9=3'45"
• Results plot GFLOPS & GIPS for a 1 GHz core using a shared FPU (1 FPU per 2 cores)
• Efficiency is the the average IPC of all cores from a cold start on bare cores
• n=4K on 64 cores - we create & execute 64 threads/core in 3.5µsec
• n=8K on 64 cores we create & executes 128 threads/core in 6.5 µsec
• These times include place allocation and delegation and show space-sharing in quanta of a few micro-seconds
Write Once - Run Anywhere
• Microthreaded code captures all concurrency to innermost loops (can even capture ILP)
• Resources are added dynamically and constrain concurrency
• The example above executes the same binary code for all executions of the program - i.e. binary compatibility regardless of place allocation
Resource Deadlock
• Because we have dependencies down the concurrency tree we have the potential for resource-constrained deadlocks
• With a finite number of contexts - limited by family table or thread table or number of registers available
• If a thread cannot create it is blocked but there is the possibility that all threads are blocked waiting to create and none can terminate releasing a context
Binary Compatibility
• The implementation of create allows thread family breadth to be managed automatically
• resource constraints can be placed by the compiler on concurrency trees with statically known depth to avoid deadlock
• Dynamic recursion requires compiler support• both sequential and concurrent versions of the code are compiled• Create must be implemented in two phases to support failure on
resource exhaustion• allocate acquires resources - if available - and returns a family
identifier which is then used to create the threads• allocate can either suspend and wait for resources (if deadlock is
avoided statically) or fail and invoke sequential execution
Dynamically Partitioning Space
• SVP has three levels of concurrency management two in space one in time
• families of threads can be delegated to a new place (cluster)
• threads are automatically distributed within a place
• threads on a core are interleaved on data availability
…
…
…
…
Concurrency tree
Delegate
Distribute
InterleaveNew set of contexts acquired here
Achievements - Systems Engineering
• The main achievement is the system level support (mixture of h/w and s/w) that allows concurrent code to be managed dynamically according to resources available
• We also have sufficient system-level development to support the porting of most standard codes to the Microgrid platform
• This includes efficient mechanisms for delegated I/O to/from the host system
Conclusion
• In conclusion we have taken a concurrent deterministic programming model
• developed an efficient implementation of this in a DRISC core supporting many-core chips & transparent execution of functions in logic
• provided a system software layer that manages the concurrency dynamically adapting to whatever resources are available using the same binary code
• and developed compilers and system services to be able to port complex legacy software to the platform with very little effort
Apple-CORE Research Strategy
• In this project we have developed a viable framework of tools for further development, evaluation and eventual exploitation of the framework of DRISC cores using the micro-threaded programming model and the SVP system level abstraction for concurrency management
• These tools have been made available by publishing them with an open source licence so that interested parties can explore achievements the Apple-CORE research has produced