intro - apple-core.infoapple-core.info/hipeac-talk-foundations-jesshope.pdf · title: intro author:...

University of Amsterdam

Foundations and Achievements of

Apple-COREProf. Chris [email protected]

The Apple-CORE Approach

• Hardware threading is not new... Delencor HEP introduced this 30 years ago

• Apple-CORE’s contribution is in the system interface with a consistent view on concurrency management whatever resources are available

• from a single thread to many cores• The goal is to make concurrency the norm and

the processing resource fungible• like money it must be divisible and interchangeable

The Core Issue

• Continuity in generations of microprocessors requires a common (sequential) binary interface to the processor and a common software interface for system functionality

• The $1000 questions is where is concurrency defined in this interface and how can we best adapt to dynamic resource heterogeneity in order to support future many core systems

Current Approach

• A combination of two techniques is currently used in mainstream microprocessors:

• implicit concurrency - sequential instructions executed out of order with wide issue

• this is expensive in both area and power

• explicit (programmer) concurrency - managed in software

• large overhead limits this to coarse granularity

The Apple-CORE Approach

• With massive concurrency predicted the Apple-CORE approach is to capture maximal concurrency in the binary interface and automatically sequentialise this based on resource availability at run time

!"#$"%&'()*'+,-'((.)/'0'(("()1'02"3)

/'0'(("(,4") 4"#$"%&'(,4")

Past approach

Apple-CORE approach

Spectrum of parallelism

Key Abstractions

• Dataflow synchronisation on instruction execution - latency tolerance

• Deterministic programming model based on multi-way fork/sync - create families of threads

• Memory consistent at fork and sync events• Handles on resources - places • Asynchronously execution of families at places• Capture sequence where necessary by thread-

to-thread dependencies - shared variables


The DRISC Core

Dataflow synchronisation

• DRISC - Dynamically-scheduled RISC• provides data-flow scheduling in RISC ISA using i-structures as

registers - allows asynchronous instruction execution and with multiple threads tolerates latency in those operations

• A Bolychevsky, C R Jesshope and V B Muchnick, (1996) Dynamic scheduling in RISC architectures, IEE Trans. E, Computers and Digital Techniques, 143, pp 309-317

• Related work past & present• Rishiyur S.Nikhil Arvind (1989) Can dataflow subsume von

Neumann computing? Proc. ISCA '89, doi: 10.1145/74925.74955

• 2 of 4 DARPA funded UHPC projects have adapted a fine-grain dynamic asynchronous program execution model

Dataflow Scheduling

• Dataflow is a more effective way to tolerate latency in instruction execution - e.g. memory accesses take 100s or even 1000s of cycles

• Prior approaches e.g. HEP/Niagara context switch on the hazard-inducing instruction

• DRISC synchronises on the result of the hazard, i.e. captures the dependency - dataflow scheduling

• This requires a synchronising register file (i-structures) and an efficient mechanism to store and manage thread continuations

!"#$%&'%

!"#$%&'()$*$()$("&

'(+"%,-.!(+&#$$%&(%&)%&'%

*+,+-$+-./%012%."-3+43%5673.8+5%8+9+%

*:;<=%."-3+43%5673.8+5%8+9+%

>?!@,!7+9%+A+.3%"-%,#9#!!+!%5!#.B-+55%

DRISC Continuation• Continuation comprise a PC and a context comprising from 2 to 32

synchronising registers (optionally some local memory)

Thread table Register file

PC R. ptr

Memory

thread context

thread

program

thread-local data

Asynchrony in DRISC Cores

• Intra-thread - i.e. an instruction that is issued and synchronised within the same thread:

• Memory references• FPU operations (shared FPU)• Functions implemented as logic• Create a family and Sync on a family

• Inter-thread - read/write to a remote register file• This asynchronously activates a waiting thread

and is one of two mechanism used for system signalling - the other is asynchronously creating a family of thread(s)

Achievements - DRISC cores

• Single DRISC core - FPGA implementation of single core (SPARC ISA)

• main result is that this transparently supports asynchronous functions in logic

• Many-core Microgrid (Alpha and SPARC ISA)• emulates all pipeline interactions in each core and

both on chip and off chip memories giving cycle accurate timing simulations

• Low-level C-based compiler for both platforms


Microthreaded Programming

Programming Strategy

• Restrict non-expert users to deterministic programming models e.g. sequential, data parallel, functional, etc.

• Then manage concurrency issues automatically in the language run-time or in a coordination layer

• Microthreading - the abstract execution model • capture as much concurrency as possible using families of threads

in a dynamically evolving concurrency tree (tree of frames)

• functions & loops captured concurrently as families of threads

• microthreading is deterministic, resource agnostic and free of deadlock given unbounded resources

Functions as Threads• Events are create/sync and

any remote register writes• register writes are synchronising

• asynchronous parameter passing

• create executed early and parent and child execute concurrently

create

sync

remote register

write

Control flow

Events

Parent thread Child

thread

foo(A){......A......}...create()foo(P)...P:=......sync

Loops as Threads

create

sync

remote register

write

Control flow

Events

Parent thread Child

thread 0Child

thread 1Child

thread 2Child

thread 3

foo(A){......A......}

...create(i=0..3)foo(P)...P:=......sync

Sequence as threads

create

sync

remote register

write

Control flow

Events

Parent thread Child

thread 0Child

thread 1Child

thread 2Child

thread 3

foo(shared A){...A:=...A......}

...create(i=0..3)foo(P)...P:=.........sync

Why Capture Sequence as Threads?

• Still need local concurrency for latency tolerance. e.g. naive inner product

• sum:=sum+A[i]*B[i]• compiles to a six instruction thread in Alpha• A+i, B+i, load, load, fmul are all independent

• only one instruction: fadd $S1 $D1 $L1 is constrained to be executed sequentially

• No speedup but independent instructions can be scheduled concurrently on a single core

• This allows multiple concurrent memory loads and floating point multiplies to be executed concurrently on a single core

Microthread Program Structure

• Any thread can create a subordinate family

• Microthreading can capture all concurrency down to innermost loops (even ILP)

• The program is a dynamically evolving tree of concurrently executing thread contexts

…

…

…

…

Memory Model

• Use dataflow synchronisation between related threads to capture sequence e.g. in loops

• uses synchronising memory (DRISC i-structures)• parent -> child 1 -> Child 2 etc.

• Other memory is asynchronous between concurrent threads - is made consistent on create and sync events

• Any child thread will see consistent memory on its parent’s writes that were issued before the create

• The parent thread will see consistent memory on any writes by its children following the sync

Achievements - deterministic Programming

• Two user-level compilers: plain old C and SaC compile to this deterministic programming model via an intermediate language (SL)

• We have shown that a combination of run-time system, low-level compiler and DRISC core is able to adapt the concurrency exposed by the programmer in these languages to the resources available at run time

• caveat the run-time systems for these compilers also use the SVP system-level abstractions for resource management


System Virtualization

PlatformSVP - see http://svp-home.org/

SVP Adds Resources to Microthreading

• Introduces the abstraction of place to capture a guaranteed set of resources - i.e. processing capability and a number of thread contexts

• Introducing places allows unrelated threads to signal each other (remote register writes) and enables operating system development

• With places unrelated threads can also be executed with mutual exclusion also required for operating system development

OS model

• In the past computing cycles were scarce but memory ubiquitous - time sharing OS

• time sharing is not energy efficient - it requires moving stuff in and out of the CPU as contexts are switched

• Now both processors and processor cycles are ubiquitous - space-sharing OS

• divide the processor space between units of work (dynamically)

• power considerations suggest it does not matter if a processor is idle but if it is powered up and clocked - it better use its cycles efficiently

Places in Microgrid

• Families are bound to a place on create• Create takes a place id parameter to specify

where the family executes• A place can be a single family/thread on a core• A place can be a single core (32 families & up to

256 threads)• A place may be a cluster of adjacent cores

• Families of independent threads are distributed automatically to all cores/contexts in a place

Delegation granularity

!"

!#$"

!#%"

!#&"

!#'"

!#("

!#)"

!#*"

!#+"

!#,"

$"

!"

("

$!"

$("

%!"

%("

&!"

&("

'!"

'("

!" $!" %!" &!" '!" (!" )!" *!"

!"#$%&#'(

)*+,-./)

0-.(1%2(3)45(#67#8(

9:;<%2(7=(#72%>(

?!@9!+(A((((((!BCDE0,9(,*(.EDE!(

*@D)F!9E(

-./0123+45"

-6123+45"

-./0123'45"

-6123'45"

789:;<9=3+45"

789:;<9=3'45"

• Results plot GFLOPS & GIPS for a 1 GHz core using a shared FPU (1 FPU per 2 cores)

• Efficiency is the the average IPC of all cores from a cold start on bare cores

• n=4K on 64 cores - we create & execute 64 threads/core in 3.5µsec

• n=8K on 64 cores we create & executes 128 threads/core in 6.5 µsec

• These times include place allocation and delegation and show space-sharing in quanta of a few micro-seconds

Write Once - Run Anywhere

• Microthreaded code captures all concurrency to innermost loops (can even capture ILP)

• Resources are added dynamically and constrain concurrency

• The example above executes the same binary code for all executions of the program - i.e. binary compatibility regardless of place allocation

Resource Deadlock

• Because we have dependencies down the concurrency tree we have the potential for resource-constrained deadlocks

• With a finite number of contexts - limited by family table or thread table or number of registers available

• If a thread cannot create it is blocked but there is the possibility that all threads are blocked waiting to create and none can terminate releasing a context

Binary Compatibility

• The implementation of create allows thread family breadth to be managed automatically

• resource constraints can be placed by the compiler on concurrency trees with statically known depth to avoid deadlock

• Dynamic recursion requires compiler support• both sequential and concurrent versions of the code are compiled• Create must be implemented in two phases to support failure on

resource exhaustion• allocate acquires resources - if available - and returns a family

identifier which is then used to create the threads• allocate can either suspend and wait for resources (if deadlock is

avoided statically) or fail and invoke sequential execution

Dynamically Partitioning Space

• SVP has three levels of concurrency management two in space one in time

• families of threads can be delegated to a new place (cluster)

• threads are automatically distributed within a place

• threads on a core are interleaved on data availability

…

…

…

…

Concurrency tree

Delegate

Distribute

InterleaveNew set of contexts acquired here

Achievements - Systems Engineering

• The main achievement is the system level support (mixture of h/w and s/w) that allows concurrent code to be managed dynamically according to resources available

• We also have sufficient system-level development to support the porting of most standard codes to the Microgrid platform

• This includes efficient mechanisms for delegated I/O to/from the host system

Conclusion

• In conclusion we have taken a concurrent deterministic programming model

• developed an efficient implementation of this in a DRISC core supporting many-core chips & transparent execution of functions in logic

• provided a system software layer that manages the concurrency dynamically adapting to whatever resources are available using the same binary code

• and developed compilers and system services to be able to port complex legacy software to the platform with very little effort

Apple-CORE Research Strategy

• In this project we have developed a viable framework of tools for further development, evaluation and eventual exploitation of the framework of DRISC cores using the micro-threaded programming model and the SVP system level abstraction for concurrency management

• These tools have been made available by publishing them with an open source licence so that interested parties can explore achievements the Apple-CORE research has produced

intro - apple-core.infoapple-core.info/hipeac-talk-foundations-jesshope.pdf · title: intro author:...

Documents