a tuneable software cache coherence protocol for ... · a tuneable software cache coherence...

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs

Marco Bekooij & Frank Ophelders

Outline

  Context

  What is cache coherence

  Addressed challenge

  Short overview of related work

  Related issue: memory consistency

  Proposed software cache coherence protocol

  Performance evaluation results

  Concluding remarks

Multi-stream car-entertainment system

Car-radio IC of NXP

Digital In Out (DIO) Switch

Audio DAC 4x

Cordic FIR Ext SPDIF-in PCM I/f SRC

Audio ADC 4x

Host IIS-in 2x IIS-out 2x

IF –IN 1x

Ext IIS-in 3x

Host/ext IIS-out 1x

Keyed AGC 1x

Radio 8*fs In + out

Cordic

DSP EPICS

ITC AHB if ITC AHB if ITC AHB if ITC AHB if

Controller ARM MEM

Inter Tile Communication (ITC)

Multi-layer AHB bus (3 layer)

VPB Domain 0

VPB Domain 1

VPB Domain 2

MEM MEM MEM MEM MEM DMA SPI CD

Block Dec.

AHB2VPB AHB2VPB AHB2VPB

DSP EPICS

ARM based subsystem

Tile 0 Tile 1 Tile 2 Tile 3

Accelerators Peripherals

Unsuitable for general purpose applications (e.g. Pthread)

Developed experimental embedded multiprocessor system

  Processors communicate through shared memory   Processors have private caches

  Cache coherence problem!

Shared Memory 1 8 MB

ARM926EJ-S

PE1 I D

ARM926EJ-S

PE2 D I

Instruction Memory PE2

Peripherals RS232 Display

Touchscreen Audio in/out Video in/out

Timers

Æthereal network-on-chip

Shared Memory 2 8 MB

SDRAM 256 MB

Virtex 4

Cache coherence problem

  A cache coherency protocol ensures that eventually writes become visible to all processors

Shared memory

$ $ X: 3 X: 3 Read returns 3 !!!

Addressed challenge

Define a cache coherence protocol that is suitable for real-time embedded systems with a NoC and with off-the-shelf processors

Related work on cache coherency   Hardware cache coherency protocols

–  Snooping based protocols: •  Requires processors to observe all memory accesses

–  Does not match well with a NoC: preferably point-2-point communication instead of broadcasting

–  Directory based protocols •  Significant overhead as a result of accessing the directory

–  Transactional memory •  Relies on speculation: suitable for real-time systems?

–  > Remark: most embedded processors do not support a hardware cache coherency protocol

  Software cache coherency protocols –  Require a specific programming style: explicit coupling between each

synchronization operation and data-structure it protects

  Prevent cache coherency issues: put shared data in uncached address range –  Low efficiency

A B A B

Issues in sharing cache lines

  Cache operations often operate on lines

A B A B

P1 P2 A B A B

Related issue: memory consistency

  Memory accesses reordering by –  Memory system –  Processor –  Compiler

  We need a memory consistency model –  Defines constraints on the order in which memory operations become visible to other

processors –  Enables programmers to reason about outcome

P1 P2 A = 1

flag = 1

while ( flag != 1 );

print A

Network-on-chip

Sequential Memory Consistency

unlock

P1 P2 P3

A=1 while (A!=1);

while (B!=1);

Print A

•  All writes must be seen in one single order by all processors (write atomicity)

•  Likely to be inefficient in combination with a NoC

P1 P2 P3

Proposed software cache coherence protocol

Tuneable software cache coherence protocol   Proposed software cache coherence protocol

–  Minimal hardware requirements

•  Suitable for heterogeneous MPSoCs with a NoC •  Off-the-shelf processors and caches are supported

–  Should support cache maintenance operations (clean, invalidate) –  Sufficient for POSIX threads (Pthreads)

•  explicit synchronization operations

  Tuneable –  Separate shared and private data

•  Shared in write-through and private in write-back cache region –  Minimize unnecessary invalidations

•  Putting shared data in a specific cache way

  Suitable for real-time systems –  Bounded protocol overhead, WCET is independent of accesses other processors

Release Consistency

  Ensuring sequential consistency efficiently is (too) costly support release consistency

  Acquire –  Guarantees reading most recent data from

memory

  Release –  Makes writes visible to other processors

  Cache coherence operations only required on acquire and release

acquire(S)

release(S)

SWCC protocol in POSIX threads

  POSIX threads No two threads can access data at the same memory location simultaneously while at least one of the threads is modifying the location...

  Pthread_mutex_lock (acquire) –  Obtain lock –  Clean & invalidate Dcache

  Pthread_mutex_unlock (release) –  Clean Dcache –  Release lock

reads / writes

Pthread_mutex_unlock(S)

reads / writes (exclusive access to shared data)

reads / writes

Pthread_mutex_lock(S)

Tuning the protocol

  Place shared and private data in different address ranges

  Private data does not need to become visible to other processors –  Private data in write-back region of the cache

  Shared data – solve the sharing problem –  Shared data in write-through region of the cache

Execution time FFT Memory accesses FFT

Experiments

  Embedded the software cache coherence operations in POSIX threads calls

  Clean and invalidate entire shared address range on each synchronization –  Entire cache –  Way with shared data –  Address range (MVA)

  Executed Splash2 applications

  Low latency 4 cycles / word

  Each processor gets equal budget –  TDM arbitration on memory port

Shared Memory 16 MB

ARM926EJ-S PE1

ARM926EJ-S PE2

$ $ $ $

Peripherals RS232 Display

Touchscreen Audio in/out Video in/out

Timers

Cost of cache coherence operations Two cost types: •  cost of the cache maintenance operation •  cost of unnecessary invalidations

Speedup Splash2 applications

Speedup between 1.89 and 2.01

Increase of memory accesses

Protocol does not increase number of memory accesses significantly

Conclusion

  Presented a cache coherence protocol that is suitable for real-time systems with a NoC and with off-the-shelf processors

  Most important optimization is separation shared and private data

  Experimental results –  Speedup between 1.89 and 2.01

•  Higher synchronization/computation ratio (e.g. hardware floating point support) lower speed-up?

–  Protocol does not significantly increase memory bandwidth requirements

  Suitable for real-time systems because software cache coherency protocol overhead is predictable

Questions?

Backup slides

SWCC protocol in POSIX threads

reads / writes

Memory

•  Pthread_mutex_lock (acquire) •  Obtain lock •  Clean and invalidate

•  Pthread_mutex_unlock (release) •  Clean •  Release lock

Existing cache coherence protocols   Transactional Memory multiprocessor systems are based on speculation

–  Suitable for real-time systems?

  Hardware protocols –  Snooping in a NoC

•  Requires processors to observe all memory accesses

•  Writes to one location are serialized

Shared Memory

Bus snoop

Cache to Memory transaction

Existing cache coherence protocols   Hardware protocols

–  Directories in a NoC •  A directory is consulted on memory accesses •  Increase in memory access latency

–  Hardware protocols require support from processors •  Supported by off-the-shelf processors?

Shared Memory

Interconnect

a tuneable software cache coherence protocol for ... · a tuneable software cache coherence...

Documents

the cache-coherence problem

cache coherence 1

ca smp and cache coherence - stony brookmidor/ese545/ca_smp...

directory-based cache coherence marc de melo. outline...

05 coherence de cache

sequential consistency and cache coherence protocols

lecture 3. directory-based cache coherence

dynamic verification of cache coherence protocols

directory based cache coherence

sequential consistency cache coherence protocols

cache coherence techniques - dipartimento di...

cache coherence - parasol laboratory | department of...

cache coherence directories for scalable multiprocessors

database backed coherence cache

cache coherence

cache coherence techniques for multicore processors

cache coherence, etc… - mimd –

cache coherence for gpu architectures - iowa state...

cache coherence “can we do a better job of supporting...

cache coherence protocols 1 cache coherence protocols in...