a tuneable software cache coherence protocol for ... · a tuneable software cache coherence...
Post on 02-Aug-2018
247 Views
Preview:
TRANSCRIPT
1
A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs
Marco Bekooij & Frank Ophelders
Outline
Context
What is cache coherence
Addressed challenge
Short overview of related work
Related issue: memory consistency
Proposed software cache coherence protocol
Performance evaluation results
Concluding remarks
2
Multi-stream car-entertainment system
Car-radio IC of NXP
Digital In Out (DIO) Switch
Audio DAC 4x
Cordic FIR Ext SPDIF-in PCM I/f SRC
Audio ADC 4x
Host IIS-in 2x IIS-out 2x
IF –IN 1x
Ext IIS-in 3x
Host/ext IIS-out 1x
Keyed AGC 1x
Radio 8*fs In + out
Cordic
DSP EPICS
MEM
ITC AHB if ITC AHB if ITC AHB if ITC AHB if
Controller ARM MEM
Inter Tile Communication (ITC)
Multi-layer AHB bus (3 layer)
VPB Domain 0
VPB Domain 1
VPB Domain 2
MEM MEM MEM MEM MEM DMA SPI CD
Block Dec.
AHB2VPB AHB2VPB AHB2VPB
DSP EPICS
MEM
DSP EPICS
MEM
DSP EPICS
MEM
ARM based subsystem
Tile 0 Tile 1 Tile 2 Tile 3
Accelerators Peripherals
Unsuitable for general purpose applications (e.g. Pthread)
3
Developed experimental embedded multiprocessor system
Processors communicate through shared memory Processors have private caches
Cache coherence problem!
Shared Memory 1 8 MB
TDM
ARM926EJ-S
PE1 I D
ARM926EJ-S
PE2 D I
$ $
$ $
Instruction Memory PE2
Instruction Memory PE1
Peripherals RS232 Display
Touchscreen Audio in/out Video in/out
Timers
Æthereal network-on-chip
Shared Memory 2 8 MB
TDM
SDRAM 256 MB
RR
Virtex 4
X: 10
P2 P1
Cache coherence problem
A cache coherency protocol ensures that eventually writes become visible to all processors
X: 3
X: 10
Shared memory
$ $ X: 3 X: 3 Read returns 3 !!!
4
Addressed challenge
Define a cache coherence protocol that is suitable for real-time embedded systems with a NoC and with off-the-shelf processors
Related work on cache coherency Hardware cache coherency protocols
– Snooping based protocols: • Requires processors to observe all memory accesses
– Does not match well with a NoC: preferably point-2-point communication instead of broadcasting
– Directory based protocols • Significant overhead as a result of accessing the directory
– Transactional memory • Relies on speculation: suitable for real-time systems?
– > Remark: most embedded processors do not support a hardware cache coherency protocol
Software cache coherency protocols – Require a specific programming style: explicit coupling between each
synchronization operation and data-structure it protects
Prevent cache coherency issues: put shared data in uncached address range – Low efficiency
5
A B A B
Issues in sharing cache lines
Cache operations often operate on lines
A B A B
A B
P1 P2 A B A B
Related issue: memory consistency
Memory accesses reordering by – Memory system – Processor – Compiler
We need a memory consistency model – Defines constraints on the order in which memory operations become visible to other
processors – Enables programmers to reason about outcome
P1 P2 A = 1
flag = 1
while ( flag != 1 );
print A
6
Network-on-chip
Sequential Memory Consistency
read
write
lock
write
read
unlock
write
write
P1 P2 P3
A=1 while (A!=1);
B = 1
while (B!=1);
Print A
• All writes must be seen in one single order by all processors (write atomicity)
• Likely to be inefficient in combination with a NoC
P1 P2 P3
7
Proposed software cache coherence protocol
Tuneable software cache coherence protocol Proposed software cache coherence protocol
– Minimal hardware requirements
• Suitable for heterogeneous MPSoCs with a NoC • Off-the-shelf processors and caches are supported
– Should support cache maintenance operations (clean, invalidate) – Sufficient for POSIX threads (Pthreads)
• explicit synchronization operations
Tuneable – Separate shared and private data
• Shared in write-through and private in write-back cache region – Minimize unnecessary invalidations
• Putting shared data in a specific cache way
Suitable for real-time systems – Bounded protocol overhead, WCET is independent of accesses other processors
8
Release Consistency
Ensuring sequential consistency efficiently is (too) costly support release consistency
Acquire – Guarantees reading most recent data from
memory
Release – Makes writes visible to other processors
Cache coherence operations only required on acquire and release
read
write
acquire(S)
write
read
release(S)
write
write
SWCC protocol in POSIX threads
POSIX threads No two threads can access data at the same memory location simultaneously while at least one of the threads is modifying the location...
Pthread_mutex_lock (acquire) – Obtain lock – Clean & invalidate Dcache
Pthread_mutex_unlock (release) – Clean Dcache – Release lock
reads / writes
Pthread_mutex_unlock(S)
reads / writes (exclusive access to shared data)
reads / writes
Pthread_mutex_lock(S)
9
Tuning the protocol
Place shared and private data in different address ranges
Private data does not need to become visible to other processors – Private data in write-back region of the cache
Shared data – solve the sharing problem – Shared data in write-through region of the cache
Execution time FFT Memory accesses FFT
Experiments
Embedded the software cache coherence operations in POSIX threads calls
Clean and invalidate entire shared address range on each synchronization – Entire cache – Way with shared data – Address range (MVA)
Executed Splash2 applications
Low latency 4 cycles / word
Each processor gets equal budget – TDM arbitration on memory port
Shared Memory 16 MB
ARM926EJ-S PE1
I D
ARM926EJ-S PE2
D I
$ $ $ $
Instruction Memory PE2
Instruction Memory PE1
Peripherals RS232 Display
Touchscreen Audio in/out Video in/out
Timers
TDM
10
Cost of cache coherence operations Two cost types: • cost of the cache maintenance operation • cost of unnecessary invalidations
Speedup Splash2 applications
Speedup between 1.89 and 2.01
11
Increase of memory accesses
Protocol does not increase number of memory accesses significantly
Conclusion
Presented a cache coherence protocol that is suitable for real-time systems with a NoC and with off-the-shelf processors
Most important optimization is separation shared and private data
Experimental results – Speedup between 1.89 and 2.01
• Higher synchronization/computation ratio (e.g. hardware floating point support) lower speed-up?
– Protocol does not significantly increase memory bandwidth requirements
Suitable for real-time systems because software cache coherency protocol overhead is predictable
12
Questions?
Backup slides
13
SWCC protocol in POSIX threads
reads / writes
Pthread_mutex_unlock(S)
reads / writes (exclusive access to shared data)
reads / writes
Pthread_mutex_lock(S)
reads / writes
Pthread_mutex_unlock(S)
reads / writes (exclusive access to shared data)
reads / writes
Pthread_mutex_lock(S)
P1 P2
P1 P2
NoC
Memory
• Pthread_mutex_lock (acquire) • Obtain lock • Clean and invalidate
• Pthread_mutex_unlock (release) • Clean • Release lock
Existing cache coherence protocols Transactional Memory multiprocessor systems are based on speculation
– Suitable for real-time systems?
Hardware protocols – Snooping in a NoC
• Requires processors to observe all memory accesses
• Writes to one location are serialized
P1 Pn
$ $
Shared Memory
...
Bus snoop
Cache to Memory transaction
14
Existing cache coherence protocols Hardware protocols
– Directories in a NoC • A directory is consulted on memory accesses • Increase in memory access latency
– Hardware protocols require support from processors • Supported by off-the-shelf processors?
P1 Pn
$ $
Shared Memory
...
Interconnect
Directory
Existing cache coherence protocols
Software protocols • Explicitly coupling between synchronization and data structure
– Conditional invalidation [Tartalja, HICSS 1992]
– Shared regions [Sandhu, ACM SIGPLAN 1993]
• Private data cached, shared data not cached
– In [Petrot, DSD 2006]
Enter critical region (D)
Exit critical region (D)
Access D
1) Check administration 2) Invalidate ?
Clean if write-back
top related