the$rise$and$fall$of$$ scratchpad$memories$jasonxue/meaow/meaow-talk-riseandriseof... · web page:...
TRANSCRIPT
C M L
The Rise and Fall of Scratchpad Memories
Aviral Shrivastava Compiler Microarchitecture Lab
Arizona State University
Rise
C M L Web page: aviral.lab.asu.edu C M L
Remember -‐ It is all about Memory!
2
} First Generation } ENIAC , UNIVAC – No memory
} Second Generation } IBM 7000 series -‐ Magnetic core memory
} Third Generation } IBM 360 -‐ Semiconductor memory
} Fourth Generation } PC and onwards -‐ VLSI Memory
} First documented use of Cache } IBM 360* } “to bridge the speed gap between processor and memory”
} Since then: Caches maybe the most important feature in a processor } Itanium 2: cache and cache-‐like structures
} More than 90% of transistors by count, 70% of chip by area, 50% power, 80% of leakage
10/8/13
*IBM (June, 1968), IBM System/360 Model 85 Functional Characteristics, SECOND EDITION, A22-6916-1.
Computer Architecture and Networks
First Generation (1945-1958)…
Built to calculate trajectories
for ballistic shells during
WWII, programmed by
setting switches and plugging
&
unplugging cables.
It used 18,000 tubes, weighted
30 tones and consumed 160
kilowatts of electrical power.
1943-46, ENIAC (Electronic Numerical Integrator and Calculator) by J.
Mauchly and J. Presper Eckert, first general purpose electronic computer
The size of its numerical word was 10 decimal digits, and it could perform 5000
additions and 357 multiplications per second.
C M L Web page: aviral.lab.asu.edu C M L
SPMs for Power, Performance, and Area
0
1
2
3
4
5
6
7
8
9
256 512 1024 2048 4096 8192 16384
memory size
Ener
gy p
er a
cces
s [n
J]
.
Scratch padCache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space
Data Array Tag Array
Tag Comparators, Muxes
Address Decoder
Cache SPM
} 40% less energy as compared to cache [Banakar02] } Absence of tag arrays, comparators and muxes
} 34 % less area as compared to cache of same size [Banakar02] } Simple hardware design (only a memory array & address decoding
circuitry) } Simpler and cheaper to build and verify
C M L Web page: aviral.lab.asu.edu C M L
SPMs became popular in ES } DSPs have used SPMs for a long time
} TI-‐99/4A, released in 1981 had 256 bytes of SPM
} Gaming Consoles regularly use SPMs } SuperH in Sega } PS1: could use SPM for stack data } PS2: 16KB SPM } PS3: Each SPU has 256KB SPM
} Network and Graphics Processors } Intel Network processors, and Nvidia Tesla
} Many embedded processors used line locking } Coldfire MCF5249, PowerPC440, MPC5554, ARM940, and
ARM946E-‐S
} Several versions of ARM and Renesas have SPMs } -‐ ARM supports upto 4M of SPM Sony Playstation
Sega Saturn
C M L Web page: aviral.lab.asu.edu C M L
Using SPMs in Embedded Systems
5 10/8/13
ARM SPM
Cache
DMA
ARM Memory Architecture
Global Memory
• Programs work without using SPM – SPM for optimization – Improve power, performance
• Placing frequently used data in SPM – Typically arrays – Using linker script
All of this was done manually!
C M L Web page: aviral.lab.asu.edu C M L
Compilers for using SPMs
6
} As applications become more complex, it was not easy to identify what should be mapped to SPM
} Compiler techniques to use SPM in embedded systems } Global: Panda97, Brockmeyer03 Avissar02, Gao05,
Kandemir02, Steinke02, Grosslinger09 } Code: Janapsatya06, Egger06, Angiolini04 } Stack: Udayakumaran06,Dominguez05 } Heap: Dominguez05, Mcllroy08
10/8/13
C M L Web page: aviral.lab.asu.edu C M L
Compilers to use SPM
7
} In general-‐purpose systems } Kennedy – proposed to use SPM for register spills
} SPMs have largely remained in embedded systems } Not popular in general-‐purpose computing
10/8/13
Not much work - Because caches keep programming and debugging simple
Times are a changing…
C M L Web page: aviral.lab.asu.edu C M L
Inevitable march to multi-‐cores } Marketing Needs
} Moore’s law
} Real Needs } Temperature and Power Problems
} Microarchitecture level: Hotspots } Chip level: Cooling Efficiency } System Level: Total power consumption
} Only way to improve performance without much increase in power
} Multi-‐cores } Reduce design complexity } Spread heat and alleviate hotspots } Improve reliability through redundancy
10/8/13 8
C M L Web page: aviral.lab.asu.edu C M L
But… how do you scale the memory? } Coherent-‐Cache Architectures (Current path)
} Can still write programs like in the uni-‐core era, but
} Coherency overheads do not scale } Tilera64 has a whole separate mesh network for coherence traffic
} http://www.theinquirer.net/inquirer/news/1006963/tilera-‐releases-‐core-‐chip
} Non-‐Coherent Cache Architectures } 48-‐core Single-‐chip Cloud Computer (SCC)
} Partly Coherent } TI-‐6678 – vertically coherent, but horizontally not coherent
} Hybrid } Locally coherent, but globally non-‐coherent
} Caches still consume a very significant amount of power
C M L Web page: aviral.lab.asu.edu C M L
PPE
Element Interconnect Bus (EIB)
Off-chip Global
Memory PPE: Power Processor Element SPE: Synergistic Processor Element LS: Local Store
SPE 0 SPE 2
SPE 5
SPE 4
SPE 3 SPE 1
SPE 6
Software Managed Memory (SMM) Architecture
10
} Cores have small local memories (scratch pad) } Core can only access local memory } Accesses to global memory through explicit DMAs in the program
} e.g. IBM Cell architecture, which is in Sony PS3.
SPE 7 LS
SPU
C M L Web page: aviral.lab.asu.edu C M L
SMM Execution
11
} Task based programming, MPI like communication #include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid (&hello_spu); }
Main Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
<spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); }
Local Core
= spe_create_thread
} Extremely power-‐efficient computation } If all code and data fit into the local memory of the cores
Processor Fab Frequency GFlops Power Power Efficiency Cell/B.E. 45nm 3.2 GHz 230 50 W 4.6 Intel i7 4-core Bloomfield 965 XE
45nm 3.2 GHz 70 130 W 0.5
C M L Web page: aviral.lab.asu.edu C M L
SMM memory organization
12
ARM SPM
Global Memory
DMA
ARM Memory Architecture
SPE SPM
Global Memory
DMA
IBM Cell Memory Architecture
SPM is for Optimization SPM is Essential
} Dynamic code/data management is needed } All code/data must be managed
Previous works are not directly applicable
C M L Web page: aviral.lab.asu.edu C M L
How to manage data within a core?
13
Local Memory Aware Code Original Code
int global; f1(){ int a,b; global = a + b; f2(); }
int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); }
C M L Web page: aviral.lab.asu.edu C M L
Data Management in LLM multicores } Manage any amount of heap, stack and code, in the core of an LLM multi-‐core
} Global data
} If small, can be permanently located in the local memory
} Stack data } ‘liveness’ depends on call path
} Function stack size know at compiler time, but not stack depth
} Heap data } dynamic and size can be unbounded
} Code
} Statically linked
} Our strategy } Partition local memory into regions for each kind of data
} Manage each kind of data in a constant amount of space
code global
stack
heap
heap
heap
stack
C M L Web page: aviral.lab.asu.edu C M L
Stack Management: Problem
15
Function Frame Size
(bytes)
F1 28
F2 40
F3 60
F4 54
F1
F2
F3
F4
Local Memory Size = 128 bytes
Local Memory Global Memory
F1
F2
F3
28
68
128
Old SP
F4
54
Global SP
C M L Web page: aviral.lab.asu.edu C M L
Stack Management: Solution
16
} Keep the active portion of the stack on the local memory } Granularity of stack frames is chosen to minimize management overhead
} It is a dynamic software technique } fci(func_stack_size)
} Check for available space in local memory } Move old frame(s) to global memory if needed
} fco() } Check if the caller frame exists in local memory! } Fetch from global memory, if it is absent
Optimized Compiler GCC 4.1.1
Executable
Runtime Library Runtime Library
void fci(int func_stack_size); void fco();
C Source
F1() { int a,b; F2();
} F2() {
F3(); } F3() {
int j=30; }
F1() { int a,b; fci(F2); F2(); fco(F1);
} F2() {
fci(F3); F3(); fco(F2);
} F3() {
int j=30; }
C M L Web page: aviral.lab.asu.edu C M L
Code Management: Problem } Static Compilation
} Functions need to be linked before execution } Divide code part of SPM in regions } Map functions to these SPM regions } Functions in the same region replace each other
REGION
REGION
REGION • • •
Local Memory Code Section
C M L Web page: aviral.lab.asu.edu C M L
(c) Local Memory
F2
F3
F1
Code region
Code Management: Solution
18
(d) Global Memory
heap
global
stack
F2
F1 F3
F1
F2
F3
F1
F2
F3 (a) Application call graph
SECTIONS { OVERLAY { F1.o F3.o } OVERLAY { F2.o } } (b) Linker script
} # of Regions and Function-‐To-‐Region Mapping } Two extreme cases
} Need careful code placement – Problem is NP-‐Complete } Minimum data transfer with given space
C M L Web page: aviral.lab.asu.edu C M L
malloc2
malloc1
Heap Size = 32bytes sizeof(student)=16bytes
HP
Local Memory Global Memory GM_HP
typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } for (i=0; i<N; i++) { student[i].id = i; } }
malloc3
• New malloc() } May need to evict older heap
objects to global memory } It may need to allocate more global
memory
• malloc() } allocates space in local memory
Heap Data Management
19
C M L Web page: aviral.lab.asu.edu C M L
Pointer Threat: Problem Stack Size= 70 bytes Stack Size= 100 bytes F1() {
int a=5, b; fci(F2); F2(&a); fco(F1);
} F2(int *a) {
fci(F3); F3(a); fco(F2);
} F3(int *a) {
int j=30; *a = 100;
}
Aha! FOUND “a”
F2 20
SP
F3 30
F1 50
a
100
50
30
0
F2 20
SP
F3 30
F1 50
a 100
50
30
Wrong value of “a”
90 90 a
Local Memory Local Memory
EVICTED
C M L Web page: aviral.lab.asu.edu C M L
Pointer Threat: Resolution
F1() { int a=5, b; fci(F2); F2(&a); fco(F1);
} F2(int *a) {
fci(F3); F3(a); fco(F2);
} F3(int *a) {
int j=30; *a = 100;
}
F1() { int a=5, b; fci(F2); fco(F1);
} F2(int *a) {
fci(F3); F3(a); fco(F2);
} F3(int *a) {
int j=30; t = g2l(a) *t = 100; l2p(a, t);
}
*ptr = val;
val = *ptr;
tptr = _g2l(ptr); *tptr = val; l2p(ptr, tptr);
tptr = _g2l(ptr); val = *tptr;
C M L Web page: aviral.lab.asu.edu C M L
Global Memory
} Can use DMA to transfer heap object to global memory } DMA is very fast – no core-‐to-‐core communication
} But eventually, you can overwrite some other data } Need mediation
Execution Core malloc Main Core
malloc
Execution Core malloc
Global Memory
DMA
22
How to evict data to global memory?
Execution Core
C M L Web page: aviral.lab.asu.edu C M L
Compiler and Runtime Infrastructure
23
} Our infrastructure includes: } code overlay script generating
tool, } runtime library implementing
the API, } compiler that inserts API
functions in the application.
Linker Script
Runtime Library API
inserting Compiler
SPE Objects
Code Overlay Script
Generating Tool
SPE Linker
SPE
Executable
Runtime Library API void * _malloc(int size, int chunkSize);
void _free (void *ppeAddr);
void _fci(int func_stack_size);
void _fco();
void * _g2l(void *ppeAddr, int size, int wrFlag);
void * _l2g(void *ppeAddr, void* speAddr, int size);
SPE Source
S H C
C M L Web page: aviral.lab.asu.edu C M L
Experimental Setup } Sony PlayStation 3 running a Fedora Core 9 Linux
} Only 6 SPEs available
} MiBench Benchmark Suite and some other applications
} Runtimes are measured with spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1
} Download GCC compiler patch } http://aviral.lab.asu.edu/?p=95
10/8/13 24
C M L Web page: aviral.lab.asu.edu C M L
Results
Enable execution for arbitrary stack sizes But quite high overheads!
100
1000
10000
100000
Log
of R
untim
e(us
)
Parameter n
Without Stack Management
Our Approach
n = 3842 Our Technique works for arbitrary stack size.
Without management the program crashes! There is no space left in local memory for the stack.
int rcount(int n) {
if (n==0) return 0; return rcount(n-1) + 1;
}
C M L Web page: aviral.lab.asu.edu C M L
How does it work?
} Pretty bad!! } Several programs run, but with high overhead } Several program still do not run
} Pointer problem } How to evict to global memory } Reduce overheads
} # of times API functions are called } # of times DMA is performed
} Good news: It only gets better from here!
C M L Web page: aviral.lab.asu.edu C M L
Reduce Data Transfer Overhead
27
malloc() { if (enough space in global memory) then write heap data using DMA else request more space in global memory }
Execution Thread on execution core
S
startAddr endAddr
mail-box based communication
Global Memory
allocate ≥S space
DMA write from local memory to global memory
Main core
Global Memory
Execution Core malloc Main Core
malloc Execution
Core
C M L Web page: aviral.lab.asu.edu C M L
Improving Stack Management } Opportunities to reduce repeated API calls by consolidation
fci(F1);!F1();!fco(F0);!fci(F2);!F2();!fco(F0);!!
fci(F1);!F1(){! fci(F2);! F2();! fco(F1);!}!fco(F0);!!
Sequential Calls Nested Call
while(<condition>){! fci(F1);! F1();! fco(F0);!}!
Call in loop
fci(max(F1,F2));!F1();!F2();!fco(F0);!
fci(F1+F2);!F1(){ ! F2();!}!fco(F0);!
fci(F1);!while(<condition>){ ! F1();!}!fco(F1);!
F1();!F2();!
F1(){! F2();!}!
while(<condition>){! F1();!}!
C M L Web page: aviral.lab.asu.edu C M L
Find optimal stack management points } Can consolidate function frame movement } Do not need to move
functions at every function call
} Formulate the problem as that of inserting cuts in the GCCFG } At the cut, dump the SPM
contents into global memory
main 128
print 32
stream 1936
init 0
update 160
final 80
transform 352
0
1 1
1 10 1
1 100
0
0
0
Cut 1
Cut 2
Cut 3
Cut 4
C M L Web page: aviral.lab.asu.edu C M L
More Stack Management Optimizations } Movement of functions
} Biggest contributor } Consolidate management for multiple functions
} Pointer management } Reduce the number of times p2s is called
} If stack variable is used continuously – perform p2s only once } If the stack variable belongs to the function that is in the SPM, do not need p2s
} Reduce the instructions in management functions } SPM-‐level management is simpler } Less fragmentation – so the management code is less
C M L Web page: aviral.lab.asu.edu C M L
Efficient Execution
} Very few fci and fco calls inserted } Less number of g2l calls } Less number of instructions executed at every management point
C M L Web page: aviral.lab.asu.edu C M L
Overheads Table 3: Number of sstore/ fci and sload/ fco Calls
Benchmark
sstore/ fci sload/ fco
CSM SSDM CSM SSDM
BasicMath 40012 0 40012 0
Dijkstra 60365 202 60365 202
FFT 7190 8 7190 8
FFT inverse 7190 8 7190 8
SHA 57 2 57 2
String Search 503 143 503 143
Susan Edges 776 1 776 1
Susan Smoothing 112 2 112 2
Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb
CSM 2404 1900 96 1024 1112
SSDM 184 176 24 120 80
only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.
More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.
9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures
are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.
10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7
Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,
1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and
M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.
[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.
[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.
[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.
Table 5: Dynamic instructions per functionsstore/ fci sload/ fco
l2g
g2l wb
F NF F NF hit miss hit miss
CSM 180 100 148 95 24 45 76 60 34
SSDM 46 0 44 0 6 11 30 4 20
* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.
Table 6: Number of pointer mgmt. function callsl2g g2l wb
CSM SSDM CSM SSDM CSM SSDM
BasicMath 37010 0 123046 0 89026 0
SHA 2 2 163 158 68 68
Edges 1 0 515 0 514 0
Smoothing 1 0 515 0 514 0
* Edges - Susan Edges, Smoothing - Susan Smoothing
[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.
[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.
[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.
[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.
[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.
[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.
[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.
[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.
[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.
[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.
[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.
[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.
[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.
[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.
[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.
[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.
[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.
[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.
[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.
[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.
[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.
Table 3: Number of sstore/ fci and sload/ fco Calls
Benchmark
sstore/ fci sload/ fco
CSM SSDM CSM SSDM
BasicMath 40012 0 40012 0
Dijkstra 60365 202 60365 202
FFT 7190 8 7190 8
FFT inverse 7190 8 7190 8
SHA 57 2 57 2
String Search 503 143 503 143
Susan Edges 776 1 776 1
Susan Smoothing 112 2 112 2
Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb
CSM 2404 1900 96 1024 1112
SSDM 184 176 24 120 80
only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.
More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.
9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures
are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.
10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7
Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,
1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and
M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.
[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.
[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.
[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.
Table 5: Dynamic instructions per functionsstore/ fci sload/ fco
l2g
g2l wb
F NF F NF hit miss hit miss
CSM 180 100 148 95 24 45 76 60 34
SSDM 46 0 44 0 6 11 30 4 20
* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.
Table 6: Number of pointer mgmt. function callsl2g g2l wb
CSM SSDM CSM SSDM CSM SSDM
BasicMath 37010 0 123046 0 89026 0
SHA 2 2 163 158 68 68
Edges 1 0 515 0 514 0
Smoothing 1 0 515 0 514 0
* Edges - Susan Edges, Smoothing - Susan Smoothing
[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.
[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.
[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.
[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.
[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.
[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.
[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.
[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.
[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.
[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.
[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.
[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.
[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.
[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.
[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.
[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.
[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.
[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.
[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.
[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.
[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.
Table 3: Number of sstore/ fci and sload/ fco Calls
Benchmark
sstore/ fci sload/ fco
CSM SSDM CSM SSDM
BasicMath 40012 0 40012 0
Dijkstra 60365 202 60365 202
FFT 7190 8 7190 8
FFT inverse 7190 8 7190 8
SHA 57 2 57 2
String Search 503 143 503 143
Susan Edges 776 1 776 1
Susan Smoothing 112 2 112 2
Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb
CSM 2404 1900 96 1024 1112
SSDM 184 176 24 120 80
only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.
More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.
9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures
are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.
10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7
Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,
1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and
M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.
[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.
[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.
[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.
Table 5: Dynamic instructions per functionsstore/ fci sload/ fco
l2g
g2l wb
F NF F NF hit miss hit miss
CSM 180 100 148 95 24 45 76 60 34
SSDM 46 0 44 0 6 11 30 4 20
* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.
Table 6: Number of pointer mgmt. function callsl2g g2l wb
CSM SSDM CSM SSDM CSM SSDM
BasicMath 37010 0 123046 0 89026 0
SHA 2 2 163 158 68 68
Edges 1 0 515 0 514 0
Smoothing 1 0 515 0 514 0
* Edges - Susan Edges, Smoothing - Susan Smoothing
[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.
[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.
[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.
[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.
[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.
[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.
[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.
[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.
[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.
[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.
[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.
[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.
[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.
[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.
[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.
[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.
[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.
[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.
[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.
[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.
[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.
(a) SSDM against ILP and CSM. (b) Overhead comparison between SSDM and CSM.
Figure 6: SSDM reduces the data management overhead and improves performance.
We first utilized PPE and 1 SPE available in the IBM CellBE and compared our SSDM performance against the resultsfrom ILP and CSM [10]. The y-axis in Figure 6(a) stands forthe execution time of each benchmark normalized to its ex-ecution time that with ILP. In this section, the number offunction calls used in Weighted Call Graph (WCG) is esti-mated from profile information. In the Appendix, sectionD, we present a compile-time scheme to assign weights onedges. Experimental results show that both the non-profiling-based scheme and the profiling-based scheme achieve almostthe same performance. As observed from Figure 6(a), ourSSDM shows very similar performance to ILP approach. Thismeans our heuristic approaches the optimal solution whenthe benchmark has a small call graph. Compared the CSMscheme, our SSDM demonstrates up to 19% and average 11%performance improvement. The overhead of the managementcomprises of i) time for data transfer, ii) execution of the in-structions in the management library functions. Figure 6(b)compares the execution time overhead of CSM and the pro-posed SSDM. Results show that when using CSM, an average11.3% of the execution time was spent on stack data man-agement. With our new approach SSDM, the overhead isreduced to a mere 0.8% – a reduction of 13X. Next we break-down the overhead and explain the e↵ect of our techniqueson the di↵erent components of the overhead:
Opt1 - Increase in the granularity of management:Due to our stack space level granularity of management, thenumber of DMA calls have been reduced. Table 2 shows thenumber of stack data management DMAs executed when weuse CSM, vs. the new technique SSDM. Note that thereare no DMAs required for Basicmath. This is because thewhole stack fits into the stack space allowed for this bench-mark. Our technique performs well for all benchmarks, ex-cept for Disjkstra. This is because of the recursive functionprint path in Dijkstra. CSM will perform a DMA only whenthe stack space is full of recursive function instantiations,while we have to evict recursive functions every time withunused stack space. As a result, our technique does not per-form very well on recursive programs. However, since manyembedded programs are non-recursive, we have left the prob-lem of optimizing for recursive functions as a future work.
Opt2 - Not performing management when not ab-solutely needed: Our SSDM scheme reduces the number
Table 1: Benchmarks, their stack sizes, and the stackspace we manage them on.
Benchmark
Stack Size Stack Region
(bytes) Size (bytes)
BasicMath 400 512
Dijkstra 1712 1024
FFT 656 512
FFT inverse 656 512
SHA 2512 2048
String Search 992 768
Susan Edges 832 768
Susan Smoothing 448 256
of library function calls because of our compile-time analy-sis. In Table 3, we compare the number of sstore and sloadfunction calls executed when using SSDM, vs. fci and fcocalls when using CSM. We can observe that our scheme hasmuch less number of library function calls. The main reasonis that our SSDM considers the thrashing e↵ect discussed inSection 4. Our approach tries to avoid placing managementlibrary function sstore and sload around the function con-taining large number of function calls if possible, while CSMalways inserts management function at all function call sites.
Opt3 - Performing minimal work each time man-agement is performed: Our management library is sim-pler, since we only need to maintain a linear queue, as com-pared to a circular queue in CSM. Table 4 shows the amountof local memory required by our SSDM and CSM, where wecan find our runtime library has much less footprint thanCSM does. It is very important for improving the perfor-mance, since stack frames will obtain less space in the localmemory if the library occupies more space. The reason forlarger footprint of CSM is that it needs to handle memoryfragmentation, while our SSDM doesn’t have this trouble.
Table 5 shows the cost of extra instructions per libraryfunction call. We ran all benchmarks with both schemes andapproximately calculated the average additional instructionsincurred by each library call. As demonstrated in Table 5, ourSSDM performs much better than CSM. There is no cost inSSDM when the stack region is su�cient to hold the incomingframes. However, CSM still needs extra instructions, since itchecks the status of the stack region at runtime. hit for g2land wb means the accessing stack data is residing in thelocal memory when the function is called, while miss denotesstack data is not in the local memory. In CSM approach,more instructions are needed for hit case than miss case inthe function wb. It is because the library directly writes backthe data to the global memory when miss, but looking up themanagement table is required to translate the address. Moreimportantly, as the table itself occupies space and thereforeneeds to be managed, CSM may need additional instructionsto transfer table entries.
Opt4 - Not performing pointer management whennot needed: Stack pointer management is properly man-aged in SSDM, while CSM might manage all pointers exces-sively. Table 6 shows the results of four benchmarks withand without pointer optimization technique. They are the
Table 2: Comparison of number of DMAsBenchmark CSM SSDM
BasicMath 0 0
Dijkstra 108 364
FFT 26 14
FFT inverse 26 14
SHA 10 4
String Search 380 342
Susan Edges 8 2
Susan Smoothing 12 4
Table 3: Number of sstore/ fci and sload/ fco Calls
Benchmark
sstore/ fci sload/ fco
CSM SSDM CSM SSDM
BasicMath 40012 0 40012 0
Dijkstra 60365 202 60365 202
FFT 7190 8 7190 8
FFT inverse 7190 8 7190 8
SHA 57 2 57 2
String Search 503 143 503 143
Susan Edges 776 1 776 1
Susan Smoothing 112 2 112 2
Table 4: Code size of stack manager (in bytes)sstore/ fci sload/ fco l2g g2l wb
CSM 2404 1900 96 1024 1112
SSDM 184 176 24 120 80
only four applications among our eight applications that con-tain pointers to stack data. We can observe that our schemecan slightly improve the performance of SHA, and totallyeliminate the pointer management functions for other threebenchmarks.
More results: Besides comparing results between SSDMand CSM, we also examined the impact of di↵erent stackspace sizes, the scalability of our heuristic, and discussed ourSSDM with cache design. We found that i) performance im-proves as we increase the space for stack data, ii) our SSDMscales well with di↵erent number of cores, iii) the penaltyof management is much less with our SSDM compared tohardware cache. The detailed results are presented in theAppendix, section F, section G, and section H.
9. CONCLUSIONScratchpad based Multicore Processor (SMP) architectures
are promising, since they are more scalable. However, sincescratchpad memory cannot always accommodate the wholeprogram, certain schemes are required to mange code, globaldata, stack data and heap data of the program to enable itsexecution. The main focus of this paper is on managing stackdata, since majority of the accesses in embedded applicationsmay be to stack variables. Assuming other data are properlymanaged by other schemes, managing stack data is especiallychallenging. In this paper, we formulated the problem of ef-ficiently placing library functions at the function call sites.In addition, we proposed a heuristic algorithm called SSDMto generate the e�cient function placement. As for pointersto stack data, a proper scheme was presented to reduce themanagement cost. Our experimental results show that SSDMgenerates function placement which leads to significant per-formance improvement compared to CSM.
10. REFERENCES[1] “GCC Internals”. http://gcc.gnu.org/onlinedocs/gccint/.[2] Intel Core i7 Processor Extreme Edition and Intel Core i7
Processor Datasheet, Volume 1. In White paper. Intel.[3] Raw Performance: SiSoftware Sandra 2010 Pro (GFLOPS).[4] SPU C/C++ Language Extensions. Technical report.[5] The SCC Programmer’s Guide. Technical report.[6] Compilers: Principles, Techniques, and Tools. Addison Wesley,
1986.[7] F. Angiolini, F. Menichelli, A. Ferrero, L. Benini, and
M. Olivieri. A Post-Compiler Approach to Scratchpad Mappingof Code. In Proc. CASES, pages 259–267, 2004.
[8] O. Avissar, R. Barua, and D. Stewart. An Optimal MemoryAllocation Scheme for Scratch-pad-based Embedded Systems.ACM TECS, 1(1):6–26, 2002.
[9] K. Bai and A. Shrivastava. Heap Data Management for LimitedLocal Memory (LLM) Multi-core Processors. In Proc.CODES+ISSS, 2010.
[10] K. Bai, A. Shrivastava, and S. Kudchadker. Stack DataManagement for Limited Local Memory (LLM) Multi-coreProcessors. In Proc. ASP-DAC, pages 231–234, 2011.
Table 5: Dynamic instructions per functionsstore/ fci sload/ fco
l2g
g2l wb
F NF F NF hit miss hit miss
CSM 180 100 148 95 24 45 76 60 34
SSDM 46 0 44 0 6 11 30 4 20
* F: stack region is full when function is called; NF: stack region isenough for the incoming function frame.
Table 6: Number of pointer mgmt. function callsl2g g2l wb
CSM SSDM CSM SSDM CSM SSDM
BasicMath 37010 0 123046 0 89026 0
SHA 2 2 163 158 68 68
Edges 1 0 515 0 514 0
Smoothing 1 0 515 0 514 0
* Edges - Susan Edges, Smoothing - Susan Smoothing
[11] M. A. Baker, A. Panda, N. Ghadge, A. Kadne, and K. S.Chatha. A Performance Model and Code Overlay Generator forScratchpad Enhanced Embedded Processors. In Proc.CODES+ISSS, pages 287–296, 2010.
[12] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, andP. Marwedel. Scratchpad Memory: Design Alternative for Cacheon-chip Memory in Embedded Systems. In Proc. CODES+ISSS,pages 73–78, 2002.
[13] A. Dominguez, S. Udayakumaran, and R. Barua. Heap DataAllocation to Scratch-pad Memory in Embedded Systems. J.Embedded Comput., 1(4):521–540, 2005.
[14] B. Egger, C. Kim, C. Jang, Y. Nam, J. Lee, and S. L. Min. ADynamic Code Placement Technique for Scratchpad MemoryUsing Postpass Optimization. In Proc. CASES, pages 223–233,2006.
[15] B. Flachs at el. The Microarchitecture of the SynergisticProcessor for A Cell Processor. IEEE Solid-state circuits,41(1):63–70, 2006.
[16] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin,T. Mudge, and R. B. Brown. Mibench: A Free, CommerciallyRepresentative Embedded Benchmark Suite. Proc. WorkloadCharacterization, pages 3–14, 2001.
[17] A. Janapsatya, A. Ignjatovic, and S. Parameswaran. A NovelInstruction Scratchpad Memory Optimization Method Based onConcomitance Metric. In Proc. ASP-DAC, pages 612–617, 2006.
[18] S. c. Jung, A. Shrivastava, and K. Bai. Dynamic Code Mappingfor Limited Local Memory Systems. In Proc. ASAP, pages13–20, 2010.
[19] M. Kandemir and A. Choudhary. Compiler-directed Scratch padMemory Hierarchy Design and Management. In Proc. DAC,pages 628–633, 2002.
[20] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan,I. Kadayif, and A. Parikh. Dynamic Management of Scratch-padMemory Space. In Proc. DAC, pages 690–695, 2001.
[21] M. Kistler, M. Perrone, and F. Petrini. Cell MultiprocessorCommunication Network: Built for Speed. IEEE Micro,26(3):10–23, May 2006.
[22] L. Li, L. Gao, and J. Xue. Memory Coloring: A CompilerApproach for Scratchpad Memory Management. In Proc. PACT,pages 329–338, 2005.
[23] M. Mamidipaka and N. Dutt. On-chip Stack Based MemoryOrganization for Low Power Embedded Architectures. In Proc.DATE, pages 1082–1087, 2003.
[24] R. Mcllroy, P. Dickman, and J. Sventek. E�cient Dynamic HeapAllocation of Scratch-pad Memory. In ISMM, pages 31–40, 2008.
[25] N. Nguyen, A. Dominguez, and R. Barua. Memory Allocationfor Embedded Systems with A Compile-time-unknownScratch-pad Size. In Proc. CASES, pages 115–125, 2005.
[26] P. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. O↵-chipMemory: the Data Partitioning Problem in EmbeddedProcessor-based Systems. In ACM TODAES, pages 682–704,2000.
[27] S. Park, H.-w. Park, and S. Ha. A Novel Technique to UseScratch-pad Memory for Stack Management. In Proc. DATE,pages 1478–1483, 2007.
[28] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, andJ. M. Mendias. An Integrated Hardware/Software Approach forRun-time Scratchpad Management. In Proc. DAC, pages238–243, 2004.
[29] A. Shrivastava, A. Kannan, and J. Lee. A Software-only Solutionto Use Scratch Pads for Stack Data. IEEE TCAD,28(11):1719–1728, 2009.
[30] J. E. Smith. A Study of Branch Prediction Strategies. In Proc.of ISCA, pages 135–148, 1981.
[31] S. Udayakumaran, A. Dominguez, and R. Barua. DynamicAllocation for Scratch-pad Memory Using Compile-timeDecisions. ACM TECS, 5(2):472–511, 2006.
C M L Web page: aviral.lab.asu.edu C M L
Minimal Overhead
} 4% of execution time spent on management
C M L Web page: aviral.lab.asu.edu C M L
Comparison with Caches } Cache Miss penalty = # misses * miss latency } SPM miss overhead = # API function calls * no. of instructions in API function
+ # times DMA is called * delay of the DMA (dep. on DMA size)
Cache is better when miss latency < 260 ps 260 ps = 0.86 * cycle time
C M L Web page: aviral.lab.asu.edu C M L
Scalability of Management
} The main core does not choke on the memory requests from several cores
0.97
0.98
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1 2 3 4 5 6
Nor
mal
ized
Run
time
# of Cores
basicmath
DFS
dijkstra
fft
invfft
MST
rbTree
sha
stringsearch
C M L Web page: aviral.lab.asu.edu C M L
Summary } SPMs are an embedded system technology } SPMs will ne needed in general-‐purpose computing
} Will need to manage stack, heap, and code } Do not work without management
} Need different strategies for different data } Code (statically linked) } Stack (Circular) } Heap (High associativity)
} Overheads of Software Data Management } DMA overhead can be comparable or better than cache
} We have just begun – lots of room for improvement
Stack
Heap
Global
Code
C M L Web page: aviral.lab.asu.edu C M L
Communication Management } No problem in MPI-‐style
} Communication is explicit
} For multi-‐threaded programs } Replace load => coh_load(), and store =>
coh_store()
} Too much overhead for sequential consistency
} Weak Consistency models allow for efficient software implementations of coherency protocols
} Lazy vs. Eager
} Invalidate vs. Update
} Page based granularity in multi-‐processor systems
} Need finer granularity in multi-‐cores
1
10
100
1000
10000
100000
Benchmarks
CRC LRC-inv LRC-upd
Exe
cuti
on t
ime
(ms)
C M L Web page: aviral.lab.asu.edu C M L
Real-‐time Multicores
} Data and communication management in software } Better timing guarantees
} Managing data at its natural granularity simplifies WCET calculation
} e.g., find out how many instruction cache misses vs. find out how many function swaps
} Not only lower WCET, but tighter WCET estimate } Excellent platform for Real-‐time Systems } Can tune the management policy to improve WCET
} Software Branch Hinting } Close to 1-‐bit HBP performance } Can place hints to achieve tighter WCET