t ra c e d e c o d e r a n d c o n tro l s o ftw a re mo d...
Post on 04-Aug-2018
214 Views
Preview:
TRANSCRIPT
EXPERIMENTAL ENVIRONMENT
Need for trace modules that can guarantee
• Unobtrusive program tracing in real-time
• High compression (low trace port bandwidth)
• Low complex implementation: narrow trace ports & small buffers
mlvCFiat
• Trace out only data cache read misses or read hits that occur for the first time
• Relatively low-complexity
• Improvement: 15-33 times when N=1 & 14-20 times when N=8
• Variable encoding improves TPB 8-9% relative to base encoding
CONCLUSIONS
RESULTSmlvCFiat
Indispensable in many aspects of many life
Trends: more complex and smaller, running sophisticated software
SW development cost exceeds 80% of the total cost
Debugging SW is hard and time-consuming: developers spend 50%-75% of time in debugging; increases as we transition to multicores
Estimated cost of SW bugs and glitches: $20-$60 billion annually
Need for better tools to help SW/FW developers find bugs faster
On-the-Fly Load Data Value Tracing in Multicores
Mounika Ponugoti, Amrish K. Tewar, Aleksandar Milenković
University of Alabama in Huntsville, Huntsville, Alabama, U.S.A.
This work is supported in part by an NSF grant CNS-1217470.
Dynamic Trace Port Bandwidth Analysis
# Cores N=1 N=2 N=4 N=8
Mechan NX_b CF_e CF_e NX_b CF_e CF_e NX_b CF_e CF_e NX_b CF_e CF_e
Config - CS16 CS32 - CS16 CS32 - CS16 CS32 - CS16 CS32
barnes 15.03 2.21 0.79 15.31 2.30 1.16 15.59 2.39 1.47 15.86 2.44 1.77
cholesky 15.35 1.86 0.75 16.21 1.30 0.79 15.87 0.95 0.62 15.59 0.61 0.44
fft 10.65 2.57 1.48 10.84 2.62 1.50 11.02 2.65 1.52 11.19 2.67 1.54
fmm 8.82 0.36 0.23 8.96 0.37 0.24 9.14 0.38 0.25 9.33 0.38 0.26
lu 11.88 0.58 0.57 12.07 0.61 0.57 12.27 0.62 0.58 12.47 0.65 0.45
radiosity 12.11 0.25 0.09 12.36 0.55 0.45 12.59 0.56 0.44 12.58 0.63 0.54
radix 13.41 0.75 0.54 13.75 1.64 1.44 14.09 1.73 1.52 14.54 1.79 1.57
raytrace 15.17 1.06 0.34 15.45 1.28 0.61 15.73 1.38 0.71 16.01 1.54 0.90
water-ns 10.64 0.49 0.22 10.81 0.52 0.25 10.98 0.56 0.39 11.15 0.56 0.42
water-sp 11.38 0.07 0.05 11.55 0.07 0.06 11.73 0.08 0.07 11.90 0.09 0.08
Total 12.34 0.80 0.37 12.63 0.92 0.57 12.89 0.93 0.61 13.17 0.92 0.66
Data Tracing: Why Is It Challenging?
Format of Trace Messages
In mlvCFiat length of LV field depends on granularity
Encoding parameters (h0,h1) = (4, 2), (i0, i1) = (2, 2)
Speedup TPB(NX_b)/TPB(compressed NX_b)
ABSTRACT
Software testing and debugging of modern multicore-based embeddedsystems is a challenging proposition because of growing hardware andsoftware complexity, increased integration, and tightening time-to-market. To find more bugs faster, software developers of real-timeembedded systems increasingly rely on on-chip trace and debugresources, including hefty on-chip buffers and wide trace ports. However,these resources often offer limited visibility of the system, increase thesystem cost, and do not scale well with a growing number of cores. Thispaper introduces mlvCFiat, a hardware/software mechanism for capturingand filtering load data value traces in multicores. It relies on first-accesstracking in data caches and equivalent modules in the software debuggerto significantly reduce the number of trace events streamed out of thetarget platform. Our experimental evaluation explores the effectiveness ofthe proposed technique as a function of cache sizes, encodingmechanism, and the number of cores. The results show that mlvCFiatsignificantly reduces the total trace port bandwidth. The improvementsrelative to the existing Nexus-like load data value tracing range from 15 to33 times for a single core and from 14 to 20 times for an octa core.
ACKNOWLEDGMENT
Speedup TPB(NX_b)/TPB(CF_e)# Cores N=1 N=2 N=4 N=8Config CS16 CS32 CS16 CS32 CS16 CS32 CS16 CS32barnes 6.8 19.1 6.7 13.1 6.5 10.6 6.5 9.0cholesky 8.2 20.4 12.5 20.4 16.7 25.5 25.7 35.1fft 4.2 7.2 4.1 7.2 4.2 7.3 4.2 7.3fmm 24.2 37.8 24.1 37.2 24.2 36.7 24.4 36.6lu 20.5 20.7 19.8 21.3 19.8 21.2 19.2 27.6radiosity 47.7 128.8 22.3 27.6 22.6 28.4 19.8 23.4radix 17.9 24.8 8.4 9.6 8.1 9.3 8.1 9.3raytrace 14.3 44.2 12.1 25.4 11.4 22.1 10.4 17.9water-ns 21.7 47.4 20.8 42.8 19.7 28.0 20.0 26.6water-sp 168.1 210.3 158.5 189.4 147.3 170.9 136.9 156.2Total 15.3 33.4 13.7 22.3 13.9 21.0 14.3 20.1
# Cores N=1 N=2 N=4 N=8Config Split Unified Split Unified Split Unified Split Unifiedbarnes 2.10 1.38 1.78 1.30 1.66 1.24 1.58 1.27cholesky 6.73 1.74 3.85 1.67 3.53 1.85 4.13 2.50fft 1.93 1.39 1.79 1.36 1.68 1.30 1.66 1.37fmm 4.95 1.95 3.73 1.85 3.06 1.58 2.75 1.59lu 5.93 1.56 3.58 1.54 3.14 1.42 3.06 1.77radiosity 3.86 1.63 2.50 1.54 2.10 1.37 1.95 1.48radix 4.23 2.02 3.05 1.81 2.08 1.46 1.96 1.42raytrace 3.88 1.51 2.61 1.47 2.26 1.32 2.08 1.38water-ns 2.69 1.41 2.08 1.38 1.94 1.27 1.87 1.35water-sp 3.03 1.37 2.40 1.36 2.11 1.26 2.02 1.39Total 3.33 1.54 2.52 1.49 2.21 1.38 2.18 1.52
Trends in Embedded Systems
Tracing and Debugging Challenges
What is my system doing now? Limited visibility of internal signals
• High system complexity + high operating frequencies
• Limited bandwidth for debugging
Traditional approach to debugging: stop the processor & examine or change the system state
• Slow and expensive
• Not practical to use breakpoints in real-time embedded systems
• May change the sequence of events on target platform
• Does not scale well to multicores
Include dedicated on-chip trace and debug infrastructure
Tracing and Debugging in Multicores
Metrics
• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑝𝑖 =𝑇𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑐𝑒 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑖𝑡𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑝𝑐 =𝑇𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑐𝑒 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑖𝑡𝑠
𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑖𝑛 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠
Hardware Structures on Target Platform
Operations on Target Platform
Operations in Software Debugger
System Interconnect
TracePortMulticore
SoC
Trace & Debug Interconnect
On-chipTrace Buffer
Trace PortInterface
Inter-connect
TraceModule DSP
Core
TraceModule
DMACore
TraceModule
Debug & TraceControl
Software Debugger System View
Binaries
Multicore Instruction Set Simulator
GUI
. . . . . .
Core 0mlvCFiat Model
. . .. . .Nexus Trace
Software Debugger(s)
in HostWorkstation
Trace Probe
Host InterfaceBuffers(~GB)Target
Interface
Trace Decoder and Control Software Module
CPUCore i
TraceModule
mlvCFiat
CPUCore 0
TraceModule
mlvCFiat
CPUCore N-1
TraceModule
mlvCFiat
Core imlvCFiat Model
Core N-1mlvCFiat Model
IEEE Standard: Nexus 5001
• Class 1: Traditional run-control debugging
• Class 2: Captures control-flow tracing in near real time
• Class 3: Captures data tracing in near real time
• Class 4: Emulating memory and I/O through a trace port
...
Set/Reset FA flags
Trace Message B
uffer
Data Cache
DC Hit
Data Address
FA Hit
Tag FA Flags
0
1
q-1
index
Ti.fahCnt
way 0
way k-1
Load Value
Ti.PCC
Current Clock
-
Ti.dCC
Ti.LV
...
(a) Nexus-like encoding (NX_b) Legend:dCC Clock Cycle (differential enc.)Ti Thread/Core ID - élog2Nù bitsLV Load Value fahCnt First Access Hit Counterh0, h1 Chunk Sizes for CCi0, i1 Chunk Sizes for fahCnt
(b) mlvCFiat baseline encoding (CF_b) (c) mlvCFiat variable encoding (CF_e)
LVTidCC
dCC[0:7]8 b
dCC[8:15]8 b
C1 b
C 1 b
...
fahCnt[0:7] 8 b
fahCnt[8:15] 8 b
...C
1 bC
1 b
dCC fahCnt LVTi
dCC[0:7]8 b
dCC[8:15]8 b
C1 b
C 1 b
...
fahCnt LVTidCC
fahCnt[0:i0-1] i0 b
fahCnt[i0:i0+i1-1] i1 b
C 1 b
...C
1 b
dCC[0:h0-1] h0 b
C1 b
dCC[h0:h0+h1-1] h1 b
C 1 b
...
L1I L1D
...
L2 Cache
Main Memory
Network L1-L2
Network L2-MM
CS16:L1D/L1I cache size: 16 KBL2 cache size: N*64 KB
CS32:L1D/L1I cache size: 32 KBL2 cache size: N*128 KB
L1D/L1I hit time: 4 ccL1D/L1I associativity: 4-wayL2 hit time: 12 ccL2 associativity: 16-wayCache block size: 32 BFirst-access granularity: 4 BMemory latency: 100 cc
Core 0
L1I L1D
Core 1
L1I L1D
Core N-1
TmTrace: Software Timed Trace Generator
32 bit Target
Application
Application Input
Number Of Threads
Application Output
Multi2Sim Configuration
Files
TmTraceFlags
Performance Statistics
tmlsTrace
mlvCFiatConfiguration
tmlvCFiatTrace
Hardware traces
NX_b
CF_b
CF_e
Multi2Sim TmTrace
mlvCFiatSimulator
Trace Filtering
FixedEncoding
FixedEncoding
VariableEncoding
0
4
8
12
16
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
barnes cholesky fft fmm lu radiosity radix raytrace water-ns water-sp Total
(a) Trace Port Bandwidth [bpi, bits per instruction]LV Ti dCC
0
10
20
30
40
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
barnes cholesky fft fmm lu radiosity radix raytrace water-ns water-sp Total
(b) Trace Port Bandwidth [bpc, bits per clock cycle]LV Ti dCC
mlvCFiat – multicore load value cache first access tracking
• Hardware/software mechanism for filtering load data value traces
Target platform: L1D caches are extended to include first-access tracking bits (FA bits)
Software debugger maintains software copies of L1D caches
Same updating policies, same cache organization as in hardware
Instruction set simulator requires load data values from the target platform only for reads that cannot be inferred
0
1
2
3
4
5
CF_b(CS16) CF_b(CS32) CF_e(CS16) CF_e(CS32)
Trace port bandwidth (bpc)N=1 N=2 N=4 N=8
0
5
10
15
20
25
30
35
40
45
NX_b
Trace port bandwidth (bpc)
N=1 N=2 N=4 N=8
0
4
8
12
N=1 N=2 N=4 N=8
NX_b
Trace port bandwith [bpi]LV Ti dCC
0.0
0.2
0.4
0.6
0.8
1.0
1.2
N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8
CF_b(CS16) CF_b(CS32) CF_e(CS16) CF_e(CS32)
Trace port bandwidth [bpi] LV fahCnt Ti dCC
Trace Port Bandwidth [bpc]
Trace Port Bandwidth [bpi]
BACKGROUND AND MOTIVATION
What do we need to replay a program offline?
• Instruction Set Simulator + program binary
• Initial platform state (GPRs + SPRs)
• Exceptions Trace + Load Data Value Trace captured on the platform
0.00001
0.0001
0.001
0.01
0.1
1
10
100
0 100 200 300 400 500 600 700 800
Clock cycle [x106]
raytrace: Trace port bandwidth in bpc as a function of time
NX_b(CS16) NX_b(CS16).gz1 CF_e(CS16) CF_e(CS32)
0.0001
0.001
0.01
0.1
1
10
100
0 20 40 60 80 100 120 140 160 180 200
Clock cycle [x106]
water-ns: Trace port bandwidth in bpc as a function of time
NX_b(CS16) NX_b(CS16).gz1 CF_e(CS16) CF_e(CS32)
Ti.fahCnt--
Ti: Memory Read
Ti.fahCnt > 0?
Read n Bytes From Trace Msg.Update SW Cache
Get New Trace Msg.Load Ti.fahCnt
Lookup SW CacheGet Data From SW Cache
END
Y
N
Ti: Cache Lookup
Ti: Memory Write
HIT?
N
Y
END
Clear FA Bits
Acquire OwnershipUpdate The SW Cache
Set Corresponding FA Bits
Invalidate The Cache BlockClear FA Flags
END
External Invalidation
Ti: Cache Lookup
Ti: Memory Read
HIT?
Replace Cache BlockClear FA Bits
Y
Ti.fahCnt++
END
N
Corresponding FA Bits Set?
Y
N
Emit Trace Msg. [Ti.dCC, Ti, Ti.fahCnt, Ti.LV]Set Corresponding FA Bits
Ti.fahCnt = 0
Ti: Cache Lookup
Ti: Memory Write
HIT?
N
Y
END
Replace Cache BlockClear FA Bits
Acquire OwnershipUpdate The Cache
Set Corresponding FA
Invalidate The Cache blockClear FA flags
External Invalidation
END
top related