Transcript

EXPERIMENTAL ENVIRONMENT

Need for trace modules that can guarantee

• Unobtrusive program tracing in real-time

• High compression (low trace port bandwidth)

• Low complex implementation: narrow trace ports & small buffers

mlvCFiat

• Trace out only data cache read misses or read hits that occur for the first time

• Relatively low-complexity

• Improvement: 15-33 times when N=1 & 14-20 times when N=8

• Variable encoding improves TPB 8-9% relative to base encoding

CONCLUSIONS

RESULTSmlvCFiat

Indispensable in many aspects of many life

Trends: more complex and smaller, running sophisticated software

SW development cost exceeds 80% of the total cost

Debugging SW is hard and time-consuming: developers spend 50%-75% of time in debugging; increases as we transition to multicores

Estimated cost of SW bugs and glitches: $20-$60 billion annually

Need for better tools to help SW/FW developers find bugs faster

On-the-Fly Load Data Value Tracing in Multicores

Mounika Ponugoti, Amrish K. Tewar, Aleksandar Milenković

University of Alabama in Huntsville, Huntsville, Alabama, U.S.A.

This work is supported in part by an NSF grant CNS-1217470.

Dynamic Trace Port Bandwidth Analysis

# Cores N=1 N=2 N=4 N=8

Mechan NX_b CF_e CF_e NX_b CF_e CF_e NX_b CF_e CF_e NX_b CF_e CF_e

Config - CS16 CS32 - CS16 CS32 - CS16 CS32 - CS16 CS32

barnes 15.03 2.21 0.79 15.31 2.30 1.16 15.59 2.39 1.47 15.86 2.44 1.77

cholesky 15.35 1.86 0.75 16.21 1.30 0.79 15.87 0.95 0.62 15.59 0.61 0.44

fft 10.65 2.57 1.48 10.84 2.62 1.50 11.02 2.65 1.52 11.19 2.67 1.54

fmm 8.82 0.36 0.23 8.96 0.37 0.24 9.14 0.38 0.25 9.33 0.38 0.26

lu 11.88 0.58 0.57 12.07 0.61 0.57 12.27 0.62 0.58 12.47 0.65 0.45

radiosity 12.11 0.25 0.09 12.36 0.55 0.45 12.59 0.56 0.44 12.58 0.63 0.54

radix 13.41 0.75 0.54 13.75 1.64 1.44 14.09 1.73 1.52 14.54 1.79 1.57

raytrace 15.17 1.06 0.34 15.45 1.28 0.61 15.73 1.38 0.71 16.01 1.54 0.90

water-ns 10.64 0.49 0.22 10.81 0.52 0.25 10.98 0.56 0.39 11.15 0.56 0.42

water-sp 11.38 0.07 0.05 11.55 0.07 0.06 11.73 0.08 0.07 11.90 0.09 0.08

Total 12.34 0.80 0.37 12.63 0.92 0.57 12.89 0.93 0.61 13.17 0.92 0.66

Data Tracing: Why Is It Challenging?

Format of Trace Messages

In mlvCFiat length of LV field depends on granularity

Encoding parameters (h0,h1) = (4, 2), (i0, i1) = (2, 2)

Speedup TPB(NX_b)/TPB(compressed NX_b)

ABSTRACT

Software testing and debugging of modern multicore-based embeddedsystems is a challenging proposition because of growing hardware andsoftware complexity, increased integration, and tightening time-to-market. To find more bugs faster, software developers of real-timeembedded systems increasingly rely on on-chip trace and debugresources, including hefty on-chip buffers and wide trace ports. However,these resources often offer limited visibility of the system, increase thesystem cost, and do not scale well with a growing number of cores. Thispaper introduces mlvCFiat, a hardware/software mechanism for capturingand filtering load data value traces in multicores. It relies on first-accesstracking in data caches and equivalent modules in the software debuggerto significantly reduce the number of trace events streamed out of thetarget platform. Our experimental evaluation explores the effectiveness ofthe proposed technique as a function of cache sizes, encodingmechanism, and the number of cores. The results show that mlvCFiatsignificantly reduces the total trace port bandwidth. The improvementsrelative to the existing Nexus-like load data value tracing range from 15 to33 times for a single core and from 14 to 20 times for an octa core.

ACKNOWLEDGMENT

Speedup TPB(NX_b)/TPB(CF_e)# Cores N=1 N=2 N=4 N=8Config CS16 CS32 CS16 CS32 CS16 CS32 CS16 CS32barnes 6.8 19.1 6.7 13.1 6.5 10.6 6.5 9.0cholesky 8.2 20.4 12.5 20.4 16.7 25.5 25.7 35.1fft 4.2 7.2 4.1 7.2 4.2 7.3 4.2 7.3fmm 24.2 37.8 24.1 37.2 24.2 36.7 24.4 36.6lu 20.5 20.7 19.8 21.3 19.8 21.2 19.2 27.6radiosity 47.7 128.8 22.3 27.6 22.6 28.4 19.8 23.4radix 17.9 24.8 8.4 9.6 8.1 9.3 8.1 9.3raytrace 14.3 44.2 12.1 25.4 11.4 22.1 10.4 17.9water-ns 21.7 47.4 20.8 42.8 19.7 28.0 20.0 26.6water-sp 168.1 210.3 158.5 189.4 147.3 170.9 136.9 156.2Total 15.3 33.4 13.7 22.3 13.9 21.0 14.3 20.1

# Cores N=1 N=2 N=4 N=8Config Split Unified Split Unified Split Unified Split Unifiedbarnes 2.10 1.38 1.78 1.30 1.66 1.24 1.58 1.27cholesky 6.73 1.74 3.85 1.67 3.53 1.85 4.13 2.50fft 1.93 1.39 1.79 1.36 1.68 1.30 1.66 1.37fmm 4.95 1.95 3.73 1.85 3.06 1.58 2.75 1.59lu 5.93 1.56 3.58 1.54 3.14 1.42 3.06 1.77radiosity 3.86 1.63 2.50 1.54 2.10 1.37 1.95 1.48radix 4.23 2.02 3.05 1.81 2.08 1.46 1.96 1.42raytrace 3.88 1.51 2.61 1.47 2.26 1.32 2.08 1.38water-ns 2.69 1.41 2.08 1.38 1.94 1.27 1.87 1.35water-sp 3.03 1.37 2.40 1.36 2.11 1.26 2.02 1.39Total 3.33 1.54 2.52 1.49 2.21 1.38 2.18 1.52

Trends in Embedded Systems

Tracing and Debugging Challenges

What is my system doing now? Limited visibility of internal signals

• High system complexity + high operating frequencies

• Limited bandwidth for debugging

Traditional approach to debugging: stop the processor & examine or change the system state

• Slow and expensive

• Not practical to use breakpoints in real-time embedded systems

• May change the sequence of events on target platform

• Does not scale well to multicores

Include dedicated on-chip trace and debug infrastructure

Tracing and Debugging in Multicores

Metrics

• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑝𝑖 =𝑇𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑐𝑒 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑖𝑡𝑠

𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑥𝑒𝑐𝑢𝑡𝑒𝑑 𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠

• 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑏𝑝𝑐 =𝑇𝑜𝑡𝑎𝑙 𝑡𝑟𝑎𝑐𝑒 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑏𝑖𝑡𝑠

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑡𝑖𝑚𝑒 𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑑 𝑖𝑛 𝑐𝑙𝑜𝑐𝑘 𝑐𝑦𝑐𝑙𝑒𝑠

Hardware Structures on Target Platform

Operations on Target Platform

Operations in Software Debugger

System Interconnect

TracePortMulticore

SoC

Trace & Debug Interconnect

On-chipTrace Buffer

Trace PortInterface

Inter-connect

TraceModule DSP

Core

TraceModule

DMACore

TraceModule

Debug & TraceControl

Software Debugger System View

Binaries

Multicore Instruction Set Simulator

GUI

. . . . . .

Core 0mlvCFiat Model

. . .. . .Nexus Trace

Software Debugger(s)

in HostWorkstation

Trace Probe

Host InterfaceBuffers(~GB)Target

Interface

Trace Decoder and Control Software Module

CPUCore i

TraceModule

mlvCFiat

CPUCore 0

TraceModule

mlvCFiat

CPUCore N-1

TraceModule

mlvCFiat

Core imlvCFiat Model

Core N-1mlvCFiat Model

IEEE Standard: Nexus 5001

• Class 1: Traditional run-control debugging

• Class 2: Captures control-flow tracing in near real time

• Class 3: Captures data tracing in near real time

• Class 4: Emulating memory and I/O through a trace port

...

Set/Reset FA flags

Trace Message B

uffer

Data Cache

DC Hit

Data Address

FA Hit

Tag FA Flags

0

1

q-1

index

Ti.fahCnt

way 0

way k-1

Load Value

Ti.PCC

Current Clock

-

Ti.dCC

Ti.LV

...

(a) Nexus-like encoding (NX_b) Legend:dCC Clock Cycle (differential enc.)Ti Thread/Core ID - élog2Nù bitsLV Load Value fahCnt First Access Hit Counterh0, h1 Chunk Sizes for CCi0, i1 Chunk Sizes for fahCnt

(b) mlvCFiat baseline encoding (CF_b) (c) mlvCFiat variable encoding (CF_e)

LVTidCC

dCC[0:7]8 b

dCC[8:15]8 b

C1 b

C 1 b

...

fahCnt[0:7] 8 b

fahCnt[8:15] 8 b

...C

1 bC

1 b

dCC fahCnt LVTi

dCC[0:7]8 b

dCC[8:15]8 b

C1 b

C 1 b

...

fahCnt LVTidCC

fahCnt[0:i0-1] i0 b

fahCnt[i0:i0+i1-1] i1 b

C 1 b

...C

1 b

dCC[0:h0-1] h0 b

C1 b

dCC[h0:h0+h1-1] h1 b

C 1 b

...

L1I L1D

...

L2 Cache

Main Memory

Network L1-L2

Network L2-MM

CS16:L1D/L1I cache size: 16 KBL2 cache size: N*64 KB

CS32:L1D/L1I cache size: 32 KBL2 cache size: N*128 KB

L1D/L1I hit time: 4 ccL1D/L1I associativity: 4-wayL2 hit time: 12 ccL2 associativity: 16-wayCache block size: 32 BFirst-access granularity: 4 BMemory latency: 100 cc

Core 0

L1I L1D

Core 1

L1I L1D

Core N-1

TmTrace: Software Timed Trace Generator

32 bit Target

Application

Application Input

Number Of Threads

Application Output

Multi2Sim Configuration

Files

TmTraceFlags

Performance Statistics

tmlsTrace

mlvCFiatConfiguration

tmlvCFiatTrace

Hardware traces

NX_b

CF_b

CF_e

Multi2Sim TmTrace

mlvCFiatSimulator

Trace Filtering

FixedEncoding

FixedEncoding

VariableEncoding

0

4

8

12

16

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

barnes cholesky fft fmm lu radiosity radix raytrace water-ns water-sp Total

(a) Trace Port Bandwidth [bpi, bits per instruction]LV Ti dCC

0

10

20

30

40

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

barnes cholesky fft fmm lu radiosity radix raytrace water-ns water-sp Total

(b) Trace Port Bandwidth [bpc, bits per clock cycle]LV Ti dCC

mlvCFiat – multicore load value cache first access tracking

• Hardware/software mechanism for filtering load data value traces

Target platform: L1D caches are extended to include first-access tracking bits (FA bits)

Software debugger maintains software copies of L1D caches

Same updating policies, same cache organization as in hardware

Instruction set simulator requires load data values from the target platform only for reads that cannot be inferred

0

1

2

3

4

5

CF_b(CS16) CF_b(CS32) CF_e(CS16) CF_e(CS32)

Trace port bandwidth (bpc)N=1 N=2 N=4 N=8

0

5

10

15

20

25

30

35

40

45

NX_b

Trace port bandwidth (bpc)

N=1 N=2 N=4 N=8

0

4

8

12

N=1 N=2 N=4 N=8

NX_b

Trace port bandwith [bpi]LV Ti dCC

0.0

0.2

0.4

0.6

0.8

1.0

1.2

N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8 N=1 N=2 N=4 N=8

CF_b(CS16) CF_b(CS32) CF_e(CS16) CF_e(CS32)

Trace port bandwidth [bpi] LV fahCnt Ti dCC

Trace Port Bandwidth [bpc]

Trace Port Bandwidth [bpi]

BACKGROUND AND MOTIVATION

What do we need to replay a program offline?

• Instruction Set Simulator + program binary

• Initial platform state (GPRs + SPRs)

• Exceptions Trace + Load Data Value Trace captured on the platform

0.00001

0.0001

0.001

0.01

0.1

1

10

100

0 100 200 300 400 500 600 700 800

Clock cycle [x106]

raytrace: Trace port bandwidth in bpc as a function of time

NX_b(CS16) NX_b(CS16).gz1 CF_e(CS16) CF_e(CS32)

0.0001

0.001

0.01

0.1

1

10

100

0 20 40 60 80 100 120 140 160 180 200

Clock cycle [x106]

water-ns: Trace port bandwidth in bpc as a function of time

NX_b(CS16) NX_b(CS16).gz1 CF_e(CS16) CF_e(CS32)

Ti.fahCnt--

Ti: Memory Read

Ti.fahCnt > 0?

Read n Bytes From Trace Msg.Update SW Cache

Get New Trace Msg.Load Ti.fahCnt

Lookup SW CacheGet Data From SW Cache

END

Y

N

Ti: Cache Lookup

Ti: Memory Write

HIT?

N

Y

END

Clear FA Bits

Acquire OwnershipUpdate The SW Cache

Set Corresponding FA Bits

Invalidate The Cache BlockClear FA Flags

END

External Invalidation

Ti: Cache Lookup

Ti: Memory Read

HIT?

Replace Cache BlockClear FA Bits

Y

Ti.fahCnt++

END

N

Corresponding FA Bits Set?

Y

N

Emit Trace Msg. [Ti.dCC, Ti, Ti.fahCnt, Ti.LV]Set Corresponding FA Bits

Ti.fahCnt = 0

Ti: Cache Lookup

Ti: Memory Write

HIT?

N

Y

END

Replace Cache BlockClear FA Bits

Acquire OwnershipUpdate The Cache

Set Corresponding FA

Invalidate The Cache blockClear FA flags

External Invalidation

END

Top Related