cadre: cycle-accurate deterministic replay for...

25
CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging Replay for Hardware Debugging Smruti R. Sarangi Bi LG k Brian L. Greskamp Josep Torrellas University of Illinois Urbana Champaign http://iacoma.cs.uiuc.edu

Upload: doanh

Post on 28-May-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

CADRE: Cycle-Accurate Deterministic Replay for Hardware DebuggingReplay for Hardware Debugging

Smruti R. SarangiB i L G kBrian L. Greskamp

Josep Torrellas

University of Illinois Urbana Champaign

http://iacoma.cs.uiuc.edu

MotivationMotivation

50-70% effort spent on verification1-2 year verification timesMore features on-chipVerification speed not keeping up with complexitySome bugs inevitably slip throughg y p g

Pentium fdiv bugPrefetching bug in Pentium 4 XeonIBM G-3 frequency bug

Smruti R. Sarangi 2

Error Rate vs TimeError Rate vs Timed

s de

tect

ed

8 weeks

% o

f bug

s%

Reduced time to debug Vital ingredient of profitability

Smruti R. Sarangi 3

profitability

OutlineOutline

Problems in Debugginggg gSources of Non-DeterminismHandling Non-Determinism in Busesg

CADRE ArchitectureEvaluationEvaluation

Space OverheadPerformance OverheadPerformance Overhead

Smruti R. Sarangi 4

Design BugsAn example of a design bug in IBM G3

Power manager shuts down L1Power manager shuts down L1AND L1 is waiting for dataAND L2 is being invalidatedAND L2 is being invalidated

All the L2 lines might not be invalidated

Two FeaturesTwo FeaturesInfrequent conditionsL d t ti l t D t ti bLarge detection latency ─ Data corruption bugs

are detected only after an observable event: program crash HW hang wrong output etc

Smruti R. Sarangi 5

program crash, HW hang, wrong output, etc.

Problems in Debugging HardwareProblems in Debugging Hardware

Infrequent conditions RTL simulators are very slow, roughly 30 cyc/sNeed to test as many paths as possibleAt breakpoints transfer state to RTL simulatorProbably want to put it on the field and send

bug reports backbug reports backLarge detection latency

Large debugging window & modest storageLarge debugging window & modest storageSome bugs are NOT reproducible

Smruti R. Sarangi 6

Existing Debugging FrameworksExisting Debugging Frameworks

Pentium M debugging framework GolanPentium-M debugging framework – GolanLog all the signals at the pinsReplay themReplay them

DisadvantagesExpensive pin snooping electronicsExpensive pin snooping electronicsFor very high speed buses

Hard to snoop anymorep yLog signals inside the processor Extra pins required to send data to stable storage

Smruti R. Sarangi 7

CADRELog all updates

CADRE

DIMMs CheckpointDIMMs Checkpoint

CMP chipMemory

Controller (MCH)

IOController

(ICH)

To IO devices

Agent: CMP MCH Each agent has its own clockGolan CADRE

Agent: CMP, MCH, … . Each agent has its own clockAn input/output is deterministic if it is always observed at the same clock cycle w.r.t the agent. An agent is deterministic if deterministic inputs imply deterministic

Smruti R. Sarangi 8

age s de e s c de e s c pu s p y de e s coutputs

Ideal Hardware DebuggerIdeal Hardware DebuggerHigh speed execution till the “buggy point”Minimal storage & large debugging windowMinimal storage & large debugging windowExecutions are completely reproducible

RTL G l CADRERTL Simulator

Golan CADRE

Speed Very Low High HighSpeed Very Low High HighStorage Very High High LowDebugging Low Medium HighDebugging Window

Low Medium High

Reproducibility High High High

Smruti R. Sarangi 9

OutlineOutline

Problems in Debugginggg gSources of Non-DeterminismHandling Non-Determinism in Busesg

CADRE ArchitectureEvaluationEvaluation

Space OverheadPerformance OverheadPerformance Overhead

Smruti R. Sarangi 10

Sources of Non-Determinism

DIMM

Non-deterministic message delay

DIMMs

CMP chipMemory

Controller (MCH)

IOController

(ICH)

To IO devices

IO and Interrupts

Power/Thermal EventsVoltage Freq. Scaling

Soft ErrorsRefresh/Scrubbing

Smruti R. Sarangi 11

Non-Determinism in BusesNon Determinism in Buses

Transmitter Receiver Bus Interface

Source Synchronous BusReceiver

Data

ClockPLL FIFO Queue Clock

Data

PLLPLL O Queue PLL

Temperature variation

Power supply noiseTemperature variation

Power supply noiseTemperature variation

pInter-symbol interferenceProcess variationCrosstalk

The probability of non-determinism is very high for buses in the future.

Smruti R. Sarangi 12

Enforcing DeterminismEnforcing DeterminismCPU

Design DeterministicallyProperly initialize all elements

DETRST instruction (to set a deterministic state)DETRST instruction (to set a deterministic state)Log all exceptions, power/thermal events

MemoryMemoryStart every checkpoint interval with a refreshCheckpoint scrub registerCheckpoint scrub register

IOLog all the data along with cycle counts

Smruti R. Sarangi 13

Log all the data along with cycle counts

Enforcing Determinism in BusesEnforcing Determinism in Buses

Transmitter

xT

Receiverece e

DeterministicProcessingNon-deterministic

P i

yR

xT

θ1

ProcessingUncertainty

IntervalW=θ2-θ1+1

Transmitter Receiver

log (W)θ2

xT zR=xT+θ2

log2(W)

Optimal

Smruti R. Sarangi 14

T R T 2

Schemes to Enforce DeterminismSchemes to Enforce Determinism

Assume fTransmitter= fReceiverTransmitter Receiver

Receiver needs to compute: zR = xT + θ2

Trivial Solution: Send xT along with every message. Better Solution: After first message is delivered deterministically

Transmitter sends a message every cycleTransmitter sends a message every cycleReceiver then processes the messages at the rate of transmissionDisadvantages

This scheme requires a line between all pairs of nodesA receiver has to be aware of all the transmitters

Smruti R. Sarangi 15

Offset SchemeOffset SchemeyR Receiver

Transmittermin(xT) max(xT)

θ1θ2

θ θθ2-θ1

W W W θ θ +1

xTθ2 zR

W W W=θ2-θ1+1

Case 2xT

Case 1xTρDisjoint

ρ

(yR - θ1 - ρ) is in the same window as xTx = ⎣(y θ ρ)/W⎦*W + ρ

Smruti R. Sarangi 16

xT = ⎣(yR - θ1 – ρ)/W⎦*W + ρ

Implementation of Offset Scheme

θ

Mod-W Counter Domain Counter

-ρ+

-θ1

yR

Cor

e

yR - θ1 - ρ

Circular Queue+θ2

ecei

ver

ρ+

zRData

=

Re

Bus Interface

z = ⎣(y - θ – ρ)/W⎦*W + ρ + θSmruti R. Sarangi 17

zR = ⎣(yR - θ1 – ρ)/W⎦ W + ρ + θ2

Architecture Memory y Lo

g

ArchitectureReg. Ckpt

CPU Log

CMP Synchronizer

y

Mem

ory

CPU

er

Synchronizer

g

ynch

roni

ze

Synchronizer CADRE Controller

Memory Controller

emor

y Lo

gC

ontro

ller

Synchronous Bus

IO Log

PCIdevices

Sy Me C

yIO Controller Hub

devices

Asynchronous BusSource Sync. Bus

B li HW f d t i i HW for checkpointing

Smruti R. Sarangi 18

Baseline HW for determinism HW for checkpointing

OutlineOutline

Problems in Debugginggg gSources of Non-DeterminismHandling Non-Determinism in Busesg

CADRE ArchitectureEvaluationEvaluation

Space OverheadPerformance OverheadPerformance Overhead

Smruti R. Sarangi 19

Evaluation – CADREEvaluation CADREConfiguration

2 8 GHz dual proc Pentium 4 Xeon server with2.8 GHz dual proc. Pentium-4 Xeon server with hyper-threadingIntel E7525 chipset with 800 MHz FSBWe estimate the overheads of a 4-processor CMPBenchmarks – Spec: Int, FP, JBB, OMP and Web

To estimate overheadsTo estimate overheads Reconfigure MCH and ICHUse memory mapped IO to access PCIX registersy pp gSee our paper in WARP ’06 (workshop along with ISCA ’06)

Smruti R. Sarangi 20

Space OverheadSpace Overhead

Space overhead of mem. checkpointingp p gSafetyNet : 50MB/s/procRevive : 38-120 MB/s/procpaverage IO bandwidth – 100 kB/s

Design PointDesign Point1 sec checkpoint interval 50 MB * 4 = 200 MB for memory ckpt50 MB 4 = 200 MB for memory ckpt.4 MB log for IO traffic

Smruti R. Sarangi 21

Performance OverheadPerformance Overhead

Periodic cache flushing overhead for proc. checkpointing.

negligible ─ once per second

For every message assume worst case delay increase bus latency y y

assume 1-4 (MCH) cycles of non-determinismadd an extra 1-4 cycles of bus latencyy yIncrease programmable read pointer delay,

clock guard band, RAS-CAS delay

Smruti R. Sarangi 22

Performance - IIPerformance II

With a 1 sec. ckpt. interval, CADRE has a 1% slowdown and requires 200 MB of storageAs compared to 64 ms for Golan for same overheads

Smruti R. Sarangi 23

As compared to 64 ms for Golan for same overheads

Using CADREUsing CADRE

Hardware debuggingRecord and replay executionsTransfer state to an RTL simulatorUse scan chains to observe certain latchesUse CADRE in deployed systems in the field

Send a hardware core dump back to the vendorSend a hardware core dump back to the vendor

Lock-stepped execution in TMR systemsSoft are deb gging CADRE hard areSoftware debugging – CADRE hardware guarantees determinism

Smruti R. Sarangi 24

CADRE C l A t D t i i tiCADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging

Questions ?Questions ?

http://iacoma.cs.uiuc.edu