electrical and computer engineering – north carolina state university experiences with two...

40
Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury, Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan, Zhenqian Zhang, Randy Widialaksono, Thomas Belanger, Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon

Upload: linette-conley

Post on 12-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Experiences with Two FabScalar-based Chips

Elliott Forbes, Rangeen Basu Roy Chowdhury, Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan,

Zhenqian Zhang, Randy Widialaksono, Thomas Belanger, Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon

Page 2: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Introduction

• FabScalar– Synthesizable and parameterized RTL for OOO

superscalar cores• Fetch and issue widths, structure sizes

– No memory hierarchy or “uncore” (until next release)

• Two FabScalar-based research projects– H3 (“Heterogeneity in 3D”)

• Two cores with different microarchitectures• Hardware support for thread migration

– AnyCore• One core with reconfigurable microarchitecture

1

Page 3: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Goals

• Technical– Explore adaptivity: Ability to adjust microarchitecture

to current instruction-level behavior• Migrate program execution to more suitable core (H3)• Reconfigure core (AnyCore)

• Non-technical– Fulfill original vision of FabScalar

• Streamline development of single-ISA heterogeneous multi-core processors

– Experience realities of fabricating designs– Have fun building stuff

2

Page 4: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Overview• Two stacked asymmetric cores• Fast Thread Migration (FTM)

– Bulk swap of arch. register state• Cache-Core Decoupling (CCD)

– Cores may switch L1 cachesat thread migrations

• Two phases– Phase 1: 2D IC (completed testing in June 2015)

• Test cores, caches, compiled memories, migration logic, etc.– Phase 2: 3D IC (August 2015 tapeout)

• Demonstrate stacked out-of-order cores, benefits of heterogeneity, etc.

“top” core “bottom” core

fetch width 2 1

issue width 3 (alu, br, ld-st) 3

IQ 32 16

LQ/SQ 16/16 16/16

ROB 64 32

PRF 96 64

L1 I$ 4KB DM 4KB DM

L1 D$ 8KB 4-way 8KB 4-way

Face-to-face 3D bonding provides dense low-latency interconnect

3

Page 5: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 DesignTimeline - Design and Verification

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores

generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)

in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑

L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB

I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches

synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑

delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)

post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)

hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache

study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing

integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)

redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stage

integrated OOO cores + D$ checked-in to SVN repo ↑debug

Elliott: core-side FTM, perf. counters, debug coreglobal migration

perf. counterslocal migration

integrate with face-to-face bus controllerdebug core: start debug core from 2-wide core

debug core: replace L1 I/D caches with I/D scratchpadsdebug core: replace memory bus with scan I/O

Zhenqian: face-to-face bus for FTMRandy: flow, physical design

missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑

physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑

tapeout (May 20, 2013) ↑

2012 2013

9 months til RTL freeze

Two cores generated with FabScalar

D-cache: retool and integrate OpenSparc-T2 L1 D$.

Original plan (canceled): leverage T2’s 8 core x 8 bank crossbar and L2 $ implemented in stacked DRAM.

I-cache: in-house

Modify Fetch-1 for synch. R/W compiled memories

Chip I/O (mem. buses, serializer/deserializer)

New features:CCDperf. countersFTM

FabScalar saved effort.

Most effort on caches, buses, new features.

4

Page 6: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 DesignTimeline - Design and Verification

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores

generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)

in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑

L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB

I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches

synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑

delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)

post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)

hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache

study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing

integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)

redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stage

integrated OOO cores + D$ checked-in to SVN repo ↑debug

Elliott: core-side FTM, perf. counters, debug coreglobal migration

perf. counterslocal migration

integrate with face-to-face bus controllerdebug core: start debug core from 2-wide core

debug core: replace L1 I/D caches with I/D scratchpadsdebug core: replace memory bus with scan I/O

Zhenqian: face-to-face bus for FTMRandy: flow, physical design

missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑

physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑

tapeout (May 20, 2013) ↑

2012 2013

6 mo. phys. design

Physical design.

physical design data (phase 1)Technology IBM 8RF (130 nm)Dimensions 5.25 mm x 5.25 mmArea 27.6 mm2

Transistors 14.6 MillionCells 1.1 MillionNets 721 ThousandMemory macros 56Clock domains 10Pads 400 (100 for each of

four experiments)

tool execution timeEncounter PAR 9 hoursCalibre DRC 4.5 hoursCalibre LVS 2 hours

5

Page 7: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Design

Die photo + floorplan

Mitigating risk.

Debug CoreRationale:•Test a “pure” FabScalar core•Plan B, in case two-core-stack doesn’t work•Eliminate risky aspects•Enhance testability/debuggability

Core configuration:•Same configuration as 2-wide core

Key features:•I-cache and D-cache replaced with synthesized I and D scratchpads

• No compiled memories• No complex caches

•Full scan• Observability/controllability for debug• No memory buses: Scratchpads preloaded/examined via scan chains

Full scan in 1-wide core•Observability/controllability of at least one core with caches•2-wide core doesn’t have scan overhead

Dedicated memory buses for the two cores •Avoid a potential single point of failure

Parameterized memory bus width•Reduce schedule risk: Early pad planning is important but fluid.

6

Page 8: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Verification• RTL verified using SPEC2K SimPoints

– In retrospect, should have also used microbenchmarks

• Lesson: Budget enough time for netlist verification– Major effort to set up netlist simulation

• Testbench and debug more complicated (everything blasted into individual nets)• SDF annotation requires experience• Most issues caused by testbench and SDF problems

– Found serious, but not fatal, bug just after tapeout• A difference between RTL and netlist caused by misplaced `ifdef• `ifdef SIM guards instrumentation in the RTL. Thus, SIM is defined in testbench

but not in synthesis script. The problem is that a small real code fragment was also mistakenly guarded by it.

• Lessons: (1) Consolidate all instrumentation in testbench (none in RTL). (2) Do netlist verification, because netlist may not equal RTL.

– Netlist simulation also would have alerted us to hold-time violations in D$• OpenSparc T2 D$ is a heavily latch-based industry design• Problem encountered in chip bring-up, diagnosed with netlist simulation

7

Page 9: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Packaging, PCB, & Bring-up• Debug core uses only a dozen signal pins

– Allowed us to wirebond a die directly to an existing board to check debug core liveness

– Test Vdd/Gnd and V+/Gnd for shorts– Scan-in == Scan-out

Wirebond ConfigurationsExperiment Signal/Supply

PadsHetero. core pair 63/37Debug core 13/87Isolated F2F bus 62/38Isolated F2B bus 49/51

Chip-on-board debug core liveness

• Overall chip has 400 pads divided into four 100-pad experiments– Wirebond chip differently for each experiment– Allows for use of a 128-pin QFP package

• Success of debug core liveness tests was the green light to assemble the four configurations. For each configuration:

– Package the chip (wirebonding and lead-forming)– Design and fab a 4-layer PCB– Assemble the PCB

8

Page 10: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Packaging, PCB, & Bring-up

Test setup•PCB connects to LPC mezzanine of Xilinx ML605•FPGA handles memory requests from cores

– Block RAMs for L2 cache

•Host PC sends commands to FPGA via serial interface using a custom GUI•Custom compiler for writing microbenchmarks

– For good control of instruction selection and order, without assembly programming Fully assembled H3 PCB (Phase 1)

All signals go to both:•Headers (to oscilloscope)•LPC connector (to FPGA)

LPC

und

erne

ath

Layer 1 (shown): package, headersLayers 2,3: Vdd/V+, GndLayer 4 (underneath): LPC, DCAPs

9

Page 11: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Packaging, PCB, & Bring-up• Results of chip bring-up

– Identified 9 total issues– 3 setup, 1 “feature”, 4 bugs, 1 possible bug– See Table 3, #x

• 3 setup issues: #2, #3, #6 (fixed)• 1 “feature” of extra I$ bus traffic: #4

(no ill effects, but may want to fix in Phase 2)• 4 bugs (will fix in Phase 2)

– 1 bug detected post-tapeout, pre-silicon: #1(serious, but has workaround)

– 1 class of hold-time bugs in D$: #5(serious, fortunately top core ok with certain tags)

– 2 bugs exercised by thread migrations: #7, #8(just annoying, have workarounds)

• 1 possible bug when migrating with CCD enabled: #9 (debug in progress)

10

Page 12: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Packaging, PCB, & Bring-up

† 1-wide core has higher current (and power) than 2-wide core because it has full scan.

Microbenchmark Core Static instr.

Dynamic instr.

Cycles IPC Current (mA)

Energy (mJ)

Comments

PRNGpseudo-random number generator

1-wide

28 67.1M

67.1M 1.00 8.53† 59.6 Peak IPC achieved.

2-wide 64.4M 1.04 5.78 46.5IPC>1 due to branches.IPC not much greater than 1 because only one simple/complex ALU lane, and load/store lane unused.

PRIMESprime number generator

1-wide28 88.2M

100.4M 0.88 8.35† 87.3 OOO tolerates 14-cycle integer divide instruction well.

2-wide 301.8M 0.29 5.70 215.0Hypothesis: larger IQ and non-age-based scheduler exacerbates priority inversion, stalling retirement more.

ARRAY-SUMsum elements of an array

1-wide35 41.9M

Did not finish See Table 3, issue #5.

2-wide 20.9M 2.00 6.54 17.1 Peak IPC achieved.

BUBBLEbubble sort, list initiallyreverse sorted

1-wide

73 2.5M

Did not finish See Table 3, issue #5.

8.5M 0.29 5.24 5.57

Early diagnosis: SQ stalls dominate other resource stalls.Hypothesis: Limited store buffer size of write-through D$ causing back-pressure. Frequent swaps imply many consecutive stores. Write-through latency to FPGA is high.

2-wide

MIGRATE“migrate” instr.inside loop

             

Completes 1 million consecutive thread migrations correctly.Cores at different frequencies.~1 migration per 1K cycles.

11

Page 13: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Phase 2

• Scheduled tapeout in August 2015• Just the two-core stack

– No debug core– No scan chains

• Design tasks– Partition RTL for two tiers– Implement thread migration enhancements– Fix bugs– Replace T2 D-caches with in-house D-caches

• T2 not workable: explicit latch instances, everywhere

• T2 no longer needed: shelved crossbar and stacked DRAM L2$

• We now have in-house D-cache from AnyCore effort

– Custom-design M1 pads and F2F bondpoints

12

Page 14: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Overview• FabScalar evolution

– Released FabScalar:• Build-up cores from library of stage designs of different widths

– Next-gen FabScalar:• “Superset Core”: single Verilog description with

parameterized widths (structure sizes already parameterized)

• AnyCore derived from “superset core”– Keep static configurability of superset core to allow for

synthesis of different max sized AnyCore processors– Add dynamic configurability within max size

• AnyCore instance that was fabricated:Adaptive microarchitecture feature Configurationsfetch/dispatch width (instructions/cycle) 1, 2, 3, 4issue width (instructions/cycle) 3, 4, 5physical register file & active list 64, 96, 128load and store queues (each) 16, 32issue queue 16, 32, 48, 64

13

Page 15: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Design and Verification

• Automatic liveness test (BIST)– Upon applying power, clock, and reset, chip should toggle a dedicated

pin. This signals it is correctly executing the preset test program.

• Got netlist simulation working early– Both post-synthesis (ideal clocks, estimated SDF) and post-layout

(everything realistic)

• In-house L1 cache designs with three modes– Cache– Scratchpad– BIST (mode after processor reset)

• Scratchpad mode with first N rows preset to a test program, including a new instruction that toggles a dedicated pin.

• Debug interface for direct reading/writing key pipeline structures, scratchpads, core configuration registers, and performance counters

Notable Design Features

Applied Lessons from H3 Project

14

Page 16: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Packaging, PCB, & Bring-up• Applied more lessons from H3 project

– Used a socket instead of soldering package to PCB• Possible benefit that PCB may be repurposed for other projects• Replace defective chips• Study variations

– Used a dedicated bypassable shuntresistor to measurecurrent

– Narrowed PCB to LPCprofile. Compatible witharbitrary Xilinx boards(ML605 and Zynq boards).

Received Intermediate checks

Packaged chips from MOSIS

Examined chip orientation, wirebonds, and downbonds under microscope.Checked lead-forming to socket specification.

Sockets from Agile Checked fit of package in socket under microscope.Vdd, V+, and Gnd checks with chip in socket.

Bare PCB from B.B. Checked all connections.

Assembled PCB from B.B.

Repeated Vdd, V+, and Gnd checks with chip in socket.BIST: successful retirement of first instructions within 24 hours of receiving assembled PCB. (Toggle Pin observed April 9!)15

Page 17: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Discussion and Questions

• Thank you

• Any comments or questions?

The H3 project is supported by a grant from Intel. The AnyCore project is supported by NSF grant CCF-1018517. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Fully assembled H3 PCB

Fully assembled AnyCore PCB

16

Page 18: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Backup

Page 19: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

References

N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceedings of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011.

E. Rotenberg, B. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, October 2013.

Page 20: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Detailed Pre-tapeout TimelineTimeline - Design and Verification

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores

generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)

in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑

L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB

I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches

synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑

delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)

post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)

hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache

study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing

integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)

redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stageintegrated OOO cores + D$ checked-in to SVN repo ↑

debugElliott: core-side FTM, perf. counters, debug core

global migrationperf. counters

local migrationintegrate with face-to-face bus controller

debug core: start debug core from 2-wide coredebug core: replace L1 I/D caches with I/D scratchpads

debug core: replace memory bus with scan I/OZhenqian: face-to-face bus for FTMRandy: flow, physical design

missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑

physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑

tapeout (May 20, 2013) ↑

2012 2013

Page 21: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Detailed Post-tapeout Timeline

Timeline - post-tapeoutSep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr

Steve: packaging, PCBsbare die back from MOSIS (Sep. 6, 2013) ↑

debug core: wirebond chip-on-boarddebug core: package chip; design, fab, and assemble PCB

hetero. core pair: package chip; design, fab, and assemble PCBElliott: chip bring-up

develop compiler for controlled benchmark creation <<July 2013>>debug core: Vdd/Gnd short test, current test, scan test

debug core: retire millions of instructionsdebug core: simple power and freq. measurements

setup host PC and FPGA infrastructurehetero. core pair: first instructions retired ↑

hetero. core pair: synthesize full testbench to FPGAhetero. core pair: 10M+ instr. retired (no memory ops) ↑

hetero. core pair: on-going testing and debughetero. core pair: breakthrough - tracked down flaky reset issue ↑

research ICCD-32 - Design-Effort Alloy (DEA) ICCD-32 revisions & pres.research other

Thomas: scan chain scripts (debug core), chip bring-up

2013 2014 2015

Page 22: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Cool H3 Pictures

Final layout

Full test setup (host PC, power supply, ML605, etc.)

Page 23: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Errata

  Symptom Description Stage Workaround

1 Instructions that depend on loads will wake up early when the load misses (reading an incorrect value from the register file/bypass).

A misplaced `ifdef guarded key RTL during simulation, but the `ifdef condition was not enabled during synthesis.

post-tapeout Prefetch blocks such that loads will hit.

2 Source-synchronous clock port errors in FPGA testbench during PAR/mapping.

Clock signals from the cores to the FPGA were not routed to clock-capable pins of the LPC.

post-PCB-assembly

Use a jumper from the clock header of the PCB to a clock-capable pin of the XM105 debug card attached to the HPC connector of the ML605.

3 Non-repeatability in running single threaded test programs.

High precision ammeter used to measure logic current. Current draw of running core caused the ammeter to switch shunt resistance (with a relay) immediately after reset was de-asserted. During the switch, the supply as seen by the core dropped below threshold voltage, causing metastability.

post-silicon testing

Use dedicated shunt resistor, and measure voltage drop across that resistor instead of the in-line ammeter.

Page 24: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Errata

  Symptom Description Stage Workaround

4 Repeated I-cache misses to the same block.

- Frequently-executed and mostly-taken branch in the last slot of a cache block.- A BTB update occurs when the branch is being fetched (Fetch-1 stage), and the update is for this branch. Thus, there is a BTB write and a BTB read to the same entry, that of the branch. The BTB read indicates a miss due to the concurrent read and write. This causes Fetch-1 to fetch the next sequential block, generating an I-cache miss.- The Fetch-1 stage is redirected to the branch’s taken target in the next cycle, when the branch is predecoded in the Fetch-2 stage.- Meanwhile, the I-cache miss request for the sequential block is not canceled. Further, the retrieved block is dropped by the core if the Fetch-1 stage is fetching instructions.- End result: The same miss (of the sequential block) is generated repeatedly.

post-silicon testing

This issue does not impact performance, but does make it difficult to measure the number of actual I-cache misses. It is possible to add NOPs to ensure frequently-executed and mostly-taken branches are not in the last slot of a cache block.

Page 25: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Errata

  Symptom Description Stage Workaround

5 Cache-missed load never retires.

After block is retrieved from memory, netlist simulation shows the tag not being written to the correct set (set index goes to x’s), owing to a hold-time violation in the set index for the tag fill. The core replays a cache-missed load until it hits. After the MHSR performs the flawed line-fill, the replayed load will miss again, generating another miss request for the same block. This repeats indefinitely.

post-silicon testing

Top Core: Hold-time violation on fill path seems to affect only the tag fill, not the data fill. Further, it appears that uninitialized tags in the tag SRAM are, by chance, 0. Thus, limiting load and store addresses to have tags of 0 masks the flawed tag fill. The workaround is not guaranteed but does seem very stable. Bottom Core: Not only does the load’s tag need to be 0, but we also had to lower Vdd to slow circuit paths. With these two measures, the replayed load hits and retires. However, the replayed load does not appear to get the correct data, presumably due to the data fill path being compromised (conjecture).

Page 26: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Errata

  Symptom Description Stage Workaround

6 Inability to read/write TRF registers.

Attempted read/write of TRF registers in a single-thread experiment. The reset of the F2F controller is fed through clock-synchronizing flip-flop pairs, but the clock of the F2F controller was not running (since it was a single thread experiment).

post-silicon testing

Ensure F2F controller is clocked when either core is clocked, even when migrations are not being tested.

7 Execution deadlock after a thread migration.

The BTB is implemented in SRAM, including its valid bits, so valid bits are not reset after a migration. This caused BTB hits on non-control instructions. False BTB hits are detected and recovered in Fetch-2 stage, except CTIQ is still pushed. Those instructions would allocate entries in the CTIQ, but during retirement, the CTIQ entry was not de-allocated. The CTIQ would fill, and fetch would stall indefinitely.

post-silicon testing

Carefully craft test programs such that instruction addresses do not overlap. Hard reset tends to work, but not guaranteed.

Page 27: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

H3 Errata

  Symptom Description Stage Workaround

8 Repeated thread migrations only work for fewer than 33 migrations.

The MIGRATE instruction is in a loop. If the loop-ending branch is fetched before the MIGRATE instruction retires, then a CTIQ entry will be allocated for the branch. However, after the migration, the CTIQ is not reset, so the allocated entries are not de-allocated, eventually filling the CTIQ, blocking forward progress.

post-silicon testing

Add instructions (NOPs if necessary) after a MIGRATE instruction to guarantee no branches can be fetched before the MIGRATE retires. 

9 Unable to repeatedly migrate from core-to-core when CCD is enabled.

Debugging in progress post-silicon testing

Unknown

Page 28: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Ammeter Shunt Resistor Switching

• High precision meter still uses shunt resistor– Our meter switches shunt resistor using relay– The relay causes a lag between changes in core current draw,

and the ability of the ammeter to react

1. FPGA reset pressed (resets core, which then draws near zero current).

2. Ammeter switches to higher resistance shunt resistor

3. FPGA reset released (core starts executing, increasing current draw).

4. Ammeter switches back to lower resistance shunt resistor

1 2 3 4

Power Supply Core

DCM

Page 29: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Ammeter Burden Voltage

• Using a discrete, fixed value, resistor to measure current also has trade-offs– Increasing resistance gives more accurate measurements

of voltage drop (and hence current)– However, that same voltage drop also lowers the voltage

as seen by the core– This is why the ammeter switches shunt resistances in the

first place

Power Supply Core

DVM

Voltage regulated here

Page 30: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Power Supply Sense Input

• You could bump the voltage of the power supply up so that the voltage as seen by the core is within spec

• But then during reset, the core current drops to near zero, and the voltage as seen by the core is roughly the same as the bumped up output of the power supply

• A more fully featured power supply, with a sense input, can help this situation… regulation is at the sense node, not the output of node of the power supply

Power Supply Core

DVM

Sense

Voltage regulated here

Page 31: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Custom Compiler

• Needed a way to explicitly control all aspects of emitted instructions

• Language syntax is assembly with a few higher-level features– Registers and memory locations can be named– if statements and while loops– Arithmetic operators, assignments, address-of

• Syntax allows the ability to place code/data at arbitrary memory locations– Including non-contiguous locations

• The compiler also emits to our checkpoint file format• Written in flex/bison

Page 32: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Custom Compiler Example

% example program that % sums the values of an arraymem (0x00400000) {ii: $r1addr: $r2cond: $r4val: $r5total: $r6

addr = @data total = #0 ii = #0 cond = ii < #4 while (cond) { lw val, #0[addr] addr = addr + #4 total = total + val ii = ii + #1 cond = ii < #4 }

addr = @result sw total, #0[addr]}

% example program array datamem (0x00100000) {data: !0x0f0f0f0f !0xabcd1234 !0x00000001 !0xdeadbeefresult: !0x00000000}

Page 33: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Intel 80386 Errata (it’s not just us!)

• Hardware Peculiarity: Unless pin F13 of the 80386 is connected to the +5V power supply, the 80386 never terminates a memory cycle, halting the processor.– Datasheet indicates pin F13 is NC with a note that “Pins identified as

‘NC’ should remain completely unconnected.”• Successive Floating-Point Instructions: If two floating-point

instructions are executed close together, the 80386 may force the coprocessor to start the second one too soon if the first one did not require any memory operands.

• Misaligned Floating-Point Instructions: If 80287 and/or 80387 instructions are not word-aligned, the 80386 passes the wrong instruction to the coprocessor, causing unpredictable behavior.

• Self-test: The self-test feature does not work on the A1 stepping of the 80386.

• Double Page Faults: The bug that appeared in the B0 stepping regarding page faults that occur during page faults has been made a permanent feature of the 80386…

Turley, James L., Advanced 80386 Programming Techniques, McGraw-Hill, Berkeley, CA, 1988.

Page 34: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Detailed Pre-tapeout Timeline

Timeline - Design and VerificationJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Rangeen: 6-wide coreAdd microarchiterular adaptivity to superset

Add dynamic reconfiguration controlBeef up testbench to support dynamic reconfiguration tests

Add debug facility and configuration registersRTL functional and performance verification and debug

synthesis, fine grain clock gating and scan chain insertion Flow FinalRTL feature freeze ↑

post-synthesis verification (synthesized netlist simulation)post-layout verification (P&R netlist simulation)

UPF based power aware design and synthesis flow researchUPF based power aware netlist simulation and power analysis

Rangeen: 4-wide coreAdd coarse grain clock gates - partition level

RTL functional and performance verification and debugsynthesis and scan chain insertion

post-synthesis verification (synthesized netlist simulation)post-layout verification (P&R netlist simulation)

UPF based power aware design and synthesis flow researchUPF based power aware netlist simulation and power analysis

Rangeen: L1 cachesNon blocking write through data cache

Instruction cacheMiss handling and off chip communication machinery

integrate and debugAdd scratch mode and BIST to caches

Change cache sizes and off-chip bus widthsAnil: perf. counters, debug access and verification

Add perf countersRead and write PRF via debug bus

Read AMT via debug busVerify perf counter read and reset

Verify debug access to PRF and AMTPost synthesis verification of debug features

Verify read/write access to caches via debug busRangeen: physical design

Modify H3 flow to suit AnyCore requirementsphysical design (floorplan, P&R, DRC, LVS)

missed tapeout (May 23, 2014): couldn't close on timing - design too dense ↑tapeout (August 22, 2014) ↑

Vinesh: scan chain verification - post layoutScript to generate bit sequence from executable

Add scan support in testbenchVerify basic scan functionality

2013 2014

Page 35: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Physical Design Details

Physical design dataTechnology IBM 8RF (130nm)Dimensions 5 mm x 5 mmArea 25 mm2

Pads(signal, power)

100(79, 21)

Transistors 3.4 millionCells 1.5 millionNets 7.6 million

Page 36: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Detailed Post-tapeout Timeline

Timeline - Post TapeoutOct Nov Dec Jan Feb Mar Apr

Rangeen/Anil: packagingPackage bonding diagram

Socket selection: dictates lead trimPackaging specifics : lead trim, downbond, fill

Packaging validation after receiving packaged partsAnil: PCB design

PCB planning and knowledge transferSocket footprint creation - consult with PCB manufacturer for design rule

Rev 1A- minimum required components and headersRev 1B- debug features, silk screen markings, headers, power measurement

Design review and detailed connectivity check with FPGA and headerssubmit final PCB design to Better Boards (April 1) ↑

test bare board at Better Boards (April 7) ↑pick up assembled boards from Better Boards (April 8) ↑

Rangeen/Anil/Elliott: chip bring-upinsert chip in socket, check Vdd/V+/Gnd, attach to FPGA, BIST success! (April 9) ↑

20152014

Page 37: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Cool AnyCore Pictures

Final layout Floorplan

Page 38: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

AnyCore Pipeline

Page 39: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

BIST Program

0x00 addi r1, r0, #00x08 addi r2, r0, #00x10 addi r3, r0, #00x18 addi r4, r0, #00x20 toggle0x28 nop0x30 nop0x38 nop0x40 st r3(#0), r40x48 addi r2, r2, #100x50 addi r1, r1, #50x58 addi r2, r2, #100x60 toggle0x68 addi r1, r1, #50x70 addi r2, r2, #100x78 addi r1, r1, #5

0x80 addi r4, r0, #00x88 ld r4, r3(#0)0x90 addi r1, r1, #50x98 addi r2, r2, #100xa0 toggle0xa8 addi r2, r2, #100xb0 addi r4, r4, #20xb8 addi r3, r3, #40xc0 addi r2, r2, #100xc8 addi r1, r1, #50xd0 addi r1, r1, #50xd8 jmp 0x400xe0 nop0xe8 nop0xf0 nop0xf8 nop

Page 40: Electrical and Computer Engineering – North Carolina State University Experiences with Two FabScalar-based Chips Elliott Forbes, Rangeen Basu Roy Chowdhury,

Electrical and Computer Engineering – North Carolina State University

Debug Bus

ConfigurationRegisters

Caches / AMT / PRF

Performance Counters

Debug Registers(PC, queue

pointers etc.)

Write Data / Wr En

Read/Write Addr

Read Data