electrical and computer engineering – north carolina state university experiences with two...
TRANSCRIPT
Electrical and Computer Engineering – North Carolina State University
Experiences with Two FabScalar-based Chips
Elliott Forbes, Rangeen Basu Roy Chowdhury, Brandon Dwiel, Anil Kannepalli, Vinesh Srinivasan,
Zhenqian Zhang, Randy Widialaksono, Thomas Belanger, Steve Lipa, Eric Rotenberg, W. Rhett Davis, Paul D. Franzon
Electrical and Computer Engineering – North Carolina State University
Introduction
• FabScalar– Synthesizable and parameterized RTL for OOO
superscalar cores• Fetch and issue widths, structure sizes
– No memory hierarchy or “uncore” (until next release)
• Two FabScalar-based research projects– H3 (“Heterogeneity in 3D”)
• Two cores with different microarchitectures• Hardware support for thread migration
– AnyCore• One core with reconfigurable microarchitecture
1
Electrical and Computer Engineering – North Carolina State University
Goals
• Technical– Explore adaptivity: Ability to adjust microarchitecture
to current instruction-level behavior• Migrate program execution to more suitable core (H3)• Reconfigure core (AnyCore)
• Non-technical– Fulfill original vision of FabScalar
• Streamline development of single-ISA heterogeneous multi-core processors
– Experience realities of fabricating designs– Have fun building stuff
2
Electrical and Computer Engineering – North Carolina State University
H3 Overview• Two stacked asymmetric cores• Fast Thread Migration (FTM)
– Bulk swap of arch. register state• Cache-Core Decoupling (CCD)
– Cores may switch L1 cachesat thread migrations
• Two phases– Phase 1: 2D IC (completed testing in June 2015)
• Test cores, caches, compiled memories, migration logic, etc.– Phase 2: 3D IC (August 2015 tapeout)
• Demonstrate stacked out-of-order cores, benefits of heterogeneity, etc.
“top” core “bottom” core
fetch width 2 1
issue width 3 (alu, br, ld-st) 3
IQ 32 16
LQ/SQ 16/16 16/16
ROB 64 32
PRF 96 64
L1 I$ 4KB DM 4KB DM
L1 D$ 8KB 4-way 8KB 4-way
Face-to-face 3D bonding provides dense low-latency interconnect
3
Electrical and Computer Engineering – North Carolina State University
H3 DesignTimeline - Design and Verification
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores
generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑
L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑
delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache
study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stage
integrated OOO cores + D$ checked-in to SVN repo ↑debug
Elliott: core-side FTM, perf. counters, debug coreglobal migration
perf. counterslocal migration
integrate with face-to-face bus controllerdebug core: start debug core from 2-wide core
debug core: replace L1 I/D caches with I/D scratchpadsdebug core: replace memory bus with scan I/O
Zhenqian: face-to-face bus for FTMRandy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑
physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑
tapeout (May 20, 2013) ↑
2012 2013
9 months til RTL freeze
Two cores generated with FabScalar
D-cache: retool and integrate OpenSparc-T2 L1 D$.
Original plan (canceled): leverage T2’s 8 core x 8 bank crossbar and L2 $ implemented in stacked DRAM.
I-cache: in-house
Modify Fetch-1 for synch. R/W compiled memories
Chip I/O (mem. buses, serializer/deserializer)
New features:CCDperf. countersFTM
FabScalar saved effort.
Most effort on caches, buses, new features.
4
Electrical and Computer Engineering – North Carolina State University
H3 DesignTimeline - Design and Verification
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores
generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑
L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑
delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache
study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stage
integrated OOO cores + D$ checked-in to SVN repo ↑debug
Elliott: core-side FTM, perf. counters, debug coreglobal migration
perf. counterslocal migration
integrate with face-to-face bus controllerdebug core: start debug core from 2-wide core
debug core: replace L1 I/D caches with I/D scratchpadsdebug core: replace memory bus with scan I/O
Zhenqian: face-to-face bus for FTMRandy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑
physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑
tapeout (May 20, 2013) ↑
2012 2013
6 mo. phys. design
Physical design.
physical design data (phase 1)Technology IBM 8RF (130 nm)Dimensions 5.25 mm x 5.25 mmArea 27.6 mm2
Transistors 14.6 MillionCells 1.1 MillionNets 721 ThousandMemory macros 56Clock domains 10Pads 400 (100 for each of
four experiments)
tool execution timeEncounter PAR 9 hoursCalibre DRC 4.5 hoursCalibre LVS 2 hours
5
Electrical and Computer Engineering – North Carolina State University
H3 Design
Die photo + floorplan
Mitigating risk.
Debug CoreRationale:•Test a “pure” FabScalar core•Plan B, in case two-core-stack doesn’t work•Eliminate risky aspects•Enhance testability/debuggability
Core configuration:•Same configuration as 2-wide core
Key features:•I-cache and D-cache replaced with synthesized I and D scratchpads
• No compiled memories• No complex caches
•Full scan• Observability/controllability for debug• No memory buses: Scratchpads preloaded/examined via scan chains
Full scan in 1-wide core•Observability/controllability of at least one core with caches•2-wide core doesn’t have scan overhead
Dedicated memory buses for the two cores •Avoid a potential single point of failure
Parameterized memory bus width•Reduce schedule risk: Early pad planning is important but fluid.
6
Electrical and Computer Engineering – North Carolina State University
H3 Verification• RTL verified using SPEC2K SimPoints
– In retrospect, should have also used microbenchmarks
• Lesson: Budget enough time for netlist verification– Major effort to set up netlist simulation
• Testbench and debug more complicated (everything blasted into individual nets)• SDF annotation requires experience• Most issues caused by testbench and SDF problems
– Found serious, but not fatal, bug just after tapeout• A difference between RTL and netlist caused by misplaced `ifdef• `ifdef SIM guards instrumentation in the RTL. Thus, SIM is defined in testbench
but not in synthesis script. The problem is that a small real code fragment was also mistakenly guarded by it.
• Lessons: (1) Consolidate all instrumentation in testbench (none in RTL). (2) Do netlist verification, because netlist may not equal RTL.
– Netlist simulation also would have alerted us to hold-time violations in D$• OpenSparc T2 D$ is a heavily latch-based industry design• Problem encountered in chip bring-up, diagnosed with netlist simulation
7
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up• Debug core uses only a dozen signal pins
– Allowed us to wirebond a die directly to an existing board to check debug core liveness
– Test Vdd/Gnd and V+/Gnd for shorts– Scan-in == Scan-out
Wirebond ConfigurationsExperiment Signal/Supply
PadsHetero. core pair 63/37Debug core 13/87Isolated F2F bus 62/38Isolated F2B bus 49/51
Chip-on-board debug core liveness
• Overall chip has 400 pads divided into four 100-pad experiments– Wirebond chip differently for each experiment– Allows for use of a 128-pin QFP package
• Success of debug core liveness tests was the green light to assemble the four configurations. For each configuration:
– Package the chip (wirebonding and lead-forming)– Design and fab a 4-layer PCB– Assemble the PCB
8
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
Test setup•PCB connects to LPC mezzanine of Xilinx ML605•FPGA handles memory requests from cores
– Block RAMs for L2 cache
•Host PC sends commands to FPGA via serial interface using a custom GUI•Custom compiler for writing microbenchmarks
– For good control of instruction selection and order, without assembly programming Fully assembled H3 PCB (Phase 1)
All signals go to both:•Headers (to oscilloscope)•LPC connector (to FPGA)
LPC
und
erne
ath
Layer 1 (shown): package, headersLayers 2,3: Vdd/V+, GndLayer 4 (underneath): LPC, DCAPs
9
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up• Results of chip bring-up
– Identified 9 total issues– 3 setup, 1 “feature”, 4 bugs, 1 possible bug– See Table 3, #x
• 3 setup issues: #2, #3, #6 (fixed)• 1 “feature” of extra I$ bus traffic: #4
(no ill effects, but may want to fix in Phase 2)• 4 bugs (will fix in Phase 2)
– 1 bug detected post-tapeout, pre-silicon: #1(serious, but has workaround)
– 1 class of hold-time bugs in D$: #5(serious, fortunately top core ok with certain tags)
– 2 bugs exercised by thread migrations: #7, #8(just annoying, have workarounds)
• 1 possible bug when migrating with CCD enabled: #9 (debug in progress)
10
Electrical and Computer Engineering – North Carolina State University
H3 Packaging, PCB, & Bring-up
† 1-wide core has higher current (and power) than 2-wide core because it has full scan.
Microbenchmark Core Static instr.
Dynamic instr.
Cycles IPC Current (mA)
Energy (mJ)
Comments
PRNGpseudo-random number generator
1-wide
28 67.1M
67.1M 1.00 8.53† 59.6 Peak IPC achieved.
2-wide 64.4M 1.04 5.78 46.5IPC>1 due to branches.IPC not much greater than 1 because only one simple/complex ALU lane, and load/store lane unused.
PRIMESprime number generator
1-wide28 88.2M
100.4M 0.88 8.35† 87.3 OOO tolerates 14-cycle integer divide instruction well.
2-wide 301.8M 0.29 5.70 215.0Hypothesis: larger IQ and non-age-based scheduler exacerbates priority inversion, stalling retirement more.
ARRAY-SUMsum elements of an array
1-wide35 41.9M
Did not finish See Table 3, issue #5.
2-wide 20.9M 2.00 6.54 17.1 Peak IPC achieved.
BUBBLEbubble sort, list initiallyreverse sorted
1-wide
73 2.5M
Did not finish See Table 3, issue #5.
8.5M 0.29 5.24 5.57
Early diagnosis: SQ stalls dominate other resource stalls.Hypothesis: Limited store buffer size of write-through D$ causing back-pressure. Frequent swaps imply many consecutive stores. Write-through latency to FPGA is high.
2-wide
MIGRATE“migrate” instr.inside loop
Completes 1 million consecutive thread migrations correctly.Cores at different frequencies.~1 migration per 1K cycles.
11
Electrical and Computer Engineering – North Carolina State University
H3 Phase 2
• Scheduled tapeout in August 2015• Just the two-core stack
– No debug core– No scan chains
• Design tasks– Partition RTL for two tiers– Implement thread migration enhancements– Fix bugs– Replace T2 D-caches with in-house D-caches
• T2 not workable: explicit latch instances, everywhere
• T2 no longer needed: shelved crossbar and stacked DRAM L2$
• We now have in-house D-cache from AnyCore effort
– Custom-design M1 pads and F2F bondpoints
12
Electrical and Computer Engineering – North Carolina State University
AnyCore Overview• FabScalar evolution
– Released FabScalar:• Build-up cores from library of stage designs of different widths
– Next-gen FabScalar:• “Superset Core”: single Verilog description with
parameterized widths (structure sizes already parameterized)
• AnyCore derived from “superset core”– Keep static configurability of superset core to allow for
synthesis of different max sized AnyCore processors– Add dynamic configurability within max size
• AnyCore instance that was fabricated:Adaptive microarchitecture feature Configurationsfetch/dispatch width (instructions/cycle) 1, 2, 3, 4issue width (instructions/cycle) 3, 4, 5physical register file & active list 64, 96, 128load and store queues (each) 16, 32issue queue 16, 32, 48, 64
13
Electrical and Computer Engineering – North Carolina State University
AnyCore Design and Verification
• Automatic liveness test (BIST)– Upon applying power, clock, and reset, chip should toggle a dedicated
pin. This signals it is correctly executing the preset test program.
• Got netlist simulation working early– Both post-synthesis (ideal clocks, estimated SDF) and post-layout
(everything realistic)
• In-house L1 cache designs with three modes– Cache– Scratchpad– BIST (mode after processor reset)
• Scratchpad mode with first N rows preset to a test program, including a new instruction that toggles a dedicated pin.
• Debug interface for direct reading/writing key pipeline structures, scratchpads, core configuration registers, and performance counters
Notable Design Features
Applied Lessons from H3 Project
14
Electrical and Computer Engineering – North Carolina State University
AnyCore Packaging, PCB, & Bring-up• Applied more lessons from H3 project
– Used a socket instead of soldering package to PCB• Possible benefit that PCB may be repurposed for other projects• Replace defective chips• Study variations
– Used a dedicated bypassable shuntresistor to measurecurrent
– Narrowed PCB to LPCprofile. Compatible witharbitrary Xilinx boards(ML605 and Zynq boards).
Received Intermediate checks
Packaged chips from MOSIS
Examined chip orientation, wirebonds, and downbonds under microscope.Checked lead-forming to socket specification.
Sockets from Agile Checked fit of package in socket under microscope.Vdd, V+, and Gnd checks with chip in socket.
Bare PCB from B.B. Checked all connections.
Assembled PCB from B.B.
Repeated Vdd, V+, and Gnd checks with chip in socket.BIST: successful retirement of first instructions within 24 hours of receiving assembled PCB. (Toggle Pin observed April 9!)15
Electrical and Computer Engineering – North Carolina State University
Discussion and Questions
• Thank you
• Any comments or questions?
The H3 project is supported by a grant from Intel. The AnyCore project is supported by NSF grant CCF-1018517. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Fully assembled H3 PCB
Fully assembled AnyCore PCB
16
Electrical and Computer Engineering – North Carolina State University
Backup
Electrical and Computer Engineering – North Carolina State University
References
N. K. Choudhary, S. V. Wadhavkar, T. A. Shah, H. Mayukh, J. Gandhi, B. H. Dwiel, S. Navada, H. H. Najaf-abadi, and E. Rotenberg. FabScalar: Composing Synthesizable RTL Designs of Arbitrary Cores within a Canonical Superscalar Template. Proceedings of the 38th IEEE/ACM International Symposium on Computer Architecture (ISCA-38), pp. 11-22, June 2011.
E. Rotenberg, B. Dwiel, E. Forbes, Z. Zhang, R. Widialaksono, R. Basu Roy Chowdhury, N. Tshibangu, S. Lipa, W. R. Davis, and P. D. Franzon. Rationale for a 3D Heterogeneous Multi-core Processor. Proceedings of the 31st IEEE International Conference on Computer Design (ICCD-31), pp. 154-168, October 2013.
Electrical and Computer Engineering – North Carolina State University
H3 Detailed Pre-tapeout TimelineTimeline - Design and Verification
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May JunBrandon: cores
generate dual hetero. cores with FabScalardust off in-order scalar core (cut from tapeout)
in-order core checked-in to SVN repo ↑FabScalar-generated cores checked-in to SVN repo ↑
L1 instruction cachemodify Fetch-1 stage for synch. R/W RAMs for I$, BP, BTB
I/O (4 buses: core 1 req+resp, core 2 req+resp), serializer/deserializercache-core decoupling (CCD) for both I- and D-caches
synthesis and scan chain insertionRTL freeze on "allcores" (Sep. 19, 2012) ↑
delivered synthesized netlist with scan chains inserted ↑regenerate memories for L1 I$, L1 D$, BP, BTB (in IBM 8RF)
post-RTL-freeze verification (RTL simulation)post-tapeout verification (P&R netlist simulation)
hetero. multi-core research (C++ simulator, power models, etc.) research (culminated in data for ICCD-31 paper)Rangeen: L1 data cache
study OpenSparc T2 L1 data cacheintegrate T2 D$ into in-order core for early design and testing
integrated in-order core + D$ checked-in to SVN repo ↑misc. mods to T2 D$ (remove TLB, reset strategy, etc.)
redesign per-thread MHSRs for single thread miss-under-miss (MLP)modify core's load/store lane for 2-cycle MEM stageintegrated OOO cores + D$ checked-in to SVN repo ↑
debugElliott: core-side FTM, perf. counters, debug core
global migrationperf. counters
local migrationintegrate with face-to-face bus controller
debug core: start debug core from 2-wide coredebug core: replace L1 I/D caches with I/D scratchpads
debug core: replace memory bus with scan I/OZhenqian: face-to-face bus for FTMRandy: flow, physical design
missed tapeout: Tezzaron 3D run deferred indefinitely ↑download & unpack IBM 8RF PDK, ARM IP (std cells, pads, mem) ↑
physical design (floorplan, P&R, DRC, LVS, power integrity)missed tapeout: couldn't close on DRC, LVS ↑
tapeout (May 20, 2013) ↑
2012 2013
Electrical and Computer Engineering – North Carolina State University
H3 Detailed Post-tapeout Timeline
Timeline - post-tapeoutSep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr
Steve: packaging, PCBsbare die back from MOSIS (Sep. 6, 2013) ↑
debug core: wirebond chip-on-boarddebug core: package chip; design, fab, and assemble PCB
hetero. core pair: package chip; design, fab, and assemble PCBElliott: chip bring-up
develop compiler for controlled benchmark creation <<July 2013>>debug core: Vdd/Gnd short test, current test, scan test
debug core: retire millions of instructionsdebug core: simple power and freq. measurements
setup host PC and FPGA infrastructurehetero. core pair: first instructions retired ↑
hetero. core pair: synthesize full testbench to FPGAhetero. core pair: 10M+ instr. retired (no memory ops) ↑
hetero. core pair: on-going testing and debughetero. core pair: breakthrough - tracked down flaky reset issue ↑
research ICCD-32 - Design-Effort Alloy (DEA) ICCD-32 revisions & pres.research other
Thomas: scan chain scripts (debug core), chip bring-up
2013 2014 2015
Electrical and Computer Engineering – North Carolina State University
Cool H3 Pictures
Final layout
Full test setup (host PC, power supply, ML605, etc.)
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom Description Stage Workaround
1 Instructions that depend on loads will wake up early when the load misses (reading an incorrect value from the register file/bypass).
A misplaced `ifdef guarded key RTL during simulation, but the `ifdef condition was not enabled during synthesis.
post-tapeout Prefetch blocks such that loads will hit.
2 Source-synchronous clock port errors in FPGA testbench during PAR/mapping.
Clock signals from the cores to the FPGA were not routed to clock-capable pins of the LPC.
post-PCB-assembly
Use a jumper from the clock header of the PCB to a clock-capable pin of the XM105 debug card attached to the HPC connector of the ML605.
3 Non-repeatability in running single threaded test programs.
High precision ammeter used to measure logic current. Current draw of running core caused the ammeter to switch shunt resistance (with a relay) immediately after reset was de-asserted. During the switch, the supply as seen by the core dropped below threshold voltage, causing metastability.
post-silicon testing
Use dedicated shunt resistor, and measure voltage drop across that resistor instead of the in-line ammeter.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom Description Stage Workaround
4 Repeated I-cache misses to the same block.
- Frequently-executed and mostly-taken branch in the last slot of a cache block.- A BTB update occurs when the branch is being fetched (Fetch-1 stage), and the update is for this branch. Thus, there is a BTB write and a BTB read to the same entry, that of the branch. The BTB read indicates a miss due to the concurrent read and write. This causes Fetch-1 to fetch the next sequential block, generating an I-cache miss.- The Fetch-1 stage is redirected to the branch’s taken target in the next cycle, when the branch is predecoded in the Fetch-2 stage.- Meanwhile, the I-cache miss request for the sequential block is not canceled. Further, the retrieved block is dropped by the core if the Fetch-1 stage is fetching instructions.- End result: The same miss (of the sequential block) is generated repeatedly.
post-silicon testing
This issue does not impact performance, but does make it difficult to measure the number of actual I-cache misses. It is possible to add NOPs to ensure frequently-executed and mostly-taken branches are not in the last slot of a cache block.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom Description Stage Workaround
5 Cache-missed load never retires.
After block is retrieved from memory, netlist simulation shows the tag not being written to the correct set (set index goes to x’s), owing to a hold-time violation in the set index for the tag fill. The core replays a cache-missed load until it hits. After the MHSR performs the flawed line-fill, the replayed load will miss again, generating another miss request for the same block. This repeats indefinitely.
post-silicon testing
Top Core: Hold-time violation on fill path seems to affect only the tag fill, not the data fill. Further, it appears that uninitialized tags in the tag SRAM are, by chance, 0. Thus, limiting load and store addresses to have tags of 0 masks the flawed tag fill. The workaround is not guaranteed but does seem very stable. Bottom Core: Not only does the load’s tag need to be 0, but we also had to lower Vdd to slow circuit paths. With these two measures, the replayed load hits and retires. However, the replayed load does not appear to get the correct data, presumably due to the data fill path being compromised (conjecture).
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom Description Stage Workaround
6 Inability to read/write TRF registers.
Attempted read/write of TRF registers in a single-thread experiment. The reset of the F2F controller is fed through clock-synchronizing flip-flop pairs, but the clock of the F2F controller was not running (since it was a single thread experiment).
post-silicon testing
Ensure F2F controller is clocked when either core is clocked, even when migrations are not being tested.
7 Execution deadlock after a thread migration.
The BTB is implemented in SRAM, including its valid bits, so valid bits are not reset after a migration. This caused BTB hits on non-control instructions. False BTB hits are detected and recovered in Fetch-2 stage, except CTIQ is still pushed. Those instructions would allocate entries in the CTIQ, but during retirement, the CTIQ entry was not de-allocated. The CTIQ would fill, and fetch would stall indefinitely.
post-silicon testing
Carefully craft test programs such that instruction addresses do not overlap. Hard reset tends to work, but not guaranteed.
Electrical and Computer Engineering – North Carolina State University
H3 Errata
Symptom Description Stage Workaround
8 Repeated thread migrations only work for fewer than 33 migrations.
The MIGRATE instruction is in a loop. If the loop-ending branch is fetched before the MIGRATE instruction retires, then a CTIQ entry will be allocated for the branch. However, after the migration, the CTIQ is not reset, so the allocated entries are not de-allocated, eventually filling the CTIQ, blocking forward progress.
post-silicon testing
Add instructions (NOPs if necessary) after a MIGRATE instruction to guarantee no branches can be fetched before the MIGRATE retires.
9 Unable to repeatedly migrate from core-to-core when CCD is enabled.
Debugging in progress post-silicon testing
Unknown
Electrical and Computer Engineering – North Carolina State University
Ammeter Shunt Resistor Switching
• High precision meter still uses shunt resistor– Our meter switches shunt resistor using relay– The relay causes a lag between changes in core current draw,
and the ability of the ammeter to react
1. FPGA reset pressed (resets core, which then draws near zero current).
2. Ammeter switches to higher resistance shunt resistor
3. FPGA reset released (core starts executing, increasing current draw).
4. Ammeter switches back to lower resistance shunt resistor
1 2 3 4
Power Supply Core
DCM
Electrical and Computer Engineering – North Carolina State University
Ammeter Burden Voltage
• Using a discrete, fixed value, resistor to measure current also has trade-offs– Increasing resistance gives more accurate measurements
of voltage drop (and hence current)– However, that same voltage drop also lowers the voltage
as seen by the core– This is why the ammeter switches shunt resistances in the
first place
Power Supply Core
DVM
Voltage regulated here
Electrical and Computer Engineering – North Carolina State University
Power Supply Sense Input
• You could bump the voltage of the power supply up so that the voltage as seen by the core is within spec
• But then during reset, the core current drops to near zero, and the voltage as seen by the core is roughly the same as the bumped up output of the power supply
• A more fully featured power supply, with a sense input, can help this situation… regulation is at the sense node, not the output of node of the power supply
Power Supply Core
DVM
Sense
Voltage regulated here
Electrical and Computer Engineering – North Carolina State University
Custom Compiler
• Needed a way to explicitly control all aspects of emitted instructions
• Language syntax is assembly with a few higher-level features– Registers and memory locations can be named– if statements and while loops– Arithmetic operators, assignments, address-of
• Syntax allows the ability to place code/data at arbitrary memory locations– Including non-contiguous locations
• The compiler also emits to our checkpoint file format• Written in flex/bison
Electrical and Computer Engineering – North Carolina State University
Custom Compiler Example
% example program that % sums the values of an arraymem (0x00400000) {ii: $r1addr: $r2cond: $r4val: $r5total: $r6
addr = @data total = #0 ii = #0 cond = ii < #4 while (cond) { lw val, #0[addr] addr = addr + #4 total = total + val ii = ii + #1 cond = ii < #4 }
addr = @result sw total, #0[addr]}
% example program array datamem (0x00100000) {data: !0x0f0f0f0f !0xabcd1234 !0x00000001 !0xdeadbeefresult: !0x00000000}
Electrical and Computer Engineering – North Carolina State University
Intel 80386 Errata (it’s not just us!)
• Hardware Peculiarity: Unless pin F13 of the 80386 is connected to the +5V power supply, the 80386 never terminates a memory cycle, halting the processor.– Datasheet indicates pin F13 is NC with a note that “Pins identified as
‘NC’ should remain completely unconnected.”• Successive Floating-Point Instructions: If two floating-point
instructions are executed close together, the 80386 may force the coprocessor to start the second one too soon if the first one did not require any memory operands.
• Misaligned Floating-Point Instructions: If 80287 and/or 80387 instructions are not word-aligned, the 80386 passes the wrong instruction to the coprocessor, causing unpredictable behavior.
• Self-test: The self-test feature does not work on the A1 stepping of the 80386.
• Double Page Faults: The bug that appeared in the B0 stepping regarding page faults that occur during page faults has been made a permanent feature of the 80386…
Turley, James L., Advanced 80386 Programming Techniques, McGraw-Hill, Berkeley, CA, 1988.
Electrical and Computer Engineering – North Carolina State University
AnyCore Detailed Pre-tapeout Timeline
Timeline - Design and VerificationJan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Rangeen: 6-wide coreAdd microarchiterular adaptivity to superset
Add dynamic reconfiguration controlBeef up testbench to support dynamic reconfiguration tests
Add debug facility and configuration registersRTL functional and performance verification and debug
synthesis, fine grain clock gating and scan chain insertion Flow FinalRTL feature freeze ↑
post-synthesis verification (synthesized netlist simulation)post-layout verification (P&R netlist simulation)
UPF based power aware design and synthesis flow researchUPF based power aware netlist simulation and power analysis
Rangeen: 4-wide coreAdd coarse grain clock gates - partition level
RTL functional and performance verification and debugsynthesis and scan chain insertion
post-synthesis verification (synthesized netlist simulation)post-layout verification (P&R netlist simulation)
UPF based power aware design and synthesis flow researchUPF based power aware netlist simulation and power analysis
Rangeen: L1 cachesNon blocking write through data cache
Instruction cacheMiss handling and off chip communication machinery
integrate and debugAdd scratch mode and BIST to caches
Change cache sizes and off-chip bus widthsAnil: perf. counters, debug access and verification
Add perf countersRead and write PRF via debug bus
Read AMT via debug busVerify perf counter read and reset
Verify debug access to PRF and AMTPost synthesis verification of debug features
Verify read/write access to caches via debug busRangeen: physical design
Modify H3 flow to suit AnyCore requirementsphysical design (floorplan, P&R, DRC, LVS)
missed tapeout (May 23, 2014): couldn't close on timing - design too dense ↑tapeout (August 22, 2014) ↑
Vinesh: scan chain verification - post layoutScript to generate bit sequence from executable
Add scan support in testbenchVerify basic scan functionality
2013 2014
Electrical and Computer Engineering – North Carolina State University
AnyCore Physical Design Details
Physical design dataTechnology IBM 8RF (130nm)Dimensions 5 mm x 5 mmArea 25 mm2
Pads(signal, power)
100(79, 21)
Transistors 3.4 millionCells 1.5 millionNets 7.6 million
Electrical and Computer Engineering – North Carolina State University
AnyCore Detailed Post-tapeout Timeline
Timeline - Post TapeoutOct Nov Dec Jan Feb Mar Apr
Rangeen/Anil: packagingPackage bonding diagram
Socket selection: dictates lead trimPackaging specifics : lead trim, downbond, fill
Packaging validation after receiving packaged partsAnil: PCB design
PCB planning and knowledge transferSocket footprint creation - consult with PCB manufacturer for design rule
Rev 1A- minimum required components and headersRev 1B- debug features, silk screen markings, headers, power measurement
Design review and detailed connectivity check with FPGA and headerssubmit final PCB design to Better Boards (April 1) ↑
test bare board at Better Boards (April 7) ↑pick up assembled boards from Better Boards (April 8) ↑
Rangeen/Anil/Elliott: chip bring-upinsert chip in socket, check Vdd/V+/Gnd, attach to FPGA, BIST success! (April 9) ↑
20152014
Electrical and Computer Engineering – North Carolina State University
Cool AnyCore Pictures
Final layout Floorplan
Electrical and Computer Engineering – North Carolina State University
AnyCore Pipeline
Electrical and Computer Engineering – North Carolina State University
BIST Program
0x00 addi r1, r0, #00x08 addi r2, r0, #00x10 addi r3, r0, #00x18 addi r4, r0, #00x20 toggle0x28 nop0x30 nop0x38 nop0x40 st r3(#0), r40x48 addi r2, r2, #100x50 addi r1, r1, #50x58 addi r2, r2, #100x60 toggle0x68 addi r1, r1, #50x70 addi r2, r2, #100x78 addi r1, r1, #5
0x80 addi r4, r0, #00x88 ld r4, r3(#0)0x90 addi r1, r1, #50x98 addi r2, r2, #100xa0 toggle0xa8 addi r2, r2, #100xb0 addi r4, r4, #20xb8 addi r3, r3, #40xc0 addi r2, r2, #100xc8 addi r1, r1, #50xd0 addi r1, r1, #50xd8 jmp 0x400xe0 nop0xe8 nop0xf0 nop0xf8 nop
Electrical and Computer Engineering – North Carolina State University
Debug Bus
ConfigurationRegisters
Caches / AMT / PRF
Performance Counters
Debug Registers(PC, queue
pointers etc.)
Write Data / Wr En
Read/Write Addr
Read Data