the impact of fetch rate and reorder buffer size on ...perl_newop 0x000a8260 save %sp, -96, %sp...

50
1 1 The Impact of Fetch Rate and Reorder Buffer Size on Speculative Pre-Execution David M. Koppelman Electrical & Computer Engineering Department Louisiana State University Baton Rouge, LA U.S.A. 1 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 1

Upload: others

Post on 24-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

1 1

The Impact of Fetch Rate and Reorder Buffer Size on

Speculative Pre-Execution

David M. Koppelman

Electrical & Computer Engineering Department

Louisiana State University

Baton Rouge, LA U.S.A.

1 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 1

Page 2: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

2 2Notes on These Slides

Slides for presentation made at the Workshop on Duplicating, Deconstructing, andDebunking, held in conjunction with the 30th International Symposium on ComputerArchitecture, 8 June 2003

Pages Added Since Presentation

Formulas for computing speedup components.

Paper available via http://www.ece.lsu.edu/koppel/pubs/pe_wddd.pdf

2 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 2

Page 3: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

3 3Dynamic Speculative Pre-Execution

Dynamic Speculative Pre-Execution:

A technique to reduce the impact of load misses. . .

. . . by pre-executing copies of loads. . .

. . . along with instructions needed to compute their addresses.

Some schemes also pre-execute branches.

3 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 3

Page 4: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

4 4Load Miss (perl)

Grid 5 insn X 5 cycTime 6,968,169

Perl_sv_grow+1790x00103634 mov %i1, %o0malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i30x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0, 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2, 2, %g20x0007515c add %g3, 120, %g3 {0x2040780x00075160 ldsb [ %g2 + %g3 ], %i2 {[b0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2, 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nextf+0x000751c8 cmp %o3, 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3, 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2

4 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 4

Page 5: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

5 5Load Miss and Slice (perl)

Grid 5 insn X 5 cycTime 6,968,169

Perl_sv_grow+1790x00103634 mov %i1 , %o0malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i30x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g20x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[b0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3 , 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3 , 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2

5 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 5

Page 6: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

6 6

Pre-Execution Idea

Preparation

Find troublesome (frequently missing) load.

Extract slice, load and predecessors . . .

. . . by examining instructions within a construction window.

Identify trigger instruction to initiate slice.

Put slice in cache, using trigger address for index.

Execution

Hardware usually added onto a simultaneous multithreading core.

When trigger encountered a p-thread is forked to run slice.

Slice’s load, prefetch, should execute in advance of original, target, load.

Margin is number of cycles from initiation of prefetch to target.

When slice fi anished results discarded.

6 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 6

Page 7: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

7 7Typical Pre-Execution Hardware

MainDecode,Rename

InstructionQueue

SliceCache

Slice Dcd,Rename

Reorder BufferSlice

Construct

Func.Unit

Func.Unit

Predict,Fetch

Func.Unit

Instructions

Instructions

Program Counter

MissPredict

7 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 7

Page 8: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

8 8Prefetch Load Miss (perl)

Grid 5 insn X 5 cycTime 6,968,503

malloc+10x00075134 orcc %g0, %i0, %i3morecore0x000756f8 save %sp, -104, %spmorecore+60x00075710 sethi %hi(0x215c00), %i0malloc+20x00075138 bne +4i {malloc+6}morecore+110x00075724 lduw [ %i0 + 876 ], %o3malloc+30x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g20x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[b0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3 , 00x000751cc bne +6i {malloc+45}

8 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 8

Page 9: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

9 9Prefetch Load Miss (perl)

Grid 5 insn X 5 cycTime 6,968,503

Perl_yylex+1210x00081264 lduh [ %g2 + 160 ], %g2 {[PL_lamalloc+120x00075160 ldsb [ %g2 + %g3 ], %i2Perl_yylex+1220x00081268 mov %o0, %i4malloc+140x00075168 sethi %hi(0x216000), %g2Perl_yylex+1230x0008126c cmp %g2, 150malloc+350x000751bc sll %i2, 2, %i1Perl_yylex+1240x00081270 bne +65i {Perl_yylex+189}malloc+360x000751c0 add %g2, 120, %i0Perl_yylex+1250x00081274 mov 12, %o0 {0xc}malloc+370x000751c4 lduw [ %i1 + %i0 ], %o3Perl_yylex+1890x00081374 call +39867i {Perl_newOP}0x00081378 mov 0, %o1 {0x0}Perl_newOP0x000a8260 save %sp, -96, %sp0x000a8264 call -52301i {malloc}0x000a8268 mov 24, %o0 {0x18}malloc

9 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 9

Page 10: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

10 10Pre-Execution Potential

No shortage of load misses to help . . .

. . . but will prefetch really execute before target?

10 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 10

Page 11: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

11 11Small-Window, Low Margin (gzip)

Grid 40 insn X 5 cycTime 89,179,264

deflate+1240x00012648 ldub [ %g3 - 1 ], %g30x0001264c sll %g4, 1, %g40x00012650 subcc %o0, 1, %o00x00012654 stw %o0, [ %g2 + 148 ] {[prev_0x00012658 xor %o1 , %g3 , %g20x0001265c and %g2 , %l6 , %o40x00012660 sll %o4, 1, %g20x00012664 lduh [ %g2 + %l3 ], %g30x00012668 sth %g3, [ %g4 + %l2 ]0x0001266c sth %o3, [ %g2 + %l3 ]0x00012670 bne,a -15i {deflate+119}0x00012674 add %o3 , 1, %o3deflate+1190x00012634 sethi %hi(0x72000), %g20x00012638 add %o3 , %l4 , %g30x0001263c lduw [ %g2 + 148 ], %o0 {[prev_0x00012640 sll %o4 , 5, %o10x00012644 and %o3, %l6, %g40x00012648 ldub [ %g3 - 1 ], %g30x0001264c sll %g4, 1, %g40x00012650 subcc %o0, 1, %o00x00012654 stw %o0, [ %g2 + 148 ] {[prev_0x00012658 xor %o1 , %g3 , %g20x0001265c and %g2 , %l6 , %o40x00012660 sll %o4 , 1, %g20x00012664 lduh [ %g2 + %l3 ], %g30x00012668 sth %g3 , [ %g4 + %l2 ]0x0001266c sth %o3, [ %g2 + %l3 ]0x00012670 bne,a -15i {deflate+119}0x00012674 add %o3, 1, %o3deflate+119

11 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 11

Page 12: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

12 12Large-Window, Low Margin (gzip)

Grid 200 insn X 25 cycTime 89,178,997

deflate+1240x00012648 ldub [ %g3 - 1 ], %g30x0001264c sll %g4, 1, %g40x00012650 subcc %o0, 1, %o00x00012654 stw %o0, [ %g2 + 148 ] {[prev_0x00012658 xor %o1 , %g3 , %g20x0001265c and %g2 , %l6 , %o40x00012660 sll %o4, 1, %g20x00012664 lduh [ %g2 + %l3 ], %g30x00012668 sth %g3, [ %g4 + %l2 ]0x0001266c sth %o3, [ %g2 + %l3 ]0x00012670 bne,a -15i {deflate+119}0x00012674 add %o3 , 1, %o3deflate+1190x00012634 sethi %hi(0x72000), %g20x00012638 add %o3 , %l4 , %g30x0001263c lduw [ %g2 + 148 ], %o0 {[prev_0x00012640 sll %o4 , 5, %o10x00012644 and %o3, %l6, %g40x00012648 ldub [ %g3 - 1 ], %g30x0001264c sll %g4, 1, %g40x00012650 subcc %o0, 1, %o00x00012654 stw %o0, [ %g2 + 148 ] {[prev_0x00012658 xor %o1 , %g3 , %g20x0001265c and %g2 , %l6 , %o40x00012660 sll %o4 , 1, %g20x00012664 lduh [ %g2 + %l3 ], %g30x00012668 sth %g3 , [ %g4 + %l2 ]0x0001266c sth %o3, [ %g2 + %l3 ]0x00012670 bne,a -15i {deflate+119}0x00012674 add %o3, 1, %o3deflate+119

12 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 12

Page 13: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

13 13Higher Margin, Ignoring Mispredictions (gcc)

Grid 200 insn X 25 cycTime 83,166,363

.LLM385+1 mark_target_live_regs+543 ../../gcc/resource.c:10x001b4e24 cmp %o0, 100x001b4e28 bne,a +36i {.LLM393+1 mark_target_live_0x001b4e2c lduw [ %o4 + 8 ], %o4.LLM393+1 mark_target_live_regs+580 ../../gcc/resource.c:10x001b4eb8 cmp %o4, 00x001b4ebc bne,a -38i {.LLM385+1 mark_target_live_r0x001b4ec4 ba +49i {.LLM403 mark_target_live_regs+0x001b4ec8 mov %l1 , %o2.LLM403 mark_target_live_regs+632 ../../gcc/resource.c:1550x001b4f88 cmp %l1, 00x001b4f8c be,a +33i {.LLM410 mark_target_live_regs0x001b4f94 lduw [ %l1 ], %o00x001b4f98 andcc %o0, 32, %g00x001b4f9c be,a +14i {.LLM407 mark_target_live_regs0x001b4fa0 lduw [ %o2 + 12 ], %o2.LLM407 mark_target_live_regs+651 ../../gcc/resource.c:1650x001b4fd4 cmp %o2, 00x001b4fd8 be +14i {.LLM410 mark_target_live_regs+0x001b4fdc mov %o2 , %l10x001b4fe0 lduh [ %o2 ], %o00x001b4fe4 cmp %o0, 290x001b4fe8 bne +11i {.LLM410+1 mark_target_live_re0x001b4fec cmp %l1, %i1

13 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 13

Page 14: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

14 14High Margin (swim)

Grid 200 insn X 25 cycTime 87,174,642

calc3_+2410x0001337c faddd %f12, %f8, %f260x00013380 fmuld %f22, %f14, %f20x00013384 fmuld %f20, %f14, %f80x00013388 lddf [ %l0 + %o1 ], %f120x0001338c stdf %f26, [ %l0 + %o0 ]0x00013390 fsubd %f30, %f16, %f160x00013394 stdf %f30, [ %i3 + %o1 ]0x00013398 fsubd %f4, %f2, %f20x0001339c stdf %f4, [ %i5 + %o1 ]0x000133a0 fsubd %f10, %f8, %f80x000133a4 stdf %f10, [ %l1 + %o1 ]0x000133a8 faddd %f16, %f0, %f00x000133ac add %o1 , 16, %o00x000133b0 lddf [ %i3 + %g2 ], %f300x000133b4 faddd %f2, %f6, %f20x000133b8 lddf [ %i5 + %g2 ], %f40x000133bc fmuld %f18, %f0, %f160x000133c0 lddf [ %l1 + %g2 ], %f100x000133c4 faddd %f8, %f12, %f80x000133c8 lddf [ %l4 + %g2 ], %f00x000133cc fmuld %f18, %f2, %f20x000133d0 lddf [ %l5 + %g2 ], %f60x000133d4 faddd %f24, %f16, %f160x000133d8 lddf [ %g4 + %g2 ], %f120x000133dc fmuld %f18, %f8, %f80x000133e0 lddf [ %l2 + %g2 ], %f240x000133e4 stdf %f16, [ %l2 + %o1 ]

14 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 14

Page 15: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

15 15Hardware-Only Schemes

Collins, et al, Micro 2001.

Dynamic Speculative Pre-Execution

Troublesome load selection: frequent blocker of ROB.

Construction window: 512 instructions.

Slice: some optimization.

Execution: up to 7 p-threads in ¤ight.

Moshovos, et al, ICS 2001

Slice Processor

Troublesome load selection: load miss predictor.

Construction window: 32 instructions.

Execution: up to 8 p-threads in ¤ight.

15 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 15

Page 16: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

16 16Pr fi ce Assisted Schemes

Roth & Sohi, HPCA 2001.

Data-Driven Multithreading

Troublesome load selection: trace analysis.

Construction window: 1024 instructions.

Execution: up to 3 p-threads in ¤ight.

Branches also pre-executed.

Results can be re-integrated. (No need to re-execute.)

Zilles & Sohi, ISCA 2001.

Troublesome load selection: trace analysis.

P-threads hand constructed and optimized.

Branches also pre-executed.

16 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 16

Page 17: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

17 17Remaining Issues

Focus of Other Studies

Resource consumption issues.

Slice construction and triggering techniques.

Remaining Issues

How well can ROB hide latency reduced by pre-execution?

How does fetch rate impact margin?

17 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 17

Page 18: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

18 18? ? ? Deconstruction ? ? ?

Pre-Execution v. Enlarged Reorder Buffer (ROB)

Both hide load miss latency.

Both consume CPU core chip area.

Performance comparison helpful for tradeoffs.

Larger Reorder Buffer Advantage

No special hardware needed.

Pre-Execution Advantages

Reduced branch resolution time.

No need to schedule from enlarged window.

18 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 18

Page 19: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

19 19Speedup Components

Interesting difference: pre-execution can reduce branch resolution time.

Use two speedup components to measure effect.

Squash Speedup Component

Speedup due to reduced branch and jump resolution times.

Larger ROB cannot realize this type of speedup.

Stall Speedup Component

Speedup due to certain reduced correct-path stall cycles.

Both pre-execution and larger ROB can realize this speedup.

Manuscript Note

Manuscript uses less precise speedup components (see below).

19 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 19

Page 20: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

20 20Details of Speedup Components

Execution Time Components

Each cycle examine slots where hardware decodes instructions. . .

. . . n-way superscalar processor has n slots.

Find each of the following slot counts:

iqfw, holds wrong-path instruction (not stalled),

iqsw, holds stalled wrong-path instruction and ROB is full,

iqscc, holds stalled correct-path instruction on mispredicted-branch or -jump resolutioncritical path and ROB is full,

isf , holds stalled correct-path instruction (not counting iqscc instructions) and ROB is full,

iofc, holds correct-path instruction (not stalled),

ioe, slot is empty or holds stalled instruction and ROB is not full.

Note: Full ROB causes pipeline to stall, stalls also caused by full instruction queues, lack offree physical registers, etc.

Component iqscc, not used in manuscript.

20 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 20

Page 21: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

21 21Details of Speedup Components

Slot Categories

Let n denote decode width and T denote run time in cycles.

Total slots: nT = iqfw + iqsw + iqscc + isf + iofc + ioe

Squash slots: iq = iqfw + iqsw + iqscc

Stall slots: is = isf

Other slots: is = iofc + ioe

CPI isiq + is + io

niofc.

21 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 21

Page 22: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

22 22

Let iq(X), is(X), and io(X), denote counts for system X.

Total speedup of A over B is:iq(B) + is(B) + io(B)iq(A) + is(A) + io(A)

.

The squash component of speedup of A or B is:iq(B) + is(B) + io(B)iq(A) + is(B) + io(B)

.

The stall component of speedup of A or B is:iq(B) + is(B) + io(B)iq(B) + is(A) + io(B)

.

22 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 22

Page 23: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

23 23Base Simulated Systems

Common Parameter Value

Decode Width 8-way superscalarReorder Buffer 256 instructionsReturn-Address Stack 8 entriesL1 ICache 5-way, 328 kB, 256-B lineL1 DCache Hit Latency 2 cycleL1 DCache 4-way, 16 kiB, 64-B lineL2 DCache 8-way, 256 kiB, 64-B lineL2 Hit Latency 16 cycles

L2 DCache Miss Latency ≈ 100 cyclesID to EX 3 cycles.Global History 16 branchesInteger Units 80Floating-Point Units 4Memory Units 12Instruction Queues 2048

Pre-Execution Parameter Value

Construction Window 256Injection Rate 4-insn / cycleMaximum Slice Size 64 instructionsIn-Flight Limit 8 slicesSlice Cache 216 slices

23 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 23

Page 24: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

24 24Benchmarks

Chosen for pre-execution suitability and comparability.

Olden: mst, em3d

SPEC2000/Ref Inputs: vpr

Other benchmarks.

SPEC2000/Ref Inputs: mcf, wupwise

SPEC2000/Test Inputs: swim

SPEC2000/Other Inputs: gzip, gcc, perl, TEX.

24 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 24

Page 25: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

25 25Pre-Execution Characterization, Construction Window Size

Speedup v. Construction Window

Vary construction window from 32 to 2048 instructions.

25 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 25

Page 26: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

26 26Construction Window Size, Benchmark Set 1

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

mst em3d mcf swim vpr wupwise

Total32 48 64 96 128 160 192 256 384 512768 1024 2048

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

26 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 26

Page 27: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

27 27Construction Window Size, Benchmark Set 1

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

mst em3d mcf swim vpr wupwise

Stall Squash Total32 48 64 96 128 160 192 256 384 512768 1024 2048

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

27 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 27

Page 28: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

28 28Construction Window Size, Benchmark Set 2

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

bzip2 gcc gzip perl TeX vpr

Stall Squash Total32 48 64 96 128 160 192 256 384 512768 1024 2048

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

1.25

28 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 28

Page 29: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

29 29Construction Window Size, Average

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

Stall Squash Total32 48 64 96 128 160192 256 384 512 768 10242048

Spe

edup

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1.14

29 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 29

Page 30: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

30 30

Construction Window Size Comments

Average performance maxes at 256, at maximum margin of 32 cycles.

Limit appears due to limit of 8 threads.

Large Speedups

Large number of level-2 cache misses.

Speedup due mostly to stall component (except vpr).

Smaller Speedups

Squash plays a larger roll.

Results comparable with other studies.

30 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 30

Page 31: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

31 31Pre-Execution in bzip2

Grid 200 insn X 25 cycTime 40,080,754

.LLM189 mainSimpleSort+45 blocksort.c:5630x0001f058 add %i0 , %l1 , %i00x0001f05c mov %g2, %o00x0001f060 ldub [ %i1 + %i0 ], %g30x0001f064 ldub [ %i1 + %g1 ], %g2.LLM384+1 mainSimpleSort+533 blocksort.c:4200x0001f7f8 ldub [ %i1 + %g1 ], %g2.LLM190+3 mainSimpleSort+49 blocksort.c:4080x0001f068 cmp %g3 , %g2.LLM280 mainSimpleSort+272 blocksort.c:5740x0001f3e4 add %o7, 1, %o7.LLM374 mainSimpleSort+504 blocksort.c:5880x0001f784 add %o7, 1, %o7.LLM465+2 mainSimpleSort+731 blocksort.c:5990x0001fb10 mov %o7, %i0.LLM190+4 mainSimpleSort+50 blocksort.c:4080x0001f06c bne +80i {.LLM212+6 mainSim0x0001f070 cmp %g2, %g30x0001f074 add %i0, 1, %i00x0001f078 add %g1, 1, %g10x0001f07c ldub [ %i1 + %i0 ], %g30x0001f080 ldub [ %i1 + %g1 ], %g2

31 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 31

Page 32: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

32 32Pre-Execution in vpr

Grid 200 insn X 25 cycTime 117,235,010

32 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 32

Page 33: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

33 33ROB Size ³. Construction Window Size

Conventional Systems: ROB 256 to 512.

Pre-Execution Systems: Construction Window 32 - 2048,. . .. . . increments match change in ROB size.

Speedup is over conventional system with 256-entry ROB.

33 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 33

Page 34: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

34 34Reorder Buffer 256-512, Construction Window 32-2048, Set 1

C C P

mst

C C P

em3d

C C P

mcf

C C P

swim

C C P

vpr

C C P

wupwise

Stall Squash Totalr+0 r+64 r+128 r+160 r+192 r+256 w64w128 w160 w192 w256 w512 w1024 w2048

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

34 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 34

Page 35: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

35 35Reorder Buffer 256-512, Construction Window 32-2048, Set 2

C C P

bzip2

C C P

gcc

C C P

gzip

C C P

perl

C C P

TeX

C C P

vpr

Stall Squash Totalr+0 r+64 r+128 r+160 r+192 r+256 w64w128 w160 w192 w256 w512 w1024 w2048

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

1.25

35 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 35

Page 36: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

36 36Reorder Buffer 256-512, Construction Window 32-2048, Average

C C P

Stall Squash Totalr+0 r+64 r+128 r+160 r+192 r+256 w64w128 w160 w192 w256 w512 w1024 w2048

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

36 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 36

Page 37: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

37 37

Results

ROB more effective at stall component.

Exceptions are em3d and wupwise.

Pre-execution’s impact on squash provides advantage over ROB for many benchmarks.

37 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 37

Page 38: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

38 38Pre-Execution v. 512-Entry ROB with Varying Pipeline Lengths

Doubling ROB size (and scheduling queues) might have a critical path impact.

Compensate by increasing minimum number of cycles from decode to execute (ID-to-EX).

Systems

Conventional, 256-entry ROB, 3-cycle ID-to-EX pipeline.

Conventional, 512-entry ROB, 3- to 9-cycle ID-to-EX pipeline.

Pre-execution, 256-entry ROB, 3-cycle ID-to-EX pipeline.

38 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 38

Page 39: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

39 39Pipeline Lengths, Set 1

C C P

mst

C C P

em3d

C C P

mcf

C C P

swim

C C P

vpr

C C P

wupwise

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.8

1.0

1.2

1.4

1.6

1.8

39 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 39

Page 40: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

40 40Pipeline Lengths, Set 2

C C P

bzip2

C C P

gcc

C C P

gzip

C C P

perl

C C P

TeX

C C P

vpr

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.8

0.9

1.0

1.1

1.2

1.3

40 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 40

Page 41: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

41 41Pipeline Lengths, Average

C C P

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.90

0.95

1.00

1.05

1.10

1.15

1.20

41 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 41

Page 42: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

42 42Pre-Execution v. 512-Entry ROB with Varying Pipeline Lengths

Results

Pipeline delay handicaps conventional system in ways pre-execution helps.

42 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 42

Page 43: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

43 43Pre-Execution and Fetch Rate

Systems (Conventional and Pre-Execution)

4-way, 1- and 2-cache ports.

8-way, 2- and 3-cache ports. (Three-port predicts two blocks/cycle.)

16-way, 3- and 4-cache ports. (Four-port predicts three blocks/cycle.)

43 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 43

Page 44: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

44 44Fetch Rate Set 1, CPI

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mst

Ports 1 2DW 4

Ports 2 38

Ports 3 416

em3d

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mcf

Ports 1 2DW 4

Ports 2 38

Ports 3 416

swim

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Ports 1 2DW 4

Ports 2 38

Ports 3 416

wupwise

Commit Fetch Other Squash StallC P

CP

I Con

trib

utio

n

0.0

0.2

0.4

0.6

0.8

1.0

44 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 44

Page 45: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

45 45Fetch Rate Set 1, Speedup

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mst

Ports 1 2DW 4

Ports 2 38

Ports 3 416

em3d

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mcf

Ports 1 2DW 4

Ports 2 38

Ports 3 416

swim

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Ports 1 2DW 4

Ports 2 38

Ports 3 416

wupwise

Stall Squash TotalP

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

45 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 45

Page 46: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

46 46Fetch Rate Set 2, CPI

Ports 1 2DW 4

Ports 2 38

Ports 3 416

bzip2

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gcc

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gzip

Ports 1 2DW 4

Ports 2 38

Ports 3 416

perl

Ports 1 2DW 4

Ports 2 38

Ports 3 416

TeX

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Commit Fetch Other Squash StallC P

CP

I Con

trib

utio

n

0.0

0.2

0.4

0.6

0.8

1.0

46 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 46

Page 47: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

47 47Fetch Rate Set 2, Speedup

Ports 1 2DW 4

Ports 2 38

Ports 3 416

bzip2

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gcc

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gzip

Ports 1 2DW 4

Ports 2 38

Ports 3 416

perl

Ports 1 2DW 4

Ports 2 38

Ports 3 416

TeX

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Stall Squash TotalP

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

1.25

47 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 47

Page 48: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

48 48Fetch Rate Average, CPI

Ports 1 2DW 4

Ports 2 38

Ports 3 416

Commit Fetch Other Squash StallC P

CP

I Con

trib

utio

n

0.0

0.2

0.4

0.6

0.8

48 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 48

Page 49: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

49 49Fetch Rate Average, Speedup

Ports 1 2DW 4

Ports 2 38

Ports 3 416

Stall Squash TotalP

Spe

edup

1.02

1.04

1.06

1.08

1.10

1.12

1.14

49 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 49

Page 50: The Impact of Fetch Rate and Reorder Buffer Size on ...Perl_newOP 0x000a8260 save %sp, -96, %sp 0x000a8264 call -52301i {malloc} 0x000a8268 mov 24, %o0 {0x18} malloc 9 Workshop on

50 50Conclusions

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mst

Ports 1 2DW 4

Ports 2 38

Ports 3 416

em3d

Ports 1 2DW 4

Ports 2 38

Ports 3 416

mcf

Ports 1 2DW 4

Ports 2 38

Ports 3 416

swim

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Ports 1 2DW 4

Ports 2 38

Ports 3 416

wupwise

Commit Fetch Other Squash StallC P

CP

I Con

trib

utio

n

0.0

0.2

0.4

0.6

0.8

1.0

Ports 1 2DW 4

Ports 2 38

Ports 3 416

bzip2

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gcc

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gzip

Ports 1 2DW 4

Ports 2 38

Ports 3 416

perl

Ports 1 2DW 4

Ports 2 38

Ports 3 416

TeX

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Stall Squash TotalP

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

1.25

Ports 1 2DW 4

Ports 2 38

Ports 3 416

bzip2

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gcc

Ports 1 2DW 4

Ports 2 38

Ports 3 416

gzip

Ports 1 2DW 4

Ports 2 38

Ports 3 416

perl

Ports 1 2DW 4

Ports 2 38

Ports 3 416

TeX

Ports 1 2DW 4

Ports 2 38

Ports 3 416

vpr

Commit Fetch Other Squash StallC P

CP

I Con

trib

utio

n

0.0

0.2

0.4

0.6

0.8

1.0

C C P

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.90

0.95

1.00

1.05

1.10

1.15

1.20

C C P

mst

C C P

em3d

C C P

mcf

C C P

swim

C C P

vpr

C C P

wupwise

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.8

1.0

1.2

1.4

1.6

1.8

C C P

bzip2

C C P

gcc

C C P

gzip

C C P

perl

C C P

TeX

C C P

vpr

Stall Squash TotalR256-L3 R512-L3 R512-L4 R512-L5 R512-L6 R512-L7 R512-L9

Spe

edup

0.8

0.9

1.0

1.1

1.2

1.3

ROBROBROBROBROBC

ROBP

Stall Squash Total320 384 416 448 512 256

Spe

edup

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

Grid 5 insn X 5 cycTime 6,968,499

Perl_yylex+740x000811a8 sethi %hi(0x218400), %g2malloc+110x0007515c add %g3, 120, %g3Perl_yylex+1210x00081264 lduh [ %g2 + 160 ], %g2 {[PL_lamalloc+120x00075160 ldsb [ %g2 + %g3 ], %i2Perl_yylex+1220x00081268 mov %o0, %i4malloc+140x00075168 sethi %hi(0x216000), %g2Perl_yylex+1230x0008126c cmp %g2, 150malloc+350x000751bc sll %i2, 2, %i1Perl_yylex+1240x00081270 bne +65i {Perl_yylex+189}malloc+360x000751c0 add %g2, 120, %i0Perl_yylex+1250x00081274 mov 12, %o0 {0xc}malloc+370x000751c4 lduw [ %i1 + %i0 ], %o3Perl_yylex+1890x00081374 call +39867i {Perl_newOP}0x00081378 mov 0, %o1 {0x0}Perl_newOP

ROBC

ROBROBROBROBROBC P

bzip2

C C P

gcc

C C P

gzip

C C P

perl

C C P

TeX

C C P

vpr

Stall Squash Total256 320 384 416 448 512

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

Stall Squash Total32 48 64 96 128 160192 256 384 512 768 10242048

Spe

edup

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1.14

Cnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst WndwCnst Wndw

mst em3d mcf swim vpr wupwise

Total32 48 64 96 128 160 192 256 384 512768 1024 2048

Spe

edup

0.9

1.0

1.1

1.2

1.3

1.4

Doubled ROB size can attain comparable performance . . .

. . . though pre-execution better on many benchmarks.

When larger ROB “paid for” with longer pipeline, pre-execution is clearly better.

Fetch rate has minor impact on pre-execution effectiveness.

Grid 40 insn X 5 cycTime 14,028,952

S_regmatch+21100x00163030 mov %i1 , %l3S_regmatch+1110x001610f4 cmp %i1, 00x001610f8 be -100i {S_regmatch+12}0x001610fc sethi %hi(0x1f0800), %g20x00161100 lduh [ %l3 + 2 ], %g20x00161104 ldub [ %l3 + 1 ], %o00x00161108 sll %g2, 2, %g20x0016110c add %l3, %g2, %i10x00161110 cmp %i1, %l30x00161114 bne +3i {S_regmatch+122}0x00161118 mov %o0, %l00x00161120 cmp %o0, 770x00161124 bgu +3509i {S_regmatch+36320x00161128 sethi %hi(0x219c00), %g20x0016112c sethi %hi(0x160c00), %g10x00161130 sll %o0 , 2, %g30x00161134 add %g1 , 512, %g2 {0x160e00x00161138 lduw [ %g3 + %g2 ], %g3 {[0x0x0016113c jmp %g3 + %g20x00161140 nop S_regmatch+3010

Grid 40 insn X 5 cycTime 14,029,691

malloc+110x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[0x20x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2 , 120, %i0 {0x2160780x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3, 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3, 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2malloc+1180x00075308 cmp %i2, 120x0007530c lduw [ %o3 ], %g20x00075310 bl +13i {malloc+133}0x00075314 stw %g2 , [ %i1 + %i0 ] {[nextf+malloc+1330x00075344 ret 0x00075348 restore %g0, %o3, %o0

Grid 5 insn X 5 cycTime 6,968,528

morecore+110x00075724 lduw [ %i0 + 876 ], %o3malloc+30x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g20x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[buc0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2 , 120, %i0 {0x216070x000751c4 lduw [ %i1 + %i0 ], %o3 {[ne0x000751c8 cmp %o3 , 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3 , 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2

Grid 5 insn X 5 cycTime 6,968,499

Perl_yylex+740x000811a8 sethi %hi(0x218400), %g2malloc+110x0007515c add %g3, 120, %g3Perl_yylex+1210x00081264 lduh [ %g2 + 160 ], %g2 {[PL_lamalloc+120x00075160 ldsb [ %g2 + %g3 ], %i2Perl_yylex+1220x00081268 mov %o0, %i4malloc+140x00075168 sethi %hi(0x216000), %g2Perl_yylex+1230x0008126c cmp %g2, 150malloc+350x000751bc sll %i2, 2, %i1Perl_yylex+1240x00081270 bne +65i {Perl_yylex+189}malloc+360x000751c0 add %g2, 120, %i0Perl_yylex+1250x00081274 mov 12, %o0 {0xc}malloc+370x000751c4 lduw [ %i1 + %i0 ], %o3Perl_yylex+1890x00081374 call +39867i {Perl_newOP}0x00081378 mov 0, %o1 {0x0}Perl_newOP

Grid 5 insn X 5 cycTime 6,968,503

malloc+10x00075134 orcc %g0, %i0, %i3morecore0x000756f8 save %sp, -104, %spmorecore+60x00075710 sethi %hi(0x215c00), %i0malloc+20x00075138 bne +4i {malloc+6}morecore+110x00075724 lduw [ %i0 + 876 ], %o3malloc+30x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g20x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[bu0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3 , 00x000751cc bne +6i {malloc+45}

Grid 5 insn X 5 cycTime 6,968,503

Perl_yylex+1210x00081264 lduh [ %g2 + 160 ], %g2 {[PL_lamalloc+120x00075160 ldsb [ %g2 + %g3 ], %i2Perl_yylex+1220x00081268 mov %o0, %i4malloc+140x00075168 sethi %hi(0x216000), %g2Perl_yylex+1230x0008126c cmp %g2, 150malloc+350x000751bc sll %i2, 2, %i1Perl_yylex+1240x00081270 bne +65i {Perl_yylex+189}malloc+360x000751c0 add %g2, 120, %i0Perl_yylex+1250x00081274 mov 12, %o0 {0xc}malloc+370x000751c4 lduw [ %i1 + %i0 ], %o3Perl_yylex+1890x00081374 call +39867i {Perl_newOP}0x00081378 mov 0, %o1 {0x0}Perl_newOP0x000a8260 save %sp, -96, %sp0x000a8264 call -52301i {malloc}0x000a8268 mov 24, %o0 {0x18}malloc

Grid 5 insn X 5 cycTime 6,968,169

Perl_sv_grow+1790x00103634 mov %i1 , %o0malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i30x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g20x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[bu0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3 , 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3 , 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2

Grid 5 insn X 5 cycTime 6,968,169

Perl_sv_grow+1790x00103634 mov %i1, %o0malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i30x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0, 1, %g20x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2, 2, %g20x0007515c add %g3, 120, %g3 {0x2040780x00075160 ldsb [ %g2 + %g3 ], %i2 {[bu0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2, 2, %i10x000751c0 add %g2, 120, %i0 {0x216078}0x000751c4 lduw [ %i1 + %i0 ], %o3 {[nextf+0x000751c8 cmp %o3, 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3, 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2

Grid 5 insn X 5 cycTime 498,720

S_new_logop+9660x000b36a0 mov 28, %o0 {0x1c}malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i3 {0x1c}0x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g2 {0x1b}0x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g2 {0x6}0x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[buc0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2 , 120, %i0 {0x2160780x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3, 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3, 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2malloc+1180x00075308 cmp %i2, 120x0007530c lduw [ %o3 ], %g20x00075310 bl +13i {malloc+133}0x00075314 stw %g2 , [ %i1 + %i0 ] {[nextf+

Grid 5 insn X 5 cycTime 498,720

S_new_logop+9660x000b36a0 mov 28, %o0 {0x1c}malloc0x00075130 save %sp, -176, %sp0x00075134 orcc %g0, %i0, %i3 {0x1c}0x00075138 bne +4i {malloc+6}0x0007513c mov 4, %i2 {0x4}malloc+60x00075148 cmp %i0, 800x0007514c bgu +8i {malloc+15}0x00075150 sub %i0 , 1, %g2 {0x1b}0x00075154 sethi %hi(0x204000), %g30x00075158 srl %g2 , 2, %g2 {0x6}0x0007515c add %g3 , 120, %g3 {0x204070x00075160 ldsb [ %g2 + %g3 ], %i2 {[buc0x00075164 ba +22i {malloc+35}0x00075168 sethi %hi(0x216000), %g2malloc+350x000751bc sll %i2 , 2, %i10x000751c0 add %g2 , 120, %i0 {0x2160780x000751c4 lduw [ %i1 + %i0 ], %o3 {[nex0x000751c8 cmp %o3, 00x000751cc bne +6i {malloc+45}0x000751d0 cmp %o3, 0malloc+450x000751e4 bne +73i {malloc+118}0x000751e8 sethi %hi(0x218400), %g2malloc+1180x00075308 cmp %i2, 120x0007530c lduw [ %o3 ], %g20x00075310 bl +13i {malloc+133}0x00075314 stw %g2 , [ %i1 + %i0 ] {[nextf+

Load Miss

50 Workshop on Duplicating, Deconstructing, and Debunking, ISCA 03. 50