open64 on mips porting and enhancing open64 for loongson ii loongcc group, ict, beijing seatle, mar....

74
Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Post on 21-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Open64 on MIPS

Porting and enhancing Open64 for Loongson II

Loongcc Group, ICT, Beijing

Seatle, Mar. 21, 2009

Page 2: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Outline

What’s Loongson II?What’s Loongcc? How Loongcc works, like for art.The porting process and evaluation of

performance

Page 3: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

The chip

Loongson 2F in the Loongson II familyFeatures

64-bit, Out-of-order, 4-issue, (0.8~1GHz)MIPS III-compatibleOn-chip 64K/64K L1 cache, 512K L2 cacheOn-chip MMU supporting DDR2 (533MHz)

Page 4: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

The chip

Page 5: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

The chip

Page 6: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Loongcc

Yet Another Open64 branchTargeting Loongson familyAims

good performance robust

Open source

Page 7: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Loongcc

Page 8: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Loongcc’s transformation of art

Page 9: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Transformation of art

Structure peeling produces temporary arrays

· double ** f1_layer_I;· double * f1_layer_W;

· typedef struct {· double * I;· double w;· … }

f1_neuron;

Page 10: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Structure peeling

A

B

A

B

A

B

Page 11: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Structure peeling

A

B

A

B

A

B

50% cache line

utilization

Page 12: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Structure peeling

ABABAB

AAA

BBB

Page 13: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Structure peeling

ABABAB

AAA

BBB

100% cache line

utilization

Page 14: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Page 15: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Page 16: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Page 17: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Page 18: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Old values always killed.

No need write dirty cache lines of A to memory after used.

Page 19: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

A

Cache

Page 20: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

A

Cache

B

Page 21: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

A

Cache

B

Need write them to memor

y!

Page 22: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

A

Cache

BC

Page 23: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

A

Cache

BC

Write more to memory.

Page 24: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Cache

BC

A

Page 25: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Cache

BC

A

Write misses!

Page 26: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Problems with temporary arrays

Unnecessary writes to memory

Large cache footprint

Page 27: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Solution?

Page 28: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Problems with temporary arrays

Solution?Contraction?

Page 29: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Prevents array contraction!

All A need be ready before any

B.

Page 30: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Solution?

Page 31: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Temporary arrays

Solution?Overlay

Page 32: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

Page 33: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

Page 34: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

No write miss. Even cold miss.

Page 35: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

Nothing out of cache.

No memory writes.

Page 36: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

Page 37: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

Cache

A,B,C

A is still in cache.

Page 38: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array overlay

No writes to memory! (as long as in cache).

Page 39: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Cache

BC

A

C B

A

Cache

AA,B,C

Less cache footprint!

Page 40: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of Overlay

On Loongson 2F, for art

Page 41: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of Overlay

On Loongson 2F, for art

Page 42: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of Overlay

On Loongson 2F, for art

Page 43: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of Overlay

On Loongson 2F, for art

Page 44: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of Overlay

On Loongson 2F, for art

Page 45: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Other source-to-source transformations

ArrayTranspositionFlattening

Multi-dimension array to one-dimension

StructureSplitting

Special loop patterns

Page 46: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of source-to-source transformation of art.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Intel Xeon Amd Opteron Loongson 2F

Spee

dup

Ori gi nal W/O Overl ay W/ Overl ay

Page 47: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Effect of source-to-source transformation

Works good when there exists special patterns, like a hot large structure array. It works good for art and equake.

Applying to other SPEC2000INT does not yield good gains (yet).

It can only process C sources.

Page 48: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Source-to-source transformation

Pros Complete information of source level Human readable intermediate results Natural representation of data structure

transformations

Cons Redo dataflow analysis, alias analysis, collection of

frequency information. Interference with all consequent passes of

optimization

Page 49: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Constructing Loongcc and its performance

Page 50: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Porting Process

Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team

Support full SPEC2000SPEC2006 under work

Page 51: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Porting process

Page 52: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Performance

We measure contribution of an optimization by the performance loss when the optimization is disabled.

Page 53: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Performance Comparison

Loongcc base = -O3 –ipaLoongcc peak = follow SPEC peak rule GCC base =

-O3 -march=loongson2f -mtune=loongson2fGCC peak = mild tuning of flagsGFortran used.

Page 54: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Performance

Loongcc base outperforms GCC base by 13%/35%

Loongcc peak outperforms GCC peak by 28%/78%

Apology that we are not real GCC experts.

Page 55: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

SPEC2000INT

Page 56: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

SPEC2000INTOthers14%

Profiling41%

Prefetch9%

Instruction scheduling5%

Delay slot filling8%

Flag tuning23%

Page 57: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

SPEC2000INT

Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and

more arcs in CG Dependency graph).forward-scheduling in IGLS

improves gap by 8%.

Page 58: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Prefetch

Stride prefetch improves mcf by 27%improves parser by 4%and gap by 6.3%.

Page 59: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Prefetch

Loongson 2F has only “Pseudo Prefetch” lbu %0,addr

Illegal address exception suppressed.Higher costNo effect for SPEC2000FP cases yet.

Page 60: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Other optimizations

Use of conditional move instructionsPlacing affine global data near each

otherPeephole optimizations in EBO

Page 61: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

SPEC2000FP

Page 62: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Loongcc compared to GCC

Flush to zeroFlush

to zeromode

Inlining

Page 63: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

SPEC2000FP

Array contraction

Array contraction

Source-to-source

transformation

Optimizing cache behavior

Page 64: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Thank you!

Questions please.

Page 65: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Answer to Questions

What’s the take-home message?We develop a working, open source

branch for MIPS, with good performance.

We showcase that source-to-source transformation is a good way to express some optimizations.

Page 66: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Answer to Questions

Why not CPU2006?Support is under work.

Page 67: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Performance comparison

The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®.

GFortran of corresponding version is used.

Page 68: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Question about source-to-source transformation

The source-to-source transformation is implemented as a plugin to CIL

It can only process C sources due to restriction of front-end.

The frequency information has to be collected independently.

Page 69: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Source-to-source transformation

Page 70: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Source-to-source transformation

Page 71: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Source-to-source transformation

Recover index variable to avoid confusing Loongcc

Page 72: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Source-to-source transformation

CIL, C Intermediate LanguageSource-to-source transformation frameworkDataflow analysis etc.Canonicalize the C source.

Page 73: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array contraction

Loop 1Def of A B C DUse of A B C D

Loop 2Def of A B C Use of A B C D

Missing D prevents direct contraction.

Page 74: Open64 on MIPS Porting and enhancing Open64 for Loongson II Loongcc Group, ICT, Beijing Seatle, Mar. 21, 2009

Array contraction

Loop 1Def of A B C DUse of A B C D

Loop 2Def of A B C DUse of A B C D

Missing D prevents direct contraction.Rematerialize D.