open64 on mips porting and enhancing open64 for loongson ii loongcc group, ict, beijing seatle, mar....

Open64 on MIPS

Porting and enhancing Open64 for Loongson II

Loongcc Group, ICT, Beijing

Seatle, Mar. 21, 2009

Outline

What’s Loongson II?What’s Loongcc? How Loongcc works, like for art.The porting process and evaluation of

performance

The chip

Loongson 2F in the Loongson II familyFeatures

64-bit, Out-of-order, 4-issue, (0.8~1GHz)MIPS III-compatibleOn-chip 64K/64K L1 cache, 512K L2 cacheOn-chip MMU supporting DDR2 (533MHz)

The chip

Loongcc

Yet Another Open64 branchTargeting Loongson familyAims

good performance robust

Open source

Loongcc

Loongcc’s transformation of art

Transformation of art

Structure peeling produces temporary arrays

· double ** f1_layer_I;· double * f1_layer_W;

· typedef struct {· double * I;· double w;· … }

f1_neuron;

Structure peeling

A

B

A

B

A

B

Structure peeling

A

B

A

B

A

B

50% cache line

utilization

Structure peeling

ABABAB

AAA

BBB

Structure peeling

ABABAB

AAA

BBB

100% cache line

utilization

Temporary arrays

Special pattern of visitIn a loop

A := i + Ct := ||A||B := t -1 A …next iteration of loop

Temporary arrays



Old values always killed.

No need write dirty cache lines of A to memory after used.

Temporary arrays

A

Cache

Temporary arrays

A

Cache

B

Temporary arrays

A

Cache

B

Need write them to memor

y!

Temporary arrays

A

Cache

BC

Temporary arrays

A

Cache

BC

Write more to memory.

Temporary arrays

Cache

BC

A

Temporary arrays

Cache

BC

A

Write misses!

Problems with temporary arrays

Unnecessary writes to memory

Large cache footprint

Temporary arrays

Solution?

Problems with temporary arrays

Solution?Contraction?

Temporary arrays



Prevents array contraction!

All A need be ready before any

B.

Temporary arrays

Solution?

Temporary arrays

Solution?Overlay

Array overlay

Cache

A,B,C

Array overlay

Cache

A,B,C

No write miss. Even cold miss.

Array overlay

Cache

A,B,C

Nothing out of cache.

No memory writes.

Array overlay

Cache

A,B,C

Array overlay

Cache

A,B,C

A is still in cache.

Array overlay

No writes to memory! (as long as in cache).

Cache

BC

A

C B

A

Cache

AA,B,C

Less cache footprint!

Effect of Overlay

On Loongson 2F, for art

Other source-to-source transformations

ArrayTranspositionFlattening

Multi-dimension array to one-dimension

StructureSplitting

Special loop patterns

Effect of source-to-source transformation of art.

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Intel Xeon Amd Opteron Loongson 2F

Spee

dup

Ori gi nal W/O Overl ay W/ Overl ay

Effect of source-to-source transformation

Works good when there exists special patterns, like a hot large structure array. It works good for art and equake.

Applying to other SPEC2000INT does not yield good gains (yet).

It can only process C sources.

Source-to-source transformation

Pros Complete information of source level Human readable intermediate results Natural representation of data structure

transformations

Cons Redo dataflow analysis, alias analysis, collection of

frequency information. Interference with all consequent passes of

optimization

Constructing Loongcc and its performance

Porting Process

Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team

Support full SPEC2000SPEC2006 under work

Porting process

Performance

We measure contribution of an optimization by the performance loss when the optimization is disabled.

Performance Comparison

Loongcc base = -O3 –ipaLoongcc peak = follow SPEC peak rule GCC base =

-O3 -march=loongson2f -mtune=loongson2fGCC peak = mild tuning of flagsGFortran used.

Performance

Loongcc base outperforms GCC base by 13%/35%

Loongcc peak outperforms GCC peak by 28%/78%

Apology that we are not real GCC experts.

SPEC2000INT

SPEC2000INTOthers14%

Profiling41%

Prefetch9%

Instruction scheduling5%

Delay slot filling8%

Flag tuning23%

SPEC2000INT

Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and

more arcs in CG Dependency graph).forward-scheduling in IGLS

improves gap by 8%.

Prefetch

Stride prefetch improves mcf by 27%improves parser by 4%and gap by 6.3%.

Prefetch

Loongson 2F has only “Pseudo Prefetch” lbu %0,addr

Illegal address exception suppressed.Higher costNo effect for SPEC2000FP cases yet.

Other optimizations

Use of conditional move instructionsPlacing affine global data near each

otherPeephole optimizations in EBO

SPEC2000FP

Loongcc compared to GCC

Flush to zeroFlush

to zeromode

Inlining

SPEC2000FP

Array contraction

Array contraction

Source-to-source

transformation

Optimizing cache behavior

Thank you!

Questions please.

Answer to Questions

What’s the take-home message?We develop a working, open source

branch for MIPS, with good performance.

We showcase that source-to-source transformation is a good way to express some optimizations.

Answer to Questions

Why not CPU2006?Support is under work.

Performance comparison

The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®.

GFortran of corresponding version is used.

Question about source-to-source transformation

The source-to-source transformation is implemented as a plugin to CIL

It can only process C sources due to restriction of front-end.

The frequency information has to be collected independently.


Recover index variable to avoid confusing Loongcc


CIL, C Intermediate LanguageSource-to-source transformation frameworkDataflow analysis etc.Canonicalize the C source.

Array contraction

Loop 1Def of A B C DUse of A B C D

Loop 2Def of A B C Use of A B C D

Missing D prevents direct contraction.

Array contraction



Missing D prevents direct contraction.Rematerialize D.

open64 on mips porting and enhancing open64 for loongson ii loongcc group, ict, beijing seatle, mar....

Documents