open64 on mips porting and enhancing open64 for loongson ii loongcc group, ict, beijing seatle, mar....
Post on 21-Dec-2015
225 views
TRANSCRIPT
Open64 on MIPS
Porting and enhancing Open64 for Loongson II
Loongcc Group, ICT, Beijing
Seatle, Mar. 21, 2009
Outline
What’s Loongson II?What’s Loongcc? How Loongcc works, like for art.The porting process and evaluation of
performance
The chip
Loongson 2F in the Loongson II familyFeatures
64-bit, Out-of-order, 4-issue, (0.8~1GHz)MIPS III-compatibleOn-chip 64K/64K L1 cache, 512K L2 cacheOn-chip MMU supporting DDR2 (533MHz)
The chip
The chip
Loongcc
Yet Another Open64 branchTargeting Loongson familyAims
good performance robust
Open source
Loongcc
Loongcc’s transformation of art
Transformation of art
Structure peeling produces temporary arrays
· double ** f1_layer_I;· double * f1_layer_W;
· typedef struct {· double * I;· double w;· … }
f1_neuron;
Structure peeling
A
B
A
B
A
B
Structure peeling
A
B
A
B
A
B
50% cache line
utilization
Structure peeling
ABABAB
AAA
BBB
Structure peeling
ABABAB
AAA
BBB
100% cache line
utilization
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Old values always killed.
No need write dirty cache lines of A to memory after used.
Temporary arrays
A
Cache
Temporary arrays
A
Cache
B
Temporary arrays
A
Cache
B
Need write them to memor
y!
Temporary arrays
A
Cache
BC
Temporary arrays
A
Cache
BC
Write more to memory.
Temporary arrays
Cache
BC
A
Temporary arrays
Cache
BC
A
Write misses!
Problems with temporary arrays
Unnecessary writes to memory
Large cache footprint
Temporary arrays
Solution?
Problems with temporary arrays
Solution?Contraction?
Temporary arrays
Special pattern of visitIn a loop
A := i + Ct := ||A||B := t -1 A …next iteration of loop
Prevents array contraction!
All A need be ready before any
B.
Temporary arrays
Solution?
Temporary arrays
Solution?Overlay
Array overlay
Cache
A,B,C
Array overlay
Cache
A,B,C
Array overlay
Cache
A,B,C
No write miss. Even cold miss.
Array overlay
Cache
A,B,C
Nothing out of cache.
No memory writes.
Array overlay
Cache
A,B,C
Array overlay
Cache
A,B,C
A is still in cache.
Array overlay
No writes to memory! (as long as in cache).
Cache
BC
A
C B
A
Cache
AA,B,C
Less cache footprint!
Effect of Overlay
On Loongson 2F, for art
Effect of Overlay
On Loongson 2F, for art
Effect of Overlay
On Loongson 2F, for art
Effect of Overlay
On Loongson 2F, for art
Effect of Overlay
On Loongson 2F, for art
Other source-to-source transformations
ArrayTranspositionFlattening
Multi-dimension array to one-dimension
StructureSplitting
Special loop patterns
Effect of source-to-source transformation of art.
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Intel Xeon Amd Opteron Loongson 2F
Spee
dup
Ori gi nal W/O Overl ay W/ Overl ay
Effect of source-to-source transformation
Works good when there exists special patterns, like a hot large structure array. It works good for art and equake.
Applying to other SPEC2000INT does not yield good gains (yet).
It can only process C sources.
Source-to-source transformation
Pros Complete information of source level Human readable intermediate results Natural representation of data structure
transformations
Cons Redo dataflow analysis, alias analysis, collection of
frequency information. Interference with all consequent passes of
optimization
Constructing Loongcc and its performance
Porting Process
Merge front/middle-end from Pathscale® with ORC ® -based back-end of our team
Support full SPEC2000SPEC2006 under work
Porting process
Performance
We measure contribution of an optimization by the performance loss when the optimization is disabled.
Performance Comparison
Loongcc base = -O3 –ipaLoongcc peak = follow SPEC peak rule GCC base =
-O3 -march=loongson2f -mtune=loongson2fGCC peak = mild tuning of flagsGFortran used.
Performance
Loongcc base outperforms GCC base by 13%/35%
Loongcc peak outperforms GCC peak by 28%/78%
Apology that we are not real GCC experts.
SPEC2000INT
SPEC2000INTOthers14%
Profiling41%
Prefetch9%
Instruction scheduling5%
Delay slot filling8%
Flag tuning23%
SPEC2000INT
Have Delay Slot Filling in Loongcc base. It is enhanced in Loongcc peak (Bug fix and
more arcs in CG Dependency graph).forward-scheduling in IGLS
improves gap by 8%.
Prefetch
Stride prefetch improves mcf by 27%improves parser by 4%and gap by 6.3%.
Prefetch
Loongson 2F has only “Pseudo Prefetch” lbu %0,addr
Illegal address exception suppressed.Higher costNo effect for SPEC2000FP cases yet.
Other optimizations
Use of conditional move instructionsPlacing affine global data near each
otherPeephole optimizations in EBO
SPEC2000FP
Loongcc compared to GCC
Flush to zeroFlush
to zeromode
Inlining
SPEC2000FP
Array contraction
Array contraction
Source-to-source
transformation
Optimizing cache behavior
Thank you!
Questions please.
Answer to Questions
What’s the take-home message?We develop a working, open source
branch for MIPS, with good performance.
We showcase that source-to-source transformation is a good way to express some optimizations.
Answer to Questions
Why not CPU2006?Support is under work.
Performance comparison
The performance numbers of GCC peak are the maximum of our testing of GCC 4.4/GCC 4.3/ special branch for Loongson 2F from STMicroelectronics®.
GFortran of corresponding version is used.
Question about source-to-source transformation
The source-to-source transformation is implemented as a plugin to CIL
It can only process C sources due to restriction of front-end.
The frequency information has to be collected independently.
Source-to-source transformation
Source-to-source transformation
Source-to-source transformation
Recover index variable to avoid confusing Loongcc
Source-to-source transformation
CIL, C Intermediate LanguageSource-to-source transformation frameworkDataflow analysis etc.Canonicalize the C source.
Array contraction
Loop 1Def of A B C DUse of A B C D
Loop 2Def of A B C Use of A B C D
Missing D prevents direct contraction.
Array contraction
Loop 1Def of A B C DUse of A B C D
Loop 2Def of A B C DUse of A B C D
Missing D prevents direct contraction.Rematerialize D.