caps team compiler and architecture for superscalar and embedded processors

36
CAPS team Compiler and Architecture for superscalar and embedded processors

Upload: marylou-wells

Post on 20-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CAPS team Compiler and Architecture for superscalar and embedded processors

CAPS team

Compiler and Architecture

for superscalar and embedded processors

Page 2: CAPS team Compiler and Architecture for superscalar and embedded processors

2

CA

PS

p

roje

ct

CAPS members

2 INRIA researchers: A. Seznec, P. Michaud

2 professors: F. Bodin, J. Lenfant

11 Ph D students: R. Amicel, R. Dolbeau, A. Monsifrot , L. Bertaux,

K. Heydemann, L. Morin, G. Pokam, A. Djabelkhir,

A. Fraboulet, O. Rochecouste, E.Toullec

3 engineers: S. Bihan, P. Villalon, J. Simonnet

Page 3: CAPS team Compiler and Architecture for superscalar and embedded processors

3

CA

PS

p

roje

ct

CAPS themes

Two interacting activities

High performance microprocessor

architecture

Performance oriented compilation

Page 4: CAPS team Compiler and Architecture for superscalar and embedded processors

4

CA

PS

p

roje

ct

CAPS Grail

Performance at the best cost

Progress in computer science and applications are driven by

performance

Page 5: CAPS team Compiler and Architecture for superscalar and embedded processors

5

CA

PS

p

roje

ct

CAPS path to the Grail

Defining the tradeoffs between:

what should be done through hardware

what can be done by the compiler

for maximum performance

or for minimum cost

or for minimum size, power ..

Page 6: CAPS team Compiler and Architecture for superscalar and embedded processors

6

CA

PS

p

roje

ct

Need for high-performance processors

Current applications general purpose: scientific, multimedia, data bases … embedded systems: cell phones, automotive, set-top boxes ..

Future applications don’t worry: users have a lot of imagination !

New software engineering techniques are CPU hungry: reusability, generality portability, extensibility (indirections, virtual machines) safety (run-time verifications) encryption/decryption

Page 7: CAPS team Compiler and Architecture for superscalar and embedded processors

7

CA

PS

p

roje

ct

CAPS (ancient) background

« ancient » background in hardware and software

management of ILP decoupled pipeline architectures OPAC, an hardware matrix floating-point

coprocessor software pipeline for LIW

« Supercomputing » background interleaved memories Fortran-S

Page 8: CAPS team Compiler and Architecture for superscalar and embedded processors

CA

PS

p

roje

ct

CAPS background in architecture

Solid knowledge in microprocessor architecture technological watch on microprocessors A. Seznec worked with Alpha Development Group in

1999-2000

Researches in cache architecture

Researches in branch prediction mechanisms

Page 9: CAPS team Compiler and Architecture for superscalar and embedded processors

9

CA

PS

p

roje

ct

CAPS background in compilers

Software optimizations for cache memories Numerical algorithms on dense structures Optimizing data layout

Many prototype environments for parallel compilers: CT++ (with CEA): image processing C++ library for a SIMD

architecture, Menhir: a parallel compiler for MatLab

IPF (with Thomson-LER): Fortran Compiler for image processing on Maspar

Sage (with Indiana): Infrastusture for source level transformation

Page 10: CAPS team Compiler and Architecture for superscalar and embedded processors

10

CA

PS

p

roje

ct

We build on

SALTO: System for Assembly-Language Transformations and Optimizations retargetable assembly source to source preprocessor Erven Rohou’s Ph. D 

TSF: Scripting language for program transformation on top

of ForeSys (Simulog) Yann Mevel’s Ph. D

Page 11: CAPS team Compiler and Architecture for superscalar and embedded processors

11

CA

PS

p

roje

ct

Salto overview Assembly source to source preprocessor Fine grain machine description Independent from compilers

Transformationtool

SALTO

inte

rfa

ce

C++

Machine Description

assemblylanguage

assemblylanguage

Page 12: CAPS team Compiler and Architecture for superscalar and embedded processors

12

CA

PS

p

roje

ct

Compiler activities

Code optimizations for embedded applications infrastructures rather than compilers

optimizing compiler strategies rather than new code optimizations

Global constraints performance /code sizes/ low power (starting)

Focus on interactive tools rather than automatic code tuning case based reasoning assembly code optimizations

Page 13: CAPS team Compiler and Architecture for superscalar and embedded processors

13

CA

PS

p

roje

ct

Computer aided hand tuning

Automatic optimization has many shortcomings

rather provide the user with a testbed to hand-tune

applications

Target applications

Fortran codes and embedded C applications

Our approach

case based reasoning

static code analysis and pattern matching

profiling

learning techniques

the user is the ultimate responsible

Page 14: CAPS team Compiler and Architecture for superscalar and embedded processors

14

CA

PS

p

roje

ctCAHT

Prototype built onForesys: Fortran interactive front-end (from Simulog)

TSF: Scripting language for program transformation

Sage++: Infrastusture for source level transformation

Page 15: CAPS team Compiler and Architecture for superscalar and embedded processors

15

CA

PS

p

roje

ct

Analysis and Tuning tool for Low Level Assembly and Source code (with Thomson Multimedia)

ATLLAS objectives : Has the compiler done a good job ? Try to match source and optimized assembly at fine

grain Development/analysis environment:

Models for both source and assembly Global and local analysis (WCET, …) at both levels Interactive environment for codes visualization and

manual/ automatic analysis and optimization

Built using Salto and Sage++: Retargetable with compilers and architectures

Page 16: CAPS team Compiler and Architecture for superscalar and embedded processors

16

CA

PS

p

roje

ct

ATLLAS - Analysis and Tuning tool for Low Level Assembly and Source code : Tuning method

Good ?

Half-Automatic or Manual Source

Optimisations

Atllas

compilation profiling

End

Yes

Half-Automatic or Manual Assembly

Optimisations

Source Code Assembly Code

Post-Processing

ProcessingSupport

Code matching analysis and evaluationsGraphic Display of Ass. And Src. Code

Page 17: CAPS team Compiler and Architecture for superscalar and embedded processors

17

CA

PS

p

roje

ct

Assembly Level Infrastrure for Software Enhancement (with STmicroelectonics)

ALISE enhanced SALTO for code optimization:

• better integration with code generation– interface with front-end– interface for profiling data

• targets global optimization• based on component software optimization

engines

Answer to a real need from industry: A retargetable infrastructure

Page 18: CAPS team Compiler and Architecture for superscalar and embedded processors

18

CA

PS

p

roje

ct

ALISE

Environment for: global assembly code optimization providing optimization alternatives

Support for new embedded processors ISAs with ILP support (VLIW, EPIC) Predicated instructions Functional unit clusters, ..

Page 19: CAPS team Compiler and Architecture for superscalar and embedded processors

19

CA

PS

p

roje

ct

ALISE

ArchitectureDescription

D to MArchitecture Model

Intermediate representation

Opt 1 Opt 2 Opt n

P to IRTextInput

IR to Ass(Emit)

OptimizedProgram

High Level API

Interfaces

External Infrastructure

User interfaceG.U.I.

IntermediateCode

External Infrastructure

Page 20: CAPS team Compiler and Architecture for superscalar and embedded processors

20

CA

PS

p

roje

ct

Preprocessor for media processors (MEDEA+ Mesa project)

Multimedia instructions on embedded and general-purpose processors but : no consensus on MMD instructions among constructors:

• saturated arithmetic or not, different instructions, …

Multimedia instructions are not well handled by compilers:

• but performance is very dependent

Page 21: CAPS team Compiler and Architecture for superscalar and embedded processors

21

CA

PS

p

roje

ct

Preprocessor for media processors:our approach

C source to source preprocessor user oriented idioms recognition:

easy to retarget target dedicated recognition

exploiting loop parallelism vectorization techniques multiprocessor systems

available soon

Collaboration with Stmicroelectonics

Page 22: CAPS team Compiler and Architecture for superscalar and embedded processors

22

CA

PS

p

roje

ct

Iterative compilation

Embedded systems: Compile time is not critical Performance/code size/power are critical One can often relate on profiling

Classical compiler: local optimizations but constraints are GLOBAL

Proof of concept for code sizes (Rohou ’s Ph. D) new Ph. D. beginning in september 2000

Page 23: CAPS team Compiler and Architecture for superscalar and embedded processors

23

CA

PS

p

roje

ct

High performance instruction set simulation

Embedded processors: // development of silicon, ISA, compiler and applications

Need for flexible instruction set simulation: high performance simulation of large codes debugging retargetable to experiment:

• new ISA• various microarchitecture options

First results: up to 50x faster than ad-hoc simulator

Page 24: CAPS team Compiler and Architecture for superscalar and embedded processors

24

CA

PS

p

roje

ctABSCISS: Assembly Based System for Compiled Instruction Set Simulation

C Source TriMedia Assemblytmcc

TriMedia Binary

ABSCISS

tmsim

tmas

gcc

C/C++ Source

Compiled simulator

Architecture Description

Page 25: CAPS team Compiler and Architecture for superscalar and embedded processors

25

CA

PS

p

roje

ct

Enabling superscalar processor simulation

Complete O-O-O microprocessor simulation: 10000-100000 slower than real hardware can not simulate realistic applications, but slices even fast mode emulation is slow (50-100x):

• simulation generally limited to slices at the beginning of the application

• representativeness ? Calvin2 + DICE:

combines direct execution with simulation really fast mode: 1-2x slowdown enables simulating slices distributed over the whole

application

Page 26: CAPS team Compiler and Architecture for superscalar and embedded processors

26

CA

PS

p

roje

ct DICEHost ISAEmulator

User analysisroutines

Calvin2 + DICE

Original code

SPARC V9 assembly

code

calvin2Static Code Annotation Tool

checkpoint

checkpoint

checkpoint

checkpoint

checkpoint

Switching event

Emulation modeSwitching event

Page 27: CAPS team Compiler and Architecture for superscalar and embedded processors

27

CA

PS

p

roje

ct

Moving tools to IA64

New 64bit ISA from Intel/HP: Explicitly Parallel Instruction Computing Predicated Execution Advanced loads (i.e. speculative) A very interesting platform for research !!

Porting SALTO and Calvin2+DICE approach to IA64

Exploring new trade-offs enabled by instruction sets: predicting the predicates ? advanced loads against predicting dependencies ultimate out-of-order execution against compiler

Page 28: CAPS team Compiler and Architecture for superscalar and embedded processors

28

CA

PS

p

roje

ct

Low power, compilation, architecture, …(just beginning :=)

Power consumption becomes a major issue: Embedded and general purpose

Compilation (setting a collaboration with STmicroelectronics/Stanford/Milan): Is it different from performance optimization ? Global constraint optimization Instruction Set Architecture support ?

Architecture: High order bits are generally null, … registers and memory ALUs

Page 29: CAPS team Compiler and Architecture for superscalar and embedded processors

29

CA

PS

p

roje

ct

Caches and branch predictors

International CAPS visibility in architecture = skewed associative cache + decoupled sectored cache + multiple block ahead branch prediction + skewed branch predictor

Continue recurrent work on these topics: multiple block ahead + tradeoffs complexity/accuracy

Page 30: CAPS team Compiler and Architecture for superscalar and embedded processors

30

CA

PS

p

roje

ct

Simultaneous Multithreading

Sharing functional units among several processes Among the first groups working on this topic

S. Hily’s Ph. D. SMT behavior well understood for independent threads

now, focus on // threads from a single application

Current research directions: speculative multithreading

• ultimate performance with a single thread through predicting threads

performance/complexity tradeoffs: SMT/CMP/hybrid

Page 31: CAPS team Compiler and Architecture for superscalar and embedded processors

31

CA

PS

p

roje

ct

« Enlarging » the instruction window (supported by Intel)

In an O-O-O processor, fireable instructions are chosen in a window of a few tens of RISC-like instructions.

Limitations are: size of the window number of physical registers

Prescheduling: separate data flow scheduling from resource arbitration. coarser units of work ?

Reducing the number of physical registers: how to detect when a physical register is dead ? Per group validation ? revisiting CISC/RISC war ?

Page 32: CAPS team Compiler and Architecture for superscalar and embedded processors

32

CA

PS

p

roje

ct

Unwritten rule on superscalar processor designs

For general purpose registers:

Any physical register can be the source or the result of any instruction executed

on any functional unit

Page 33: CAPS team Compiler and Architecture for superscalar and embedded processors

33

CA

PS

p

roje

ct

4-cluster WSRS architecture(supported by Intel)

S0

S0 C0

S1

S1C1

S2

C2

S3

S3C3S2

•Half the read ports, one

fourth the write ports•Register file:

• Silicon area x 1/8• Power x 1/2• Access time x 0.6

•Gains on:•bypass network•selection logic

Page 34: CAPS team Compiler and Architecture for superscalar and embedded processors

34

CA

PS

p

roje

ct

Multiprocessor on a chip

Not just replicating board level solutions !

A way to manage a large on-chip cache capacity: how can a sequential application use efficiently a distributed

cache ? architectural supports for distributing a sequential application

on several processors ? how should instructions and data be distributed ?

Page 35: CAPS team Compiler and Architecture for superscalar and embedded processors

35

CA

PS

p

roje

ct

HIPSORHIgh Performance SOftware Random number generation

Need for unpredicable random number generation: sequences that cannot be reproduced

State of the art: < 100 bit/s using the operating system 75Kbit/s using hardware generator on Pentium III

Internal state of a superscalar can not be reproduced use this state to generate unpredictable random

numbers

Page 36: CAPS team Compiler and Architecture for superscalar and embedded processors

36

CA

PS

p

roje

ct

HIPSOR (2)

1000’s of unmonitorable states modified by OS interrupts

Hardware clock counter to indirectly probe these states

Combined with in-line pseudo-random number generation

100 Mbit/s unpredictable random numbers

ARC INRIA with CODES