nci report: zephyr

NCI Report: Zephyr

PLDI NCI TutorialPLDI NCI Tutorial

University of VirginiaUniversity of Virginia

Princeton UniversityPrinceton University

6/16/2000 PLDI NCI Tutorial 2

Zephyr Goals

• Goal– Deliver high-quality, language-

neutral tools for rapidly constructing compilers for experimental computing systems research

• How– Provide specification languages and

processors to automatically generate key compiler components•Don’t write code, write specifications!

Zephyr Compilers

EDG C++Java

MachSUIF

SUIF-to-VPOBridge

lccEDG C++

Sparc MIPS X86Alpha X86

In terprocedura lanalysis

Para lle lizationand loca lity

optsO bject-oriented

optsScheduling

RegisterA llocation

Instruction se lectionRegister a llocation

Code motionM emory access

coalescingInduction variab le

e lim inationCSE

Loop unro llingIn lin ing

SUIF Zephyr

Zephyr Building Blocks

• ASDL: Abstract Syntax Description Language

• VPO: Very Portable Optimizer• CSDL: Computer System

Description Language

ASDL: Abstract Syntax Description Language

Parser

ASTSemanticAnalysis

Translate IR OPT1

IR OPTn

CodeGen

AST IR

GlueGenerator

GlueDescription

• ASDL makes it easy to communicate complex recursive data structures

• ASDL and its tools provide – Concise descriptions of tree-like

structures, including ASTs and compiler (IRs)

– Automatic generation of data structure implementations and pickling functions for C, C++, Java, Standard ML, and Haskell.

– Graphical browsing and editing of data structures on disk.

• For more information about ASDL see:– Give reference here– Give URL here

VPO: Very Portable Optimizer

• VPO is a retargetable optimizer that operates on a low-level, machine-independent representation called RTLs (register transfer lists)

• VPO is retargeted by providing a machine description (MD) of the target machine, and revising a few machine-dependent routines

• VPO is small, easily extended, and extremely effective

History Lesson

• PO developed in 1981– Pioneered use of RTLs– Demonstrated ability to

do optimizations on low-level representation

• Development split in 1982– gcc development

• Richard Stallman and Len Tower

– VPO development• Many people at Uva

and a few industrial labs

V P O gcc

Register Transfer Lists• Based on Bell and Newell's ISP

notation• Machine-independent

representation of a machine-dependent operation

• Algorithms that manipulate RTLs are machine-independent

Register Transfer Lists• While assembly language notations

may very, RTLs are very similar across architectures

ExampleRTL Machineadd %o1,%o2,%o2 SPARCaddu $10,$10,$9 MIPSar 10,9 IBM

in RTL each operation would be representedr[10] = r[10] + r[9];

• The form of RTLs are fixed• dst = src ; dst = src ; dst = src …

– The individual register transfers are performed in parallel

– Example• r[1] = r[1] + r[2] ; NZ = r[1] + r[2] ? 0

– VPO provides machine-independent primitives for operating on and manipulating RTLs• Obtain the sources and destinations• Obtain the memory locations read and written• Obtain the type of instruction (arithmetic,

branch, control transfer, etc.)

• Think of RTL as a machine-independent assembly language– For a machine X, each RTLx describes

an instruction in X’s instruction set (may be a synthetic instruction)

– RTLx should specify• instruction’s input and outputs• the transformation the instruction

makes on the machine state– VPO uses this information to

compute a dataflow graph

Compilation with VPO

SourceCode

Front andMiddle Ends

VPO Mach MachineCode

You supply the front end and a simple code generator, we supply an optimizing back end

Generating RTLX

• Translate IL ops to semantically equivalent sequences of instructions for the target machine– Generate RTL representation of

instructions, not assembly language– Do not worry about code quality

• Perform naïve, straightforward translation• Expose all computations (even effective

address computations) to VPO• Use virtual or pseudo registers for temporaries• VPO handles activation record and data

placement

Generating RTLx

The C codeK = I + 1;

= <int,32>

ADDR K<local,32>

+ <int,32>

@ <int,32>

ADDR I<local,32>

CON 1<int,32>

IL SPARC RTLADDR int K r[33]=r[14]+K.;ADDR int I r[34]=r[14]+I.;@ int r[35]=M[r[34]]; r[34]CON int 1 r[36]=1;+ int r[37]=r[35]+r[36]; r[35]:r[36]= int M[r[33]]=r[37]; r[33]:r[37]

VPO design rationale• All "traditional" optimizations performed

at the machine-level on a single representation—RTL– most optimizations are machine-dependent– better code is produced– instruction selection can be performed on

demand– avoids phase ordering problems– simplifies implementation of optimizations– easier to accommodate emerging

architectures– "plug and play" structure

RTLs in VPO

• VPO optimization algorithm– repeat

apply code-improving transformationuntil fixed-point reached or exhausted registers

• Maintaining two invariants– Semantic invariant (S)

• Observable behavior of program unchanged (according to RTL semantics)

– Machine invariant (M)• Every RTL equivalent to one machine instruction

VPO code improvements

• Each code-improving transformation is– machine-level, but– machine-independent

• Any semantics-preserving transformation is OK

• Preserve machine invariant (M) using machine description;– for each new RTL produced, ask MD if OK– if any is not target machine instruction,

roll back transformation

Code improvement catalog

• Register assignment and allocation

• Common subexpression elimination

• Induction variable elimination

• Code motion• Constant propagation• Copy propagation• Memory access

coalescing

• Recurrence detection

• Instruction scheduling

• Dead code elimination

• Constant folding• Loop unrolling• Branch minimization• Evaluation order

determination

VPO Optimizations

• Common subexpression elimination•Davidson, J. W. and Fraser, C. W.,

‘Eliminating Redundant Object Code,’ in Conference Record of the Ninth Annual ACM Symposium on Principles of Programming Languages, January 1982, pp. 128–132.

• Evaluation Order Determination•Davidson, J. W. , ‘A Retargetable Instruction

Reorganizer’, in Proceedings of the SIGPLAN ‘86 Symposium on Compiler Construction, 21(7), June 1986, pp. 23–241.

VPO Optimizations

• Link-time optimization• Benitez, M. E. and Davidson, J. W., ‘A Portable

Global Optimizer and Linker’, in Proceedings of the SIGPLAN ‘88 Symposium on Programming Language Design and Implementation, June 1988, pp. 329—338.

• Memory access coalescing• Davidson, J. W. and Jinturkar, S., ‘Memory

Access Coalescing: A Technique for Eliminating Redundant Memory Accesses’, in Proceedings of the SIGPLAN ‘94 Symposium on Programming Language Design and Implementation, Orlando, FL, June 1994, pp. 186— 195.

VPO Optimizations

• Code Motion• Benitez, M. E. and Davidson, J. W., ‘The

Advantages of Machine-Dependent Global Optimization’, in Proceedings of the 1994 Conference on Programming Languages and Systems Architectures, Zurich, Switzerland, March 1994, pp. 105–124.

• Loop Unrolling• Jinturkar, S. and Davidson, J. W., ‘Improving

Instruction-level Parallelism by Loop Unrolling and Dynamic Memory Disambiguation’, in Proceedings of the 28th Annual IEEE/ACM International Symposium on Microarchitecture, Ann Arbor, MI, November 1995, pp. 125–132.

VPO Optimizations

• Branch mininization•F. Mueller and D. B. Whalley, ‘Avoiding

Conditional Branches by Code Replication’ in Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, June 1995, pages 56-66.

•M. Yang, G. Uh, and D. Whalley, ‘Improving Performance by Branch Reordering’ in Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, June 1998, pages 130-141.

VPO Optimizations

• Recurrence detection and optimization

•Benitez, M. E. and Davidson, J. W., ‘Code Generation for Streaming: an Access/Execute Mechanism’, in Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, April 1991, pp. 132–141.

Building VPO

VPOGenerator

Eval. Order Determ.

ZIFLow Analysis &Transformation Libraries

VPOMIPS

CSDLSPARCSpecification

NewTransformation

CSDLMIPSSpecification

CSDLALPHASpecification

CSDLi486Specification

Register Allocation

Access Coalescing

Comm. Subexpr. Elim.

Eval. Order Determ.

Induction Var. Elim.

Instruction Scheduling

Code Motion

SSA Computation

CSDL: Computing System Description Language

• Computing System Description Language– Modular system of components– Allows applications to customize a

description– Easily extensible for adding new

details– Reusable/application independent

CallingConvention

MemorySystem

Description(MSDL)

PipelineDescription(PLUNGE)

CSDL Core

InstructionRepresentation

(SLED)

Object-fileFormat

MemorySystem(MSDL)

CallingConvention

(CCL)PipelineDescription(PLUNGE)Pipeline

(PLUNGE)

InstructionSemantics

(l -RTL)

Zephyr Compilers

• EDGSUIF-to-VPO Compiler– Five targets (SPARC, Pentium, Alpha,

MIPS, SimpleScalar)

TargetMachine

EDG Front EndSourceCode ...

SUIF Pass 1 VPOSPARC

SUIF Passes

SUIF-to-LIRALIRA-to-SPARC

RTLSPARC

Zephyr Compilers

• EDG-to-VPO C++ compiler– Funded by Edison Design group– Targeted to SPARC only– Compiles all benchmark suites (SPEC,

PGI, lcc)– Code generator (translator from EDG

intermediate representation to RTLs) provided as a literate program

Zephyr Compilers

• lcc-to-VPO C compiler– Targeted to SPARC, X86, MIPS, ALPHA,

and SimpleScalar– Code generators (translators from LIRA

to target-machine RTLs) provided as literate programs

– Currently producing good code, some optimizations are not fully implemented/debugged

SPEC results for SPARC

Benchmark Gcc –O Lcc vpolcc go 13.4 6.45 11.0 M88ksim 5.70 4.98 6.2 li 8.98 5.93 7.48 Compress 11.6 9.0 9.28 Ijpeg 8.79 5.54 8.6 Perl 12.3 9.2 10.2 Vortex 10.7 8.27 11.2

Acknowledgements

• This work has been funded by:– Defense Advanced Research Projects

Agency– National Science Foundation– Panasonic AVC Labs– Edison Design Group

Afternoon Schedule

Time Talk

1:30-2:00 ASDL: Dan Wang

2:00-2:55 Using Zephyr for PL Research: Kevin Scott The VPO Code Generation Interfaces LIRA: The lcc intermediate representation SUIF-to-LIRA

2:55-3:15 Using Zephyr for Architecture Research: Jason Hiser and Chris Milner Introduction Handling a target machine’s calling convention

3:15-3:30 Break

Afternoon Schedule

Time Talk

3:30-4:30 Using Zephyr for Architecture Research (continued): Jason Hiser and Chris Milner Writing a VPO machine description (md.y) Writing a VPO register specification (regs.rt) EASE: Environment for Architecture Study and Evaluation Case Study: Targeting SimpleScalar

4:30-5:20 Using Zephyr for Optimization Research: Jack Davidson Introduction to VPO’s optimization structure Adding a new optimization to VPO

Afternoon Schedule

Time Talk

5:20-5:40 Zephyr support tools: Raja Venkateswaran VET: Observing and debugging VPO VPOISO: Isolating optimization errors

5:40-6:00 Wrap up and Open Discussion

Using Zephyr for Programming Language

ResearchKevin Scott

University of Virginia

Overview

• Zephyr organization and philosophy• VPO code generation interfaces• Adding a new front-end to Zephyr:

– Using the Lira intermediate representation

– With a custom code expander using the VPO code generation interfaces

• Language related issues in retargeting Zephyr

• Q & A

What is Zephyr?

• Set of tools for generating and optimizing RTL programs– VPO (Very Portable Optimizer)

• SPARC, Alpha, x86, MIPS, SimpleScalar (PISA)

– Code Expanders• Turn a front-end’s IR into RTLs

– Glue for hooking front-ends up to VPO• VPO code generation interfaces• Lira IR

– Debugging tools• VET – interface for controlling and visualizing

VPO transformations• vpoiso – isolates optimizer bugs

National Compiler Infrastructure

SML/NJ EDG C++ Ada95DEC

FORTRANJava

MachSUIF

SUIF-to-VPOBridge

lccEDG C++IBM C++

VisualAge

Sparc MIPS X86Alpha X86

Interproceduralanalysis

Parallelizationand locality

optsObject-oriented

optsScheduling

RegisterAllocation

Instruction selectionRegister allocation

Code motionMemory access

coalescingInduction variable

eliminationCSE

Loop unrollingInlining

SUIFInfrastructure

ZephyrInfrastructure

Optional Item

Why use Zephyr?

• You’re a language researcher– Easy to hook a front-end up to VPO– Relatively little effort required to get

multiple targets– VPO is a very good optimizer

•Wide range of existing operations•Leverage work of others contributing new

optimizations to VPO– Let’s you concentrate on front-end

issues– Less work than writing a VPO-quality

optimizer yourself

Front Ends

Zephyr Organization

lccEDG SUIF

SPARC MIPS

Alpha x86

Lira code expanders

EDG code expanders

VPOi and VPOasm

CVM code expanders

Four Front Ends

• VPCC – A K&R C compiler– IR is code for a C virtual machine (CVM)– Deprecated in favor of lcc front-end

• EDG – Edison Design Group C/C++– Very flexible IR

• Lcc – Retargetable C compiler– Simple backend emits Lira, an IR based on

lcc trees

• SUIF 2.1– High level optimizations and analyses– suif2lira pass transforms SUIF IR into Lira

Code Expanders

• CVM Code Expanders– SPARC, x86, MIPS– Generate encoded RTL files directly –

don’t use VPOi or VPOasm

• EDG Code Expanders– SPARC– First expander to use VPOi and

VPOasm interfaces

Lira Code Expanders

• Targets– SPARC– X86– Alpha– MIPS32– MIPS64 and SimpleScalar (PISA)

• Input Lira code specialized for target• Output encoded RTLs for VPO• All use the VPOi and VPOasm

interfaces

• VPOi provides a C interface for:– Creating RTLs– Sending RTLs to VPO for optimization

• Abstracts away specifics of:– RTL representation– How RTLs are sent to VPO

• RTL creation routines can be semi-automatically generated from a machine specification

VPOasm

• VPOasm provides a C interface for sending assembly language statements to VPO.

• Allows a code expander to:– Change segments– Define symbols– Initialize storage locations– Specify alignments for code or data

More on VPOi and VPOasm

• Why use these interfaces?– Simpler than writing out VPO encoded RTL

files manually.– Can get some of the implementation for

free if doing a new target architecture.– Allows us to change RTL and assembly

language representations w/o fouling you up. Much.

• Reference manual for VPOi and VPOasm:– http://www.cs.virginia.edu/zephyr/vpoi

VPOi and VPOasm caveats

• Interfaces are written in C.– Bad if you’re writing a code expander in

languages with no mechanism for calling C functions.

• Interfaces are relatively rigid.– Suppose you want to communicate

something to the optimizer that doesn’t look like an RTL or assembly language.

• Interfaces have only been tested on C/C++ front ends.– Might have to change to accommodate new

language features…

• Simple IR based on lcc trees• Targets a stack-oriented virtual

machine• Two types of entities in a Lira file:

– Instructions– Directives

Lira Instructions

• Instruction is composed of:– Operator (33)

– Type• F (float), I (signed integer), U (unsigned integer),

P (pointer), V (void), B (aggregate)

– Size• 1, 2, 4, 8, …

– Auxiliary info

CALLGEMODADDCVF

ARGEQLSHNEGBCOM

NEASGNDIVINDIRCNST

LABELLTSUBBXORCVUADDRL

JUMPLERSHBORCVPADDRG

RETGTMULBANDCVIADDRF

Lira Instruction Example

• C Fragmentint a;

a = a + 10;

• Lira Translation

ADDRGP4 “a”

INDIRI4

CNSTI4 10

ADDRGP “a”

ASGNI4

Lira Directives

• Change program segments with:– code, data, bss, lit

• Specify alignment with:– align

• Control symbol visibility with:– import, export

• Initialize storage locations with:– bytes, string, address, skip

Lira Directives (cont)

• Indicate procedure boundaries with:– proc, endproc

• Describe procedure locals and parameters with:– local, param

• Describe source coordinates with:– file, line

Lira Directive Example

• Reserving storage for a global int “a”-bss-export a-align 4+LABELI4 “a”-skip 4

The truth about Lira

• Lira can be emitted from lcc using a postorder walk of lcc trees. Almost.

• Typical case:ADDI4

INDIRI4

ADDRGP4 “a”

CNSTI4 10

ADDRGP4 “a”

INDIRI4

CNSTI4 10

The truth about Lira (cont)

• Sometimes, we don’t do a postorder traversal:

INDIRI4

ADDRGP4 “a”

CNSTI4 10

ADDRGP4 “a”

INDIRI4

CNSTI4 10

ADDRGP “a”

ASGNI4

ADDRGP4 “a”

INDIRI4

The truth about Lira (cont)

• A Lira program is specialized to the compilation target.– Types, sizes and alignments are

target specific– Front-end must generate appropriate

target dependent code for accessing the components of aggregates (arrays and structs)

Lira Code Expander

• Structured for simplicity.• Code is generated by a big switch

statement.• Two passes made over the input.

– First gather symbol information.– Second generates code.

• SPARC expander is about 1800 lines of C. Close of ½ of the code is machine independent or easily reused on new targets.

Retargeting Lira code expander

• Three big tasks:– Modify dumptree to map Lira ops

onto RTLs for the new target. Easiest of the three since there is substantial opportunity for cut & paste coding.

– Modify sp_call to emit target dependent RTLs. On the SPARC we emit the following when the caller returns a struct:VPOi_rtl(ST(tmp_loc, sp_plus(r[14], SP_OFS-4)),

VPOi_locSetBuild(tmp_loc, 0));

Retargeting Lira code expander

• Modify setup_frame to:– Use right offsets for parameters and

locals.– Emit RTLs to do target dependent

frame setup on procedure entry. For procedures returning a struct on the SPARC, we emit:

VPOi_rtl(LD(sp_plus(r[30], SP_OFS-4),tmpreg), 0);

locaddr = sp_plus_ra(r[30], locals.t[0].sym, 0);

VPOi_rtl(ST(tmpreg, Rtl_fetch(locaddr, 32)),

VPOi_locSetBuild(locaddr, tmpreg, 0));

Why use Lira?

• Lira is a pretty good intermediate language for C-like languages. (Thanks to Chris Fraser and Dave Hanson!)– Abstracts away specifics of a target’s calling

sequence! Left to code expander to implement.

• Separating Lira from lcc means that we can reuse the Lira code expanders for front-ends other than lcc. E.g., SUIF.

• Very easy to write a Lira code expander.

Lira References

• “A Retargetable C Compiler: Design and Implementation”

• Lcc version 4.1 code generation interfaces– http://www.cs.princeton.edu/software/lcc/pkg/doc/4.

• More on the way…

Adding a front-end to Zephyr

• Is your language C-like? – If yes then consider writing code to

map your IR onto Lira. This gets you all of Lira’s targets almost for free.

– If no then you might need to write a code expander for each target you want to support.

Adding a front-end to Zephyr

• Is my target already supported?– If yes then you’re golden.– If no then you may have to do one or

more of the following:•Create VPOi and VPOasm interfaces for

your target. This can be partially automated.

•Write a Lira code expander for the new target, or

•Write a custom code expander for the new target.

•Port VPO to the new target.

Adding a front-end using Lira

• Difficulty depends on your IR.– Trivial for lcc – almost same IR!– Pretty easy for SUIF. E.g.

void Translator::trans(BinaryExpression exp) { int lira_op;

translate(exp->get_source1()); translate(exp->get_source2()); switch(op_map(exp->get_opcode())) {

case SOP_add: lira_op = LIRA_ADD; break;...

} emitter->emit(lira_op, lira_map_ty(exp->get_result_type());}

Where can I find out more?

• Should be releasing suif2lira as a literate program around July 1.– Good starting point for someone

familiar with SUIF wanting to hook up a front-end with Lira.

• Literate source for SPARC and x86 Lira code expanders will be available immediately after PLDI.

Adding a front-end using a custom code expander

• Difficulty again depends on your IR.

• Refer to EDG SPARC code expander:– http://www.cs.virginia.edu/zephyr/dist/edg-sparc-1.0.pdf

Language issues in retargeting Zephyr

• Calling convention– In addition to emitting RTLs to

properly handle language calling conventions on function calls and function entry, also need to consider fixentry in VPO.

– fixentry finalizes a procedure’s prologue after optimization is complete.

– More in next talk.

Using Zephyr for Architecture Research

Jason Hiser and Chris Milner

A Brief Introduction to Zephyr and Architectural

ResearchJason Hiser

Roadmap

• Handling a machine’s calling convention– Jason

• Break– Coffee!

• Writing a VPO machine description and Writing a VPO register description– Chris Milner

• Case Study: Targeting SimpleScalar– Jason

Handling a Machine’s Calling Conventionfixentry fun (regs.c)

Jason HiserUniversity of Virginia

Introduction To regs.c

• Fixentry: The main routine of regs.c – Responsibilities of fixentry

• Parameters, external and global data used in fixentry

• Other functions: regarg, initmap, map, transfer, leaf

Responsibilities of Fixentry

• Calculate stack space needed – outgoing parameters, spill locations,

local variables, saved registers, and incoming parameters

• Emit function prologue – Adjust stack pointer– save return address, and saved

registers– add RTLs for local equates

Fixentry Responsibilities (continued)

• Create and maintain a “mapping” from the registers used to the actual hardware registers

• Save/restore necessary registers and incoming parameters to stack

• Emit function epilogue (including code to restore saved registers)

Not the responsibility of Fixentry

• Perform any optimization• Insert spill code• Make decisions about register

usability• Emit assembly code for any

instructions• Setup registers/stack for making

a function call• Allocate global data

Extern Variables (Where fixentry gets its data)

• struct bblock *top List of basic blocks in current function

• struct locuse *locs local variables and parameters

• int isused[MAXREGS] which registers are used and which

aren’t• int varargs is this a variable

argument function?

Parameters to Fixentry

• struct list *ptr the RTLs in the current function

• struct blist *retb the basic blocks that need epilogue code

Global Variables

• int gpregmap[] The “mapping” of the general purpose registers

• int fpregmap[] The “mapping” of the float registers

• int spilloff Information to the code emitter

about where to place spill variables

Calculating Stack Space

• Loop through RTLs and find out how much space is needed for outgoing params

• Loop through temps and calculate spill space needed

• Loop through locals and calculate local space needed

Calculating Stack Space (cont.)

• Loop through registers and find out which ones need to be saved

• Determine space needed for incoming parameters (register params only)

Emitting Prologue and Epilogue

• Prologue– Emit code to adjust stack pointer– Emit code to spill return address and

saved regs

• Epilogue– For each exit block

•Restore spilled registers•Restore stack pointer• Jump to return address

Register Map

• Register allocator determines what variables are in which register– Fixentry needs to put these variables

in the proper register.

• Fixentry attempts to map registers so no movements are necessary, overriding the allocator assignment policy– If it can’t, register to register moves

are necessary

Other Functions of regs.c

• regarg Boolean function returns true if a local variable is an argument, and enters the

function in a register• initmap Initializes the gpregmap

and fpregmap• map Returns the mapping for a

register

Other Functions of regs.c(continued)

• transfer Creates a transfer RTL from two machine

locations (memory, register, or spill)

• leaf Boolean function determines if a function is a leaf

Summary

• Fixentry is the main portion of regs.c

• Fixentry is responsible for – function prologue– function epilogue – register mapping to avoid register to

register moves

• Regs.c also contains a few functions to let other areas know about the mapping.

Using Zephyr for Architecture Research

(continued)

Jason Hiser and Chris Milner

Writing a VPOMachine Specification

Chris MilnerUniversity of Virginia

Outline of talk

• Structure of VPO• Machine descriptions• How to construct the descriptions• Getting machine dependent

information for machine independent transformations– combiner– loop (and other) transformations– scheduler

• EASE

Structure of VPO

C Code

C CodeCSE

C Codestrength

reduction

C Codedead codeelimination

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c

machine dependent source

Instructiondescription

InstructionProcessor

yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool

PipelineProcessor(real soon now)

C Code

• “Machine independent” transformations on low level “machine dependent” intermediate form (register transfer lists)

• Retargeted portion assists in:– recognizing legal RTLs– converting and inserting RTLs to

assist transformations– picking apart RTLs to get information

Role of Machine Descriptions

• md.y - legal instructions– maintains VPO invariant– YACC grammars

• regs.rt - register file– register types– alignment– size– ABI

• RTL recognizer– Workhorse– RTLs come from combiner (at compile

time)– ours are not usual table driven ones

but directly executable (yyfast)

• How do you do it?– Work from existing ones (derive

Alpha from MIPS); or, – construct one anew

Sample machine

• Subset SIMPLESCALAR– e.g. student project on FPGA– load/store– chars, half words and words– constants must be loaded into

registers– add, and, not, sll, sra, srl– branch on less than, branch on

equal,jump, call, return

Constructing md.y (continued)

• Operands - registers%token REG0 REG1 REG2

(scanner converts ‘b’‘[‘‘1’’]’ to REG0)

reg: REG0

| REG1

| REG2

• Operands - memory%token BMEM WMEM RMEM (scanner converts ‘B’‘[‘ to BMEM )

mem: BMEM reg ‘]’

| WMEM reg ‘]’

| RMEM reg ‘]’

• Operands - misc%token PC RT ST (used for call and return)

%token LOCAL GLOBAL CON LBL

expr: LOCAL

| GLOBAL

• Operations%left ‘=‘ ‘+’ ‘&’ ‘”’ ‘{‘ ‘}’

%nonassoc ‘~’ ‘,’

rhs : reg ‘+’ reg

| reg ‘&’ reg

| reg ‘{‘ reg

| reg ‘}’ reg

| reg ‘”’ reg

• Binary operationsbinops: reg ‘=‘ rhs

• Unary operationnot: reg ‘=‘ ‘~’ rhs

• Load, load immediate and storel : reg ‘=‘ mem

li: reg ‘=‘ expr

s : mem ‘=‘ reg

si: expr ‘=‘ reg (FORTRAN)

• Branchbb: PC ‘=‘ reg ‘:’ reg

| PC ‘=‘ reg ‘<‘ reg • jump call and returnjmp: PC ‘=‘ reg

jal: ST ‘=‘ expr

ret: PC ‘=‘ RT

• All instructionsinst: bb | jmp | jal | ret

| binst | not

| l | li | s

• Now, we need some glue and some checking

Glue for parser

• Build up semantic records• Found in isem.c

– addr() - record for addressing modereg: REG0 {$$=addr(BYTE,BREGISTER…)}

– memref() - record for memory access– brecord() - record for binary op– rrecord() - record for relational op– same() - ensure records are same

Semantic routines

• inst.c– each instruction or instruction class

has a routine– routine checks for legal operands– is responsible for emitting legal asm– e.g. bb() -

•on MIPS check the semantics for compare and branch

• right hand operand immediate, use immediate form of instruction

• records instruction type

Structure of VPO(again)

C Code

C CodeCSE

C Codestrength

reduction

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c

yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool

C Code

regs.rt

• TYPES– basic types of registers on the

machine– byte,half,word,float,double– BTREG, WTREG, RTREG, FTREG,

• CODES– condition codes – IC,FC,etc.

regs.rt(continued)

• CLASS – general_purpose, float, spill– number – scratch – reserve

regs.rt(continued)

• CLASS (continued) – type

•alignment (even-odd register pairs)•size - how many to allocate•invariant - mark as invariant for loops

– e.g. fp and sp•memchar, regchar - give it a different name

•stack, fifo - tells the allocator about them

regs.rt for MIPS

types BTREG, WTREG, RTREG, FTREG, DTREG

codes FC

class = general_purpose

number = 32

scratch = 2..15, 24, 25

reserve = 0, 1, 26, 27, 28, 29, 31

(notes: MIPS - reg 0 is zero, reg 1 is asm reg,reg 26,27 are used by os, reg 28 is gp,reg 29 is sp, reg 31 is return address)

regs.rt for MIPS (continued)

type = RTREG

alignment = 1

size = 1

invariant = 28, 29

endtype

type = BTREG, WTREG

alignment = 1

size = 1

endtype

class = floating_point

number = 16

scratch = 0..9

type = FTREG, DTREG

alignment = 1

size = 1

endtype

endclass

class = SPILL

number = 32

type = BTREG, WTREG, RTREG, FTREG

alignment = 1

size = 1

endtype

type = DTREG

alignment = 2

size = 2

endtype

endclass

Structure of VPO(again)

C Code

C CodeCSE

C Codestrength

reduction

C Codesimp.c

Registerdescription

reg.rt

C Codertl.c

yyfast

C Codesched.c

machineindependent

combiner()

loop_strength()

machinedependent

inst_is_legal()

is_basic()

VPO optimizer

C Code

C Compiler

Pipelinedescription

pipe.pg

RegisterProcessor

regtool

C Code

Other files

• simp.c - helps the combiner• sched.c - machine specific

portion of scheduling

• rtl.c - routines to find machine idioms in

transformations

simp.c

• Combine RTLs in machine dependent way

• e.g. SPARC 1 r[35]=~r[35]

2 {1} r[33]=r[33]&r[35]

combines tor[33]=r[33]&~r[35]

semantically ok, but not an instructioncomp() makes machine idiom substitution

r[33]=r[33] ANDNOT r[35]

simp.c(continued)

• e.g. SPARC constants 4095 is biggest immediate1 r[40]=4095

2 {1} r[41]=r[40]+13

combines and folds tor[41]=4108

comp() converts to r[41]=HI[4108]

r[41]=r[41]|LO[4108]

• Manipulate– reverse() - reverse a branch– don’t_bother_with() - tell cse to ignore

• Predicates– is_call(), is_rjmp(), ismem(), writes_mem()

– is_pc(),

• Pick apart– findlabel(), usetype()

rtl.c(continued)

• Insert code to help transformations– store(), load()– multconst()

•add series of shifts and adds

– locsub() - substitute reg for mem•SPARC has sign extend on load•no single sign extend move•have to insert shifts to do sign extend

rtl.c(continued)

r[1] = 0

r[9] = r[14] + a

r[8] = r[1]*4

R[r[8]+r[9]]=0

r[1]=r[1]+1

IC=r[1]?100

PC=IC<0,L32

• regular induction variable• induced expression• basic induction variable

•Assist loop strength reduction•might be one instruction or several

sched.c

• SPARC - yes, MIPS - no• Scheduler uses mostly machine

independent list scheduling algorithm

• keeps machine specific dependencies straight

• helps avoid hazards

sched.c(continued)

• md_sets_uses– what an instruction does– what an instruction is blocked by– reads can slide past read, not past

writesrtl->does |= READS

rtl->blocks |= WRITES

– writes cannot slide past anythingrtl->does |= WRITES

rtl->blocks |= WRITES | READS

sched.c(continued)

• md_sets_uses– condition code users can’t slide past

one another rtl->does |= ICWRITES

rtl->blocks |= ICWRITES | ICREAD

and rtl->does |= ICREADS

rtl->blocks |= ICWRITES | ICREAD

– calls are treated conservatively•assume codes, floats and memory written

sched.c(continued)

• sched_adv()– relative advantage or disadvantage

of scheduling this instructions next– relative to last instruction scheduled– e.g. SPARC

•space out float instructions•avoid consecutive stores•make consecutive instructions

independent

• EASE: Environment for Architecture Study and Experimentation– VPO includes a facility for obtaining

•Measurements of instruction usage• Instruction cache traces•Data cache traces•precise timing

– VPO provides facilities for emulating architectures•Can extend existing architectures

EASE(continued)

• Use control-flow graph to insert instrumentation code

• Low overhead (10 to 15%)

• Cache traces generated on the fly (no need to store)

Bump Counter

BasicBlocks

EASE(continued)

• Emulation of new architecture features– Add new

instructions to machine description

– Generate code and optimize as if new features exist

– In last step of VPO, emit code to emulate new features

r [ 3] = r [ 3] + r [ 2]

r [ 5] = r [ 5] + ( r [ 3] * r [ 2] )

add r2, r3, r3

mul r3, r2, r1add r1, r5, r5

VPOMachLast Step

Case Study: Targeting SimpleScalar

Jason HiserUniversity of Virginia

Introduction

• What is SimpleScalar? Why use it?

• Why use VPO with SimpleScalar?– SimpleScalar comes with gcc, why

not use that?

• Experiences in porting VPO to SimpleScalar

• Research with SimpleScalar and VPO

What is SimpleScalar?

• SimpleScalar is a functional simulator designed for use with architectural research– sim-safe -- a simple, fast simulator– sim-bpred -- measures branch

predictor statistics– sim-cache -- measures cache

statistics– sim-outorder -- models a multi-issue,

out of order superscalar processor

Why Use SimpleScalar?

• Easy to model many common architectural features.– hybrid branch predictors,arbitrarily many

functional units, much more

• Extendible instruction set -- PISA– Allows any instruction to be “annotated”

•easy to create new instructions or add fields to old ones

• Comes with GNU tools for SimpleScalar– gcc, gas, gld, glibc, etc.

Why VPO and SimpleScalar?(Why not use gcc?)

• gcc does not generate instruction annotations

• difficult to write new optimizations to take advantage of new instructions

• just building gcc can be a challenge

Why VPO and SimpleScalar?(continued)

• Easily build VPO on any machine you can build SimpleScalar

• Describe new instructions in machine description and optimizer will automatically use them when beneficial

• New optimizations can consult the machine description to see if architectural support is available– allows portability of optimizations

Experiences with Porting VPO to SimpleScalar

• PISA is basically MIPS– changes to some instruction formats– dmfc1 appears to be broken, negu not

available, branch if (not) equal to zero instructions don’t exist

• Change instruction format in inst.c• When compiling for SimpleScalar

tell the machine description that negu, beqz, bneqz and dmfc1 are not available

Research with SimpleScalar and VPO at UVa

• Idea– Compiler managed on-chip memory can

provide performance and power benefits

• Framework– Add instructions to move data to/from

on-chip memory from/to registers• to VPO (in md.y, inst.c)• to SimpleScalar (machine.def)

– Add optimization to promote variables from cache to on-chip memory

Summary

• SimpleScalar is a versatile functional simulator

• Porting VPO isn’t difficult– SimpleScalar target soon to be

included with VPO

• VPO and SimpleScalar make a great vehicle for architectural research

Using Zephyr for Optimization Research

Jack DavidsonUniversity of Virginia

VPO Logical Structure

VPOGenerator

Eval. Order Determ.

ZIFLow Analysis &Transformation Libraries

VPOMIPS

CSDLSPARCSpecification

NewTransformation

CSDLMIPSSpecification

CSDLALPHASpecification

CSDLi486Specification

Register Allocation

Access Coalescing

Comm. Subexpr. Elim.

Eval. Order Determ.

Induction Var. Elim.

Instruction Scheduling

Code Motion

SSA Computation

Actual Structure

lib SPARC MIPS X86 ALPHA

VPO Program Representation

BASIC BLOCK

LIST (RTL struct)

RTLCOSTINST TYPEUSESSETSDEF/USE

PREDSIDOMSDOMNEST LVLUSESDEFSOUTSPHIREGSTATE

VPO Optimizations

• Review vpo.h

VPO Optimization Algorithm

repeatapply code-improving

transformationuntil fixed-point reached or exhausted registers

• Maintaining two invariants– Semantic invariant (S)

• Observable behavior of program unchanged (according to RTL semantics)

– Machine invariant (M)• Every RTL equivalent to one machine instruction

VPO code optimization

• Each code-improving transformation is– machine-level, but– machine-independent

• Any semantics-preserving transformation is OK

• Preserve machine invariant (M) using machine description;– for each new RTL produced, ask MD if OK– if any is not target machine instruction,

roll back transformation

VPO Optimization Driver

• Review vpo.c

Adding a new optimization

• Determine where in optimize to insert the function– What analyses does the optimization

need?•Control-flow optimizations usually come

first as they need very little data-flow information

•Data-flow optimizations follow: code motion, induction-variable elimination, common subexpression elimination

– Does the optimization operate on a single basic block or does it operate across basic blocks?

Adding a new optimization

• Browse controlflow.c/fix_control_flow()

• Browse cdmotion.c/code_motion()

Semantic Safe Points

• A semantic safe point is a point in the optimization process where the code satisfies the M and S invariants– Code can be emitted at any semantic

safe point and it should run correctly– Can insert new optimization between

any semantic semantic-safe point

Debugging the compiler

SourceCode

Front andMiddle Ends

VPO Mach MachineCode

Trans n..........Trans 4Trans 3Trans 2Trans 1

VET-VPO Examination Tool

• Allows transformations to be observed– Observe data structure (control-flow

graph)– Set a break point at a transformation– Set a break point at a phase– Replay a transformation

VET and VPOISO

Raja VenkateswaranUVA

• VET -> VPO Examination Tool• GUI for viewing optimizations• By Phase and By transformation• Ability to revert to previous

phases• Wide range of user options

VPOISO

• Tool for isolating optimizer bugs

• Uses binary search to find the first transformation error

• Works by comparing against the correct output

nci report: zephyr

Documents

workshop report | nci

nci agency report on use of boas jan-jun 2015

burnham holdngi s, nci . 2020 annual report

zephyr soap ppt”

technical report - norma@nci library

zephyr xtra manual

california nci adult consumer survey€¦ · california nci...

zephyr annual m&a report -...

zephyr reizen

california zephyr

nci/seer residential history project technical report

"zephyr annual m&a report: global private equity, 2013" -...

zephyr m&a report north america, november 2013

zephyr monthly m&a report

zephyr 2010

zephyr spa final

zephyr endobronchial valve system instructions for use ·...

zephyr annual m&a activity report - bvdinfo.com

zephyr m&a report north america, january 2014

matinée referencement zephyr