chapter 1. overview j. h. wang sep.15, 2015. outline history of compilation what compilers do...
TRANSCRIPT
Chapter 1. Overview
J. H. WangSep.15, 2015
Outline
• History of Compilation• What Compilers Do• Interpreters• Syntax and Semantics• Organization of a Compiler• Programming Language and Compiler Design• Computer Architecture and Compiler Design• Compiler Design Considerations• Integrated Development Environments
Language Processors
• Translators– Transforming human-oriented
programming languages into computer-oriented machine languages
History of Compilation
• Early compilers– 1950s: by Grace Hopper– Late 1950s: Fortran
• Broad applications– Typesetting: TeX, LaTeX– Portable document representation:
PostScript– Symbolic and numeric problem solving:
Mathematica– VLSI: Verilog, VHDL
What Compilers Do
• Compilers may be distinguished in two ways– By the kind of machine code they
generate– By the format of the target code they
generate
Machine Code Generated by Compilers
• Pure machine code– Only instructions from a particular instruction set
• Without dependence on any software (library, OS)– Rare; mostly used in system implementation languages
• Augmented machine code– Augmented with OS and runtime language support routines
• I/O, storage allocation, mathematical functions• Data transfer, procedure call, and dynamic storage
instructions– More often
• Virtual machine code– Only virtual instructions– Virtual machine
• Pascal P-code• Java bytecodes
– Portability, program size reduction
Bootstrapping
Target Code Formats
• Assembly or other source formats– Easy to scrutinize– Useful for prototyping programming language designs
and cross-compilation
• Relocatable binary– More efficient and more control over the translation
process– External references, local instruction addresses, and
data addressed are not bound• A linkage step is required
• Absolute binary– Faster, but limited ability to interface with other code– Useful for exercises and prototyping
• Compilation costs far exceed execution costs
Interpreters
• Capabilities of interpreters– Programs can be easily modified as execution
proceeds• Interactive debugging
– Dynamic object typing can be easily supported• E.g. Lisp and Scheme
– Significant degree of machine independence
• Drawbacks– Direct interpretation of source programs can
involve significant overhead
Syntax and Semantics
• Syntax: structure– E.g. context-free grammars (CFGs)
• a=b+c is legal, but b+c=a is not
• Semantics: meaning– E.g.
• a=b+c is illegal if any of the variables are undeclared or if b or c is of type Boolean
– Static semantics– Runtime semantics
Static Semantics
• A set of rules that specify which syntactically legal programs are actually valid– E.g.: Identifier declaration, type-
compatibility of operators and operands, proper number of parameters in procedure calls
• Can be specified either formally or informally– E.g.: attribute grammars
An Example of Attribute Grammars
• Production rule:– E -> E+T
• Augmented production rule:– Eresult -> Ev1 + Tv2
• if v1.type=numeric and v2.type=numericthen result.type<-numericelse call ERROR()
– Verbose and tedious
Runtime Semantics• To specify what a program computes
– Can be specified informally• E.g.: program states
– a=1: the state component corresponding to a is changed to 1– Formal approaches
• Natural semantics: operational model– Given assertions before evaluations of a construct, we can
infer assertions that will hold after the construct’s evaluation• Axiomatic semantics: relations or predicates that relate
program variables– E.g.: var <- exp
» var is true after statement execution iff. the predicate obtained by replacing all occurrences of var by exp is true beforehand
– Good for deriving proofs of program correctness; but difficult to use
• Denotational semantics: more mathematical in form– E.g: E[T1+T2]m=E[T1]m+E[T2]m
• Difficulty in semantics: imprecise language specification– E.g.: (in Java)
• public static int subr(int b){ if (b != 0) return b+100;}
• public static int subr(int b){ if (b != 0) return b+100; else if (10*b==0) return 1;}
– The problem of deciding whether a particular statement in a program is reachable is undecidable
• In practice, a trusted reference compiler can serve as a de facto language definition– E.g.: Lisp
Organization of a Compiler
Analysis
Synthesis
The Structure of a Compiler
• Tasks performed by compilers– Analysis of the source program
• Syntax analysis• Semantic analysis
– Synthesis of a target program that, when executed, will correctly perform the computations described by the source program• Code generator • Optimizer
The Scanner
• Reading the input text and grouping individual characters into tokens– Identifiers– Integers– Reserved words– Delimiters
• What the scanner does– It puts the program into a compact and uniform format– It eliminates unneeded information– It processes compiler control directives– It sometimes enters preliminary information into symbol
table– It optionally formats and lists the source program
Lexical Analysis (Scanning)[Aho, Lam, Sethi, Ullman]
• Grouping characters into lexemes• Producing tokens
– (token-name, attribute-value)
• E.g. – position = initial + rate * 60– <id,1> <=> <id,2> <+> <id,3> <*>
<60>
• Regular expressions (Chap. 3) – An effective and powerful approach to
describe tokens– As a specification for automatic
generation of finite automata that recognizes regular sets• Scanner generator
The Parser
• Reading tokens and grouping them into phrases according to the syntax specification such as CFGs– Grammars (Chap. 2 & 4)– Parsing (Chap. 5 & 6)– Parser generator
• It usually builds an Abstract Syntax Tree (AST) as a concise representation of program structure– (Chap. 2 & 7)
Syntax Analysis (Parsing)[Aho, Lam, Sethi, Ullman]
• Creating a tree-like intermediate representation (e.g. syntax tree) that depicts the grammatical structure of the token streams– E.g.– <id,1> <=> <id,2> <+> <id,3> <*>
<60>–
=
<id, 1> +
<id, 2> *
<id, 3> 60
The Type Checker (Semantic Analysis)
• Checking the static semantics of each AST node– If the construct is semantically correct,
the type checker decorates the AST node by adding type information to it
– Otherwise, a suitable error message is issued
Semantic Analysis[Aho, Lam, Sethi, Ullman]
• Type checking• Type conversions or coercions• E.g.
– =
<id, 1> +
<id, 2> *
<id, 3>
60
int2float
Translator (Program Synthesis)
• Translating AST nodes into Intermediate Representation (IR) code– E.g. while loops -> two subtrees: expression,
body
• It’s largely dictated by the semantics of the source language
• In simple, nonoptimizing compilers, the translator may generate target code directly
• More elaborate compilers such as GCC may first generate a high-level IR and then translate it into a low-level IR
Intermediate Code Generation
[Aho, Lam, Sethi, Ullman]• Generating a low-level intermediate
representation– It should be easy to produce– It should be easy to translate into the
target machine– E.g. three-address code (in Chap. 6)
• t1 = int2float(60)t2 = id3 * t1t3 = id2 + t2id1 = t3
Symbol Tables
• A mechanism that allows information to be associated with identifiers and shared among compiler phases– Identifier declaration– Identifier use– Type checking
Symbol Table Management[Aho, Lam, Sethi, Ullman]
• To record the variable names and collect information about various attributes of each name– Storage, type, scope– Number and types of arguments,
method of argument passing, and the type returned
Name Type
position …
initial …
rate …
The Optimizer
• Analyzing and transforming the IR code generated by the translator into functionally equivalent but improved code– Complex– Optimizations may be performed in stages
• Optimization can also be done after code generation– E.g. peephole optimization: a few instructions at a time
• Multiplications by 1• Additions of 0• Loading a value into register when it’s already in another
register• Replacing a sequence of instructions by a single instruction
with the same effect
Code Optimization[Aho, Lam, Sethi, Ullman]
• Attempts to improve the intermediate code– Better: faster, shorter code, or code that
consumes less power– E.g.
• t1 = id3 * 60.0id1 = id2 + t1
The Code Generator
• Mapping the IR code generated by the translator into target machine code– Machine-dependent, complex
• Register allocation• Code scheduling
• Automatic construction of code generators has been actively studied– Matching a low-level IR to target-instruction
templates– This makes it easy to retarget a compiler to
a new target machine• E.g. GCC
Code Generation[Aho, Lam, Sethi, Ullman]
• Mapping intermediate representation of the source program into the target language– Machine code: register/memory location
assignments– E.g.
• LDF R2, id3MULF R2, R2, #60.0LDF R1, id2ADDF R1, R1, R2STF id1, R1
Phases of a Compiler [Aho, Lam, Sethi, Ullman]
Syntax Analyzer
character stream
target machine code
Lexical Analyzer
Intermediate Code Generator
Code Generator
token stream
syntax tree
intermediate representation
SymbolTable
Semantic Analyzer
syntax treeMachine-Independent
Code Optimization
Machine-Dependent Code Optimization
(optional)
(optional)
Compiler Writing Tools
• Compiler generators (compiler compilers)– Scanner generator– Parser generator– Symbol table manager– Attribute grammar evaluator– Code-generation tools
• Much of the effort in crafting a compiler lies in writing and debugging the semantic phases– Usually hand-coded
Programming Language and Compiler Design
• Many compiler techniques arise from the need to cope with some programming language construct
• The state of the art in compiler design also strongly affects programming language design
• The advantages of a programming language that’s easy to compile: – Easier to learn, read, understand– Have quality compilers on a wide variety of machines– Better code will be generated– Fewer compiler bugs– The compiler will be smaller, cheaper, faster, more
reliable, and more widely used– Better diagnostic messages and program development
tools
Computer Architecture and Compiler Design
• Compiler designers are responsible for making computing capability available to programmers
• Problems– Instruction sets for some popular architectures are highly
nonuniform– High-level programming language operations are not
always easy to support– Essential architectural features such as hardware caches
and distributed processors and memory are difficult to present to programmers in an architecturally independent manner
– Effective use of a large number of processors has always posed challenges to application developers and compiler writers
– For some programming languages, runtime checks for data and program integrity are dropped in favor of gains in execution speed
Compiler Design Considerations
• Debugging (development) compilers– Detailing programmer errors– E.g. CodeCenter– It can often tolerate or repair minor errors (e.g. inserting a
missing comma or parenthesis)• Optimizing compilers (Chap. 13 & 14)
– Producing efficient target code at the cost of increased compiler complexity and increased compilation times
– Optimal code, even when theoretically possible, is often infeasible in practice
– A variety of transformations might interfere with each other• Retargetable compilers (Chap. 11 & 13)
– Target architecture can be changed without its machine-independent components having to be rewritten
– More difficult to write, but development costs can be shared
Integrated Development Environments
• To integrate program development cycle into a single framework– Editing, compilation, testing, debugging
• Immediate feedback on syntax and semantic problems
• Focus on source program• Providing easy access to information about the
program• Many of the techniques in batch compilation can be
reformulated into incremental form to support IDEs– Parser, type checker, …
• In this book, we concentrate on the translation of C, C++, Java
End of Chapter 1
• Any Questions or Comments?