1 compilers: principles, techniques, and tools jing-shin chang department of computer science &...
TRANSCRIPT
1
Compilers: Principles, Techniques, and Tools
Jing-Shin ChangJing-Shin Chang
Department of Computer Science & Department of Computer Science & Information EngineeringInformation Engineering
National Chi-Nan UniversityNational Chi-Nan University
2
What is a Compiler? Why? Applications?What is a Compiler? Why? Applications?
How to Write a Compiler by Hands?How to Write a Compiler by Hands?
Theories and Principles behind compiler Theories and Principles behind compiler construction - Parsing, Translation & construction - Parsing, Translation & CompilingCompiling
Techniques for Efficient ParsingTechniques for Efficient Parsing
How to Write a Compiler with ToolsHow to Write a Compiler with Tools
Goals
3
1. Introduction: What, Why & Apps1. Introduction: What, Why & Apps
2. How: A Simple Compiler2. How: A Simple Compiler
- What is A Better & Typical Compiler- What is A Better & Typical Compiler
3. Lexical Analysis:3. Lexical Analysis:- Regular Expression and Scanner- Regular Expression and Scanner
4. Syntax Analysis:4. Syntax Analysis:- Grammars and Parsing- Grammars and Parsing
5. Top-Down Parsing: LL(1)5. Top-Down Parsing: LL(1)
6. Bottom-Up Parsing: LR(1)6. Bottom-Up Parsing: LR(1)
Table of Contents
4
7. Syntax-Directed Translation7. Syntax-Directed Translation
8. Semantic Processing8. Semantic Processing
9. Symbol Tables9. Symbol Tables
10. Run-time Storage Organization10. Run-time Storage Organization
Table of Contents
5
11. Translation of Special Structures11. Translation of Special Structures
*. Modular Program Structures*. Modular Program Structures*. Declarations*. Declarations
*. Expressions and Data Structure *. Expressions and Data Structure ReferencesReferences
*. Control Structures*. Control Structures
*. Procedures and Functions*. Procedures and Functions
12. General Translation Scheme:12. General Translation Scheme:- Attribute Grammars- Attribute Grammars
Table of Contents
6
13. Code Generation13. Code Generation
14. Global Optimization14. Global Optimization
15. Tools: Compiler Compiler15. Tools: Compiler Compiler
Table of Contents
7
What is A Compiler?
- Functional blocksFunctional blocks- Forms of compilersForms of compilers
8
The Compiler
What is a compiler?What is a compiler? A program for translating programming A program for translating programming
languages into machine languageslanguages into machine languages source language => target languagesource language => target language
Why compilers?Why compilers? Filling the gaps between a programmer and the Filling the gaps between a programmer and the
computer hardwarecomputer hardware
9
Compiler: A Bridge Between PL and Hardware
Operating System
Hardware (Low Level Language)
Compiler
Applications (High Level Language) A := B + C * D
MOV A, CMUL A, DADD A, BMOV va, A
Assembly CodesRegister-based orStack-based machines
10
Typical Machine Instructions –Register-based Machines Data TransferData Transfer
MOV A, BMOV A, B MOV A, [mem]MOV A, [mem] More: IN/OUT, Push, Pop, ...More: IN/OUT, Push, Pop, ...
Arithmetic OperationArithmetic Operation ADD A, CADD A, C // A := A + C// A := A + C MUL A, DMUL A, D // A := A * D// A := A * D More: ADC, SUB, SBB, INC …More: ADC, SUB, SBB, INC …
Logical OperationLogical Operation AND A, 00001111BAND A, 00001111B // A := A & 00001111B// A := A & 00001111B More: OR, NOT, XOR, Shift, RotateMore: OR, NOT, XOR, Shift, Rotate
Program ControlProgram Control JMP, JZ, JNZ, Call, …JMP, JZ, JNZ, Call, …
Low Level Instructions Features:Low Level Instructions Features: Mostly Simple Mostly Simple BinaryBinary Operators (using Operators (using sourcesource & & targettarget operands) operands)
AA
BB CC
DD EE
HH LL
Registers of an Intel 8085 processor
11
Typical Machine Instructions – Stack-based Machines Data TransferData Transfer
Push APush A // SP++; *(SP) := A// SP++; *(SP) := A Push [mem]Push [mem] // SP++; *(SP) := [mem]// SP++; *(SP) := [mem] DupDup // *(SP+1) := *(SP) ; SP++// *(SP+1) := *(SP) ; SP++ Pop [mem]Pop [mem] // *[mem] := *(SP); SP--// *[mem] := *(SP); SP--
Arithmetic OperationArithmetic Operation ADDADD // *(SP-1) := *(SP) + *(SP-1); SP--// *(SP-1) := *(SP) + *(SP-1); SP-- MULMUL // *(SP-1) := *(SP) x *(SP-1); SP--// *(SP-1) := *(SP) x *(SP-1); SP--
Logical Operation …Logical Operation … Program Control …Program Control … Low Level Instructions Features:Low Level Instructions Features:
Mostly Simple Mostly Simple BinaryBinary Operators Operators Operations are applied to the Operations are applied to the topmosttopmost 22 sourcesource operands operands
return results to new stack top (return results to new stack top (destinationdestination operand) operand) Almost no general purpose registersAlmost no general purpose registers
SPSP *SP*SP
SP-1SP-1
……
12
Compiler (1) - Compilation
CompilerSourceProgram/Code
(P.L., Formal Spec.)
TargetProgram/Code
(P.L., Assembly,Machine Code)
Error Message
A := B + C * D
MOV A, CMUL A, DADD A, BMOV va, A
16
Compiler (2a) – Execution
Target CodeInput Output
Running the compiled codes
(in Real Machine)
Loader
Target code(compiled)
(load into Real Machine)
17
Compiler (2b) – Compile & Go
Two working phases in two passes
OutputInput
CompilerSource
Program Error Message
Target Code
(in Real Machine)Compiler: Two independent phases to complete the work- (1) Compilation Phase: Source to Target compilation- (2) Execution Phase: run compiled codes & respond to input & produce output
18
Compiler (2c) – compile & go
Two working phases in two passes
Compiler(+Loader)
Source program (& executable Target code)
OutputInput
(target loaded into Real Machine)
Compiler: Two independent phases to complete the work- (1) Compilation Phase: Source to Target compilation- (2) Execution Phase: run compiled codes & respond to input & produce output
19
Interpreter (1)
Interpreter
Source program
OutputInput
Interpreter: One single pass to complete the two-phases work- Each source statement is Compiled and Executed subsequently- The next statement is then handled in the same way
Error Message
20
Interpreter (2)
Compile and then execute for each Compile and then execute for each incoming statementsincoming statements Do not save compiled codes in executable filesDo not save compiled codes in executable files
Save Save storagestorage
Re-compile the same statements if loop backRe-compile the same statements if loop back SlowerSlower
Detect (Detect (compilationcompilation & & runtimeruntime) ) errorserrors as one as one occurs during the occurs during the executionexecution time time
CompilerCompiler: Detect syntax/semantic errors : Detect syntax/semantic errors (“compilation errors”) during (“compilation errors”) during compilationcompilation time time
21
Hybrid: Compiler + Interpreter?
Interpreter+
Intermediate program
OutputInput
Source program
Compiler
(with/without JIT)
Error Message
22
Hybrid: Compiler + Interpreter?
Interpreter+
Intermediate program
OutputInput
Source program
Compiler
(with/without JIT)
Intermediate program:- without syntax/semantic errors- machine independentInterpreter:- do not interpret high level source- but compiled low level code- easy to interpret + efficient
23
Hybrid Method & Virtual Machine
Virtual Machine(VM)
Intermediate program
OutputInput
Source program
Translator
(Interpreter with/without JIT)
(Compiler)
24
Example: Java Compiler & Java VM
JavaVirtual Machine
Java Bytecodes
OutputInput
Java program
Java Compiler
(Interpreter with/without JIT)
(app.java)
(app.class)
(Javac)
25
Hybrid Method & Virtual Machine
Compile source program into a Compile source program into a platformplatform indindependentependent code code E.g., Java => Bytecodes (E.g., Java => Bytecodes (stack-basedstack-based instructio instructio
ns)ns) Execute the code with a virtual machineExecute the code with a virtual machine
High High portabilityportability: The platform independent cod: The platform independent code can be distributed on the web, downloaded ane can be distributed on the web, downloaded and executed in any platform that had VM pre-insd executed in any platform that had VM pre-installedtalled
Good for Good for cross-platformcross-platform applications applications
26
Just-in-time (JIT) Compilation
Compile a new statement (only Compile a new statement (only onceonce) as it comes f) as it comes for the first timeor the first time And And savesave the compiled codes the compiled codes Executed by virtual/real machineExecuted by virtual/real machine Do not re-compile as it loop backDo not re-compile as it loop back
Example:Example: Java VM (simple Interpreter version, without JIT): high Java VM (simple Interpreter version, without JIT): high
penalty in penalty in performanceperformance due to interpretation due to interpretation Java VM Java VM + JIT+ JIT: improved by the order of a factor of : improved by the order of a factor of 1010
JIT: JIT: translatetranslate bytecodes during run time to the native target ma bytecodes during run time to the native target machine instruction setchine instruction set
27
Comparison of Different Compilation-and-Go Schemes Normal CompilersNormal Compilers
Will generate codes for Will generate codes for allall statements whether they will be statements whether they will be executed or notexecuted or not
Separate the Separate the compilationcompilation phase and phase and executionexecution phase into two phase into two different phrasesdifferent phrases
Syntax & semantic Syntax & semantic errorserrors are detected at are detected at compilationcompilation time time Interpreters and JIT CompilersInterpreters and JIT Compilers
Can generate codes only for statements that are really executedCan generate codes only for statements that are really executed Will depend on your input – different Will depend on your input – different execution flowsexecution flows mean different mean different
sets of executed codessets of executed codes InterpreterInterpreter: Syntax & semantic : Syntax & semantic errorserrors are detected at are detected at run/executionrun/execution
timetime JIT vs. Simple InterpreterJIT vs. Simple Interpreter
JITJIT: save the target machine codes: save the target machine codes• Can be Can be re-usedre-used, and compiled at most once, and compiled at most once
InterpreterInterpreter: do not save target machine codes: do not save target machine codes• Compiled more than onceCompiled more than once
28
Register-Based Virtual Machine for Android Phone – Dalvik VM
Java VM (JVM) – Stack-based Java VM (JVM) – Stack-based Instruction SetInstruction Set Normally less efficient than RISC or Normally less efficient than RISC or
CISC instructionsCISC instructions Limited memory organizationLimited memory organization Requires too many swap and copy Requires too many swap and copy
operationsoperations
Java Bytecodes(stack based)
Java Program
JavaVirtual Machine
JavaCompiler
29
Register-Based Virtual Machine for Android Phone – Dalvik VM
Dalvik VM (for Android OS) – Register-based InDalvik VM (for Android OS) – Register-based Instruction Setstruction Set
SmallerSmaller size size Better memory Better memory efficiencyefficiency Good for phone and other embedded systemsGood for phone and other embedded systems
Generation and Execution of Generation and Execution of Dalvik byte codesDalvik byte codes Compiled/Translated from Java byte code into a new bCompiled/Translated from Java byte code into a new b
yte codeyte code app.java (Java source)app.java (Java source) =|| javac (Java Compiler)||=> app.class (executable by J=|| javac (Java Compiler)||=> app.class (executable by J
VM)VM) =|| =|| dxdx (in Android SDK tool) ||=> app.dex (Dalvik Exec (in Android SDK tool) ||=> app.dex (Dalvik Exec
utable)utable) =|| compression ||=> apps.apk (Android Application Pac=|| compression ||=> apps.apk (Android Application Pac
kage)kage) =|| Dalvik VM ||=> (execution)=|| Dalvik VM ||=> (execution)
Java Bytecodes(stack-based)
Java Program
dx(+compression)
JavaCompiler
Dalvik Bytecodes(register-based)
DalvikVirtual Machine
30
How To Construct A Compiler
- Language Processing SystemsLanguage Processing Systems- High-Level and Intermediate LanguagesHigh-Level and Intermediate Languages- Processing PhasesProcessing Phases- Quick Review on Syntax & SemanticsQuick Review on Syntax & Semantics- Processing Phases in DetailProcessing Phases in Detail- Structure of CompilersStructure of Compilers
31
Source Program
A la
ngu
age-
Pro
cess
ing
Sys
tem
Preprocessor
Modified Source Program
Compiler
Target Assembly Program
Assembler
Relocatable Machine Code
Target Machine Code
Library filesand/or
Relocatable object filesLinker/Loader
32
NaturalNatural languages: for communication between nati languages: for communication between native speakers of the same or different languagesve speakers of the same or different languages Chinese, English, French, JapaneseChinese, English, French, Japanese
ProgrammingProgramming languages: for communication betwee languages: for communication between programmers and computersn programmers and computers Generic High-Level Generic High-Level ProgrammingProgramming Languages: Languages:
Basic, Fortran, COBOL, Pascal, C/C++, JavaBasic, Fortran, COBOL, Pascal, C/C++, Java TypesettingTypesetting Languages: Languages:
TROFF (+TBL, EQN, PIC), La/Tex, PostScript TROFF (+TBL, EQN, PIC), La/Tex, PostScript MarkupMarkup Language -- Structured Documents: Language -- Structured Documents:
SGML, HTML, XML, ...SGML, HTML, XML, ... ScriptScript Languages: Languages:
Csh, bsh, awk, perl, python, javascript, asp, jsp, phpCsh, bsh, awk, perl, python, javascript, asp, jsp, php
Programming Languages vs. Natural Languages
33
Machine Independent Intermediate Instructions Low LevelLow Level Instructions Features: Instructions Features:
Mostly Simple Mostly Simple BinaryBinary Operators Operators Result is often save to Result is often save to AccumulatorAccumulator (A register) (A register) Not intuitive to programmersNot intuitive to programmers
IntermediateIntermediate instructions: instructions: 3 address codes3 address codes: (for register-based machines): (for register-based machines)
A := B + CA := B + C 2 source operands, one destination operand2 source operands, one destination operand Easy to map to machine instructions (share one source & Easy to map to machine instructions (share one source &
destination operand)destination operand)• A := A + BA := A + B
Stack machine codesStack machine codes: (for stack-based machines): (for stack-based machines)
34
Compiler: A Bridge Between PL and Hardware
Compiler
Applications (High Level Language) A := B + C * D
T1 := C * DT2 := B + T1A := T2
Intermediate CodesOperating System
Hardware (Low Level Language)MOV A, CMUL A, DADD A, BMOV va, A
Assembly Codes
Register-based orStack-based machines
35
Compiler: with Intermediate Codes
CompilerSourceProgram/Code
(P.L., Formal Spec.)
TargetProgram/Code
(P.L., Assembly,Machine Code)
Error Message
A := B + C * DT1 := C * DT2 := B + T1A := T2
MOV A, CMUL A, DADD A, BMOV va, A
36
float position, initial, rateposition := initial + rate * 60
Typ
ical
Ph
ases
of
a C
omp
iler
lexical analyzer
id1 := id2 + id3 * 60
syntax analyzer
:=
id1 +
id2 *
id3 60
semantic analyzer
:=
id1 +
id2 *
id3 inttoreal
60
intermediate code generator
temp1 := inttoreal (60)temp2 := id3 * temp1temp3 := id2 + temp2Id1 := temp3
code optimizer
temp1 := id3 * 60.0 id1 := id2 + temp1
code generator
MOVF id3, R2MULF #60.0, R2MOVF id2, R1ADDF R2, R1MOVF R1, id1
Parse Treeor
Syntax Tree
Syntax Treeor
AnnotatedSyntax Tree
Tokens 3-addresscodes, or
Stack machinecodes
Assembly(or Machine)
Codes
Optimizedcodes
37
Analysis-Synthesis Model of a Compiler AnalysisAnalysis : : ProgramProgram => Constituents => => Constituents => I.R.I.R.
LexicalLexical Analysis: linear => token Analysis: linear => token SyntaxSyntax Analysis: hierarchical, nested => tree Analysis: hierarchical, nested => tree
Identify Identify relations/actionsrelations/actions among tokens: e.g., among tokens: e.g., addadd(b, (b, multmult(c,d))(c,d)) SemanticSemantic Analysis: check legal Analysis: check legal constraintsconstraints / / meaningsmeanings
By examining By examining attributesattributes associated with tokens & relations associated with tokens & relations
SynthesisSynthesis: : I.R.I.R. => I.R.* => => I.R.* => TargetTarget Language Language IntermediateIntermediate Code Code GenerationGeneration
generategenerate intermediate representation (I.R.) intermediate representation (I.R.) fromfrom syntax syntax Code Code OptimizationOptimization: generate better equivalent IR: generate better equivalent IR
machine machine independentindependent + machine + machine dependentdependent CodeCode Generation Generation
38
Typical Modules of a Compiler
TokensSyntax
Tree IRTargetCodeIR
SourceCode
Lexical
Analyzer
Syntax
Analyzer
Semantic
Analyzer
IntermediateCode
Generator
CodeOptimizer
CodeGenerator
Error
Handler
Symbol
Table
Literal
Table
AnnotatedTree
AnnotatedSyntax Tree
float position, initial, rateposition := initial + rate * 60
Typ
ical
Ph
ases
of
a C
omp
iler
lexical analyzer
id1 := id2 + id3 * 60
syntax analyzer
:=
id1 +
id2 *
id3 60
semantic analyzer
:=
id1 +
id2 *
id3 inttoreal
60
intermediate code generator
temp1 := inttoreal (60)temp2 := id3 * temp1temp3 := id2 + temp2Id1 := temp3
code optimizer
temp1 := id3 * 60.0 id1 := id2 + temp1
code generator
MOVF id3, R2MULF #60.0, R2MOVF id2, R1ADDF R2, R1MOVF R1, id1
Parse Treeor
Syntax Tree
Syntax Treeor
AnnotatedSyntax Tree
Tokens 3-addresscodes, or
Stack machinecodes
Assembly(or Machine)
Codes
Optimizedcodes
40
How To Construct A Compiler
- Language Processing SystemsLanguage Processing Systems- High-Level and Intermediate LanguagesHigh-Level and Intermediate Languages- Processing PhrasesProcessing Phrases- Quick Review on Syntax & SemanticsQuick Review on Syntax & Semantics- Processing Phrases in DetailProcessing Phrases in Detail- Structure of CompilersStructure of Compilers
41
Syntax Analysis: Structure
Syntax Analysis
id1 := id2 + id3 * 60
id3 * 60
id2 + t
id1 := e
s
Parse Tree(Concrete syntax tree)
Grammar
Syntax Analysis (Parsing): match input tokens against a grammar of the language
To ensure that the input tokens form a legal sentence (statement)
To build the structure representation of the input tokens
So the structure can be used for translation (or code generation)
Knowledge source: Grammar in CFG (Context-
Free Grammar) form Additional semantic rules for
semantic checks and translation (in later phases)
S → id := eS → …e → id + te → …t → id * nt → …
42
Grammar: Context Free Grammar
43
Context Free Grammar (CFG):Specification for Structures & Constituency
Parse Tree: graphical representation of structure root node (S): a sentential level structure internal nodes: constituents of the sentence arcs: relationship between parent nodes and their children (constituents) terminal nodes: surface forms of the input symbols (e.g., words) alternative representation: bracketed notation:
e.g., [I saw [the [girl [in [the park]]]]]
Example:
PP
in
NP
NP
girl the park
NP
44
Parse Tree: “I saw the girl in the park”
PP
in
NP
NP
girl the park
NP
I saw the
NP
NP
S
VP
vpron det n p det n
45
CFG: Components
CFG: formal specification of parse trees G = {, N, P, S} : terminal symbols N: non-terminal symbols P: production rules S: start symbol
: terminal symbols the input symbols of the language
programming language: tokens (reserved words, variables, operators, …) natural languages: words or parts of speech
pre-terminal: parts of speech (when words are regarded as terminals) N: non-terminal symbols
groups of terminals and/or other non-terminals S: start symbol: the largest constituent of a parse tree P: production (re-writing) rules
form: α → β (α: non-terminal, β: string of terminals and non-terminals) meaning: α re-writes to (“consists of”, “derived into”)β, or βreduced to α start with “S-productions” (S → β)
46
CFG: Example Grammar
Grammar Rules S → NP VP NP → Pron | Proper-Noun | Det Norm Norm → Noun Norm | Noun VP → Verb | Verb NP | Verb NP PP | Verb PP PP → Prep NP
S: sentence, NP: noun phrase, VP: verb phrase Pron: pronoun Det: determiner, Norm: Norminal PP: prepositional phrase, Prep: preposition
Lexicon (in CFG form) Noun → girl | park | desk Verb → like | want | is | saw | walk Prep → by | in | with | for Det → the | a | this | these Pron → I | you | he | she | him Proper-Noun → IBM | Microsoft | Berkeley
47
Syntax vs. Semantic Analyses
Syntax:Syntax: How the input tokens How the input tokens looklook like? Do they form a legal like? Do they form a legal
structure?structure? Analysis of relationship between elementsAnalysis of relationship between elements
e.g., operator-operands relationshipe.g., operator-operands relationship
Semantic:Semantic: What they What they meanmean? And, thus, how they act?? And, thus, how they act? Analysis of detailed Analysis of detailed attributesattributes of elements and check of elements and check
constraints over them under the given syntaxconstraints over them under the given syntax Not all knowledge between elements can be conveniently Not all knowledge between elements can be conveniently
represented by a simple represented by a simple syntacticsyntactic structure. Various kinds of structure. Various kinds of attributesattributes are associated with sub-structures in the given syntax are associated with sub-structures in the given syntax
48
semantic analyzer
:=
id1 +
id2 *
id3 inttoreal
id4
Syntax vs. Semantic Analyses Examples:Examples:
intint a, b, c ,d; a, b, c ,d; floatfloat f; f; charchar s1[], s2[] ; s1[], s2[] ; a = b + c * d ;a = b + c * d ; a = b + f * d ; // OK, but not strictly righta = b + f * d ; // OK, but not strictly right a = b + s1 * s2 ; // BAD: * is undefined for stringsa = b + s1 * s2 ; // BAD: * is undefined for strings a = b + s1 * 3 ; // OK? if properly defineda = b + s1 * 3 ; // OK? if properly defined
All the above statements have the same All the above statements have the same looklook Convenient to represent them with the same Convenient to represent them with the same syntacticsyntactic structure structure
((grammargrammar/production rules)/production rules) But But SemanticallySemantically … …
Not all of them are Not all of them are meaningful meaningful (?? string * string ??)(?? string * string ??)• You have to check their other You have to check their other attributesattributes for meanings for meanings
Not all meaningful statements will Not all meaningful statements will mean/actmean/act the same and have the same and have the same the same codes codes (*: int * int (*: int * int int * float int * float string * int) string * int)
• You have to generate different codes according to other You have to generate different codes according to other attributeattributess of the tokens, since instructions are limited of the tokens, since instructions are limited
• E.g., INT and FLOAT additions may use different machine instrE.g., INT and FLOAT additions may use different machine instructions, like ADD and ADDF respectively.uctions, like ADD and ADDF respectively.
:=
id1 +
id2 *
id3 id4
49
Semantic Analysis: Attributes
:=id1
+
id2*
id3
60
i2r
Semantic Analysis
id3 * 60
id2 + t
id1 := e
s
Semantic checks
&abstraction
Syntax Tree(Abstract Syntax Tree)
Parse Tree(Concrete Syntax Tree)
:=id1
+
id2*
id3 60
Semantic RulesAssoc. withGrammar
Productions
50
How To Construct A Compiler
- Language Processing SystemsLanguage Processing Systems- High-Level and Intermediate LanguagesHigh-Level and Intermediate Languages- Processing PhrasesProcessing Phrases- Quick Review on Syntax & SemanticsQuick Review on Syntax & Semantics- Processing Phrases in DetailProcessing Phrases in Detail- Structure of CompilersStructure of Compilers
51
Symbol Table Management
SymbolsSymbols:: VariableVariable namesnames, , procedureprocedure names, names, constantconstant literals literals
(3.14159)(3.14159)
Symbol Table:Symbol Table: A record for each A record for each namename describing its describing its attributesattributes Managing Information about Managing Information about namesnames
VariableVariable attributes: attributes:• Type, register/storage allocated, scopeType, register/storage allocated, scope
ProcedureProcedure names: names:• Number and types of argumentsNumber and types of arguments
• Method of argument passingMethod of argument passing
– By value, address, referenceBy value, address, reference
52
[1] Lexical Analysis: Tokenization
Lexical Analysis
final := initial + rate * 60[f := i + r * 60]
id1 := id2 + id3 * 60
I(+1p+sg) see (+ed) the girl (+s)[I(+1p+sg) see (+prs) the girl (+s)]
I saw the girls[I see the girls]
11 id1id1 ““final”final” floatfloat R2R2
22 id2id2 ““initial”initial” floatfloat R1R1
33 id3id3 ““rate”rate” floatfloat
44 const1const1 ““60”60” constconst 60.060.0
11 ““I”I” ““I”I” +1p+sg+1p+sg
22 ““see”see” ““saw”saw” +ed+ed
33 ““the”the” ““the”the”
44 ““girl”girl” ““girls”girls” +3p+pl+3p+pl +s+s
Both looks the same. So you want torepresent them with the same normalized
token string, and hide detailedfeatures as additional attributes.
53
[2] Syntax Analysis: Structure
Syntax Analysis
id1 := id2 + id3 * 60I see (+ed) the girl (+s)
NP verb NP
Sentence
I see (+ed) the girl (+s)
id3 * 60
id2 + t
id1 := e
s
Parse Tree(Concrete syntax tree)
Normalized tokens havethe same parse/syntax tree
whether they were “see”/“saw”and “girl”/“girls”.
Grammar
56
[3] Semantic Analysis: Attributes
:=id1
+
id2*
id3
60
i2r
Semantic Analysis
NP.subject verb NP.object
Sentence
I see (+ed) the girl (+s)
NP verb NP
Sentence
I see (+ed) the girl (+s) id3 * 60
id2 + t
id1 := e
s Semantic checks
&abstraction
Syntax Tree(Abstract Syntax Tree)
Parse Tree(Concrete Syntax Tree)
Semantic RulesAssoc. withGrammar
Productions
58
[3] Semantic Analysis: Attributes
:=id1
+
id2*
id3
60
i2r
Semantic Analysis
id3 * 60
id2 + t
id1 := e
s
Semantic checks
&abstraction
Syntax Tree(Abstract Syntax Tree)
Parse Tree(Concrete Syntax Tree)
:=id1
+
id2*
id3 60
Semantic RulesAssoc. withGrammar
Productions
60
Semantic Checking
subject verb object
sentence
I see (+ed) the girl (+s)
Semantic Constraints:Semantic Constraints: Agreement: (somewhat Agreement: (somewhat
syntactic)syntactic) Subject-Verb: I have, shSubject-Verb: I have, sh
e has/had, I do have, she e has/had, I do have, she does notdoes not
NP: Quantifier-noun: a bNP: Quantifier-noun: a book, two booksook, two books
Selectional Constraint:Selectional Constraint: Kill Kill Animate Animate Kiss Kiss Animate Animate
subject object
see (+ed)
I the girl (+s)
abstraction
61
Semantic Checking
See[+ed](I, the girl[+s])
Kill/Kiss (John, the Stone)
(semantically meaningful)
(semantically meaningless unless the Stone refers to an animate entity)
subject verb object
sentence
I see (+ed) the girl (+s)
Semantic Constraints:Semantic Constraints: Agreement: (somewhat Agreement: (somewhat
syntactic)syntactic) Subject-Verb: I have, shSubject-Verb: I have, sh
e has/had, I do have, she e has/had, I do have, she does notdoes not
NP: Quantifier-noun: a bNP: Quantifier-noun: a book, two booksook, two books
Selectional Constraint:Selectional Constraint: Kill Kill Animate Animate Kiss Kiss Animate Animate
semantic checking
62
Parse Tree vs. Syntax Tree ParseParse Tree: (aka Tree: (aka concreteconcrete syntax tree) syntax tree)
Tree Tree concreteconcrete representation drawn according to a representation drawn according to a grammargrammar For validating correctness of syntax of inputFor validating correctness of syntax of input For easy For easy parsingparsing (or fitting constraints of parsing algorithm) (or fitting constraints of parsing algorithm)
Normally constructed incrementally during parsingNormally constructed incrementally during parsing SyntaxSyntax Tree: (aka Tree: (aka abstractabstract syntax tree) syntax tree)
Tree Tree logicallogical representation that characterize the representation that characterize the abstractabstract relationshipsrelationships between constituents between constituents
For representing semantic relationships & semantic checkingFor representing semantic relationships & semantic checking Normalizing various parse trees of the same “Normalizing various parse trees of the same “meaningmeaning” (semantics)” (semantics) May ignore non-essential syntactic detailsMay ignore non-essential syntactic details
Not always the same as Not always the same as parseparse tree tree May be constructed in May be constructed in parallelparallel with the with the parseparse treetree during parsing during parsing
Or converted from parse tree after syntactic parsingOr converted from parse tree after syntactic parsing AnnotatedAnnotated Syntax Tree (AST) Syntax Tree (AST)
Syntax Tree with annotated Syntax Tree with annotated attributesattributes
63
Parse Tree vs. Syntax Tree ParseParse Tree: (depend on Tree: (depend on grammargrammar))
Input: T + T + TInput: T + T + T G1: T ((+ T) (+ T) …)G1: T ((+ T) (+ T) …)
E → T R’E → T R’ R’ → + T R’R’ → + T R’ R’ → <null>R’ → <null>
G2: ((T) + T) + TG2: ((T) + T) + T E → E + TE → E + T E → TE → T
Syntax Tree:Syntax Tree: AbstractAbstract representation for syntax representation for syntax
defined by G1/G2defined by G1/G2 Use Use operationoperation as parent nodes and as parent nodes and
operandsoperands as children nodes as children nodes Operation-operandOperation-operand relationship: Easy relationship: Easy
for instruction selection in code for instruction selection in code generation (e.g., ADD R1, R2)generation (e.g., ADD R1, R2)
Parse Tree for G1
Parse Tree for G2
Syntax Tree (independent of G1 or G2)
64
[4] Intermediate Code Generation
:=id1
+id2
*id3
60i2r
temp1 := i2r ( 60 )temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3
Intermediate Code Generation
See[+ed](I, the girl[+s])
logic form
Attribute evaluation
(assembly codes are attributes for code generation)
+anim
Action(+anim,+anim)
+anim
3-address codes
subject object
see (+ed)
I the girl (+s)
66
Syntax-Directed Translation (1) Translation from input to target can be regarded as Translation from input to target can be regarded as
attributeattribute evaluationevaluation.. Evaluate attributes of each node, in a well defined order, Evaluate attributes of each node, in a well defined order,
based on the particular piece of sub-tree structure based on the particular piece of sub-tree structure (syntax) wherein the attributes are to be evaluated.(syntax) wherein the attributes are to be evaluated.
AttributesAttributes: the particular properties associated with : the particular properties associated with a tree node (a node may have many attributes)a tree node (a node may have many attributes) Abstract representation of the sub-tree rooted at that nodeAbstract representation of the sub-tree rooted at that node The attributes of the root node represent the particular The attributes of the root node represent the particular
properties of the whole input statement or sentence.properties of the whole input statement or sentence. E.g., E.g., valuevalue associated with a associated with a mathematicmathematic sub-expressionsub-expression E.g., E.g., machine codesmachine codes associated with a associated with a sub-expressionsub-expression E.g., language E.g., language translationtranslation associated with a associated with a sub-sentencesub-sentence
67
Syntax-Directed Translation (2) SynthesisSynthesis Attributes: Attributes:
Attributes that can be evaluated based on the attributes of Attributes that can be evaluated based on the attributes of childrenchildren nodesnodes
E.g., value of math. expression can be acquired from the values oE.g., value of math. expression can be acquired from the values of sub-expressions (and the operators being applied)f sub-expressions (and the operators being applied)
a := b + c * da := b + c * d• (( a.val = b.val + tmp.val where tmp.val = c.val * d.val) a.val = b.val + tmp.val where tmp.val = c.val * d.val)
girls = girl + sgirls = girl + s• (( tr.girls = tr.girl + tr.s = tr.girls = tr.girl + tr.s = 女孩女孩 ++ 們們 女孩們 女孩們 ))
InheritedInherited Attributes: Attributes: Attributes evaluatable from Attributes evaluatable from parentparent and/or and/or siblingsibling nodesnodes
E.g., data E.g., data typetype of a variable can be acquired from its left-hand si of a variable can be acquired from its left-hand side type declaration or from the type of its left-hand side brotherde type declaration or from the type of its left-hand side brother
int a, b, c; (int a, b, c; ( a.type = INT & b.type = a.type & …) a.type = INT & b.type = a.type & …)
68
Syntax-Directed Translation (3) Attribute Attribute evaluationevaluation orderorder::
Any order that can evaluate the attribute Any order that can evaluate the attribute AFTER all its AFTER all its dependentdependent attributesattributes are are evaluated will result in correct evaluation.evaluated will result in correct evaluation.
General: General: topologicaltopological orderorder Analyze the Analyze the dependencydependency between attributes and between attributes and
construct an attribute tree or forestconstruct an attribute tree or forest Evaluate the attribute of any leave node, and mark it Evaluate the attribute of any leave node, and mark it
as “evaluated”, thus logically remove it from the as “evaluated”, thus logically remove it from the attribute tree or forest attribute tree or forest
Repeat for any leave nodes that have not been Repeat for any leave nodes that have not been marked, until no unmarked nodemarked, until no unmarked node
69
[5] Code Optimization[Normalization]
temp1 := i2r ( 60 )temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3
Code Optimization
temp1 := id3 * 60.0id1 := id2 + temp1
See[+ed](I, the girl[+s])
See[+ed](I, the girl[+s])
Was_Kill[+ed](Bill, John)
Kill[+ed](John, Bill)
Normalization into better equivalent
form (optional)
Unify passive/active
voices
70
[6] Code Generation
Code Generation
temp1 := id3 * 60.0id1 := id2 + temp1
movf id3, r2mulf #60.0, r2movf id2, r1addf r2, r1movf r1, id1
Lexical: 看到 [ 了 ] ( 我 , 女孩 [ 們 ])
See[+ed](I, the girl[+s])
Structural: 我 看到 女孩 [ 們 ] [ 了 ]
Selection of usable codes
&order of codes
&Allocation of
available registers
Selection of target words
&order of phrases
71
Objectives of Optimizing Compilers
CorrectCorrect codes: preserve codes: preserve meaningmeaning BetterBetter performance performance
Maximum Execution Maximum Execution EfficiencyEfficiency Minimum Code Minimum Code SizeSize
Embedded systemsEmbedded systems
Minimizing Minimizing PowerPower Consumptions Consumptions Mobile devicesMobile devices Typically, faster execution also implies lower powerTypically, faster execution also implies lower power
ReasonableReasonable compilationcompilation timetime ManageableManageable engineering and maintenance engineering and maintenance effortsefforts
72
Optimization for Computer Architectures (1) ParallelismParallelism
InstructionInstruction level: multiple operations are executed simultaneously level: multiple operations are executed simultaneously Processor check Processor check dependencydependency in sequential instructions, issue them in in sequential instructions, issue them in
parallel parallel • Hardware scheduler: change order of instructionHardware scheduler: change order of instruction
CompilersCompilers: : rearrangerearrange instructions to make instruction level parallelis instructions to make instruction level parallelism more effectivem more effective
Instruction set supports:Instruction set supports:• Very long Instruction wordVery long Instruction word: issues multiple operations in parallel: issues multiple operations in parallel• Instructions that can operate on Instructions that can operate on VectorVector data at the same time data at the same time
CompilersCompilers: generate codes for such machine from sequential codes: generate codes for such machine from sequential codes ProcessorProcessor level: different level: different threadsthreads of the same application are run o of the same application are run o
n different processorsn different processors Multiprocessors + multithreaded codesMultiprocessors + multithreaded codes
• Programmer: write multithreaded codes, Programmer: write multithreaded codes, vsvs• CompilerCompiler: generate parallel codes : generate parallel codes automaticallyautomatically
73
Optimization for Computer Architectures (2) Memory HierarchiesMemory Hierarchies
No storage that is both fast and largeNo storage that is both fast and large RegistersRegisters (tens ~ hundreds bytes), (tens ~ hundreds bytes), cachescaches (K~MB), (K~MB),
main/physicalmain/physical memory (M~GB), memory (M~GB), secondary/virtualsecondary/virtual memory memory (hard disks) (G~TB)(hard disks) (G~TB)
Using Using registersregisters effectively is probably the single most effectively is probably the single most important problem in optimizing a programimportant problem in optimizing a program
Cache-managementCache-management by hardware is not effective in by hardware is not effective in scientific code that has large data structures (arrays)scientific code that has large data structures (arrays)
Improve effectiveness of Improve effectiveness of memorymemory hierarchieshierarchies::• By changing By changing layout of datalayout of data, or, or• Changing the Changing the order of instructionsorder of instructions accessing the data accessing the data
Improve effectiveness of Improve effectiveness of instructioninstruction cachecache::• Change the Change the layout of codeslayout of codes
74
How To Construct A Compiler
- Language Processing SystemsLanguage Processing Systems- High-Level and Intermediate LanguagesHigh-Level and Intermediate Languages- Processing PhrasesProcessing Phrases- Quick Review on Syntax & SemanticsQuick Review on Syntax & Semantics- Processing Phrases in DetailProcessing Phrases in Detail- Structure of CompilersStructure of Compilers
75
Structure of a Compiler
FrontFront End: End: SourceSource Dependent Dependent Lexical AnalysisLexical Analysis Syntax AnalysisSyntax Analysis Semantic AnalysisSemantic Analysis Intermediate Code GenerationIntermediate Code Generation (Code Optimization: machine independent)(Code Optimization: machine independent)
BackBack End: End: TargetTarget Dependent Dependent Code OptimizationCode Optimization Target Code GenerationTarget Code Generation
76
Structure of a Compiler
Fortran Pascal C
Intermediate Code
MIPS SPARC Pentium
77
History
1st Fortran compiler: 1950s1st Fortran compiler: 1950s
efficient? (compared with assembly program)efficient? (compared with assembly program)
not bad, but much easier to write programsnot bad, but much easier to write programs
high-level languages are feasible.high-level languages are feasible.
18 man-year, ad hoc structure18 man-year, ad hoc structure
Today, we can build a simple compiler in a few Today, we can build a simple compiler in a few month.month.
Crafting an efficient and reliable compiler is still Crafting an efficient and reliable compiler is still challenging.challenging.
78
Cousins of the Compiler PreprocessorsPreprocessors: macro definition/expansion: macro definition/expansion InterpretersInterpreters
Compiler vs. interpreter vs. just-in-time compilationCompiler vs. interpreter vs. just-in-time compilation AssemblersAssemblers: 1-pass / 2-pass: 1-pass / 2-pass LinkersLinkers: link source with library functions: link source with library functions LoadersLoaders: load executables into memory: load executables into memory EditorsEditors: editing sources (with/without syntax prediction): editing sources (with/without syntax prediction) DebuggersDebuggers: symbolically providing stepwise trace: symbolically providing stepwise trace ProfilersProfilers: gprof (: gprof (call graphcall graph and and time analysistime analysis)) Project managers: Project managers: IDEIDE
Integrated Development EnvironmentIntegrated Development Environment DeassemblersDeassemblers, , DecompilersDecompilers: low-level to high-level lang: low-level to high-level lang
uage conversionuage conversion
79
Applications of Compilation Techniques
80
Applications of Compilation Techniques
Virtually any kinds of Programming Virtually any kinds of Programming LanguagesLanguages and Specification Languages and Specification Languages with Regular and Well-defined with Regular and Well-defined Grammatical Structures will need a kind of Grammatical Structures will need a kind of compiler (or its variant, or a part of it) to compiler (or its variant, or a part of it) to analyze and then process them. analyze and then process them.
81
Applications of Lexical Analysis
Text/Pattern Processing:Text/Pattern Processing: grepgrep: get lines with specified pattern: get lines with specified pattern
• Ex: grep ‘^From ‘ /var/spool/mail/andyEx: grep ‘^From ‘ /var/spool/mail/andy
sedsed: stream editor, editing specified patterns: stream editor, editing specified patterns• Ex: ls *.JPG | sed ‘s/JPG/jpg/’Ex: ls *.JPG | sed ‘s/JPG/jpg/’
trtr: simple translation between patterns (e.g., uppercases : simple translation between patterns (e.g., uppercases to lowercases)to lowercases)
• Ex: tr ‘a-z’ ‘A-Z’ < mytext > mytext.ucEx: tr ‘a-z’ ‘A-Z’ < mytext > mytext.uc
AWKAWK: pattern-action rule processing: pattern-action rule processing pattern processing based on regular expressionpattern processing based on regular expression
• Ex: awk '$1==“John"{count++}END{print count} ' < Students.tEx: awk '$1==“John"{count++}END{print count} ' < Students.txtxt
82
Applications of Lexical Analysis
Search Engines/Information RetrievalSearch Engines/Information Retrieval full text search, keyword matching, fuzzy full text search, keyword matching, fuzzy
matchmatch Database MachineDatabase Machine
fast matching over large databasefast matching over large database database filterdatabase filter
Fast & Multiple Matching AlgorithmsFast & Multiple Matching Algorithms Optimized/specialized lexical analyzers (FSA)Optimized/specialized lexical analyzers (FSA) Examples: KMP, Boyer-Moore (BM), …Examples: KMP, Boyer-Moore (BM), …
83
Applications Syntax Analysis
Structured Editor/Word ProcessorStructured Editor/Word Processor Integrated Develop Environment (IDE)Integrated Develop Environment (IDE)
automatic formatting, keyword insertionautomatic formatting, keyword insertion Incremental Parser vs. Full-blown ParsingIncremental Parser vs. Full-blown Parsing
incremental: patching analysis made by incremental incremental: patching analysis made by incremental changes, instead of re-parsing or re-compilingchanges, instead of re-parsing or re-compiling
Pretty Printer: beautify nested structuresPretty Printer: beautify nested structures cb (C-beautifier)cb (C-beautifier) indent (an even more versatile C-beautifier)indent (an even more versatile C-beautifier)
84
Applications Syntax Analysis
Static Checker/Debugger: lintStatic Checker/Debugger: lint check errors without really running, e.g.,check errors without really running, e.g.,
statement not reachablestatement not reachable used before definedused before defined
85
Application of Optimization Techniques Data flow analysisData flow analysis
SoftwareSoftware testing: testing: Locating Locating errorserrors before running ( before running (static checkingstatic checking)) Locate errors along all possible execution pathsLocate errors along all possible execution paths
• not only on test data setnot only on test data set
TypeType Checking Checking Dereferncing null or freed pointersDereferncing null or freed pointers ““Dangerous” user supplied stringsDangerous” user supplied strings
BoundBound Checking Checking Security vulnerability: buffer Security vulnerability: buffer over-run attackover-run attack Tracking values of pointers across proceduresTracking values of pointers across procedures
MemoryMemory management management Garbage collectionGarbage collection
86
Applications of Compilation Techniques
Pre-processor: Macro definition/expansionPre-processor: Macro definition/expansion Active Webpages ProcessingActive Webpages Processing
Script or programming languages embedded in Script or programming languages embedded in webpages for interactive transactionswebpages for interactive transactions
Examples: JavaScript, JSP, ASP, PHPExamples: JavaScript, JSP, ASP, PHP Compiler Apps: expansion of embedded statemCompiler Apps: expansion of embedded statem
ents, in addition to web page parsingents, in addition to web page parsing Database Query Language: SQLDatabase Query Language: SQL
87
Applications of Compilation Techniques
InterpreterInterpreter no pre-compilationno pre-compilation executed on-the-flyexecuted on-the-fly e.g., BASICe.g., BASIC
Script Languages: C-shell, PerlScript Languages: C-shell, Perl Function: for batch processing multiple Function: for batch processing multiple
files/databasesfiles/databases mostly interpreted, some pre-compiledmostly interpreted, some pre-compiled Some interpreted and save compiled codesSome interpreted and save compiled codes
88
Applications of Compilation Techniques
Text FormatterText Formatter Troff, LaTex, Eqn, Pic, TblTroff, LaTex, Eqn, Pic, Tbl
VLSI Design: Silicon CompilerVLSI Design: Silicon Compiler Hardware Description LanguagesHardware Description Languages
variables => control signals / datavariables => control signals / data
Circuit SynthesisCircuit Synthesis Preliminary Circuit Simulation by SoftwarePreliminary Circuit Simulation by Software
89
Applications of Compilation Techniques
VLSI DesignVLSI Design
90
Advanced Applications Natural Language ProcessingNatural Language Processing
advanced search engines: retrieve relevant advanced search engines: retrieve relevant documentsdocuments
more than keyword matchingmore than keyword matching natural language natural language queryquery
information extraction:information extraction: acquire relevant information (into acquire relevant information (into structuredstructured form) form)
text text summarizationsummarization:: get most brief & relevant paragraphsget most brief & relevant paragraphs
text/web mining:text/web mining: mining information & rules from text/webmining information & rules from text/web
91
Advanced Applications Machine TranslationMachine Translation
Translating a natural language into anotherTranslating a natural language into another Models:Models:
Direct translationDirect translation Transfer-Based ModelTransfer-Based Model Inter-lingua ModelInter-lingua Model
Transfer-Based Transfer-Based Model:Model: Analysis-Transfer-Generation (or Synthesis) modelAnalysis-Transfer-Generation (or Synthesis) model
92
Tools for Compiler Construction
93
Tools: Automatic Generation of Lexical Analyzers and Compilers Lexical Analyzer Generator: Lexical Analyzer Generator: LEXLEX
Input: Token Pattern specification (in regular Input: Token Pattern specification (in regular expression)expression)
Output: a lexical analyzerOutput: a lexical analyzer Parser Generator: Parser Generator: YACCYACC
““compiler-compiler”compiler-compiler” Input: Grammar Specification (in context-free Input: Grammar Specification (in context-free
grammar)grammar) Output: a syntax analyzer (aka “parser”)Output: a syntax analyzer (aka “parser”)
94
Tools Syntax Directed Translation enginesSyntax Directed Translation engines
translations associated with nodestranslations associated with nodes translations defined in terms of translations of translations defined in terms of translations of
childrenchildren Automatic code generationAutomatic code generation
translation rulestranslation rules template matchingtemplate matching
Data flow analysesData flow analyses dependency of variables & constructsdependency of variables & constructs
95
Programming Languages
-Issues about Modern PL’sIssues about Modern PL’s- Module programming & Parameter passingModule programming & Parameter passing- Nested modules & ScopesNested modules & Scopes- Static dynamic allocationStatic dynamic allocation
96
Programming Language Basics
StaticStatic vs. vs. DynamicDynamic Issues or Policies Issues or Policies StaticStatic: determined at : determined at compilecompile time time DynamicDynamic: determined at : determined at runrun time time
ScopesScopes of of declarationdeclaration Region in which the Region in which the useuse of x refer to a of x refer to a declarationdeclaration of x of x
StaticStatic ScopeScope (aka lexical scope): (aka lexical scope): Possible to determine the scope of declaration by looking at Possible to determine the scope of declaration by looking at
the programthe program C, Java (and most PL)C, Java (and most PL)
• Delimited by Delimited by block structuresblock structures
DynamicDynamic scopescope:: At run time, the same use of x could refer to any of several At run time, the same use of x could refer to any of several
declarations of x.declarations of x.
97
Programming Language Basics
VariableVariable declaration declaration StaticStatic variablesvariables
Possible to determine the location in memory where the declarPossible to determine the location in memory where the declared variable can be founded variable can be found
• Public Public staticstatic int x; // C++ int x; // C++• Only Only one copyone copy of x, can be determined at of x, can be determined at compilecompile timetime• GlobalGlobal declarations and declared declarations and declared constantsconstants can also be made stat can also be made stat
icic
DynamicDynamic variablesvariables:: LocalLocal variables without the “static” keyword variables without the “static” keyword
• Each object of the class would have its own location where x woEach object of the class would have its own location where x would be held.uld be held.
• At run time, the same use of At run time, the same use of xx in in different objectsdifferent objects could refer to could refer to any of several different locations.any of several different locations.
98
Programming Language Basics
Parameter Passing MechanismsParameter Passing Mechanisms called by called by valuevalue
make a copy of physical valuemake a copy of physical value called by called by referencereference
make a copy of the make a copy of the addressaddress of a physical object of a physical object call by name (Algol 60)call by name (Algol 60)
callee executed as if the actual parameter were substcallee executed as if the actual parameter were substituted literally for the formal parameter in the code ituted literally for the formal parameter in the code of the calleeof the callee
• macro expansion of formal parameter into actual parametemacro expansion of formal parameter into actual parameterr
99
Formal Languages
100
Languages, Grammars and Languages, Grammars and Recognition MachinesRecognition Machines
Language
Grammar(expression)
Parser(automaton)
define acceptgenerate
construct
Parsing Table
I saw a girl in the park …
S · NP VPNP · pron | · det n
S NP VPNP pron | det n
101
LanguagesLanguages
AlphabetAlphabet - any finite set of symbols - any finite set of symbols{0, 1}: {0, 1}: binary alphabetbinary alphabet
StringString - a finite sequence of symbols from - a finite sequence of symbols from an alphabetan alphabet
1011: 1011: a string of length 4a string of length 4 : : the empty stringthe empty string
LanguageLanguage - - any set of strings on an alphabetany set of strings on an alphabet{00, 01, 10, 11}: {00, 01, 10, 11}: the set of strings of length 2the set of strings of length 2 : : the empty setthe empty set
105
GrammarsGrammars The sentences in a language may be defined The sentences in a language may be defined
by a set of rules called a by a set of rules called a grammargrammarLL: {00, 01, 10, 11}: {00, 01, 10, 11}
(the set of binary digits of length 2)(the set of binary digits of length 2)
G: (0|1)(0|1)G: (0|1)(0|1) Languages of different degree of regularity can be Languages of different degree of regularity can be
specified with grammar of different “expressive specified with grammar of different “expressive powers”powers” Chomsky Hierarchy:Chomsky Hierarchy:
Regular Grammar < Context-Free Grammar < Context-Regular Grammar < Context-Free Grammar < Context-Sensitive Grammar < Unrestricted Sensitive Grammar < Unrestricted
106
AutomataAutomata
An An acceptor/recognizeracceptor/recognizer of a language is an of a language is an automaton which determines if an input automaton which determines if an input string is a sentence in the languagestring is a sentence in the language
A A transducertransducer of a language is an automaton of a language is an automaton which determines if an input string is a which determines if an input string is a sentence in the language, and may produce sentence in the language, and may produce strings as output if it is in the languagestrings as output if it is in the language
Implementation: state transition functions Implementation: state transition functions (parsing table) (parsing table)
107
TransducerTransducer
language L1
grammar G1
automatonDefine/ Generate
construct
language L2
grammar G2
accept translation
Define/ Generate
108
Meta-languagesMeta-languages
Meta-languageMeta-language: : a language used to define a language used to define another languageanother language
Different Different meta-languagesmeta-languages will be used to will be used to define the various components of a define the various components of a programming language so that these programming language so that these components can be analyzed automatically components can be analyzed automatically
109
Definition of Programming Definition of Programming LanguagesLanguages
Lexical tokensLexical tokens: regular expressions: regular expressions SyntaxSyntax: context free grammars: context free grammars SemanticsSemantics: attribute grammars: attribute grammars Intermediate code generationIntermediate code generation: :
attribute grammarsattribute grammars Code generationCode generation: tree grammars: tree grammars
110
Implementation of Implementation of Programming LanguagesProgramming Languages Regular expressionsRegular expressions: :
finite automata, lexical analyzerfinite automata, lexical analyzer Context free grammarsContext free grammars: :
pushdown automata, parserpushdown automata, parser Attribute grammarsAttribute grammars: :
attribute evaluators, type checker andattribute evaluators, type checker and intermediate code generatorintermediate code generator
Tree grammarsTree grammars: : finite tree automata, code generatorfinite tree automata, code generator
111
Appendix: Machine Translation
112
Machine Translation (Transfer Approach)
SL
Text
Analysis SL
IR
Transfer TL
IR
Synthesis TL
Text
SLDictionaries& Grammar
TLDictionaries& Grammar
SL-TLDictionaries
TransferRules
IR: Intermediate Representation
Analysis is target independent, andAnalysis is target independent, and
Generation (Synthesis) is source independentGeneration (Synthesis) is source independent
Inter-lingua
SL TL
113
AnalysisAnalysis Morphological and Lexical AnalysisMorphological and Lexical Analysis Part-of-speech (POS) Tagging Part-of-speech (POS) Tagging
n. Missn. Missn. Smithn. Smithv. v. put (+ed)put (+ed)q. twoq. twon. n. book (+s)book (+s)p. onp. ond. thisd. thisn. n. dining table.dining table.
Example:Miss Smith put two books on this dining table.
114
S
NP VP
V NP PP
Miss Smith put(+ed) two book(s) on this dining table
Example:Miss Smith put two books on this dining table.
Syntax AnalysisSyntax Analysis
115
TransferTransfer
(1) Lexical Transfer(1) Lexical Transfer
Miss Miss 小姐小姐 SmithSmith 史密斯史密斯 put (+ed)put (+ed) 放放 twotwo 兩兩 book (+s)book (+s) 書書 onon 在…上面在…上面 thisthis 這這 dining tabledining table 餐桌餐桌
Example:Miss Smith put two books on this dining table.
116
TransferTransfer
(2) Phrasal/Structural Transfer(2) Phrasal/Structural Transfer
小姐史密斯小姐史密斯放兩書在放兩書在上面這餐桌上面這餐桌 史密斯小姐史密斯小姐放兩書在放兩書在這餐桌上面這餐桌上面
Example:Miss Smith put two books on this dining table.
117
Generation: Morphological & StructuralGeneration: Morphological & Structural
史密斯小姐放兩書在這餐桌上面史密斯小姐放兩書在這餐桌上面
史密斯小姐放兩 史密斯小姐放兩 (( 本本 )) 書在這書在這 (( 張張 )) 餐桌上面餐桌上面
史密斯小姐史密斯小姐 (( 把把 )) 兩兩 (( 本本 )) 書放在這書放在這 (( 張張 )) 餐餐
桌上面桌上面
史密斯小姐把兩本書放在這張餐桌上面中文翻譯中文翻譯::
Example:Miss Smith put two books on this dining table.
118
errorhandler
symbol-tablemanager
[Aho 86]
source program
intermediate code generator
lexicalanalyzer
syntaxanalyzer
semanticanalyzer
codeoptimizer
codegenerator
target program
119
position : = initial + rate * 60
lexical analyzer
id1 : = id2 + id3 * 60
syntax analyzer
: =+
*id1
id2 id3 60
semantic analyzer
+*
id1
id2 id3 inttoreal 60
: =
positionposition ……
initialinitial ……
raterate ……
SYMBOL TABLE
1
2
3
4
[Aho 86]
120
intermediate code generator
C
temp1 := inttoreal (60)temp2 := id3 * temp1temp3 := id2 + temp2id1 := temp3
code optimizer
temp1 := id3 * 60.0id1 := id2 + temp1
code generator
Binary Code
[Aho 86]
121
Detailed Steps (1): Analysis
Text Pre-processing (separating texts from tags)Text Pre-processing (separating texts from tags) Clean up garbage patterns (usually introduced during file conversioClean up garbage patterns (usually introduced during file conversio
n)n) Recover sentences and words (e.g., <B>C</B> omputer)Recover sentences and words (e.g., <B>C</B> omputer) Separate Processing-Regions from Non-Processing-Regions (e.g., FilSeparate Processing-Regions from Non-Processing-Regions (e.g., Fil
e-Header-Sections, Equations, etc.)e-Header-Sections, Equations, etc.) Extract and mark strings that need special treatment (e.g., Topics, KeExtract and mark strings that need special treatment (e.g., Topics, Ke
ywords, etc.) ywords, etc.) Identify and convert markup tags into internal tags (de-markup; howIdentify and convert markup tags into internal tags (de-markup; how
ever, markup tags also provide information)ever, markup tags also provide information)
Discourse and Sentence SegmentationDiscourse and Sentence Segmentation Divide text into various primary processing units (e.g., sentences)Divide text into various primary processing units (e.g., sentences) Discourse: Cue PhrasesDiscourse: Cue Phrases Sentence: mainly classify the type of “Period” and “Carriage Return” Sentence: mainly classify the type of “Period” and “Carriage Return”
in English (“sentence stops” vs. “abbreviations/titles”)in English (“sentence stops” vs. “abbreviations/titles”)
122
Detailed Steps (2): Analysis (Cont.) StemmingStemming
English: perform morphological analysis (e.g., -ed, -ing, -s, -ly, re-, prEnglish: perform morphological analysis (e.g., -ed, -ing, -s, -ly, re-, pre-, etc.) and Identify root form (e.g., got <get>, lay <lie/lay>, etc.)e-, etc.) and Identify root form (e.g., got <get>, lay <lie/lay>, etc.)
Chinese: mainly detect suffix lexemes (e.g., Chinese: mainly detect suffix lexemes (e.g., 孩子們孩子們 , , 學生們學生們 , etc.), etc.) Text normalization: Capitalization, Hyphenation, …Text normalization: Capitalization, Hyphenation, …
TokenizationTokenization English: mainly identify split-idiom (e.g., turn NP on) and compoundEnglish: mainly identify split-idiom (e.g., turn NP on) and compound Chinese: Word Segmentation (e.g., [Chinese: Word Segmentation (e.g., [ 土地土地 ] [] [ 公有公有 ] [] [ 政策政策 ])]) Regular Expression: numerical strings/expressions (e.g., twenty millionRegular Expression: numerical strings/expressions (e.g., twenty million
s), date, … (each being associated with a specific type)s), date, … (each being associated with a specific type)
TaggingTagging Assign Part-of-Speech (e.g., n, v, adj, adv, etc.)Assign Part-of-Speech (e.g., n, v, adj, adv, etc.) Associated forms are basically independent of languages starting from tAssociated forms are basically independent of languages starting from t
his stephis step
123
Detailed Steps (3): Analysis (Cont.) ParsingParsing
Decide suitable syntactic relationship (e.g., PP-Attachment)Decide suitable syntactic relationship (e.g., PP-Attachment)
Decide Word-SenseDecide Word-Sense Decide appropriate lexicon-sense (e.g., River-Bank, Money-Bank, Decide appropriate lexicon-sense (e.g., River-Bank, Money-Bank,
etc.)etc.)
Assign Case-LabelAssign Case-Label Decide suitable semantic relationship (e.g., Patient, Agent, etc.)Decide suitable semantic relationship (e.g., Patient, Agent, etc.)
Anaphora and Antecedent ResolutionAnaphora and Antecedent Resolution Pronoun reference (e.g., “he” refers to “the president”)Pronoun reference (e.g., “he” refers to “the president”)
124
Detailed Steps (4): Analysis (Cont.) Decide Discourse StructureDecide Discourse Structure
Decide suitable discourse segments relationship (e.g., Evidence, Decide suitable discourse segments relationship (e.g., Evidence, Concession, Justification, etc. [Marcu 2000].)Concession, Justification, etc. [Marcu 2000].)
Convert into Logical Form (Optional)Convert into Logical Form (Optional) Co-reference resolution (e.g., “president” refers to “Bill Clinton”), Co-reference resolution (e.g., “president” refers to “Bill Clinton”),
scope resolution (e.g., negation), Temporal Resolution (e.g., today, scope resolution (e.g., negation), Temporal Resolution (e.g., today, last Friday), Spatial Resolution (e.g., here, next), etc.last Friday), Spatial Resolution (e.g., here, next), etc.
Identify roles of Named-Entities (Person, Location, Organization), Identify roles of Named-Entities (Person, Location, Organization), and determine IS-A (also Part-of) relationship, etc.and determine IS-A (also Part-of) relationship, etc.
Mainly used in inference related applications (e.g., Q&A, etc.)Mainly used in inference related applications (e.g., Q&A, etc.)
125
Detailed Steps (5): Transfer Decide suitable Target Discourse StructureDecide suitable Target Discourse Structure
For example: Evidence, Concession, Justification, etc. [Marcu 2000].For example: Evidence, Concession, Justification, etc. [Marcu 2000].
Decide suitable Target Lexicon SensesDecide suitable Target Lexicon Senses Sense Mapping may not be one-to-one (sense resolution might be different Sense Mapping may not be one-to-one (sense resolution might be different
in different languages, e.g. “snow” has more senses in Eskimo)in different languages, e.g. “snow” has more senses in Eskimo) Sense-Token Mapping may not be one-to-one (lexicon representation powSense-Token Mapping may not be one-to-one (lexicon representation pow
er might be different in different languages, e.g., “DINK”, “er might be different in different languages, e.g., “DINK”, “ 睨”睨” , etc). It , etc). It could be 2-1, 1-2, etc.could be 2-1, 1-2, etc.
Decide suitable Target Sentence StructureDecide suitable Target Sentence Structure For example: verb nominalization, constitute promotion and demotion (usuFor example: verb nominalization, constitute promotion and demotion (usu
ally occurs when Sense-Token-Mapping is not 1-1)ally occurs when Sense-Token-Mapping is not 1-1)
Decide appropriate Target CaseDecide appropriate Target Case Case Label might change after the structure has been modifiedCase Label might change after the structure has been modified (Example) verb nominalization: “… that you (AGENT) invite me” (Example) verb nominalization: “… that you (AGENT) invite me” “… “…
your (POSS) invitation”your (POSS) invitation”
126
Detailed Steps (6): Generation Adopt suitable Sentence Syntactic PatternAdopt suitable Sentence Syntactic Pattern
Depend on Style (which is the distributions of lexicon selection Depend on Style (which is the distributions of lexicon selection and syntactic patterns adopted)and syntactic patterns adopted)
Adopt suitable Target LexiconAdopt suitable Target Lexicon Select from Synonym Set (depend on style)Select from Synonym Set (depend on style)
Add “de” (Chinese), comma, tense, measure (Chinese), etc.Add “de” (Chinese), comma, tense, measure (Chinese), etc. Morphological generation is required for target-specific tokensMorphological generation is required for target-specific tokens
Text Post-processingText Post-processing Final string substitution (replace those markers of special strings)Final string substitution (replace those markers of special strings) Extract and export associated information (e.g., Glossary, Index, Extract and export associated information (e.g., Glossary, Index,
etc.)etc.) Restore customer’s markup tags (re-markup) for saving Restore customer’s markup tags (re-markup) for saving
typesetting worktypesetting work