1
Advanced Compilers Lexical Analysis
Fall. 2016
Chungnam National Univ.
Eun-Sun Cho
2
Compiler Front-End Structure
Lexical (어휘) Analysis
Syntax (구문) Analysis
Semantic (의미) Analysis Errors
abstract syntax tree
Source code
전처리기
(preprocessor) Trivial errors
processing
#include, #defines
#ifdef ...
preprocessed source code
3
Lexical Analysis (어휘분석)
Lexical analyzer
(=scanner)
if (by == 0) ax = by;
read char by char
if ( by == 0 ) ax = by ;
• A given source program is considered as a “long” string.
• While looking into each character in sequence, a lexical
analyzer transforms what it read into a stream of “meaningful,
smallest units.”
• Spaces are eliminated – The result would shorter than the source code.
4
Syntax Analysis (구문 분석)
문장(statement)
주어구 (Subject) 동사구 (Verb)
관형어
우리들이 모였습니다.
“똑똑한 우리들이 모였습니다.”
• Check the syntax of the input program
• Check the role of each word (token)
eg) a Korean statement
진 주어 동사
똑똑한
Lexical Analysis
5
6
렉스(Lex)
• A lexical analyzer generator : published in 1975
– Input : user-defined regular expressions and supporting codes
– Output : C program
lex
Lexical analyzer A series of tokens Input stream
(source program)
lex input (regular expression
+ α )
lex cc lex.yy.c lex input test.l
Executables (of lexical analyzer)
lex library
7
Input for lex
<Definitions>
%{
definitions of data structures, variables and constants
for the resulting analyzer codes
}%
definition of names
: each name is assigned to a specific regular expression
%%
<Rules>
a rule = a regular expression (representing a token) +
an action (C codes to be executed when the token is recognized)
%%
<User-defined functions>
: invoked in actions of <Rules>
8
%{ /* file name : example1.l * input: file name
* output: the numbers of lines, words and characters */ unsigned long charCount=0, wordCount=0, lineCount=0; %} word [^ \t\n]+ eol \n
%%
{word} { wordCount++; charCount+=yyleng; }
{eol} {charCount++; lineCount++; } . {charCount++;}
%%
void main() { FILE *file; char fn[20]; printf("Type an input file name:"); scanf("%s",fn); file=fopen(fn, "r"); if (!file) { fprintf(stderr,"file '%s' could not be opened. \n",fn); exit(1); } yyin=file; yylex(); printf("%d %d %d %s \n", lineCount, wordCount, charCount,fn);
}
yywrap() { return 1; /* end of processing*/ }
$ vi example1.l $ ls example1.l $ lex example1.l $ ls example1.l lex.yy.c $ cc lex.yy.c -o example1 –ll $ ls example1* example1.l lex.yy.c $ vi XXX.c $ example1 Type an input file name: XXX.c 20 60 300
9
Regular expressions for lex (usual regular expression+ )
“ : all the letters in between “ and “ are considered as text characters eg. a“*”b and a*b are different
\ : used to escape a single character
eg. XYZ“++”, “XYZ++” and XYZ\+\+ are all same
[] : to define a class of characters
eg. [abc] : one character among a, b and c
- : an operator representing a range
eg. [a-z] : a lower case character from a to z
^ : a complementary set
eg. [^*] : any character except *
\ : an escape string in C
eg. [ \t\n] : one of a blank, a tab or a newline character
10
* : repeating 0 or more times eg. [a-zA-Z][a-zA-Z0-9]* : a regular expression for a variable name + : repeating 1 or more times eg. [a-z]+ : a regular expression for all the string of lower-case letters ? : an optional element eg. ab?c : either abc or ac | : choice operator eg. (ab | cd) : either ab or cd (ab | cd+)?(ef)* : abefef, efefef, cdef, cddd, … ^ : at the begining of a line eg. ^abc : recognizes abc only if it appears at the beginning of the line $ : at the end of a line . : any character except the newline character eg. “- -”.* : from - - to the end of the line {} : using a name instead of the corresponding regular expression
11
Lex Expression Matches
abc abc
abc* ab abc abcc abccc ...
abc+ abc, abcc, abccc, abcccc, ...
a(bc)+ abc, abcbc, abcbcbc, ...
a(bc)? a, abc
[abc] one of: a, b, c
[a-z] any letter, a through z
[a\-z] one of: a, -, z
[-az] one of: - a z
[A-Za-z0-9]+ one or more alphanumeric characters
[ \t\n]+ whitespace
[^ab] anything except: a, b
[a^b] a, ^, b
[a|b] a, |, b
a|b a, b
Recognizing Regular Expressions
12
<= { … }
<> { … }
< { … }
= { … }
>= { … }
> { … }
13
Data Structure for Tokens
“Lexeme”: representation for each token in the lexical analyzer
• token number
– internal (unique) number for a token, for efficient processing
• token value – valid if a token has a “value” that a programmer created
token value for a identifiers: the matched string
token value for a constant: the constant value
eg. if X < Y then X :=10;
(29,0) (1,X) (18,0) (1,Y) (35,0) (1,X) (9,0) (2,10) (7,0)
(1,10) : X (1,12) : Y
lexeme
identifier itself may be in Symbol table
%{
/* calc.lex */
#include "global.h“
#include "calc.h“
#include <stdlib.h>
%}
white [ \t]+
digit [0-9]
integer {digit}+
exponent [eE]([+-])?{integer}
real {integer}("."{integer})?({exponent})?
%%
{white} {}
{real} { yylval=atof(yytext);
return(NUMBER); }
"+" { return(PLUS); }
"-" { return(MINUS);}
"*" { return(TIMES); }
"/" { return(DIVIDE); }
"^" { return(POWER); }
"(" { return(LEFT_PARENTHESIS); }
")" { return(RIGHT_PARENTHESIS);
}
"\n" { return(END); }
%%
int yywrap(void) {
return 1;
}
What is the main difference from the previous wordcount example?
… check the position of return statements ..
yylval = the token value yytext = the matched string
References
• Text: Lex & Yacc 2nd Edition, John R. Levine,
Tony Mason, Doug Brown, O'Reilly,1992
• Examples: http://myweb.stedwards.edu/laurab/cosc4342/lex-
examples.html
• lex built-in functions and etc. : http://www.tldp.org/HOWTO/Lex-YACC-HOWTO-
3.html
15
More on Regular Expression
16
정규표현식 (Regular Expressions)
17
참고: 기타 정규 표현식을 쓰는 곳 1
• Unix 명령 중 grep
grep smug files {search files for lines with 'smug'}
grep '^smug' files {'smug' at the start of a line}
grep 'smug$' files {'smug' at the end of a line}
grep '^smug$' files {lines containing only 'smug'}
grep '\^s' files {lines starting with '^s', "\" escapes the ^}
grep '[Ss]mug' files {search for 'Smug' or 'smug'}
grep 'B[oO][bB]' files {search for BOB, Bob, BOb or BoB }
grep '^$' files {search for blank lines}
grep '[0-9][0-9]' file {search for pairs of numeric digits}
18
http://www.robelle.com/smugbook/regexpr.html
참고: 기타 정규 표현식을 쓰는 곳 2
• JavaScript에서
– exec, test, match, search 등의 메소드들이 사용함
– 정규표현식을 사용하는 메소드들(일부)
19
/g는, 하나 찾고 멈추지 말고,
match 되는 것은 전부 찾으란 뜻
(http://www.w3schools.com/jsref/jsref_regexp_g.asp)
참고: 기타 정규 표현식을 쓰는 곳 3
20
• Java에서
Package java.util.regex
Java Example import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) ... // error.. exit!
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search:"));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" + " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true; } if(!found){ console.format("No match found.%n"); } } } } 21
Java Example import java.io.Console;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) ... // error.. exit!
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search:"));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" + " \"%s\" starting at "
+ "index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
found = true; } if(!found){ console.format("No match found.%n"); } } } } 22
Enter your regex: foo Enter input string to search: foofoofoo I found the text foo starting at index 0 and ending at index 3. I found the text foo starting at index 3 and ending at index 6. I found the text foo starting at index 6 and ending at index 9. Enter your regex: a+ Enter input string to search: ababaaaab I found the text "a" starting at index 0 and ending at index 1. I found the text "a" starting at index 2 and ending at index 3. I found the text "aaaa" starting at index 4 and ending at index 8.
Lexical Analyzer in Java import java.io.Reader; import java.util.Scanner; import java.util.HashMap; import java.util.regex.Pattern;
public class Lexer { public static enum TokenType { GTEQ(">="), LTEQ("<="), GT(">"), LT("<"), ARROW("-->"), PLUS("+"), MINUS("-"), STAR("*"), SLASH("/"), ASSIGN("="), LPAR ("("), RPAR (")"), SEMI(";"), COMMA(","), IF("if"), ELSE("else"), WHILE("while"), IDENT(null), NUMERAL(null), EOF(null), ERROR (null);
final private String lexeme; TokenType (String s) { lexeme = s;} } public String lastLexeme; private static HashMap<String, TokenType> tokenMap = new HashMap<String, TokenType > (); static { for (TokenType c : TokenType.values()) tokenMap.put (c.lexeme, c); }
23 https://inst.eecs.berkeley.edu/~cs164/sp11/lectures/lecture2/Lexer.java
private Scanner inp; private static final Pattern tokenPat = Pattern.compile ("(\\s+|#.*)" + "|>=|<=|-->|if|def|else|fi|while" + "|([a-zA-Z][a-zA-Z0-9]*)|(\\d+)" + "|.");
public Lexer (Reader reader) { inp = new Scanner (reader); } public TokenType nextToken () { if (inp.findWithinHorizon (tokenPat, 0) == null) return TokenType.EOF; else { lastLexeme = inp.match ().group (0); if (inp.match ().start (1) != -1) return nextToken (); else if (inp.match ().start (2) != -1) return TokenType.IDENT; else if (inp.match ().start (3) != -1) return TokenType.NUMERAL;
TokenType result = tokenMap.get (lastLexeme); if (result == null) return TokenType.ERROR; else return result; } } }
24 https://inst.eecs.berkeley.edu/~cs164/sp11/lectures/lecture2/Lexer.java