lexing and parsing

48
LEXING AND PARSING THE BEGINNER’S GUIDE

Upload: elizabeth-smith

Post on 28-Jun-2015

938 views

Category:

Software


1 download

DESCRIPTION

Beginners guide of Lexing and Parsing for PHP developers - given at Zendcon 2014

TRANSCRIPT

Page 1: Lexing and parsing

LEXING AND PARSINGTHE BEGINNER’S GUIDE

Page 2: Lexing and parsing

WHY ARE WE DOING THIS?

• bbcode

• html

• xml

• programming language

Page 3: Lexing and parsing

BUT I CAN JUST REGEX

• sometimes you can

• sometimes you can’t

• is your html well formed? (view source some time)

• it depends!!

Page 4: Lexing and parsing

CHOMSKY HIERARCHY

Page 5: Lexing and parsing

COMPUTER SCIENCEWE LIKE ACRONYMS AND WEIRD WORDS

Page 6: Lexing and parsing

ENGLISH IS HARD!

• tokenizer

• scanner

• lexer

• parser

• lexical analyzer

• syntactic analyzer

• formal grammar

Page 7: Lexing and parsing

LEXICAL ANALYSISBREAK DOWN INPUT INTO A SEQUENCE OF TOKENS

LEXING

Page 8: Lexing and parsing

SCANNING

• Finite State Machine

• Finds Lexemes

• Might backtrack

Page 9: Lexing and parsing

FINITE STATE MACHINE

Page 10: Lexing and parsing

EVALUATOR

• looks at lexeme to get value

• lexeme + value = token

Page 11: Lexing and parsing

LEXING PHP - $Y = 5;• $y

• array[309, ‘$y’, 1],

• =

• =

• 5

• array[305, 5, 1]

• 309 == T_VARIABLE

• 305 == T_LNUMBER

Page 12: Lexing and parsing

LEXER GENERATORSDO NOT WRITE THIS BY HAND

Famous• lex

• flex

• re2c

• ANTLR

• DFASTAR

• jflex

• jlex

• quex

PHP generators• https://github.com/oliverheins/PHPSimpleLexYacc

• lex syntax

• https://github.com/pear/PHP_LexerGenerator

• re2c syntax

• https://github.com/wez/JLexPHP

• jlex syntax

• token_get_all (see php-parser)

• parse_ini_file/string (combined with parser)

Page 13: Lexing and parsing

RE2C

Page 14: Lexing and parsing

IN PHP LAND

Page 15: Lexing and parsing

SYNTACTIC ANALYSISCONSTRUCTING SOMETHING BASED ON A GRAMMAR

PARSING

Page 16: Lexing and parsing

THE PARSING PROCESS

• Tokens come in

• Magic

• Data structure comes out

• parse tree

• AST

Page 17: Lexing and parsing

GRAMMAR (FORMAL OF COURSE)

• "Brave men run in my family.”

• I can't recommend this book too highly.

• Prostitutes Appeal to Pope

• I had had my car for four years before I ever learned to drive it.

Page 18: Lexing and parsing

TYPES OF PARSERS

• Top Down

• Recursive Decent

• LL (left to right, leftmost derivation)

• Earley parser

• Bottom Up

• Precedence parser

• Operator-precedence parser

• Simple precedence parser

• BC (bounded context) parsing

• LR parser (Left-to-right, Rightmost derivation)

• Simple LR (SLR) parser

• LALR parser

• Canonical LR (LR(1)) parser

• GLR parser

• CYK parser

• Recursive ascent parser

Page 19: Lexing and parsing

SENTENCE DIAGRAMMING

• People who live in glass house shouldn't throw stones.

Page 20: Lexing and parsing

PARSE TREE

Page 21: Lexing and parsing

TOP DOWN VS. BOTTOM UP PARSING

Page 22: Lexing and parsing

PARSE TREES

• Constituency-based parse trees

• Dependency-based parse trees

Page 23: Lexing and parsing

AST

• Not everything appears

• additional information may be applied

• can “improve” tree nodes

• PHP is getting one!

Page 24: Lexing and parsing

LALR(K)

• Look ahead prevents “ambiguous” parsing

• I have one token, what token comes next?

Page 25: Lexing and parsing

PARSER GENERATORS

Famous• bison

• bison

• bison

• bison

• yacc

• lemon

• ANTLR

PHP versions• https://github.com/wez/lemon-php

• https://github.com/pear/PHP_ParserGenerator

• lemon

• https://github.com/scato/phpeg

• peg (peg.js)

• https://github.com/jakubkulhan/pacc

• yacc

Page 26: Lexing and parsing

BISON

• Generates LALR (or GLR) parsers

• Code in C, C++ or Java

• reentrant with %define api.pure set

• used by ALL THE THINGS

• PHP

• Ruby

• Postgresql

• Go

Page 27: Lexing and parsing

BISON IN C

Page 28: Lexing and parsing

LEMON

• Generates LALR(1) parser

• reentrant AND thread safe

• non-terminal destructor (leak avoidance)

• pull parsing

• sqlite

Page 29: Lexing and parsing

PHP LEMON

Page 30: Lexing and parsing

REENTRANT VS THREAD SAFE

• Process

• Thread

• Locking

• Scope

• Reentrant

Page 31: Lexing and parsing

COMPILE IT

• transform programming language to computer language

Page 32: Lexing and parsing

INTERPRET IT

• directly executes programming language

Page 33: Lexing and parsing

PROFIT

Page 34: Lexing and parsing

UNDER THE HOODWHAT USES THIS STUFF?

Page 35: Lexing and parsing

PHPRE2C + Bison + these crazy opcodes….

Page 36: Lexing and parsing

LALR(1) WRITTEN BY HANDHow - pythonic

Page 37: Lexing and parsing

HHVMFlex and Bison and JIT – OH MY!

Page 38: Lexing and parsing

SQLITELemon is tasty!

Page 39: Lexing and parsing

WRITING PARSERS AND LEXERSTHEORIES OF CODING

Page 40: Lexing and parsing

STEP 1: THINK SMALL

• Writing a general purpose parser is hard – that’s why you use PHP

• Writing a single purpose parser is much easier

• markup text (markdown)

• configuration or definition files (behat/gherkin syntax)

• complex validation (addresses in multiple formats)

Page 41: Lexing and parsing

STEP 2: SEPARATE AND UNOPTIMIZED

• premature optimization yada yada

• combine after it’s ready to be used (or not at if you’ll need to change it later)

• lexer and parser each have unique, well defined goals

• the ability to potentially switch parser styles later will help you!

Page 42: Lexing and parsing

STEP 3: LEXER

• the lexer's job is to recognize tokens

• it can do this via a giant switch statement of doom

• or maybe a giant loop

• or maybe a list of goto statements

• or maybe a complex class with methods

• …. or you can just use a generator

Page 43: Lexing and parsing

LET’S BREAK THAT DOWN

1. Define a token format

2. Define grammar format (what are we looking for?)

3. Go over the input data (usually a string) and make matches

1. compare or regex or ctype_* or however it make sense

4. Keep track of your current state

5. Have an output format – AST, tree, whatever

Page 44: Lexing and parsing

STEP 4: PARSER

• Loop over our tokens

• Look at the values and decide to what to do

Page 45: Lexing and parsing

STEP 5: DO SOMETHING WITH IT!

1. Compile – write out to something that can be run (html)

2. Interpret – run through another program to get output (templates to html)

3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)

4. Validate – check for proper “spelling and grammar”

5. ???

6. PROFIT

Page 46: Lexing and parsing

“If you’re not sure how to do a job – ask!”

- silly poster on my laundry room wall

Page 48: Lexing and parsing

CONTACT ME

[email protected]

• auroraeosrose – freenode.net #phpmentoring #phpwomen

• Twitter - @auroraeosrose

• http://github.com/auroraeosrose