lexing and parsing
DESCRIPTION
Beginners guide of Lexing and Parsing for PHP developers - given at Zendcon 2014TRANSCRIPT
LEXING AND PARSINGTHE BEGINNER’S GUIDE
WHY ARE WE DOING THIS?
• bbcode
• html
• xml
• programming language
BUT I CAN JUST REGEX
• sometimes you can
• sometimes you can’t
• is your html well formed? (view source some time)
• it depends!!
CHOMSKY HIERARCHY
COMPUTER SCIENCEWE LIKE ACRONYMS AND WEIRD WORDS
ENGLISH IS HARD!
• tokenizer
• scanner
• lexer
• parser
• lexical analyzer
• syntactic analyzer
• formal grammar
LEXICAL ANALYSISBREAK DOWN INPUT INTO A SEQUENCE OF TOKENS
LEXING
SCANNING
• Finite State Machine
• Finds Lexemes
• Might backtrack
FINITE STATE MACHINE
EVALUATOR
• looks at lexeme to get value
• lexeme + value = token
LEXING PHP - $Y = 5;• $y
• array[309, ‘$y’, 1],
• =
• =
• 5
• array[305, 5, 1]
• 309 == T_VARIABLE
• 305 == T_LNUMBER
LEXER GENERATORSDO NOT WRITE THIS BY HAND
Famous• lex
• flex
• re2c
• ANTLR
• DFASTAR
• jflex
• jlex
• quex
PHP generators• https://github.com/oliverheins/PHPSimpleLexYacc
• lex syntax
• https://github.com/pear/PHP_LexerGenerator
• re2c syntax
• https://github.com/wez/JLexPHP
• jlex syntax
• token_get_all (see php-parser)
• parse_ini_file/string (combined with parser)
RE2C
IN PHP LAND
SYNTACTIC ANALYSISCONSTRUCTING SOMETHING BASED ON A GRAMMAR
PARSING
THE PARSING PROCESS
• Tokens come in
• Magic
• Data structure comes out
• parse tree
• AST
GRAMMAR (FORMAL OF COURSE)
• "Brave men run in my family.”
• I can't recommend this book too highly.
• Prostitutes Appeal to Pope
• I had had my car for four years before I ever learned to drive it.
TYPES OF PARSERS
• Top Down
• Recursive Decent
• LL (left to right, leftmost derivation)
• Earley parser
• Bottom Up
• Precedence parser
• Operator-precedence parser
• Simple precedence parser
• BC (bounded context) parsing
• LR parser (Left-to-right, Rightmost derivation)
• Simple LR (SLR) parser
• LALR parser
• Canonical LR (LR(1)) parser
• GLR parser
• CYK parser
• Recursive ascent parser
SENTENCE DIAGRAMMING
• People who live in glass house shouldn't throw stones.
PARSE TREE
TOP DOWN VS. BOTTOM UP PARSING
PARSE TREES
• Constituency-based parse trees
• Dependency-based parse trees
AST
• Not everything appears
• additional information may be applied
• can “improve” tree nodes
• PHP is getting one!
LALR(K)
• Look ahead prevents “ambiguous” parsing
• I have one token, what token comes next?
PARSER GENERATORS
Famous• bison
• bison
• bison
• bison
• yacc
• lemon
• ANTLR
PHP versions• https://github.com/wez/lemon-php
• https://github.com/pear/PHP_ParserGenerator
• lemon
• https://github.com/scato/phpeg
• peg (peg.js)
• https://github.com/jakubkulhan/pacc
• yacc
BISON
• Generates LALR (or GLR) parsers
• Code in C, C++ or Java
• reentrant with %define api.pure set
• used by ALL THE THINGS
• PHP
• Ruby
• Postgresql
• Go
BISON IN C
LEMON
• Generates LALR(1) parser
• reentrant AND thread safe
• non-terminal destructor (leak avoidance)
• pull parsing
• sqlite
PHP LEMON
REENTRANT VS THREAD SAFE
• Process
• Thread
• Locking
• Scope
• Reentrant
COMPILE IT
• transform programming language to computer language
INTERPRET IT
• directly executes programming language
PROFIT
UNDER THE HOODWHAT USES THIS STUFF?
PHPRE2C + Bison + these crazy opcodes….
LALR(1) WRITTEN BY HANDHow - pythonic
HHVMFlex and Bison and JIT – OH MY!
SQLITELemon is tasty!
WRITING PARSERS AND LEXERSTHEORIES OF CODING
STEP 1: THINK SMALL
• Writing a general purpose parser is hard – that’s why you use PHP
• Writing a single purpose parser is much easier
• markup text (markdown)
• configuration or definition files (behat/gherkin syntax)
• complex validation (addresses in multiple formats)
STEP 2: SEPARATE AND UNOPTIMIZED
• premature optimization yada yada
• combine after it’s ready to be used (or not at if you’ll need to change it later)
• lexer and parser each have unique, well defined goals
• the ability to potentially switch parser styles later will help you!
STEP 3: LEXER
• the lexer's job is to recognize tokens
• it can do this via a giant switch statement of doom
• or maybe a giant loop
• or maybe a list of goto statements
• or maybe a complex class with methods
• …. or you can just use a generator
LET’S BREAK THAT DOWN
1. Define a token format
2. Define grammar format (what are we looking for?)
3. Go over the input data (usually a string) and make matches
1. compare or regex or ctype_* or however it make sense
4. Keep track of your current state
5. Have an output format – AST, tree, whatever
STEP 4: PARSER
• Loop over our tokens
• Look at the values and decide to what to do
STEP 5: DO SOMETHING WITH IT!
1. Compile – write out to something that can be run (html)
2. Interpret – run through another program to get output (templates to html)
3. Analyze – run through to analyze the data inside (code analysis/sniffer tools)
4. Validate – check for proper “spelling and grammar”
5. ???
6. PROFIT
“If you’re not sure how to do a job – ask!”
- silly poster on my laundry room wall
RESOURCES
• http://savage.net.au/Ron/html/graphviz2.marpa/Lexing.and.Parsing.Overview.html
• http://nikic.github.io/2011/10/23/Improving-lexing-performance-in-PHP.html
• https://github.com/hafriedlander/php-peg
• https://github.com/nikic/PHP-Parser/
• http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html
• http://wikipedia.org
CONTACT ME
• auroraeosrose – freenode.net #phpmentoring #phpwomen
• Twitter - @auroraeosrose
• http://github.com/auroraeosrose