text parsing in python - gayatri nittala - gayatri nittala - madhubala vasireddy - madhubala...

Text Parsing in PythonText Parsing in Python

- Gayatri Nittala- Gayatri Nittala

- Madhubala - Madhubala VasireddyVasireddy

Text ParsingText Parsing

►The three W’s! The three W’s! ►Efficiency and PerfectionEfficiency and Perfection

What is Text Parsing?What is Text Parsing?

►common programming taskcommon programming task►extract or split a sequence of extract or split a sequence of

characterscharacters

Why is Text Parsing?Why is Text Parsing?

► Simple file parsingSimple file parsing A tab separated fileA tab separated file

►Data extractionData extraction Extract specific information from log fileExtract specific information from log file

► Find and replaceFind and replace► Parsers – syntactic analysisParsers – syntactic analysis►NLPNLP

Extract information from corpusExtract information from corpus POS TaggingPOS Tagging

Text Parsing MethodsText Parsing Methods

►String FunctionsString Functions►Regular ExpressionsRegular Expressions►ParsersParsers

String FunctionsString Functions

►String module in pythonString module in python Faster, easier to understand and maintainFaster, easier to understand and maintain

► If you can do, DO IT!If you can do, DO IT!►Different built-in functionsDifferent built-in functions

Find-ReplaceFind-Replace Split-JoinSplit-Join Startswith and EndswithStartswith and Endswith Is methodsIs methods

Find and ReplaceFind and Replace

►find, index, rindex, replacefind, index, rindex, replace►EX: Replace a string in all files in a EX: Replace a string in all files in a

directorydirectoryfiles = glob.glob(path)files = glob.glob(path)for line in fileinput.input(files,inplace=1):for line in fileinput.input(files,inplace=1):

lineno = 0lineno = 0 lineno = string.find(line, stext)lineno = string.find(line, stext) if lineno >0:if lineno >0: line =line.replace(stext, rtext)line =line.replace(stext, rtext) sys.stdout.write(line)sys.stdout.write(line)

startswith and endswithstartswith and endswith

► Extract quoted words from the given textExtract quoted words from the given textmyString = "\"123\"";myString = "\"123\"";

if (myString.startswith("\""))if (myString.startswith("\""))

print "string with double quotes“print "string with double quotes“

► Find if the sentences are interrogative or Find if the sentences are interrogative or exclamative exclamative

►What an amazing game that was! What an amazing game that was! ►Do you like this?Do you like this?

endings = ('!', '?')endings = ('!', '?')

sentence.endswith(endings)sentence.endswith(endings)

isMethodsisMethods

►to check alphabets, numerals, to check alphabets, numerals, character case etccharacter case etc m = 'xxxasdf ‘m = 'xxxasdf ‘ m.isalpha()m.isalpha() FalseFalse

Regular ExpressionsRegular Expressions

►concise way for complex patternsconcise way for complex patterns►amazingly powerfulamazingly powerful►wide variety of operationswide variety of operations►when you go beyond simple, think when you go beyond simple, think

about regular expressions!about regular expressions!

Real world problemsReal world problems

►Match IP Addresses, email addresses, Match IP Addresses, email addresses, URLsURLs

►Match balanced sets of parenthesisMatch balanced sets of parenthesis►Substitute wordsSubstitute words►TokenizeTokenize►ValidateValidate►CountCount►Delete duplicatesDelete duplicates►Natural Language processingNatural Language processing

RE in PythonRE in Python

► Unleash the power - built-in re moduleUnleash the power - built-in re module► FunctionsFunctions

to compile patternsto compile patterns►compliecomplie

to perform matchesto perform matches► match, search, findall, finditermatch, search, findall, finditer

to perform opertaions on match objectto perform opertaions on match object► group, start, end, spangroup, start, end, span

to substituteto substitute► sub, subnsub, subn

► - Metacharacters- Metacharacters

Compiling patternsCompiling patterns

►re.complile()re.complile()►pattern for IP Address pattern for IP Address

^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ^\d+\.\d+\.\d+\.\d+$^\d+\.\d+\.\d+\.\d+$ ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ ^([01]?\d\d?|2[0-4]\d|25[0-])\.^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$([01]?\d\d?|2[0-4]\d|25[0-5])$

Compiling patternsCompiling patterns►pattern for matching parenthesispattern for matching parenthesis

$.*$$.*$ $[^)]*$$[^)]*$ $[^()]*$$[^()]*$

SubstituteSubstitute

► Perform several string substitutions on a given stringPerform several string substitutions on a given stringimport reimport redef make_xlat(*args, **kwargs):def make_xlat(*args, **kwargs):

adict = dict(*args, **kwargs)adict = dict(*args, **kwargs)rx = re.compile('|'.join(map(re.escape, adict)))rx = re.compile('|'.join(map(re.escape, adict)))def one_xlate(match):def one_xlate(match):

return adict[match.group(0)]return adict[match.group(0)]def xlate(text):def xlate(text):

return rx.sub(one_xlate, text)return rx.sub(one_xlate, text)return xlatereturn xlate

CountCount

►Split and count words in the given textSplit and count words in the given text p = re.compile(r'\W+')p = re.compile(r'\W+') len(p.split('This is a test for split().'))len(p.split('This is a test for split().'))

TokenizeTokenize

►Parsing and Natural Language Parsing and Natural Language ProcessingProcessing s = 'tokenize these words's = 'tokenize these words' words = re.compile(r'\b\w+\b|\$')words = re.compile(r'\b\w+\b|\$') words.findall(s)words.findall(s) ['tokenize', 'these', 'words']['tokenize', 'these', 'words']

Common PitfallsCommon Pitfalls

►operations on fixed strings, single operations on fixed strings, single character class, no case sensitive character class, no case sensitive issuesissues

►re.sub() and string.replace()re.sub() and string.replace()►re.sub() and string.translate()re.sub() and string.translate()►match vs. searchmatch vs. search►greedy vs. non-greedygreedy vs. non-greedy

PARSERSPARSERS

►Flat and Nested textsFlat and Nested texts►Nested tags, Programming language Nested tags, Programming language

constructsconstructs►Better to do less than to do more!Better to do less than to do more!

Parsing Non flat textsParsing Non flat texts

►GrammarGrammar►StatesStates►Generate tokens and Act on themGenerate tokens and Act on them►Lexer - Generates a stream of tokensLexer - Generates a stream of tokens►Parser - Generate a parse tree out of Parser - Generate a parse tree out of

the tokensthe tokens►Lex and YaccLex and Yacc

Grammar Vs REGrammar Vs RE

► Floating PointFloating Point #---- EBNF-style description of Python ---##---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloatfloatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "."pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponentfloat ::= (intpart | pointfloat)

exponentexponent intpart ::= digit+intpart ::= digit+ fraction ::= "." digit+fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"digit ::= "0"..."9"

Grammar Vs REGrammar Vs REpat = r'''(?x)pat = r'''(?x) ( # exponentfloat( # exponentfloat ( # intpart or pointfloat( # intpart or pointfloat ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat || \d+ # intpart\d+ # intpart ) # end intpart or pointfloat) # end intpart or pointfloat [eE][+-]?\d+ # exponent[eE][+-]?\d+ # exponent ) # end exponentfloat) # end exponentfloat || ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat ''''''

PLY - The Python Lex and PLY - The Python Lex and YaccYacc

►higher-level and cleaner grammar higher-level and cleaner grammar languagelanguage

►LALR(1) parsing LALR(1) parsing ►extensive input validation, error extensive input validation, error

reporting, and diagnosticsreporting, and diagnostics►Two moduoles lex.py and yacc.pyTwo moduoles lex.py and yacc.py

Using PLY - Lex and Yacc Using PLY - Lex and Yacc

►Lex:Lex:► Import the [lex] moduleImport the [lex] module►Define a list or tuple variable 'tokens', the Define a list or tuple variable 'tokens', the

lexer is allowed to producelexer is allowed to produce►Define tokens - by assigning to a specially Define tokens - by assigning to a specially

named variable ('t_tokenName')named variable ('t_tokenName')►Build the lexerBuild the lexer

mylexer = lex.lex()mylexer = lex.lex() mylexer.input(mytext) # handled by yaccmylexer.input(mytext) # handled by yacc

LexLex

t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'

def t_NUMBER(t):def t_NUMBER(t): r'\d+'r'\d+' try:try: t.value = int(t.value)t.value = int(t.value) except ValueError:except ValueError: print "Integer value too large", t.valueprint "Integer value too large", t.value t.value = 0t.value = 0 return treturn t

t_ignore = " \t"t_ignore = " \t"

YaccYacc

► Import the 'yacc' moduleImport the 'yacc' module►Get a token map from a lexerGet a token map from a lexer►Define a collection of grammar rulesDefine a collection of grammar rules►Build the parserBuild the parser

yacc.yacc()yacc.yacc() yacc.parse('x=3')yacc.parse('x=3')

YaccYacc

► Specially named functions having a 'p_' Specially named functions having a 'p_' prefix prefix

def p_statement_assign(p):def p_statement_assign(p): 'statement : NAME "=" expression''statement : NAME "=" expression' names[p[1]] = p[3]names[p[1]] = p[3]

def p_statement_expr(p):def p_statement_expr(p): 'statement : expression''statement : expression' print p[1]print p[1]

SummarySummary

► String FunctionsString Functions

A thumb rule - if you can do, do it.A thumb rule - if you can do, do it.► Regular ExpressionsRegular Expressions

Complex patterns - something beyond Complex patterns - something beyond simple!simple!

► Lex and YaccLex and Yacc

Parse non flat texts - that follow some Parse non flat texts - that follow some rulesrules

ReferencesReferences► http://docs.python.org/http://docs.python.org/► http://code.activestate.com/recipes/langs/http://code.activestate.com/recipes/langs/

python/python/► http://www.regular-expressions.info/http://www.regular-expressions.info/► http://www.dabeaz.com/ply/ply.htmlhttp://www.dabeaz.com/ply/ply.html►Mastering Regular Expressions by Jeffrey E F. Mastering Regular Expressions by Jeffrey E F.

FriedlFriedl► Python Cookbook by Alex Martelli, Anna Martelli Python Cookbook by Alex Martelli, Anna Martelli

& David Ascher& David Ascher► Text processing in Python by David MertzText processing in Python by David Mertz

Thank YouThank YouQ & AQ & A

text parsing in python - gayatri nittala - gayatri nittala - madhubala vasireddy - madhubala...

Documents