text parsing in python - gayatri nittala - gayatri nittala - madhubala vasireddy - madhubala...
TRANSCRIPT
Text Parsing in PythonText Parsing in Python
- Gayatri Nittala- Gayatri Nittala
- Madhubala - Madhubala VasireddyVasireddy
Text ParsingText Parsing
►The three W’s! The three W’s! ►Efficiency and PerfectionEfficiency and Perfection
What is Text Parsing?What is Text Parsing?
►common programming taskcommon programming task►extract or split a sequence of extract or split a sequence of
characterscharacters
Why is Text Parsing?Why is Text Parsing?
► Simple file parsingSimple file parsing A tab separated fileA tab separated file
►Data extractionData extraction Extract specific information from log fileExtract specific information from log file
► Find and replaceFind and replace► Parsers – syntactic analysisParsers – syntactic analysis►NLPNLP
Extract information from corpusExtract information from corpus POS TaggingPOS Tagging
Text Parsing MethodsText Parsing Methods
►String FunctionsString Functions►Regular ExpressionsRegular Expressions►ParsersParsers
String FunctionsString Functions
►String module in pythonString module in python Faster, easier to understand and maintainFaster, easier to understand and maintain
► If you can do, DO IT!If you can do, DO IT!►Different built-in functionsDifferent built-in functions
Find-ReplaceFind-Replace Split-JoinSplit-Join Startswith and EndswithStartswith and Endswith Is methodsIs methods
Find and ReplaceFind and Replace
►find, index, rindex, replacefind, index, rindex, replace►EX: Replace a string in all files in a EX: Replace a string in all files in a
directorydirectoryfiles = glob.glob(path)files = glob.glob(path)for line in fileinput.input(files,inplace=1):for line in fileinput.input(files,inplace=1):
lineno = 0lineno = 0 lineno = string.find(line, stext)lineno = string.find(line, stext) if lineno >0:if lineno >0: line =line.replace(stext, rtext)line =line.replace(stext, rtext) sys.stdout.write(line)sys.stdout.write(line)
startswith and endswithstartswith and endswith
► Extract quoted words from the given textExtract quoted words from the given textmyString = "\"123\"";myString = "\"123\"";
if (myString.startswith("\""))if (myString.startswith("\""))
print "string with double quotes“print "string with double quotes“
► Find if the sentences are interrogative or Find if the sentences are interrogative or exclamative exclamative
►What an amazing game that was! What an amazing game that was! ►Do you like this?Do you like this?
endings = ('!', '?')endings = ('!', '?')
sentence.endswith(endings)sentence.endswith(endings)
isMethodsisMethods
►to check alphabets, numerals, to check alphabets, numerals, character case etccharacter case etc m = 'xxxasdf ‘m = 'xxxasdf ‘ m.isalpha()m.isalpha() FalseFalse
Regular ExpressionsRegular Expressions
►concise way for complex patternsconcise way for complex patterns►amazingly powerfulamazingly powerful►wide variety of operationswide variety of operations►when you go beyond simple, think when you go beyond simple, think
about regular expressions!about regular expressions!
Real world problemsReal world problems
►Match IP Addresses, email addresses, Match IP Addresses, email addresses, URLsURLs
►Match balanced sets of parenthesisMatch balanced sets of parenthesis►Substitute wordsSubstitute words►TokenizeTokenize►ValidateValidate►CountCount►Delete duplicatesDelete duplicates►Natural Language processingNatural Language processing
RE in PythonRE in Python
► Unleash the power - built-in re moduleUnleash the power - built-in re module► FunctionsFunctions
to compile patternsto compile patterns►compliecomplie
to perform matchesto perform matches► match, search, findall, finditermatch, search, findall, finditer
to perform opertaions on match objectto perform opertaions on match object► group, start, end, spangroup, start, end, span
to substituteto substitute► sub, subnsub, subn
► - Metacharacters- Metacharacters
Compiling patternsCompiling patterns
►re.complile()re.complile()►pattern for IP Address pattern for IP Address
^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ^\d+\.\d+\.\d+\.\d+$^\d+\.\d+\.\d+\.\d+$ ^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$ ^([01]?\d\d?|2[0-4]\d|25[0-])\.^([01]?\d\d?|2[0-4]\d|25[0-])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$([01]?\d\d?|2[0-4]\d|25[0-5])$
Compiling patternsCompiling patterns►pattern for matching parenthesispattern for matching parenthesis
\(.*\)\(.*\) \([^)]*\)\([^)]*\) \([^()]*\)\([^()]*\)
SubstituteSubstitute
► Perform several string substitutions on a given stringPerform several string substitutions on a given stringimport reimport redef make_xlat(*args, **kwargs):def make_xlat(*args, **kwargs):
adict = dict(*args, **kwargs)adict = dict(*args, **kwargs)rx = re.compile('|'.join(map(re.escape, adict)))rx = re.compile('|'.join(map(re.escape, adict)))def one_xlate(match):def one_xlate(match):
return adict[match.group(0)]return adict[match.group(0)]def xlate(text):def xlate(text):
return rx.sub(one_xlate, text)return rx.sub(one_xlate, text)return xlatereturn xlate
CountCount
►Split and count words in the given textSplit and count words in the given text p = re.compile(r'\W+')p = re.compile(r'\W+') len(p.split('This is a test for split().'))len(p.split('This is a test for split().'))
TokenizeTokenize
►Parsing and Natural Language Parsing and Natural Language ProcessingProcessing s = 'tokenize these words's = 'tokenize these words' words = re.compile(r'\b\w+\b|\$')words = re.compile(r'\b\w+\b|\$') words.findall(s)words.findall(s) ['tokenize', 'these', 'words']['tokenize', 'these', 'words']
Common PitfallsCommon Pitfalls
►operations on fixed strings, single operations on fixed strings, single character class, no case sensitive character class, no case sensitive issuesissues
►re.sub() and string.replace()re.sub() and string.replace()►re.sub() and string.translate()re.sub() and string.translate()►match vs. searchmatch vs. search►greedy vs. non-greedygreedy vs. non-greedy
PARSERSPARSERS
►Flat and Nested textsFlat and Nested texts►Nested tags, Programming language Nested tags, Programming language
constructsconstructs►Better to do less than to do more!Better to do less than to do more!
Parsing Non flat textsParsing Non flat texts
►GrammarGrammar►StatesStates►Generate tokens and Act on themGenerate tokens and Act on them►Lexer - Generates a stream of tokensLexer - Generates a stream of tokens►Parser - Generate a parse tree out of Parser - Generate a parse tree out of
the tokensthe tokens►Lex and YaccLex and Yacc
Grammar Vs REGrammar Vs RE
► Floating PointFloating Point #---- EBNF-style description of Python ---##---- EBNF-style description of Python ---# floatnumber ::= pointfloat | exponentfloatfloatnumber ::= pointfloat | exponentfloat pointfloat ::= [intpart] fraction | intpart "."pointfloat ::= [intpart] fraction | intpart "." exponentfloat ::= (intpart | pointfloat) exponentfloat ::= (intpart | pointfloat)
exponentexponent intpart ::= digit+intpart ::= digit+ fraction ::= "." digit+fraction ::= "." digit+ exponent ::= ("e" | "E") ["+" | "-"] digit+exponent ::= ("e" | "E") ["+" | "-"] digit+ digit ::= "0"..."9"digit ::= "0"..."9"
Grammar Vs REGrammar Vs REpat = r'''(?x)pat = r'''(?x) ( # exponentfloat( # exponentfloat ( # intpart or pointfloat( # intpart or pointfloat ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat || \d+ # intpart\d+ # intpart ) # end intpart or pointfloat) # end intpart or pointfloat [eE][+-]?\d+ # exponent[eE][+-]?\d+ # exponent ) # end exponentfloat) # end exponentfloat || ( # pointfloat( # pointfloat (\d+)?[.]\d+ # optional intpart with fraction(\d+)?[.]\d+ # optional intpart with fraction || \d+[.] # intpart with period\d+[.] # intpart with period ) # end pointfloat) # end pointfloat ''''''
PLY - The Python Lex and PLY - The Python Lex and YaccYacc
►higher-level and cleaner grammar higher-level and cleaner grammar languagelanguage
►LALR(1) parsing LALR(1) parsing ►extensive input validation, error extensive input validation, error
reporting, and diagnosticsreporting, and diagnostics►Two moduoles lex.py and yacc.pyTwo moduoles lex.py and yacc.py
Using PLY - Lex and Yacc Using PLY - Lex and Yacc
►Lex:Lex:► Import the [lex] moduleImport the [lex] module►Define a list or tuple variable 'tokens', the Define a list or tuple variable 'tokens', the
lexer is allowed to producelexer is allowed to produce►Define tokens - by assigning to a specially Define tokens - by assigning to a specially
named variable ('t_tokenName')named variable ('t_tokenName')►Build the lexerBuild the lexer
mylexer = lex.lex()mylexer = lex.lex() mylexer.input(mytext) # handled by yaccmylexer.input(mytext) # handled by yacc
LexLex
t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'t_NAME = r'[a-zA-Z_][a-zA-Z0-9_]*'
def t_NUMBER(t):def t_NUMBER(t): r'\d+'r'\d+' try:try: t.value = int(t.value)t.value = int(t.value) except ValueError:except ValueError: print "Integer value too large", t.valueprint "Integer value too large", t.value t.value = 0t.value = 0 return treturn t
t_ignore = " \t"t_ignore = " \t"
YaccYacc
► Import the 'yacc' moduleImport the 'yacc' module►Get a token map from a lexerGet a token map from a lexer►Define a collection of grammar rulesDefine a collection of grammar rules►Build the parserBuild the parser
yacc.yacc()yacc.yacc() yacc.parse('x=3')yacc.parse('x=3')
YaccYacc
► Specially named functions having a 'p_' Specially named functions having a 'p_' prefix prefix
def p_statement_assign(p):def p_statement_assign(p): 'statement : NAME "=" expression''statement : NAME "=" expression' names[p[1]] = p[3]names[p[1]] = p[3]
def p_statement_expr(p):def p_statement_expr(p): 'statement : expression''statement : expression' print p[1]print p[1]
SummarySummary
► String FunctionsString Functions
A thumb rule - if you can do, do it.A thumb rule - if you can do, do it.► Regular ExpressionsRegular Expressions
Complex patterns - something beyond Complex patterns - something beyond simple!simple!
► Lex and YaccLex and Yacc
Parse non flat texts - that follow some Parse non flat texts - that follow some rulesrules
ReferencesReferences► http://docs.python.org/http://docs.python.org/► http://code.activestate.com/recipes/langs/http://code.activestate.com/recipes/langs/
python/python/► http://www.regular-expressions.info/http://www.regular-expressions.info/► http://www.dabeaz.com/ply/ply.htmlhttp://www.dabeaz.com/ply/ply.html►Mastering Regular Expressions by Jeffrey E F. Mastering Regular Expressions by Jeffrey E F.
FriedlFriedl► Python Cookbook by Alex Martelli, Anna Martelli Python Cookbook by Alex Martelli, Anna Martelli
& David Ascher& David Ascher► Text processing in Python by David MertzText processing in Python by David Mertz
Thank YouThank YouQ & AQ & A