perl regular expressions: string matching. for this lecture, we focus on string matching using a if...
Post on 22-Dec-2015
233 views
TRANSCRIPT
Perl
regular expressions:string matching
string matching
• For this lecture, we focus on string matching using a if statement
• The format—if ($str =~ /pattern to match/) # true when
match—if ($str !~ /patch to match/) #true when no
match—the same as—if ($str =~ m/pattern to match/) # true when
match—if ($str !~ m/patch to match/) #true when no
match
simple matching
• match a string or string variable• if($str =~ /dog/)
—true if $str contains dog
• If the $str and =~ or !~ is left off, then it uses $_ for matching
case insensitive matching
• /i ignore case
• if ($str =~ /dog/i)—true if $str contains dog. The match is case
insensitive.—if ($str =~ /DOG/i) #same
alternation matching
• | allows matching with an or• if ($str =~ /Fred|Wilma|Pebbles/)
—True if contains Fred, Wilma, or Pebbles
• if ($str =~/Fred|Wilma|Pebbles Flintstone/)—matches Fred, Wilma, or Pebbles Flintstone
• Grouping• if ($str =~/(Fred|Wilma|Pebbles)
Flintstone/)—matches Fred Flintstone, Wilma Flintstone, or
Pebbles Flintstone
• if ($str =~/(Blue|Song)bird/)—matches Bluebird or Songbird
alternation matching (2)
• if ($str =~/th(is|at)/)—true if $str contains this or that
• if ($str =~ /(p|g|m|s|b)et/) —true if $str contains: pet, get, met, set, or bet
Single character matching
• Use []• if($str =~ /[abc]/)
—true if $str contains a and/or b and/or c
• if ($str =~ /[pgmsb]et/)—true if $str contains for pet, get, met, set or bet
• if($str =~/[Fred]/)—true if $str contains F and/or r and/or e and/or d
• Not listed characters ^ character• if($str =~/[^abc]/)
—true if $str does not contain a and b and c
• if($str =~/[a-z]/)—true if $str contains any lower case letter a
through z
Single character or'd matching (2)
• if ($str =~/[0-9]/)—true if $str contains any number 0 through 9
• if ($str =~/[0-9\-]/)—matches 0 through 9 or the minus
• if ($str =~/[a-z0-9\^]/)—matches any single lowercase letter or digit or
^
• if ($str =~/[a-zA-Z0-9_]/)—matches any single letter, digit, or underscore
• if ($str =~/[^aeiouAEIOU]/)—matches any non-vowel in $str
• if ($str !~ /[aeiouAEIOU]/)—matches only if there are no vowels in $str
matching quantifiers
• multiple uses {min,max}• if ($str =~ /a{3}/)
—true if $str contains aaa
• common mistake• if($str =~ /Fred{3}/)
—matches Freddd, not FredFredFred
• if ($str =~/(Fred){3}/)—matches FredFredFred
• if ($str =~/a{3,}/)—matches aaa, aaaa, aaaaa, aaaaaa, etc.
• if ($str =~/a{3,5}b/)—matches aaab, aaaab, aaaaab
matching quantifiers (2)
• if ($str =~/a{0,5}/)—match a, aa, aaa, aaaa, aaaaa, and if there are no
a's
• if ($str =~/a*/)—* match 0 or more times (max match)
• if ($str =~/a*?/)—* match 0 or more times (min match)
• Difference between min and max matching• $_ ="aaaa"; #matches all three above
—Difference *, matches "aaaa" while *? matches "a"—max matches as many characters as it can—while min, matches as few characters as it can—This becomes important in the next lecture.
matching quantifiers (3)
• + 1 or more times (max match)• +? 1 or more times (min match)• if ($str =~ /a+/)
—true if there are 1 or more "a"s
• ? match 0 or 1 time (max match)• ?? match 0 or 1 time (min match)• if ($str =~ /a?/)
—true if there 1 a or no "a"s
• Also {3,5}? min match – tries to match only 3 where possible
• and {3,5} max match—tries to match 5 where possible
matching quantifiers (4)
• if ($str =~ /fo+ba?r/)—matches f, 1 or more o's, b, 0 or 1 a, then an r—match: fobar, foobar, foobr, —Non-match: fbar (missing o), foobaar (to many
a's)
• if ($str =~ /fo*ba?r/)—matches f, 0 or more o's, b, 0 or 1 a, then an r—match: fobar, fbr, fooobr, etc…
• Inside [], matching quantifiers are "normal" characters.
• if ($str =~/[.?!+]*/)—matches zero or more ., ?, !, or +
Exercise 7
• What will the following match?1. /a+[bc]/2. /(a|be)t/i3. /Hi{1,3} There\!?/4. /(Foo)?Bar/i5. /[1-9][1-9][a-z]*/6. /[a-zA-z]+, [A-Z]{2} [0-9]{5}/
• Write an regular expression for these1. Match a social security number (with or
without dashes)2. A street address: number Name with either
St, Ln, Rd or nothing. Also case insensitive
metasymbols
• . match one character (except newline)• if($str =~ /./)
—Always true, except when $str = ""
• if ($str =~ /d.g/)—true for d and anycharacter and g
– so dog, dbg, dag, dcg, d g, etc.
• if ($str =~ /d.*g/)—true d and 0 or more character and g
– so dg, dog, dasdfg, d g, etc.
• if ($str =~ /d.+g/)—true d and 1 or more character and g
– so NOT dg, but the rest dog, dasdfg, d g, etc.
metasymbols (2)
• if ($str =~ /d.?g/)—true for d and any single character and g AND
dg
• if ($str =~ /d.{0,1}g/)—true for d and any single character and g AND
dg—same as above
• if ($str =~ /d.{2}g/)—true for d and 2 characters and g
– so doog, dafg, dghg, etc…
• if ($str =~ /d.{2,5}g/)—true for d and 2 to 5 characters and g
– so dooog, doog, dXXXXXg, gXobgg, etc…
metasymbols (3)
• Anchoring• ^ beginning of the string (only a not in [])• $ end of the string• if ($str =~ /^dog$/)
—true only for "dog", not "ddogg"
• if ($str =~ /^dog/)—true only when the string start with "dog"—so "dog", "doga", etc.
metasymbols (4)
• if ($str =~ /dog$/)—true when the string ends with "dog"—"dog", "asdfadfdog", "ddddooodog"
• if ($str =~ /^.$/)—true when the string is one character long and
not the newline symbol
• if ($str =~/^[abc]+/)—true when the string start with
– "a", "aa", "aaa", etc with any characters following.– "b", "bb", "bbb", etc with any characters following.– "c", "cc", "ccc", etc with any characters following– As well as any combination of a's, b's, and c's
+ "abcabc", etc.
metasymbols (5)
• \d match a Digit [0-9]• \D match a Nondigit [^0-9]• \s match whitespace [ \t\n\r\
f]• \S match a Nonwhitespace [^ \t\
n\r\f]• \w match a Word character [a-zA-Z0-
9_]• \W match a Non word Character [^a-
zA-Z0-9_]
Examples• if ($str =~ /\d/) #true when $str contains a digit • if ($str =~ /\d+/) #true when $str contains 1 or
more digit• if ($str =~/\w\d/) #true contains a word character
and 1 digit• if ($str =~/\w+\d/) #true when contains 1 or
more word characters and 1 digit—true "abc1" "a1" "11" "_9" "Z8" and "a1a1"
• if ($str =~/^\s\w\d/)—true when it starts with a whitespace, then a word
character, and then a digit—" 11" "\ta1" "\n11" etc.
• if ($str =~/^\s*\w\d/)—true when it starts with 0 or more whitespaces, then a
word character, and then a digit—" 11" "11" " \t a1" etc
boundaries assertions
• \b matches at any word boundary—as defined by \w and \W
• \B matches at any non word boundary—as defined by \W and \w
/\bis\b/ #matches "what it is" and "that is it"—can also be writing as /\Wis\W/—won't match "tist"
/\Bis\B/ #matches "thistle" and "artist"—can also be writing as /\wis\w/—won't match "that is it"
boundaries assertions (2)
/\bis\B/ #matches "istanbul" and "so—isn't that"—similar to /\Wis\w/
– but won't match "istanbul", because "is" is at the front of the string and won't match \W.
—Since \w is [a-zA-Z0-9_], then all punctuation counts as a word boundary.
—So /\bisn\B/ won't match "isn't", because of ' is not a Word character
/\Bis\b/ #matches "this" and "this is for you"—similar to /\wis\W/—For the second example, the match is for
"this", instead of "is".—As in example above \W won't match at the
end of a string.
Exercise 8
• What will the following match?1./a+\w*?/2./\w\s*\w+/3./\bHi\bThere/4./\b\w+\b.+There[!]?$/i5./^\d+[a-z]*/6./\w+,\s\w{2}\s{2}\d{5}/
• Write an regular expression for these1.Rewrite #6 so the city can two or more words.2.Must start with has a letter, then have any
number of letters and/or numbers or none at all, but end with a number
Parentheses as memory
• special variables \1 .. \9• $1 holds the first match inside a ()if ($str =~ /(\d)asdf\1/)
—true when has a digit, then asdf, then the same digit
—examples: 1asdf1, 3asdf3
if($str =~ /(\w+)(\d+)as\2\1/)—true for a word, then digits, as, same digits,
then same word—examples: "hi12as12hi" "1_31as311_"
Parentheses as memory (2)
if ($str =~ /(\d)+asdf\1/)• Note: (\d)+ is different from (\d+)
—(\d+) match max digits, goes into \1—(\d)+ match a digit, but last match goes into \1—examples:—(\d)+ on 123, \1 = 3, but the match is on 123
– So 123asdf3 would match from the top if– In the next lecture, it does some strange things on
substitutions.
Parentheses as memory (3)
• parentheses around parentheses• if ($str =~ /((\w+) (\w+))/)
—\1, \2, \3 are bound to values$str = "Hi There"; \2 = "Hi", \3 = "There", \1="Hi There"
• Perl works from the outer most parentheses to the inner, ( is 1, ((\w+) is 2, the second (\w+) is 3
• (((\w+) )(\w+)) has \1, \2, \3, \4• 12 3 4• \1 = "Hi There", \2 = "Hi ", \3 ="Hi", \4 =
"There"
Variable Interpolation
• Using variables inside in the match• $find = "abc";• if ($str =~/$find/)
—matches when $str contains the value of $find
• $str = "ddogg";• if($str =~ /\w$dog\w/)
—true if $str contains the string in $dog and a word letter in front and behind.
Special Read-Only variables
• We've seen \1 .. \9. There only have a value inside the match. But $1 .. $9 hold they value (same as \1 .. \9) after the match
if ($str =~ /(\d+)asd\1/) {print "matched $1 \n";
}• If $str = "123asd123", then the output
would be matched 123
capturing matches
• $str = "a xxx c xxxxxc xxx d";• ($a, $b) = ($str =~ m/(.+)x(.+)c/);
—$a = "a xxx c xxx"; Also $1 = "a xxx c xxx";—$b = "x"; Also $2 = "x";
match as a true value
• / / returns a true/false value• returning the switch structureSWITCH: {
$str =~ /abc/ && do {$a =1; last SWITCH;};$str =~ /def/ and do {$d = 1; last SWITCH;};$c = 1;
}• Strange looking code. Also, this is one of
the very few places a ; is needed after a }—NOTE either && or and could be used.
Commenting your matches
• /x ignore most white space and allows comments
/\w+: #Match a word and a colon ( #Begin group \s+ #match one or more spaces
\w+ #match another word) #end group\s* #match zero or more spaces\d+ #match 1 more digit
/x;• same as/\w+:(\s+\w+)\s*\d+/;
Commenting your matches (2)
• Be careful in comments that you don't use / otherwise perl thinks it is the end of the match
• You have think about where the whitespace is in the match.
• If you need to match a #, use \#
more flags for pattern matching
• matching with newline in the string• //s let the . match the newline (\n)
—$str = "asdf\n asdf\n";—/(f.)/; no match—/(f.)/s; #$1 = "f\n";
• //m lets ^ and $ match next to embedded \n—$str = "af\nasdf\n";—/(af$)/; # won't match —/(af$)/m; # $1 = "af";—/^(as)/; #won't match—/^(as)/m; # matches, $1= "as";—/(f.)$/ms; # matches only the last "f\n", because
the . matched the \n, so it's "end of line marker".
Pattern Delimiters
• if ($str =~ /\/usr\/local/)—true if $str contains /usr/local
• To avoid backslashing / we can change the delimiter—choose another delimiter, which is a
nonalpanumeric character, such %%, ##, {}, [] , <>, etc
—must use the m in front of the match so perl knows what you want
• if ($str =~ m%/usr/local%)—true if $str contains /usr/local
• if ($str =~ m[/usr/local])—true if $str contains /usr/local, but confusing
since it can be mistaking for [] single character matching.
Exercise 9
• What will the following match this /\-?(\|)?m\(\d+\)\1/i1. "–|m(12)|"2. "|M(12)|"3. "-|M(12)"4. "m(12)|"5. "M(12)"For /\-?(\|)?m\(\d+\)\1?/i6. "|m(12)|"7. "m(12)"8. "-|m(12)|"9. "-|m(12)"
QA&