regex intro

27
^[Rr]egular [Ee]xpressions$ Introduction

Upload: jason-noble

Post on 18-Dec-2014

731 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Regex Intro

^[Rr]egular [Ee]xpressions$

Introduction

Page 2: Regex Intro

Vocabulary

• Regular expression / Regex / Regexp– Regex is pronounced Reg (as in register)

Ex (as in FedEx)

• Matching– Regex matches a string means it matches in a string

Page 3: Regex Intro

Regular Expressions

• Composed of two types of characters– Metacharacters / Special characters

• * ? ^ $ . [ ]

– Literal characters• a b c d

Page 4: Regex Intro

Egrep tool

• Allows you to use Regular Expressions to find words that match

• Available for Macs, PCs and Linux

• cat /usr/share/dict/words | egrep ‘…’

• See http://regex.info/egrep.html if you don’t have it preinstalled

Page 5: Regex Intro

My first regex

• cat /usr/share/dict/words | egrep ‘cat’– Matches any words

with a ‘c’ followed by an ‘a’ followed by a ‘t’

• bobcat• cat• catwalk• scatter

• Simple regex, only uses Literal chars

Page 6: Regex Intro

Metacharacters: ^ and $

• ^ matches the beginning of a line• $ matches the end of a line

– ^cat (start of line followed by ‘c’ then ‘a’ then ‘t’)• cat• catwalk

– cat$ (‘c’ followed by ‘a’ then ‘t’ followed by EOL)• bobcat• cat

– ^cat$ (start of line followed by ‘c’ then ‘a’ then ‘t’ then EOL)

• cat

Page 7: Regex Intro

How to read regex

• Read each character one at a time• ^bat

– Start of line followed by ‘b’ then ‘a’ then ‘t’

• rat$– ‘r’ then ‘a’ then ‘t’ followed by end of line

• ^dog$– Start of line followed by ‘d’ then ‘o’ then ‘g’

then EOL

Page 8: Regex Intro

More simple regex’s

• ^– Start of line

• ^$– Start of line followed by end of line

• $– End of line

• ^foot$– Start of line followed by ‘f’ then ‘o’ then ‘o’ then ‘t’

followed by EOL

Page 9: Regex Intro

Character Classes [ ]

• Matches one of the characters in the [ ]– [ae]

• Matches ‘a’ or ‘e’

– [aeiouy]• Matches any vowel

– ^gr[ae]y$• Start of line followed by ‘g’ then ‘r’ then ‘a’ or ‘e’

then ‘y’ followed by end of line• grey or gray

Page 10: Regex Intro

Character Classes cont.

• [Ss]– Matches upper or lower case ‘S’

• [123456]– Matches any of the digits listed

• [Hh][123456]– Matches H1, h2, h3, H4, etc

Page 11: Regex Intro

Special characters in [ ]’s

• - (dash) references a range– [1-6] is the same as [123456]– [a-f] is the same as [abcdef]

• Ranges can be mixed with literals– [0-9a-fA-F_!.?]

• Any digit, upper or lower case ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, underscore, exclamation, period or question mark

Page 12: Regex Intro

Negated character class [^ ]

• ^ inside of [ ] means “not any of these”– [^1-6]

• Any character other than 1, 2, 3, 4, 5, 6

– [^a-fA-F]• Any character other than A-F (upper or lower)

– The ^ must be the first character inside [ ]• [^c] (Matches anything but ‘c’)• [c^] (Matches a ‘c’ or ‘^’)

Page 13: Regex Intro

Translating regex practice

• List of words that have ‘q’ followed by a character other than ‘u’– q[^u]

• List of words with ‘f’ followed by an ‘i’ or ‘o’ followed by ‘r’ then ‘e’– f[io]re

• Line starts with ‘Qu’ or ‘qu’ followed by an ‘e’ followed by any letter between ‘p’ and ‘t’– ^[Qq]ue[p-t]

Page 14: Regex Intro

Metacharacter: . (dot)

• Matches any character• c.t

– ‘c’ followed by any character followed by ‘t’• cat• cot• c8t

• Period inside of [ ]’s matches a period– [a.c]

• Matches ‘a’, ‘.’ or ‘c’

Page 15: Regex Intro

Periods cont.

• 03.19.76– Matches ‘03’ followed by a char then ‘19’

then any char then ‘76’• 03-19-76• 03/19/76• 03.19.76• 03 19 76• 03319876

Page 16: Regex Intro

Alternatives: | (pipe)

• Pipes allow you to specify alternatives• grey|gray

– Matches on grey or gray

• Use parentheses to constrain alternatives– gr(e|a)y

• Within [ ]’s, | is a normal character– [a|b]

• Matches ‘a’ or ‘|’ or ‘b’

Page 17: Regex Intro

Pipes (cont.)

• Use parenthesis to constrain– gre|ay

• matches ‘gre’ or ‘ay’

– gr(e|a)y• matches ‘gr’ followed by ‘e’ or ‘a’ then ‘y’

Page 18: Regex Intro

Regex practice

• Match “First Street” or “1st street”– (First|1st) [Ss]treet– (Fir|1)st [Ss]treet

• These are equivalent, which is better?

• Match “toothbrush” or “hairbrush”– (tooth|hair)brush

Page 19: Regex Intro

^ or $ and alternation

• Be careful when using ^ or $ with alternation• ^From|Subject|Date:

– Start of line followed by From OR– Subject OR– Date:

• ^(From|Subject|Date):– Start of line followed by ‘From’ or ‘Subject’ or

‘Date’ followed by ‘:’

• Safer to use ()’s to group your alternates

Page 20: Regex Intro

Case insensitive match

• Matches are case sensitive by default– [Ff]rom will match From but not FRom

• Use egrep’s -i option to do a case insensitive match

• Most languages have a case insensitive match as well

Page 21: Regex Intro

Quantifiers: ?

• ? metacharacter means optional– colou?r

• matches color or colour• ‘c’ then ‘o’ then ‘l’ then ‘o’ then optionally ‘u’

then ‘r’

• Match July or Jul and fourth, 4th and 4– (July|Jul) (fourth|4th|4)– July? (fourth|4th|4)– July? (fourth|4(th)?)

Page 22: Regex Intro

Quantifiers: + and *

• + (plus) – One or more of the previous item

• * (star)– Zero or more of the previous item

• b[0-9]*a– ba– b9999a– b999999999999999a

Page 23: Regex Intro

Summary of Quantifiers

Minimum Required

Maximum to try

Meaning

? none 1 zero or one occurrence

* none no limit zero or more occurrences

+ 1 no limit one or more occurrences

Page 24: Regex Intro

Escaping metacharacters

• Use \ (backslash) to escape metacharacters– \. matches ‘.’– . matches any character

• c.t matches cat

• c\.t does not match cat

• \(cat\) matches ‘(cat)’ not ‘cat’

Page 25: Regex Intro

More practice

• Match chat, cat, chart– ch?ar?t– c[h]?a[r]?t

• Start of line then M then one or more ‘a’ followed by ‘st’ and zero or more ‘b’– ^M[a]+st[b]*

• Lines ending with one or more ‘c’ followed by a ‘t’ then zero or one ‘e’– [c]+t[e]*$

Page 26: Regex Intro

More practice

• ^[Mm][^a-np-z]ney$– Start of line then ‘M’ or ‘m’ then any

character not a-n and p-z then ‘ney’ followed by end of line

– Money, money, m3ney

• ^be.*(bob|ted)$– Start of line followed by ‘be’ followed by

zero or more characters followed by ‘bob’ or ‘ted’ followed by end of line

Page 27: Regex Intro

More practice

• Match truck, firetruck but not dumptruck– ^(fire)?truck$

• $0.99, $599.95, $1000.45, $5000– \$[0-9]+(\.[0-9][0-9])?$

• 404-555-1212, 404.555.1212, (404) 555-1212– ^[()0-9]+.[0-9]+.[0-9]+$