grokking regex

74
php[tek] 2014 David Stockton May 21, 2014 Grokking Regex

Upload: david-stockton

Post on 24-Jun-2015

486 views

Category:

Technology


12 download

DESCRIPTION

Understanding regular expressions gives developers another extremely useful and powerful tool they can use to perform some operations that would otherwise be very tedious or difficult. This presentation goes over how to build and test regular expressions so developers can start using them within their own code.

TRANSCRIPT

Page 1: Grokking regex

php[tek] 2014

David StocktonMay 21, 2014

Grokking Regex

Page 2: Grokking regex

What are regular expressions?

Page 3: Grokking regex

Patterns to describe text

Page 4: Grokking regex

Regular

Page 5: Grokking regex

Extremely Powerful

Page 6: Grokking regex

Often Abused.

Page 7: Grokking regex

Regular Expression Joke

Page 8: Grokking regex

How to use regex in PHP

● The preg_* functions○ Use Perl compatible regular expressions○ Probably the most common regex syntax

● Don't use ereg_* functions

Page 9: Grokking regex

PHP Functions

preg_match - Search a subject for a match

preg_match_all - Searches a subject for all matches

preg_replace - Replace a pattern with something else

preg_split - Split a string based on regex delimiter

Page 10: Grokking regex

PHP Functions

preg_replace_callback - Replacement defined in a callback

preg_grep - Return array of elements that match a pattern

preg_quote - Quote regular expression characters

preg_last_error - Error code of last regex function

Page 11: Grokking regex

Starting Pattern

● Matches letters, numbers, plus, dash, dots, underscore, plus, equals (1 or more)

● Followed by @● Followed by letters, numbers, dots and

dashes● Followed by a dot● Followed by 2 to 4 letters

/[A-Z0-9._+=]+@[A-Z0-9.-]\.[A-Z]{2,4}/i

Page 12: Grokking regex

What does it mean?

Page 13: Grokking regex

Email Addresses

Page 14: Grokking regex

Some Email Addresses

Page 15: Grokking regex

The "real" email address regex(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:( ?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)* ](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+ (?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?: (?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: ) ?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: r )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: ) ?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ] )*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])* )(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*) *:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+ ||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31 ]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*]( ?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(? :(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(? : )?[ ])*))*>(?:(?: )?[ ])*)|(?:[^()<>@,;:quot;.[] 00-31]+(?:(? :(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )? [ ]))*"(?:(?: )?[ ])*)*:(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" | |(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<> @,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|" (?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(? :[^()<>@,;:quot;.[] 00-

Page 16: Grokking regex

More "real" regex31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||( ?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,; :quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([ ^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot; .[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[ ] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;. [] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] r|)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" |.|(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@, ;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(? :[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])* (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[ ^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[] ]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)(?:,s*( ?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:( ?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[ ["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(? :.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?: [^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: ) ?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[" ()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: ) ?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<> @,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@, ;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)? (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[ "()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ]) *))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?: .(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:( ?: )?[ ])*))*)?;s*)

Page 17: Grokking regex

How do we implement this regex?

Page 18: Grokking regex

Time for real learning

Page 19: Grokking regex

Letters and Numbers

Letters and numbers match... letters and numbers

/a/ - Matches a string that contains "a"

/7/ - Matches a string that contains a 7

Page 20: Grokking regex

Match a word

/regex/ - Matches a string with the word "regex" in it

Page 21: Grokking regex

Match a choice of words

Use pipe when you want a choice

/pizza|steak|cheeseburger/

Page 22: Grokking regex

Delimiters

So far, delimiters have been /

Needs to tell regex where to start and end

Can use other delimiters

#\\My\\PHP\\Namespace#

Page 23: Grokking regex

Character Matching

/[Pp][Hh][Pp]/ - Matches PHP in an case

Define ranges

/[abcdefghijklmnopqrstuvwxyz]/ - Any lower case alpha

/[a-z]/ - Any lower case alpha

Page 24: Grokking regex

Character Ranges

Combine Ranges:/[A-Za-z0-9]/ - Matches any alphanumeric/[A-Fa-f0-9]/ - Matches hex character

Invert Character selection/[^0-9]/ - Non digit characters/[^ ]/ - Non space characters/[.!@#$%^&*]/ - Some punctuation

Page 25: Grokking regex

Special Characters

Dot (.) matches any character/.//../ - Matches any two characters

To match an actual dot character, escape it/\./

Not needed in character selection/[.]/

Page 26: Grokking regex

Character Classes

\d means [0-9] (Digit, but also all unicode digits)\D means [^0-9]

\w means word characters - [A-Za-z0-9_]\W means non word - [^A-Za-z0-9_]

\s means whitespace character [ \t\n\r]\S means non-whitespace characters

Page 27: Grokking regex

Repetition

Match two digits in a row● /\d\d/● /[0-9][0-9]/● /\d{2}/● /[0-9]{2}/

Match at least one, as many as possible/\d+/Zero or more: /\d*/

Page 28: Grokking regex

Repetition Repeated

● * match 0 or more● + match 1 or more● {x} match exactly x● {x,} match x or more● {,y} match up to y● {x,y} match between x and y

Page 29: Grokking regex

More special characters

? - Preceding selection is optional

Page 30: Grokking regex

Step by Step

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Page 31: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Opening delimiter

Page 32: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional open paren

Page 33: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Capture group - Parens capture pattern inside

Page 34: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Three digits (captured)

Page 35: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional closing paren

Page 36: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Space or dash character

Page 37: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional space or dash character

Page 38: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Another three digit capture group

Page 39: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional space or dash character

Page 40: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Capture group for four digits

Page 41: Grokking regex

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Closing delimiter

Page 42: Grokking regex

More special characters

Put it together:

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Matches 720-675-7471 or (720)675-7471 or (720) 675-7471 or 7206757471 or 720 675 7471

Page 43: Grokking regex

Phone number matching

Does not match 720.675.7471 or a number of other formats.

Other ways?

Replace all non-digits, check for length of 10

Page 44: Grokking regex

PHP Codes

$number = preg_replace( '/[^0-9]/', '', $potentialNumber);

$valid = strlen($number) == 10;

Page 45: Grokking regex

Regex Anchors

Page 46: Grokking regex

Specify Position With Anchors

/^ab/ - Matches abcdefg but not cab

/ab$/ - Matches cab but not abcdefg

/^[a-z]+$/ - Matches a string of only lowercase characters

Page 47: Grokking regex

Word Boundaries

\b means word boundaries● Before first character if first character is word

character● After last character if word character● Between two characters if one is a word

character and the other isn't

/\bfish\b/ matches fish but not fisherman or catfish/fish\b/ matches fish and catfish

Page 48: Grokking regex

Alternation

/cow|boy/ Matches cow or boy or cowboy or coward, etc/\b(cow|boy)\b/ - Matches cow or boy but not cowboy or coward

Parens capture the matching word - more on that later

Page 49: Grokking regex

Greedy vs Lazy

Default is greedy - match as much as possible

Grab starting HTML tag:/<.+>/Matches in bold: <h1>Welcome to Tek</h1>

Not what we want.

Page 50: Grokking regex

Make it lazy.

Page 51: Grokking regex

Lazy Matching

/<.+?>/

Now matches:

<h1>Welcome to FRPUG</h1>

Page 52: Grokking regex

Another way to match tags

/<[^>]+>/

Literally match: “Less than” followed by one or more non-“less than” characters followed by a “less than” character.

Faster than the last example. No backtracking.

Page 53: Grokking regex

Capture Part of Regex

Page 54: Grokking regex

Capturing Regex - Backreference

/__(construct|destruct)/

Backreference will contain construct or destruct so you can use it later

/([a-z]+)\1/Matches repeated sequence of characters

Page 55: Grokking regex

Backreference

/([a-z]{3})\1/

Matches words like booboo or bambam

Page 56: Grokking regex

Practical Backreference Uses

Search and replace

preg_replace('/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/', '(1) 2-3', $phone);

Format phone numbers from a variety of input styles(xxx) xxx-xxxx

Page 57: Grokking regex

More Practical Backreferences

preg_replace( '/\b(\w+)\s+\1\b/', '\1', $string);

Replace duplicated words that that have been inadvertently been left in.

Replace duplicated words that have been inadvertently been left in.

Page 58: Grokking regex

Non-capturing groups

Match an IPv4 address

/((?:\d{1,3}\.){3}\d{1,3})/

Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times

Page 59: Grokking regex

Non-capturing groups

Match an IPv4 address

/((?:\d{1,3}\.){3}\d{1,3})/

Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times

Page 60: Grokking regex

Pattern Modifiers

Modifiers after the last delimiter:

i - case insensitive matchingm - multiline matchings - dot matches all characters, including \nx - ignore whitespace characters if not escaped or in a character class

Page 61: Grokking regex

More Pattern Modifiers

D - Anchor matches end of string onlyU - Invert the meaning of greediness

Other modifiers can be seen here:

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Page 62: Grokking regex

Named Capture Groups

Instead of numbers, get back names

No need to renumber in code later if you add another capture group

Page 63: Grokking regex

Named Capture Group - Phone

preg_match('/

\(? # opt. open paren

(?P<area_code>\d{3}) # area code

\)? # opt. closed paren

[ -]? # opt. space/dash

(?P<exchange>\d{3}) # exchange

[ -]? # opt. space/dash

(?P<number>\d{4}) # last 4 digits

/x', // ignore spaces and comment stuff

$number, $matches);

Page 64: Grokking regex

Named Capture Group Result

array(7) {

[0] => string(10) "7206757471"

['area_code'] => string(3) "720"

[1] => string(3) "720"

['exchange'] => string(3) "675"

[2] => string(3) "675"

['number'] => string(4) "7471"

[3] => string(4) "7471"

}

Page 65: Grokking regex

Positive Look Ahead Matches

Find a pattern followed by another pattern

/p(?=h)/ - Match a p followed by an "h" but don't include the "h"

Matches "phone", "phish", "telegraph"

Does not match "potassium"

Page 66: Grokking regex

Negative Look Ahead

Look for a pattern which is not followed by some other pattern

/p(?!h)/ - p not followed by h

Matches potassium

Does not match phone, telegraph or phish

Page 67: Grokking regex

Look aheads

● Positive and negative lookaheads do not capture anything

● They determine if a match is possible● They are zero-width● /p[^h]/ is not the same as /p(?!h)/● /ph/ is not the same as /p(?=h)/

Page 68: Grokking regex

Look behinds

Positive Look Behind/(?<=oo)d/ - d preceded by oo

- Matches the d in "food" and "mood"

Negative Look Behind/(?<!oo)d/ - d not preceded by oo

- Matches "dude", "crude" and "d"

Page 69: Grokking regex

With Great Power...

Test your regular expressions before they go to production

It's much easier to get them wrong than to get them right if you don't test

Use tools like Sublime Text, Atom

Page 70: Grokking regex

When to not use regex

When they are not needed

If you can use strstr, strpos or str_replace

If you cannot use those, maybe regex is appropriate

Don't use regex when you need a parser

Page 71: Grokking regex

Resources

http://regular-expressions.infohttp://php.net/manual/en/ref.pcre.phphttp://www.php.net/manual/en/reference.pcre.pattern.syntax.php

Page 72: Grokking regex

Photo Credits● http://www.flickr.com/photos/justinbaeder/5317820857 (Hammer & Screw)● http://www.flickr.com/photos/doug88888/5891638442 (Water Pattern)● http://www.flickr.com/photos/mwparenteau/7566437660 (Laxative Cereal)● http://www.flickr.com/photos/auyuchuco/3669864253 (Mantis Shrimp)● http://www.flickr.com/photos/anderspiren/4678572968 (Spray Can)● http://www.flickr.com/photos/dcmatt/473127479 (Comedy Club)● http://www.flickr.com/photos/gschueler/72294706 (License Plate)● http://www.flickr.com/photos/horiavarlan/4514164700 (Puzzle @ sign)● http://www.flickr.com/photos/proimos/4199675334 (Facepalm)● http://www.flickr.com/photos/mklapper/5812224468 (Teacher in Classroom)● http://www.flickr.com/photos/light_arted/3927322326 (Anchor)● http://www.flickr.com/photos/kpcauchi/5376768095 (Lizard)● http://www.flickr.com/photos/focusshoot/5617788347 (Spider web)● http://www.flickr.com/photos/oberazzi/318947873 (Cuff links)

Page 74: Grokking regex

Please rate this talkhttps://joind.in/10642