grokking regex
DESCRIPTION
Understanding regular expressions gives developers another extremely useful and powerful tool they can use to perform some operations that would otherwise be very tedious or difficult. This presentation goes over how to build and test regular expressions so developers can start using them within their own code.TRANSCRIPT
php[tek] 2014
David StocktonMay 21, 2014
Grokking Regex
What are regular expressions?
Patterns to describe text
Regular
Extremely Powerful
Often Abused.
Regular Expression Joke
How to use regex in PHP
● The preg_* functions○ Use Perl compatible regular expressions○ Probably the most common regex syntax
● Don't use ereg_* functions
PHP Functions
preg_match - Search a subject for a match
preg_match_all - Searches a subject for all matches
preg_replace - Replace a pattern with something else
preg_split - Split a string based on regex delimiter
PHP Functions
preg_replace_callback - Replacement defined in a callback
preg_grep - Return array of elements that match a pattern
preg_quote - Quote regular expression characters
preg_last_error - Error code of last regex function
Starting Pattern
● Matches letters, numbers, plus, dash, dots, underscore, plus, equals (1 or more)
● Followed by @● Followed by letters, numbers, dots and
dashes● Followed by a dot● Followed by 2 to 4 letters
/[A-Z0-9._+=]+@[A-Z0-9.-]\.[A-Z]{2,4}/i
What does it mean?
Email Addresses
Some Email Addresses
The "real" email address regex(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:( ?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)* ](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+ (?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?: (?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: ) ?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: r )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: ) ?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ] )*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])* )(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*) *:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+ ||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31 ]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*]( ?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(? :(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(? : )?[ ])*))*>(?:(?: )?[ ])*)|(?:[^()<>@,;:quot;.[] 00-31]+(?:(? :(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )? [ ]))*"(?:(?: )?[ ])*)*:(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" | |(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<> @,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|" (?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(? :[^()<>@,;:quot;.[] 00-
More "real" regex31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||( ?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,; :quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([ ^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot; .[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[ ] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;. [] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] r|)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" |.|(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@, ;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(? :[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])* (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[ ^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[] ]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)(?:,s*( ?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:( ?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[ ["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(? :.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?: [^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: ) ?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[" ()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: ) ?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<> @,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@, ;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)? (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[ "()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ]) *))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?: .(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:( ?: )?[ ])*))*)?;s*)
How do we implement this regex?
Time for real learning
Letters and Numbers
Letters and numbers match... letters and numbers
/a/ - Matches a string that contains "a"
/7/ - Matches a string that contains a 7
Match a word
/regex/ - Matches a string with the word "regex" in it
Match a choice of words
Use pipe when you want a choice
/pizza|steak|cheeseburger/
Delimiters
So far, delimiters have been /
Needs to tell regex where to start and end
Can use other delimiters
#\\My\\PHP\\Namespace#
Character Matching
/[Pp][Hh][Pp]/ - Matches PHP in an case
Define ranges
/[abcdefghijklmnopqrstuvwxyz]/ - Any lower case alpha
/[a-z]/ - Any lower case alpha
Character Ranges
Combine Ranges:/[A-Za-z0-9]/ - Matches any alphanumeric/[A-Fa-f0-9]/ - Matches hex character
Invert Character selection/[^0-9]/ - Non digit characters/[^ ]/ - Non space characters/[.!@#$%^&*]/ - Some punctuation
Special Characters
Dot (.) matches any character/.//../ - Matches any two characters
To match an actual dot character, escape it/\./
Not needed in character selection/[.]/
Character Classes
\d means [0-9] (Digit, but also all unicode digits)\D means [^0-9]
\w means word characters - [A-Za-z0-9_]\W means non word - [^A-Za-z0-9_]
\s means whitespace character [ \t\n\r]\S means non-whitespace characters
Repetition
Match two digits in a row● /\d\d/● /[0-9][0-9]/● /\d{2}/● /[0-9]{2}/
Match at least one, as many as possible/\d+/Zero or more: /\d*/
Repetition Repeated
● * match 0 or more● + match 1 or more● {x} match exactly x● {x,} match x or more● {,y} match up to y● {x,y} match between x and y
More special characters
? - Preceding selection is optional
Step by Step
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Opening delimiter
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Optional open paren
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Capture group - Parens capture pattern inside
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Three digits (captured)
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Optional closing paren
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Space or dash character
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Optional space or dash character
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Another three digit capture group
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Optional space or dash character
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Capture group for four digits
Break it down
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Closing delimiter
More special characters
Put it together:
/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/
Matches 720-675-7471 or (720)675-7471 or (720) 675-7471 or 7206757471 or 720 675 7471
Phone number matching
Does not match 720.675.7471 or a number of other formats.
Other ways?
Replace all non-digits, check for length of 10
PHP Codes
$number = preg_replace( '/[^0-9]/', '', $potentialNumber);
$valid = strlen($number) == 10;
Regex Anchors
Specify Position With Anchors
/^ab/ - Matches abcdefg but not cab
/ab$/ - Matches cab but not abcdefg
/^[a-z]+$/ - Matches a string of only lowercase characters
Word Boundaries
\b means word boundaries● Before first character if first character is word
character● After last character if word character● Between two characters if one is a word
character and the other isn't
/\bfish\b/ matches fish but not fisherman or catfish/fish\b/ matches fish and catfish
Alternation
/cow|boy/ Matches cow or boy or cowboy or coward, etc/\b(cow|boy)\b/ - Matches cow or boy but not cowboy or coward
Parens capture the matching word - more on that later
Greedy vs Lazy
Default is greedy - match as much as possible
Grab starting HTML tag:/<.+>/Matches in bold: <h1>Welcome to Tek</h1>
Not what we want.
Make it lazy.
Lazy Matching
/<.+?>/
Now matches:
<h1>Welcome to FRPUG</h1>
Another way to match tags
/<[^>]+>/
Literally match: “Less than” followed by one or more non-“less than” characters followed by a “less than” character.
Faster than the last example. No backtracking.
Capture Part of Regex
Capturing Regex - Backreference
/__(construct|destruct)/
Backreference will contain construct or destruct so you can use it later
/([a-z]+)\1/Matches repeated sequence of characters
Backreference
/([a-z]{3})\1/
Matches words like booboo or bambam
Practical Backreference Uses
Search and replace
preg_replace('/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/', '(1) 2-3', $phone);
Format phone numbers from a variety of input styles(xxx) xxx-xxxx
More Practical Backreferences
preg_replace( '/\b(\w+)\s+\1\b/', '\1', $string);
Replace duplicated words that that have been inadvertently been left in.
Replace duplicated words that have been inadvertently been left in.
Non-capturing groups
Match an IPv4 address
/((?:\d{1,3}\.){3}\d{1,3})/
Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times
Non-capturing groups
Match an IPv4 address
/((?:\d{1,3}\.){3}\d{1,3})/
Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times
Pattern Modifiers
Modifiers after the last delimiter:
i - case insensitive matchingm - multiline matchings - dot matches all characters, including \nx - ignore whitespace characters if not escaped or in a character class
More Pattern Modifiers
D - Anchor matches end of string onlyU - Invert the meaning of greediness
Other modifiers can be seen here:
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
Named Capture Groups
Instead of numbers, get back names
No need to renumber in code later if you add another capture group
Named Capture Group - Phone
preg_match('/
\(? # opt. open paren
(?P<area_code>\d{3}) # area code
\)? # opt. closed paren
[ -]? # opt. space/dash
(?P<exchange>\d{3}) # exchange
[ -]? # opt. space/dash
(?P<number>\d{4}) # last 4 digits
/x', // ignore spaces and comment stuff
$number, $matches);
Named Capture Group Result
array(7) {
[0] => string(10) "7206757471"
['area_code'] => string(3) "720"
[1] => string(3) "720"
['exchange'] => string(3) "675"
[2] => string(3) "675"
['number'] => string(4) "7471"
[3] => string(4) "7471"
}
Positive Look Ahead Matches
Find a pattern followed by another pattern
/p(?=h)/ - Match a p followed by an "h" but don't include the "h"
Matches "phone", "phish", "telegraph"
Does not match "potassium"
Negative Look Ahead
Look for a pattern which is not followed by some other pattern
/p(?!h)/ - p not followed by h
Matches potassium
Does not match phone, telegraph or phish
Look aheads
● Positive and negative lookaheads do not capture anything
● They determine if a match is possible● They are zero-width● /p[^h]/ is not the same as /p(?!h)/● /ph/ is not the same as /p(?=h)/
Look behinds
Positive Look Behind/(?<=oo)d/ - d preceded by oo
- Matches the d in "food" and "mood"
Negative Look Behind/(?<!oo)d/ - d not preceded by oo
- Matches "dude", "crude" and "d"
With Great Power...
Test your regular expressions before they go to production
It's much easier to get them wrong than to get them right if you don't test
Use tools like Sublime Text, Atom
When to not use regex
When they are not needed
If you can use strstr, strpos or str_replace
If you cannot use those, maybe regex is appropriate
Don't use regex when you need a parser
Resources
http://regular-expressions.infohttp://php.net/manual/en/ref.pcre.phphttp://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Photo Credits● http://www.flickr.com/photos/justinbaeder/5317820857 (Hammer & Screw)● http://www.flickr.com/photos/doug88888/5891638442 (Water Pattern)● http://www.flickr.com/photos/mwparenteau/7566437660 (Laxative Cereal)● http://www.flickr.com/photos/auyuchuco/3669864253 (Mantis Shrimp)● http://www.flickr.com/photos/anderspiren/4678572968 (Spray Can)● http://www.flickr.com/photos/dcmatt/473127479 (Comedy Club)● http://www.flickr.com/photos/gschueler/72294706 (License Plate)● http://www.flickr.com/photos/horiavarlan/4514164700 (Puzzle @ sign)● http://www.flickr.com/photos/proimos/4199675334 (Facepalm)● http://www.flickr.com/photos/mklapper/5812224468 (Teacher in Classroom)● http://www.flickr.com/photos/light_arted/3927322326 (Anchor)● http://www.flickr.com/photos/kpcauchi/5376768095 (Lizard)● http://www.flickr.com/photos/focusshoot/5617788347 (Spider web)● http://www.flickr.com/photos/oberazzi/318947873 (Cuff links)
Please rate this talkhttps://joind.in/10642