grokking regex

Post on 24-Jun-2015

490 Views

Category:

Technology

12 Downloads

Preview:

Click to see full reader

DESCRIPTION

Understanding regular expressions gives developers another extremely useful and powerful tool they can use to perform some operations that would otherwise be very tedious or difficult. This presentation goes over how to build and test regular expressions so developers can start using them within their own code.

TRANSCRIPT

php[tek] 2014

David StocktonMay 21, 2014

Grokking Regex

What are regular expressions?

Patterns to describe text

Regular

Extremely Powerful

Often Abused.

Regular Expression Joke

How to use regex in PHP

● The preg_* functions○ Use Perl compatible regular expressions○ Probably the most common regex syntax

● Don't use ereg_* functions

PHP Functions

preg_match - Search a subject for a match

preg_match_all - Searches a subject for all matches

preg_replace - Replace a pattern with something else

preg_split - Split a string based on regex delimiter

PHP Functions

preg_replace_callback - Replacement defined in a callback

preg_grep - Return array of elements that match a pattern

preg_quote - Quote regular expression characters

preg_last_error - Error code of last regex function

Starting Pattern

● Matches letters, numbers, plus, dash, dots, underscore, plus, equals (1 or more)

● Followed by @● Followed by letters, numbers, dots and

dashes● Followed by a dot● Followed by 2 to 4 letters

/[A-Z0-9._+=]+@[A-Z0-9.-]\.[A-Z]{2,4}/i

What does it mean?

Email Addresses

Some Email Addresses

The "real" email address regex(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:( ?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)* ](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+ (?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?: (?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: ) ?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: r )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: ) ?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ] )*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])* )(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ] )+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*) *:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+ ||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31 ]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*]( ?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(? :(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(? : )?[ ])*))*>(?:(?: )?[ ])*)|(?:[^()<>@,;:quot;.[] 00-31]+(?:(? :(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )? [ ]))*"(?:(?: )?[ ])*)*:(?:(?: )?[ ])*(?:(?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" | |(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<> @,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|" (?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(? :[^()<>@,;:quot;.[] 00-

More "real" regex31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?:[^()<>@,;:quot;.[] 00- 31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" ||( ?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: )?[ ])*(?:@(?:[^()<>@,; :quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([ ^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot; .[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[ ] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;. [] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] r|)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)?(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(?:[^" |.|(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@, ;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|"(? :[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])* (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[ ^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[] ]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:(?: )?[ ])*)(?:,s*( ?:(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:( ?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[ ["()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(? :.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*|(?: [^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;.[ ]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)*<(?:(?: ) ?[ ])*(?:@(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[" ()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: ) ?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<> @,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*(?:,@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@, ;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?:.(?:(?: )?[ ] )*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:".[]]))|[([^[] |)*](?:(?: )?[ ])*))*)*:(?:(?: )?[ ])*)? (?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[["()<>@,;:quot;. []]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ])*)(?:.(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+||(?=[[ "()<>@,;:quot;.[]]))|"(?:[^" ||(?:(?: )?[ ]))*"(?:(?: )?[ ]) *))*@(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ]) +||(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*)(?: .(?:(?: )?[ ])*(?:[^()<>@,;:quot;.[] 00-31]+(?:(?:(?: )?[ ])+| |(?=[["()<>@,;:quot;.[]]))|[([^[] |)*](?:(?: )?[ ])*))*>(?:( ?: )?[ ])*))*)?;s*)

How do we implement this regex?

Time for real learning

Letters and Numbers

Letters and numbers match... letters and numbers

/a/ - Matches a string that contains "a"

/7/ - Matches a string that contains a 7

Match a word

/regex/ - Matches a string with the word "regex" in it

Match a choice of words

Use pipe when you want a choice

/pizza|steak|cheeseburger/

Delimiters

So far, delimiters have been /

Needs to tell regex where to start and end

Can use other delimiters

#\\My\\PHP\\Namespace#

Character Matching

/[Pp][Hh][Pp]/ - Matches PHP in an case

Define ranges

/[abcdefghijklmnopqrstuvwxyz]/ - Any lower case alpha

/[a-z]/ - Any lower case alpha

Character Ranges

Combine Ranges:/[A-Za-z0-9]/ - Matches any alphanumeric/[A-Fa-f0-9]/ - Matches hex character

Invert Character selection/[^0-9]/ - Non digit characters/[^ ]/ - Non space characters/[.!@#$%^&*]/ - Some punctuation

Special Characters

Dot (.) matches any character/.//../ - Matches any two characters

To match an actual dot character, escape it/\./

Not needed in character selection/[.]/

Character Classes

\d means [0-9] (Digit, but also all unicode digits)\D means [^0-9]

\w means word characters - [A-Za-z0-9_]\W means non word - [^A-Za-z0-9_]

\s means whitespace character [ \t\n\r]\S means non-whitespace characters

Repetition

Match two digits in a row● /\d\d/● /[0-9][0-9]/● /\d{2}/● /[0-9]{2}/

Match at least one, as many as possible/\d+/Zero or more: /\d*/

Repetition Repeated

● * match 0 or more● + match 1 or more● {x} match exactly x● {x,} match x or more● {,y} match up to y● {x,y} match between x and y

More special characters

? - Preceding selection is optional

Step by Step

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Opening delimiter

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional open paren

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Capture group - Parens capture pattern inside

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Three digits (captured)

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional closing paren

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Space or dash character

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional space or dash character

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Another three digit capture group

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Optional space or dash character

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Capture group for four digits

Break it down

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Closing delimiter

More special characters

Put it together:

/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/

Matches 720-675-7471 or (720)675-7471 or (720) 675-7471 or 7206757471 or 720 675 7471

Phone number matching

Does not match 720.675.7471 or a number of other formats.

Other ways?

Replace all non-digits, check for length of 10

PHP Codes

$number = preg_replace( '/[^0-9]/', '', $potentialNumber);

$valid = strlen($number) == 10;

Regex Anchors

Specify Position With Anchors

/^ab/ - Matches abcdefg but not cab

/ab$/ - Matches cab but not abcdefg

/^[a-z]+$/ - Matches a string of only lowercase characters

Word Boundaries

\b means word boundaries● Before first character if first character is word

character● After last character if word character● Between two characters if one is a word

character and the other isn't

/\bfish\b/ matches fish but not fisherman or catfish/fish\b/ matches fish and catfish

Alternation

/cow|boy/ Matches cow or boy or cowboy or coward, etc/\b(cow|boy)\b/ - Matches cow or boy but not cowboy or coward

Parens capture the matching word - more on that later

Greedy vs Lazy

Default is greedy - match as much as possible

Grab starting HTML tag:/<.+>/Matches in bold: <h1>Welcome to Tek</h1>

Not what we want.

Make it lazy.

Lazy Matching

/<.+?>/

Now matches:

<h1>Welcome to FRPUG</h1>

Another way to match tags

/<[^>]+>/

Literally match: “Less than” followed by one or more non-“less than” characters followed by a “less than” character.

Faster than the last example. No backtracking.

Capture Part of Regex

Capturing Regex - Backreference

/__(construct|destruct)/

Backreference will contain construct or destruct so you can use it later

/([a-z]+)\1/Matches repeated sequence of characters

Backreference

/([a-z]{3})\1/

Matches words like booboo or bambam

Practical Backreference Uses

Search and replace

preg_replace('/\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4})/', '(1) 2-3', $phone);

Format phone numbers from a variety of input styles(xxx) xxx-xxxx

More Practical Backreferences

preg_replace( '/\b(\w+)\s+\1\b/', '\1', $string);

Replace duplicated words that that have been inadvertently been left in.

Replace duplicated words that have been inadvertently been left in.

Non-capturing groups

Match an IPv4 address

/((?:\d{1,3}\.){3}\d{1,3})/

Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times

Non-capturing groups

Match an IPv4 address

/((?:\d{1,3}\.){3}\d{1,3})/

Matching 1-3 digits followed by a dot 3 times. Repeat that match 3 times

Pattern Modifiers

Modifiers after the last delimiter:

i - case insensitive matchingm - multiline matchings - dot matches all characters, including \nx - ignore whitespace characters if not escaped or in a character class

More Pattern Modifiers

D - Anchor matches end of string onlyU - Invert the meaning of greediness

Other modifiers can be seen here:

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

Named Capture Groups

Instead of numbers, get back names

No need to renumber in code later if you add another capture group

Named Capture Group - Phone

preg_match('/

\(? # opt. open paren

(?P<area_code>\d{3}) # area code

\)? # opt. closed paren

[ -]? # opt. space/dash

(?P<exchange>\d{3}) # exchange

[ -]? # opt. space/dash

(?P<number>\d{4}) # last 4 digits

/x', // ignore spaces and comment stuff

$number, $matches);

Named Capture Group Result

array(7) {

[0] => string(10) "7206757471"

['area_code'] => string(3) "720"

[1] => string(3) "720"

['exchange'] => string(3) "675"

[2] => string(3) "675"

['number'] => string(4) "7471"

[3] => string(4) "7471"

}

Positive Look Ahead Matches

Find a pattern followed by another pattern

/p(?=h)/ - Match a p followed by an "h" but don't include the "h"

Matches "phone", "phish", "telegraph"

Does not match "potassium"

Negative Look Ahead

Look for a pattern which is not followed by some other pattern

/p(?!h)/ - p not followed by h

Matches potassium

Does not match phone, telegraph or phish

Look aheads

● Positive and negative lookaheads do not capture anything

● They determine if a match is possible● They are zero-width● /p[^h]/ is not the same as /p(?!h)/● /ph/ is not the same as /p(?=h)/

Look behinds

Positive Look Behind/(?<=oo)d/ - d preceded by oo

- Matches the d in "food" and "mood"

Negative Look Behind/(?<!oo)d/ - d not preceded by oo

- Matches "dude", "crude" and "d"

With Great Power...

Test your regular expressions before they go to production

It's much easier to get them wrong than to get them right if you don't test

Use tools like Sublime Text, Atom

When to not use regex

When they are not needed

If you can use strstr, strpos or str_replace

If you cannot use those, maybe regex is appropriate

Don't use regex when you need a parser

Resources

http://regular-expressions.infohttp://php.net/manual/en/ref.pcre.phphttp://www.php.net/manual/en/reference.pcre.pattern.syntax.php

Photo Credits● http://www.flickr.com/photos/justinbaeder/5317820857 (Hammer & Screw)● http://www.flickr.com/photos/doug88888/5891638442 (Water Pattern)● http://www.flickr.com/photos/mwparenteau/7566437660 (Laxative Cereal)● http://www.flickr.com/photos/auyuchuco/3669864253 (Mantis Shrimp)● http://www.flickr.com/photos/anderspiren/4678572968 (Spray Can)● http://www.flickr.com/photos/dcmatt/473127479 (Comedy Club)● http://www.flickr.com/photos/gschueler/72294706 (License Plate)● http://www.flickr.com/photos/horiavarlan/4514164700 (Puzzle @ sign)● http://www.flickr.com/photos/proimos/4199675334 (Facepalm)● http://www.flickr.com/photos/mklapper/5812224468 (Teacher in Classroom)● http://www.flickr.com/photos/light_arted/3927322326 (Anchor)● http://www.flickr.com/photos/kpcauchi/5376768095 (Lizard)● http://www.flickr.com/photos/focusshoot/5617788347 (Spider web)● http://www.flickr.com/photos/oberazzi/318947873 (Cuff links)

dave@davidstockton.com

Please rate this talkhttps://joind.in/10642

top related