regex basics

28
Expression Basics PHPNW 2008 Ciarán Walsh

Upload: jeremy-coates

Post on 18-May-2015

2.532 views

Category:

Technology


3 download

DESCRIPTION

Ciarán Walsh's PHPNW08 slides: In the right hands regular expressions can be a powerful tool, but it’s also far too easy for them to be used badly, or in the wrong situations. This talk will kick off with a look at alternatives to regular expressions, for when the power of pattern matching is not required, and will also go over some cases when there are better alternatives available. Then there will be a brief refresher on pattern syntax and some general tips and tricks to help when constructing regular expressions, before we go on to look at some situations where the use of pattern matching is a good fit, how to solve some common problems, and some common pitfalls when writing patterns.

TRANSCRIPT

Page 1: Regex Basics

Regular Expression Basics

PHPNW 2008Ciarán Walsh

Page 2: Regex Basics

What are regular expressions?

•Regular expressions allow matching and manipulation of textual data.

•Abbreviated as regex or regexp, or alternatively just “patterns”.

Page 3: Regex Basics

Regular Expression Basics

Literals

busMatches a ‘b’, followed by a ‘u’, followed by an ‘s’

Page 4: Regex Basics

Regular Expression Basics

Anchors

^ Matches at the beginning of a line

$ Matches at the end of a line

Page 5: Regex Basics

Regular Expression Basics

Character Classes

[abc] Matches one of ‘a’, ‘b’ or ‘c’

[a-c] Same as above (character range)

[^abc]Matches one character that is not listed

. Matches any single character

Page 6: Regex Basics

Regular Expression Basics

Alternation

a|b Matches one of ‘a’ or ‘b’

dog|cat Matches one of “dog” or “cat”

Page 7: Regex Basics

Regular Expression Basics

Quantifiers (repetition)

{x,y}Matches minimum of x and a maximum of y occurrences; either can be omitted

*Matches zero or more occurrences(any amount). Same as {0,}

+Matches one or more occurrences.Same as {1,}

?Matches zero or one occurrences.Same as {0,1}

Page 8: Regex Basics

Regular Expression Basics

Grouping

(…)

Groups the contents of the parentheses.Affects alternation and quantifiers.Allows parts of the match to be captured

for|backward “for” or “backward”(for|back)ward “forward” or “backward”

Page 9: Regex Basics

Regular Expression Basics

Delimiters

pattern / modifiers/

/i Makes match case-insensitive

Page 10: Regex Basics

Performing a Match

preg_match('/Te(.)f?/i','text',$matches);

•Returns number of matches (0 or 1)

•$matches will contain captured groups

Page 11: Regex Basics

Performing a Replacement

preg_replace('/some(text)/','\1',$text)

•Returns string after replacement

•Can use backreferences with \0-9

Page 12: Regex Basics

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:

(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\

]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n

)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)

*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0

31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\

](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\

\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\"

.\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?

:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(

?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]

))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]

+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:

\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?

:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:

\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?

:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]

|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@

,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".

\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[

\t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[

\t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(

?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|

\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\

.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|

\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[

\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\

031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

Don’t Use Regular Expressions!

Don’t Abuse Regular

Expressions!Some people, when confronted with a problem, think

“I know, I'll use regular expressions.”Now they have two problems. — Jamie Zawinski

Page 13: Regex Basics

Testing for a Substring

if (preg_match('/foo/', $var))

if (strpos($var, 'foo') !== false)

if (preg_match('/foo/i', $var))

if (stripos($var, 'foo') !== false)

Page 14: Regex Basics

if (preg_match('/^\d+$/', $value)) { // $value is a positive integer}

Validating an Integer

Regular Expression

•Intention is not immediately

obvious

•Not efficient

Page 15: Regex Basics

ctype (Character Type)

Validating an Integer

if (ctype_digit($value)) { // $value is a positive integer}

• Native C library (fast)

• Makes the intention obvious

Page 16: Regex Basics

$casted_value = intval($value);if ($casted_value > 0) { // $casted_value is a positive (non-zero) integer}

Validating an Integer

Casting

•Intention is fairly clear

•Casting is safe practice

•Any invalid values will result in zero

Page 17: Regex Basics

HTML Parsing

Page 18: Regex Basics

Using Regular Expressions

Page 19: Regex Basics

Using Regular Expressions

Postcodes/[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9]

[A-Z]{2}/

IP Addresses @^(\d{1,2})/(\d{1,2})/(\d{4})$@

Page 20: Regex Basics

Constructing Patterns

Writing patterns is a balance between matching what you do want, against not matching what you don’t want.

Page 21: Regex Basics

preg_match('/<b><s>.+<\/s>.+<\/b>/', $html)

preg_match('@<b><s>.+</s>.+</b>@', $html)

You don’t need to use

/…/ to denote a pattern!

Page 22: Regex Basics

Greediness$html = <<<HTML <span>some text</span><span>some more text!</span>HTML;

preg_match("@<span>(.+)</span>@", $html, $matches);echo $matches[0];

preg_match("@<span>(.+?)</span>@", $html, $matches);echo $matches[0];

Page 23: Regex Basics

preg_match('`^(\w+)://(?:(.+?):(.+?)@)?(.+?)\.(\w+)$`', $s, $matches)

preg_match('` ^ (\w+):// # Protocol (?: (.+?) # Username : # : (.+?) # Password @ # @ )? # Username/password are optional (.+?) # Hostname \.(\w+) # Top-level domain $ `x', $s, $matches);

You can make your pattern readable!

Page 24: Regex Basics

preg_match('`^ (?P<protocol>\w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) \.(?P<tld>\w+) $`x', $s, $matches);

Extracting Captures

Array(    [0] => http://foo:[email protected]    [protocol] => http    [1] => http    [user] => foo    [2] => foo    [pass] => bar    [3] => bar    [host] => baz.example    [4] => baz.example    [tld] => com    [5] => com)

preg_match('`^ (?P<protocol>\w+):// (?: (?P<user>.+?) : (?P<pass>.+?) @ )? (?P<host>.+?) \.(?P<tld>\w+) $`x', $s, $matches);

Page 25: Regex Basics

if (preg_match("!>$value</(?:div|span)>!", $text))

Variable Data

$value = preg_quote($value, '!');

Page 26: Regex Basics

preg_replace('/\w+/e', 'strtoupper("\0")', 'foo bar baz')

function upper_case_match($matches) { return strtoupper($matches[0]);}preg_replace_callback('/\w+/','upper_case_match','foo bar baz')

Performing Logic on Replacements

Page 27: Regex Basics

Testing Tools

•RegexBuddy

•Reggy

•http://rubular.com

Page 28: Regex Basics

Any Questions?