regular expressions: javascript and beyond
DESCRIPTION
Regular Expressions is a powerful tool for text and data processing. What kind of support do browsers provide for that? What are those little misconceptions that prevent people from using RE effectively? The talk gives an overview of the regular expression syntax and typical usage examples.TRANSCRIPT
Regular Expressions:JavaScript And Beyond
Max ShirshinFrontend Team Lead
deltamethod
Introduction
Types of regular expressions• POSIX (BRE, ERE)
• PCRE = Perl-Compatible Regular Expressions
4
From the JavaScript language specification:
"The form and functionality of regular expressions is modelled after the regular expression facility in the Perl 5 programming language".
5
JS syntax (overview only)
var re = /^foo/;
6
JS syntax (overview only)
var re = /^foo/;
// booleanre.test('string');
7
JS syntax (overview only)
var re = /^foo/;
// booleanre.test('string'); // null or Arrayre.exec('string');
8
Regular expressions consist of...
● Tokens— common characters— special characters (metacharacters)
● Operations— quantification— enumeration— grouping
Tokens and metacharacters
/./.test('foo'); // true
/./.test('\r\n') // false
10
Any character
/./.test('foo'); // true
/./.test('\r\n') // false
What do you need instead:
/[\s\S]/ for JavaScript or/./s (works in Perl/PCRE, not in JS)
11
Any character
>>> /^something$/.test('something')true
12
String boundaries
>>> /^something$/.test('something')true
>>> /^something$/.test('something\nbad')false
13
String boundaries
>>> /^something$/.test('something')true
>>> /^something$/.test('something\nbad')false
>>> /^something$/m.test('something\nbad')true
14
String boundaries
>>> /\ba/.test('alabama)true
15
Word boundaries
>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true
16
Word boundaries
>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true
>>> /a\b/.test('naïve')true
17
Word boundaries
>>> /\ba/.test('alabama)true>>> /a\b/.test('alabama')true
>>> /a\b/.test('naïve')true
not a word boundary/\Ba/.test('alabama');
18
Word boundaries
Character classes
/\s/ (inverted version: /\S/)
20
Whitespace
/\s/ (inverted version: /\S/)
FF:\t \n \v \f \r \u0020 \u00a0 \u1680 \u180e \u2000 \u2001 \u2002 \u2003 \u2004 \u2005 \u2006 \u2007 \u2008 \u2009 \u200a\ u2028 \u2029\ u202f \u205f \u3000
Chrome, IE 9:as in FF plus \ufeff
IE 7, 8 :-(only:\t \n \v \f \r \u0020
21
Whitespace
/\d/ ~ digits from 0 to 9
/\w/ ~ Latin letters, digits, underscoreDoes not work for Cyrillic, Greek etc.
Inverted forms:/\D/ ~ anything but digits/\W/ ~ anything but alphanumeric characters
22
Alphanumeric characters
Example:/[abc123]/
23
Custom character classes
Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/
24
Custom character classes
Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/ More than one range is okay:/[a-cG-M0-7]/
25
Custom character classes
Example:/[abc123]/ Metacharacters and ranges supported:/[A-F\d]/ More than one range is okay:/[a-cG-M0-7]/ IMPORTANT: ranges come from Unicode, not from national alphabets!
26
Custom character classes
"dot" means just dot!/[.]/.test('anything') // false
27
Custom character classes
"dot" means just dot!/[.]/.test('anything') // false
adding \ ] -/[\\\]-]/
28
Custom character classes
anything except a, b, c:/[^abc]/ ^ as a character:/[abc^]/
29
Inverted character classes
/[^]/matches ANY character;
a nice alternative to /[\s\S]/
30
Inverted character classes
/[^]/matches ANY character;could bea nice alternative to /[\s\S]/
31
Inverted character classes
/[^]/matches ANY character;could bea nice alternative to /[\s\S]/
Chrome, FF:>>> /([^])/.exec('a');['a', 'a']
32
Inverted character classes
/[^]/matches ANY character;could bea nice alternative to /[\s\S]/
IE:>>> /([^])/.exec('a');['a', '']
33
Inverted character classes
/[^]/matches ANY character;could bea nice alternative to /[\s\S]/
IE:>>> /([\s\S])/.exec('a');['a', 'a']
34
Inverted character classes
Quantifiers
/bo*/.test('b') // true
36
Zero or more, one or more
/bo*/.test('b') // true
/.*/.test('') // true
37
Zero or more, one or more
/bo*/.test('b') // true
/.*/.test('') // true /bo+/.test('b') // false
38
Zero or more, one or more
/colou?r/.test('color');/colou?r/.test('colour');
39
Zero or one
40
How many?
/bo{7}/ exactly 7
41
How many?
/bo{7}/ exactly 7
/bo{2,5}/ from 2 to 5, x < y
42
How many?
/bo{7}/ exactly 7
/bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more
43
How many?
/bo{7}/ exactly 7
/bo{2,5}/ from 2 to 5, x < y /bo{5,}/ 5 or more This does not work in JS:/b{,5}/.test('bbbbb')
var r = /a+/.exec('aaaaa');
44
Greedy quantifiers
var r = /a+/.exec('aaaaa'); >>> r[0]
45
Greedy quantifiers
var r = /a+/.exec('aaaaa'); >>> r[0]"aaaaa"
46
Greedy quantifiers
var r = /a+?/.exec('aaaaa');
47
Lazy quantifiers
var r = /a+?/.exec('aaaaa');>>> r[0]
48
Lazy quantifiers
var r = /a+?/.exec('aaaaa');>>> r[0]"a"
49
Lazy quantifiers
var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');
50
Lazy quantifiers
var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');>>> r[0]
51
Lazy quantifiers
var r = /a+?/.exec('aaaaa');>>> r[0]"a" r = /a*?/.exec('aaaaa');>>> r[0]""
52
Lazy quantifiers
Groups
capturing/(boo)/.test("boo");
54
Groups
capturing/(boo)/.test("boo");
non-capturing/(?:boo)/.test("boo");
55
Groups
var result = /(bo)o+(b)/.exec('the booooob');
56
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo"
57
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b"
58
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9""
59
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9"">>> RegExp.$10undefined
60
Grouping and the RegExp constructor
var result = /(bo)o+(b)/.exec('the booooob');>>> RegExp.$1"bo">>> RegExp.$2"b">>> RegExp.$9"">>> RegExp.$10undefined>>> RegExp.$0undefined
61
Grouping and the RegExp constructor
/((foo) (b(a)r))/
62
Numbering of capturing groups
/((foo) (b(a)r))/
$1 ( ) foo bar
63
Numbering of capturing groups
/((foo) (b(a)r))/
$1 ( ) foo bar $2 ( ) foo
64
Numbering of capturing groups
/((foo) (b(a)r))/
$1 ( ) foo bar $2 ( ) foo$3 ( ) bar
65
Numbering of capturing groups
/((foo) (b(a)r))/
$1 ( ) foo bar $2 ( ) foo$3 ( ) bar$4 ( ) a
66
Numbering of capturing groups
var r = /best(?= match)/.exec('best match');
67
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!rtrue
68
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!rtrue
>>> r[0]"best"
69
Lookahead
var r = /best(?= match)/.exec('best match');
>>> !!rtrue
>>> r[0]"best" >>> /best(?! match)/.test('best match')false
70
Lookahead
NOT supported in JavaScript at all
/(?<=text)match/positive lookbehind
/(?<!text)match/negative lookbehind
71
Lookbehind
Enumerations
/red|green|blue light//(red|green|blue) light/ >>> /var a(;|$)/.test('var a')true
73
Logical "or"
true/(red|green) apple is \1/.test('red apple is red')
true/(red|green) apple is \1/.test('green apple is green')
74
Backreferences
Alternative character represenations
\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)
76
Representing a character
\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)
backslash takes away special character meaning:
/\(\)/.test('()') // true/\\n/.test('\\n') // true
77
Representing a character
\x09 === \t (not Unicode but ASCII/ANSI)\u20AC === € (in Unicode)
backslash takes away special character meaning:
/\(\)/.test('()') // true/\\n/.test('\\n') // true
...or vice versa!/\f/.test('f') // false!
78
Representing a character
Flags
g i m s x y
80
Regular expression flags
g i m s x y global match
81
Regular expression flags
g i m s x y global matchignore case
82
Regular expression flags
g i m s x y global matchignore casemultiline matching for ^ and $
83
Regular expression flags
g i m s x y global matchignore casemultiline matching for ^ and $
JavaScript does NOT provide support for:string as single lineextend pattern
84
Regular expression flags
g i m s x y global matchignore casemultiline matching for ^ and $
Mozilla-only, non-standard:stickyMatch only from the .lastIndex index (a regexp instance property). Thus, ^ can match at a predefined position.
85
Regular expression flags
/(?i)foo//(?i-m)bar$//(?i-sm).x$//(?i)foo(?-i)bar/ Some implementations do NOT support flag switching on-the-go.
In JS, flags are set for the whole regexp instance and you can't change them.
86
Alternative syntax for flags
RegExp in JavaScript
RegExp instances: /regexp/.exec('string') null or array ['whole match', $1, $2, ...] /regexp/.test('string') false or true String instances: 'str'.match(/regexp/) 'str'.match('\\w{1,3}') - same as /regexp/.exec if no 'g' flag used; - array of all matches if 'g' flag used (internal capturing groups ignored) 'str'.search(/regexp/) 'str'.search('\\w{1,3}') first match index, or -1
88
Methods
String instances:'str'.replace(/old/, 'new'); WARNING: special magic supported in the replacement string: $$ inserts a dollar sign "$" $& substring that matches the regexp $` substring before $& $' substring after $& $1, $2, $3 etc.: string that matches n-th capturing group 'str'.replace(/(r)(e)gexp/g, function(matched, $1, $2, offset, sourceString) { // what should replace the matched part on this iteration? return 'replacement';});
89
Methods
// BAD CODEvar re = new RegExp('^' + userInput + '$');// ...var userInput = '[abc]'; // oops!
// GOOD, DO IT AT HOMERegExp.escape = function(text) { return text.replace(/[-[\]{}()*+?.,\\^$|#\s]/g, "\\$&");}; var re = new RegExp('^' + RegExp.escape(userInput) + '$');
90
RegExp injection
Recommended reading
Online, just google it:MDN Guide on Regular Expressions
Mastering Regular ExpressionsO'Reilly Media
The Book:
Thank you!