regular expression (1) learning objectives: 1. to understand the concept of regular expression 2. to...

26
Regular Expression (1) Learning Objectives: 1. To understand the concept of regular expression 2. To learn commonly used operations involving regular expression / pattern matching 3. To learn the special cases occurred in regular expression / pattern matching

Post on 19-Dec-2015

233 views

Category:

Documents


0 download

TRANSCRIPT

Regular Expression (1)

Learning Objectives:1. To understand the concept of regular

expression

2. To learn commonly used operations involving regular expression / pattern matching

3. To learn the special cases occurred in regular expression / pattern matching

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 22

In Perl, we can make Shakespeare a regular expression by enclosing it in slashes:

if(/Shakespeare/){print $_;

}

What is tested in the if-statement?

Answer: $_. Can you write a even shorter statement using &&?

Simple Uses of Regular Expressions

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 33

if(/Shakespeare/){print $_;

}

The previous example tests only one line, and prints out the line if it contains Shakespeare.

To work on all lines, add a loop:

while(<>){if(/Shakespeare/){

print;}

}

Simple Uses of Regular Expressions

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 44

What if we are not sure how to spell Shakespeare? Certainly the first part is easy Shak, and there must be

a r near the end. How can we express our idea?

grep: grep "Shak.*r" movie > result

Perl: while(<>){if(/Shak.*r/){

print;}

}

.* means “zero or more of any character”.

Simple Uses of Regular Expressions

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 55

The dot “.” matches any single character except the newline (\n). For example, the pattern /a./ matches any two-letter sequence

that starts with a and is not “a\n”. Use \. if you really want to match the period.

$ cat testhihi bob.$ cat sub3 test#!/usr/local/bin/perl5 -wwhile(<>){

if(/\./){ print; }}$ sub3 testhi bob.$

Single-Character Patterns

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 66

If you want to specify one out of a group of characters to match use [ ]:

/[abcde]/

This matches a string containing any one of the first 5 lowercase letters, while:

/[aeiouAEIOU]/

matches any of the 5 vowels in either upper or lower case.

Single-Character Groups (1)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 77

If you want ] in the group, put a backslash before it, or put it as the first character in the list:/[abcde]]/ # matches [abcde] + ]/[abcde\]]/ # okay/[]abcde]/ # also okay

Use - for ranges of characters (like a through z):/[0123456789]/ # any single digit /[0-9]/# same

If you want - in the list, put a backslash before it, or put it at the beginning/end:/[X-Z]/ # matches X, Y, Z/[X\-Z]/ # matches X, -, Z/[XZ-]/ # matches X, Z, -/[-XZ]/ # matches -, X, Z

Single-Character Groups (2)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 88

More range examples:/[0-9\-]/ # match 0-9, or minus/[0-9a-z]/ # match any digit or lowercase letter/[a-zA-Z0-9_]/ # match any letter, digit, underscore

There is also a negated character group, which starts with a ^ immediately after the left bracket. This matches any single character not in the list.

/[^0123456789]/ # match any single non-digit/[^0-9]/ # same

/[^aeiouAEIOU]/ # match any single non-vowel/[^\^]/ # match any single character except ^

Single-Character Groups (3)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 99

For convenience, some common character groups are predefined:

Predefined Group Negated Negated Group\d (a digit) [0-9] \D (non-digit) [^0-9]\w (word char) [a-zA-Z0-9_] \W (non-word) [^a-zA-Z0-9_]\s (space char) [ \t\n] \S (non-space) [^ \t\n]

\d matches any digit \w matches any letter, digit, underscore \s matches any space, tab, newline

You can use these predefined groups in other groups:/\da-fA-F/ # match any hexadecimal digit

Single-Character Groups (4)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1010

The split function allows you to break a string into fields. split takes a regular expression and a string, and breaks up the

line wherever the pattern occurs.

$ cat split1#!/usr/local/bin/perl5 -w$line = "Bill Shakespeare in love with Bill Gates";@fields = split(/ /,$line);# split $line using space as delimiterprint "$fields[0] $fields[3] $fields[6]\n";$ split1Bill love Gates$

Split (1)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1111

You can use $_ with split. split defaults to look for space delimiters.

$ cat split2#!/usr/local/bin/perl5 -w$_ = "Bill Shakespeare in love with Bill Gates";@fields = split;# split $line using space (default) as delimiterprint "$fields[0] $fields[3] $fields[6]\n";$ split2Bill love Gates$

Split (2)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1212

How would we match a pattern that starts and ends with the same letter or word?

For this, we need to remember the pattern. Use ( ) around any pattern to put that part of the

string into memory (it has no effect on the pattern itself).

To recall memory, include a backslash followed by an integer.

/Bill(.)Gates\1/

Pattern Memory (1)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1313

Example:/Bill(.)Gates\1/

This example matches a string starting with Bill, followed by any single non-newline character, followed by Gates, followed by that same single character.

So, it matches:Bill!Gates! Bill-Gates-

but not:Bill?Gates! Bill-Gates_

(Note that /Bill.Gates./ would match all four)

Pattern Memory (2)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1414

Pattern Memory (3)

More examples:

/a(.)b(.)c\2d\1/

This example matches a string starting with a, a character (#1), followed by b, another single character (#2), c, the character #2, d, and the character #1.

So it matches: a-b!c!d-.

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1515

Pattern Memory (4)

The reference part can have more than a single character.

For example:/a(.*)b\1c/

This example matches an a, followed by any number of characters (even zero), followed by b, followed by the same sequence of characters, followed by c.

So it matches: aBillbBillc and abc, but not: aBillbBillGatesc.

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1616

Or

How about picking from a set of alternatives when there is more than one character in the patterns.

The following example matches either Gates or Clinton or Shakespeare:

/Gates|Clinton|Shakespeare/

For single character alternatives, /[abc]/

is the same as /a|b|c/.

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1717

Anchoring Patterns Anchors requires that the pattern be at the beginning or end of the

line. ^ matches the beginning of the line (only if ^ is the first character

of the pattern): /^Bill/ # match lines that begin with Bill /^Gates/ # match lines that begin with Gates /Bill\^/ # match lines containing Bill^ somewhere/\^/ # match lines containing ^

$ matches the end of the line (only if $ is the last character of the pattern):

/Bill$/ # match lines that end with Bill /Gates$/ # match lines that end with Gates /$Bill/ # match with contents of scalar $Bill/\$/ # match lines containing $

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1818

Using =~ (1)

What if you want to match a different variable than $_? Answer: Use =~. Examples:

$name = "Bill Shakespeare";

$name =~ /^Bill/; # true

$name =~ /(.)\1/; # also true (matches ll)

if($name =~ /(.)\1/){

print "$name\n";

}

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 1919

Using =~ (2)

An example using =~ to match <STDIN>:$ cat match1#!/usr/local/bin/perl5 -wprint "Quit (y/n)? ";if(<STDIN> =~ /^[yY]/){

print "Quitting\n";exit;

}print "Continuing\n";$ match1Quit (y/n)? yQuitting$

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2020

Ignoring Case

In the previous examples, we used [yY] and [nN] to match either upper or lower case.

Perl has an “ignore case” option for pattern matching: /somepattern/i

$ cat match1a#!/usr/local/bin/perl5 -wprint "Quit (y/n)? ";if(<STDIN> =~ /^y/i){

print "Quitting\n";exit;

}print "Continuing\n";$ match1aQuit (y/n)? YQuitting$

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2121

Slash and Backslash

If your pattern has a slash character (/), you must precede each with a backslash (\):

$ cat slash1#!/usr/local/bin/perl5 -wprint "Enter path: ";$path = <STDIN>;if($path =~ /^\/usr\/local\/bin/){

print "Path is /usr/local/bin\n";}$ slash1Enter path: /usr/local/binPath is /usr/local/bin$

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2222

If your pattern has lots of slash characters (/), you can also use a different pattern delimiter with the form: m#somepattern#

The # can be any non-alphanumeric character.$ cat slash1a#!/usr/local/bin/perl5 -wprint "Enter path: ";$path = <STDIN>;if($path =~ m#^/usr/local/bin#){

# if($path =~ m@^/usr/local/bin@){ # also worksprint "Path is /usr/local/bin\n";

}$ slash1aEnter path: /usr/local/binPath is /usr/local/bin$

Different Pattern Delimiters

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2323

After a successful pattern match, the variables $1, $2, $3,… are set to the same values as \1, \2, \3,…

You can use $1, $2, $3,… later in your program. $ cat read1#!/usr/local/bin/perl5 -w$_ = "Bill Shakespeare in Love";/(\w+)\W+(\w+)/; # match first two words# $1 is now "Bill" and $2 is now "Shakespeare"print "The first name of $2 is $1\n";$ read1The first name of Shakespeare is Bill

Special Read-Only Variables (1)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2424

You can also use $1, $2, $3,… by placing the match in a list context:

$ cat read2

#!/usr/local/bin/perl5 -w

$_ = "Bill Shakespeare in Love";

($first, $last) = /(\w+)\W+(\w+)/;

print "The first name of $last is $first\n";

$ read2

The first name of Shakespeare is Bill

Special Read-Only Variables (2)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2525

Other read-only variables: $& is the part of the string that matched the pattern. $` is the part of the string before the match $’ is the part of the string after the match

$ cat read3#!/usr/local/bin/perl5 -w$_ = "Bill Shakespeare in Love";/ in /; print "Before: $`\n";print "Match: $&\n";print "After: $'\n";$ read3Before: Bill ShakespeareMatch: in After: Love

Special Read-Only Variables (3)

COMP111COMP111Lecture 16 / Slide Lecture 16 / Slide 2626

Repeat {n}

/(fred){5,15}/ Match from five to fifteen repetitions of “fred”

/a{5,}/ Match five or more times repetitions of “a”

/\w{8}/ Match exactly 8 word characters.