ling 388: language and computers sandiway fong lecture 2: 8/23

19
LING 388: Language and Computers Sandiway Fong Lecture 2: 8/23

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

LING 388: Language and Computers

Sandiway Fong

Lecture 2: 8/23

Administrivia

• Class:

Major

Math

Linguistics

Creative WritingCivil Engineering

MIS

Accounting

Economics

Computer Science

PsychologyGerman

Administrivia

• Class:Programming Experience

Some

None

Administrivia

• Did you bring your laptop?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Administrivia

• Did you install Perl yet?– Active State Perl

– http://www.activestate.com– install free version

Regular Expressions

• regular expressions are used in string pattern-matching – important tool in automated searching

– formally equivalent to finite-state automata (FSA) and regular grammars

• popular implementations– Unix grep

• command line program• returns lines matching a regular expression• standard part of all Unix-based systems

– including MacOS X (command-line interface in Terminal)

• many shareware/freeware implementations available for Windows XP– just Google and see...

– grep functionality is built into many programming languages• e.g. Perl

– wildcard search in Microsoft Word• limited version of regular expressions (not full power)• with differences in notation

Regular Expressions

• Historical note – grep:

• name comes from Unix ed command – g/re/p– “search globally for lines matching the regular expression, and

print them”– [Source: http://en.wikipedia.org/wiki/Grep]

– ed• is an obscure and difficult-to-use text edit program on

Unix systems– doesn’t need a screen display– would work on an ancient teletype

Regular Expressions

• Formally– a regular expression

(regexp) is formed from:• an alphabet

(= set of characters)

• operators

– a regexp is shorthand for a set of strings

• (possibly infinite set)

• (strings are of finite length)

• Formally– a set of strings is

called a language– a language that can

be defined by a regular expression is called a regular language

• (not all languages are regular)

Regular Expressions

• alphabet– e.g. {a,b,c,...,z}

set of lower case English letters

Note: case is important

• operators– a single symbol

a– an exactly n

occurrences of

a, n a positive integer

– an a3 aaa

– a* zero or more occurrences of a

– a+ one or more occurrences of a

– concatenation• two regexps may be

concatenated, the resulting string is also a regexp

• e.g. abc

– disjunction• infix operator: |• (vertical bar)• e.g. a|b

– parentheses may be used for disambiguation

• e.g. gupp(y|ies)

Regular Expressions

• Technically, a+ is not necessary– aa* = a+

“a concatenated with a* (zero or more occurrences of a)”

= “one or more occurrences of a”

• Disjunction– [set of characters]

set of characters enclosed in square brackets means match one of the characters

– e.g. [aeiou]

matches any of the vowels a, e, i, o or u but not d

– dash (-) shorthand for a range

– e.g. [a-e]

matches a, b, c, d or e

Regular Expressions

• Range defined over a computer character set– typically ASCII– originally a 7 bit

character set – 2^7 = 128 (0-127)

different characters– ASCII = American

Standard Code for Information Interchange

– e.g. [A-z] [0-9A-Za-z]

• http://www.idevelopment.info/data/Programming/ascii_table/PROGRAMMING_ascii_table.shtml

Regular Expressions: Microsoft Word

• terminology:– wildcard search

Regular Expressions: Microsoft Word

Note: zero or more times is missing in Microsoft Word

Regular Expressions

• Perl uses the same notation as grep (textbook also uses grep notation)

• More shorthand– question mark (?) means the

previous regexp is optional– e.g. colou?r– matches color or colour– metacharacters or operators

like ? have a function– to match a question mark,

escape it using a backslash (\)– e.g. why\?– ? in Microsoft Word means

match any character

• More shorthand– period (.) stands for any

character (except newline)

– e.g. e.tmatches eat as well as eet

– caret sign (^) as the first character of a range of characters [^set of characters]means don’t match any of the characters mentioned (after the caret)

– e.g. [^aeiou]– any character except for one

of the vowels listed

Regular Expressions

• Text files in Unix consists of sequences of lines separated by a newline character (LF = line feed)

• Typically, text files are read a line at a time by programs

• Matching in Perl and grep is line-oriented(can be changed in Perl)

• Differences in platforms for line breaking:– Unix: LF– Windows (DOS): CR LF– MacOS (X): CR

Regular Expressions

• Line-oriented metacharacters:– caret (^) at the beginning of a

regexp string matches the “beginning of a line”

– e.g. ^The

matches lines beginning with

the sequence The– Note: the caret is very

overloaded...• [^ab]

• a^b

– dollar sign ($) at the end of a regexp string matches the “end of the line”

– e.g. end\.$– matches lines ending in

the sequence end.– e.g. ^$

matches blank lines only– e.g. ^ $

matches lines contains exactly one space

Regular Expressions

• Word-oriented metacharacters:– a word is any sequence of digits [0-9],

underscores (_) and letters [a-zA-Z]– (historical reasons for this)– \b matches a word boundary, e.g. a

space or beginning or end of a line or a non-word character

– e.g. the– matches the, they, breathe and other– but \bthe will only match the and they– the\b will match the and breathe– \bthe\b will only match the– (\< and \> can also be used to match the

beginning and end of a word)

– e.g. \b99– matches 99 but not 299– also matches $99

Regular Expressions

• Range abbreviations:– \d (digit) = [0-9]

– \s (whitespace character) = space (SP), tab (HT), carriage return (CR), newline (LF) or form feed (FF)

– \w (word character) = [0=9a-zA-Z_]

• uppercase versions denote negation– e.g. \W means a non-word

character

\D means a non-digit

• Repetition abbreviations:– a? a optional– a* zero or more a’s– a+ one or more a’s– a{n,m} between n and m a’s– a{n,} at least n a’s– a{n} exactly n a’s– e.g. \d{7,}

matches numbers with at least 7 digits

– e.g. \d{3}-\d{4}– matches 7 digit telephone

numbers with a separating dash

Reading

• Perl Quick Intro– http://perldoc.perl.org/perlintro.html

• Perl Regular Expressions (RE)– perlrequick - Perl regular expressions quick start

• http://perldoc.perl.org/perlrequick.html

– perlretut - Perl regular expressions tutorial• http://perldoc.perl.org/perlretut.html