automata and formal languages course …grammarware.net/slides/2014/regular.pdf · regular...

48
R EGULAR L ANGUAGES , E XPRESSIONS AND A PPLICATIONS A UTOMATA AND F ORMAL L ANGUAGES , # COURSE [15103] D R . V ADIM Z AYTSEV A . K . A . @ GRAMMARWARE

Upload: buidien

Post on 22-May-2018

233 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

R E G U L A R L A N G U A G E S , E X P R E S S I O N S A N D A P P L I C A T I O N S

A U T O M A T A A N D F O R M A L L A N G U A G E S , # C O U R S E [ 1 5 1 0 3 ]

D R . V A D I M Z A Y T S E V A . K . A . @ G R A M M A R W A R E

Page 2: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

R O A D M A P

• Chomsky hierarchy revisited

• How to see if the language is regular?

• The class of regular languages

• Tools to work with regular languages

• Advanced methods

source is given at the bottom of each slide

Page 3: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C H O M S K Y H I E R A R C H Y

Duncan Rawlinson, Chomsky.jpg, 2004, CC-BY.

Page 4: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C H O M S K Y H I E R A R C H Y

l a n g u a g e s

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

Noam Chomsky. On Certain Formal Properties of Grammars, Information & Control 2(2):137–167, 1959.

Page 5: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

l a n g u a g e s

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

C H O M S K Y : A U T O M A T A Tu r i n g m a c h i n e

p u s h d o w n a u t o m a t o n

f i n i t e s t a t e a u t o m a t o n

l i n e a r b o u n d e d a u t o m a t o n

(too many to list)

Page 6: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

l a n g u a g e s

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

C H O M S K Y : T O O L S i m a g i n a r y

g r a m m a r w a r er e g e x p

c o m p u t e r

Page 7: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

l a n g u a g e s

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

C H O M S K Y : R E W R I T I N G α → β

X → γX → a X → a B

αXβ → αγβ

Axel Thue. Probleme über Veränderungen von Zeichenreihen nach gegebenen Regeln, 1914. http://arxiv.org/abs/1308.5858

Page 8: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

R E G E X P S R E V I S I T E D

• Regular sets by Stephen Kleene in 1956

• ∅, ε, letters from Σ

• concatenation

• iteration

• alternation

• Precisely fit the regular classS. C. Kleene, Representation of Events in Nerve Nets and Finite Automata. In Automata Studies, pp. 3–42, 1956.

photo from: Konrad Jacobs, S. C. Kleene, 1978, MFO.

Page 9: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

D E T E R M I N I S T I C F I N I T E A U T O M A T O N

C. E. Shannon, W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, 1949. (finite state grammars and finite diagrams and finite state Markov processes)

Page 10: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

T O W H I C H C L A S S D O L A N G U A G E S B E L O N G ?

• ∅

• {ε}

• {ε} in a non-empty alphabet

• {x, y, z}

• {0ⁿ | n > 1}

• decimal numbers

• {0ⁿ1ⁿ | n > 1}

• {0ⁿ1 ⁿ | n > 1}

• {0ⁿ1ⁿ2ⁿ | n > 1}interactive

a l l

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

²

R E G U L A R

C - F R E E

C - S E N S I T I V E

F I N I T E

F I N I T E

F I N I T E

F I N I T E

R E G U L A R

C - F R E E

Page 11: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

I S A T A S K S O LV A B L E B Y R E G U L A R M E A N S ?

• Substring search

• grep, contains(), find(), substring(), …

• Substring replacement

• sed, awk, perl, vim, replace(), replaceAll(), …

• Pretty-printing

• VS.NET, Sublime, TextMate, …

interactive

Page 12: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

I S A T A S K S O LV A B L E B Y R E G U L A R M E A N S ?

• Counting [non-empty] lines in a file

• wc -l, grep -c “”

• grep -v “^$”, sed -n /./p | wc -l, …

• Parsing HTML

• <BODY><TABLE><P><A HREF=…

• Parsing a postcode

• 1098 XG, …

interactive

Page 13: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

H O W T O P R O V E W H I C H C L A S S A L A N G U A G E B E L O N G S T O

Page 14: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

P U M P I N G L E M M A

• In simple terms

• sufficiently long words have repeatable parts

• (works for all infinite regular languages)

• L is regular ⇒ formula holds

• Formula does not hold ⇒ L is finite or not regular

a l l

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

Jos C.M. Baeten, Models of Computation: Automata, Formal Languages and Communicating Processes, §2.9, p.58.

F O R R E G U L A R L A N G U A G E S

Page 15: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

J O H N M Y H I L L A N D A N I L N E R O D E

Cornell, Faculty and Senior Researcher Profiles. Who's That Mathematician? Paul R. Halmos Collection - Page 36.

Page 16: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

M Y H I L L – N E R O D E T H E O R E M

• Myhill-Nerode equivalence

• u~v ⟺ ∀w: (uw∈L ∧ vw∈L) ∨ (uw∉L ∧ vw∉L)

• Theorem: L is regular iff the number of Myhill-Nerode equivalence classes is finite.

• In simple terms

• few groups of forgettable prefixes

• Works both ways

Anil Nerode, Linear Automaton Transformations, Proceedings of the AMS 9, 1958.

Page 17: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

L I M I T E D M E M O R Y

• Advice from teh internetz:

• how many characters must you remember from the stream?

• bounded ⇒ regular

• unbounded ⇒ ?

• Correct or not?

Brian M. Scott, http://math.stackexchange.com/questions/282216/determine-if-a-language-is-regular-from-the-first-sight

c o r r e c t ! m e m o r y i s l i m i t e d , a l p h a b e t i s l i m i t e d ⇒ p r e f i x e s a r e l i m i t e d

Page 18: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

N U M B E R O F C O U N T E R S

• {0ⁱ1ⁿ…}

• no relation between i and n ⇒ regular

• 1 counter ⇒ context-free

• n counters ⇒ context-sensitive

• ∞ counters ⇒ recursively enumerable

Himanshu Saikia, http://math.stackexchange.com/questions/282216/determine-if-a-language-is-regular-from-the-first-sight

Page 19: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

D I S A S S E M B L E / M A S S A G E

• {0ⁿ1ⁿ | n > 1}

• {0ⁱ1ⁿ | n > 1, i > 1, i ≠ n}

• matching brackets language not regular

• ⇒ no matching pairs language is regular

• Many combinations of regular languages are regular

• Proving by decomposition is valid

Page 20: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

T H E C L A S S O F R E G U L A R L A N G U A G E S

Page 21: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R C O M P L E M E N T

• If A is a regular language, then

• Ā is regular

• Meaning…

• grep -v “123” file.txt

• (Must know the alphabet Σ)

• (Actually stronger: any finite number of errors)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4. E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.

Page 22: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R S E T U N I O N

• If A and B are regular languages, then

• A⋃B is regular

• Meaning…

• [a-z]

• x | y | z (in some notations)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 23: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R I N T E R S E C T I O N

• If A and B are regular languages, then

• A⋂B is regular

• Meaning…

• cat file.txt | grep “abc” | grep “xyz”

• (Not true for context-free languages!)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 24: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R D I F F E R E N C E

• If A and B are regular languages, then

• A∖B is regular

• Meaning…

• cat file.txt | grep “abc” | grep -v “123”

• (Not true for context-free languages!)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 25: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R I T E R A T I O N

• If A is a regular language, then

• A* and A⁺ are regular

• Meaning…

• [a]*

• [a]⁺

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 26: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R C O N C A T E N A T I O N

• If A and B are regular languages, then

• AB is regular

• Meaning…

• [Bb][Oo][Dd][Yy]

• (Just glue regexps; in practice, watch out for subgroups)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 27: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S [ S O M E T I M E S ] C L O S E D U N D E R D E C O M P O S I T I O N

• If A is a regular language, then

• “front halves” is regular

• “tail halves” is regular

• “middle thirds” is regular

• “arbitrary halves/thirds” is regular

• NB: glued side thirds is NOT regular

E. Stearns, J. Hartmanis, Regularity Preserving Modifications of Regular Expressions, Information & Control 6:55–69, 1963.

Page 28: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C L A S S C L O S E D U N D E R H O M O M O R P H I S M

• If A is a regular language and

• h : Σ → Σ*

• then

• h(A) is regular

• Meaning that debugging is feasible

• (Even better for context-free languages: substitutions)

J. E. Hopcroft, R. Motwani, J. D. Ullman, Introduction to Automata Theory, Languages and Computation, Chapter 4.

Page 29: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

var whitelist = @"</?p>|<br\s?/?>|</?b>|</?strong>|</?i>|</?em>| </?s>|</?strike>|</?blockquote>|</?sub>|</?super>| </?h(1|2|3)>|</?pre>|<hr\s?/?>|</?code>|</?ul>| </?ol>|</?li>|</a>|<a[^>]+>|<img[^>]+/?>";

R E G E X P S Y O U N E E D T O D E B U G

Jeff Atwood, If You Like Regular Expressions So Much, Why Don't You Marry Them?, 22 Mar 2005. Jeff Atwood, Regular Expressions: Now You Have Two Problems, 27 Jun 2008.

Page 30: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*))*)?;\s*)

“somewhat pushes the limits of what it is

sensible to do

with regular expressions”

Jeff Atwood, Regex use vs. Regex abuse, 16 Feb 2005. RFC822. Paul Warren, Mail::RFC822::Address: regexp-based address validation, 17/09/2012.

Page 31: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

T O O L S O V E R V I E W

Page 32: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

A L L T O O L S A R E

D I F F E R E N T

POSIX standard since 1993

(who the hell uses [[:digit:]] anyway?)

Page 33: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

G R E P

• Ken Thompson: qed, ed, grep

• grep “abc” program.c

• grep \\d file.txt

• grep ^#*\ \\w README.md

photo from: Archetypal hackers ken (left) and dmr (right).

Page 34: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

S E D

• sed 's/Finite/Regular/g' oldfile >newfile

• sed -n 12,18p myfile

• sed 12,18d myfile

• sed 12q myfile

• sed 12,/foo/d myfile

• sed ‘$d’ myfile

• sed -n '/[0-9]\{2\}/p' myfile

• sed ‘5!s/Finite/Regular/g' oldfile >newfile

• sed ‘/Automaton/!s/Finite/Regular/g’ oldfile >newfile

Lee E. McMahon, sed, Stream EDitor, 1973 or 1974, http://www.columbia.edu/~rh120/ch106.x09 photo from: http://merdivengo.blogspot.com/2012/03/turnuva-sistemleri-uzerine.html

Page 35: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

O R I G I N S O F A W K

Page 36: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

A W K

• Turing-complete one-liner language with regexps

• Built-in variables

• $0, NF, $1, $2, $3, …, $NF

• FILENAME, NR, FS, OFS, RS, ORS

• Built-in operators

• print, printf, length, $

• Can define own functions & variables

A. V. Aho, B. W. Kernighan, P. J. Weinberger, AWK — A Pattern Scanning and Processing Language. SPE, 9(4): 267-279, 1979.

Page 37: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

A W K I N A C T I O N

Page 38: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

A W K E X A M P L E S

• { w += NF c += length + 1 }END { print NR, w, c }

• yes Wikipedia | awk 'NR % 4 == 1 { printf "%6d %s\n", NR, $0 }' | sed 5q

• #!/usr/bin/awk -fBEGIN { print "Hello, world!" }

https://en.wikipedia.org/wiki/AWK

Page 39: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

A W K O U T P U T

Page 40: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

L E X

• Regexps used for the first phase of parsing since 1968.

• Wikipedia explains why it is used together with yacc/bison:

!

• Collection of regexp patterns with actions

• [a-zA-Z]+ { printf("Word: %s\n", yytext); }

• .|\n {}

• Easy to write a tokeniserW. L. Johnson, J. H. Porter, S. I. Ackley, D. T. Ross, Automatic Generation of Efficient Lexical Processors Using Finite State

Techniques. Communications of the ACM 11 (12): 805–813, 1968. M. E. Lesk, LEX - A Lexical Analyzer Generator, CSTR 39, Bell Laboratories, 1975.

Page 41: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

L E X E X A M P L E

int lineno=1; !letter [a-zA-Z] digit [0-9] id {letter}({letter}|{digit})* number {digit}+ %% printf("\nTokeniser running -- ^D to exit\n"); !^{id} {line();printf("<id>");} {id} printf("<id>"); ^{number} {line();printf("<number>");} {number} printf("<number>"); ^[ \t]+ line(); [ \t]+ printf(" "); [\n] ECHO; ^[^a-zA-Z0-9 \t\n]+ {line();printf("\\%s\\",yytext);} [^a-zA-Z0-9 \t\n]+ printf("\\%s\\",yytext); %% line() { printf("%4d: ",lineno++); }

M. G. Roth, CS 631, Lex example, https://www.cs.uaf.edu/~cs631/lex_token.txt

c h a r a c t e r- l e v e l g r a m m a r

::= ::= ::= ::=

Page 42: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

P E R L [ , T C L , P Y T H O N , … ]

• Henry Spencer made advanced regex in 1986

• his DFA/NFA-based TCL version is faster!

• Can be used as sed:

• perl -pi -w -e 's/Perl/Python/g;' *

• Or, in programs:

• $bar =~ /foo/

• Redesigned in Perl 6 (merged with PEG)

Page 43: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

P E R L R E G E X E X A M P L E S

• Match

• my ($hs, $ms, $ss) = ($time =~ m/(\d+):(\d+):(\d+)/);

• Substitute

• $s =~ s/dog/cat/;

• Transliterate

• $uc =~ tr/a-z/A-Z/;

Tutorialspoint, PERL Regular Expressions, http://www.tutorialspoint.com/perl/perl_regular_expression.htm

Page 44: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

P C R E

• “Perl Compatible Regular Expressions”

• P.S.: not compatible with Perl

• P.P.S.: not regular

• C library by Philip Hazel (stable release Dec. 2013)

• PCREs are used in other languages

• PHP, Ruby, JavaScript, …

• Way beyond regular: backrefs, recursion, assertions, …

• <(\w+)>.*<\/\1>, \((?R)*\)

http://www.pcre.org

Page 45: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

R A S C A L ( M E T A P R O G R A M M I N G )

• Java Regex

• /xyz/ := “xyz”

• if (/xyz/ := s) {…}

• if (/x<m:y+>z/ := s) println(m);

• /[0-9]+ \w*/ := “1098 XG”

• Lexical grammars

• lexical Number = [1-9][0-9]*;

• parse(#Number, file);

http://rascal-mpl.org/

Page 46: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

C O N C L U S I O N

• Benefits of regular languages:

• lexical tools are fast & always applicable

• (relatively) easy to develop

• Drawbacks:

• very limited context

• (usually) many false positives, requires tweaking

Page 47: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

S U M M A R Y

a l l

r e c u r s i v e l y e n u m e r a b l e

c o n t e x t -s e n s i t i v e

c o n t e x t -f r e e

r e g u l a r

f i n i t e

• Chomsky hierarchy • languages, automata, algorithms, rewriting systems, hardware

• Judging if regular • pumping lemma, Myhill-Nerode, memory, counters, disassembly

• Class closed under • complement, union, intersection, • difference, concatenation, • homomorphism

• Tools • grep, perl, sed, awk, rsc

• To read: Jeffrey E. F. Friedl, Mastering Regular Expressions: Understand Your Data and Be More Productive, O’Reilly, 2006.

Page 48: AUTOMATA AND FORMAL LANGUAGES COURSE …grammarware.net/slides/2014/regular.pdf · regular languages, expressions and applications automata and formal languages, #course[15103] dr

T H A N K S F O R Y O U R A T T E N T I O N !

• This was Dr. Vadim Zaytsev a.k.a. grammarware

• grammarware.net, twitter.com/grammarware, grammarware.github.com, …

• I usually teach at Master Software Engineering

• and do research on grammars and software languages

• Affiliations

• UvA (2013–2014), CWI (2000, 2010–2013), Uni Koblenz (2008–2010), VU (2004–2008), UTwente (2002–2004), Rostov State Transport University (1999–2008), Rostov State University (1998–2003), Desk.nl (1999, 2001)

• Slides are CC-BY-SA: grammarware.net/slides/2014/regular.pdf