einführung in die computerlinguistik - reguläre ausdrücke ... · dfas and regular expressions...

24
Einf¨ uhrung in die Computerlinguistik Regul¨ are Ausdr¨ ucke und regul¨ are Grammatiken Laura Kallmeyer Heinrich-Heine-Universit¨atD¨ usseldorf Summer 2019 1 / 24

Upload: others

Post on 30-Aug-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

Einfuhrung in die ComputerlinguistikRegulare Ausdrucke und regulare Grammatiken

Laura Kallmeyer

Heinrich-Heine-Universitat Dusseldorf

Summer 2019

1 / 24

Regular expressions (1)

Let Σ be an alphabet. The set of regular expressions over Σ isrecursively defined:

∅ is a regular expression denoting the set ∅.ε is a regular expression denoting the set {ε}.For each a ∈ Σ: a is a regular expression denoting {a}.If r and s are regular expressions denoting R and S, then(r|s), (rs) and (r∗) are also regular expressions denotingR ∪ S, R ◦ S and R∗ respectively.

Assume that ∗ has a higher priority than concatenation which inturn has a higher priority than |.a+ is an abbreviation for aa∗.

2 / 24

Regular expressions (2)

Beispiel: regulare Ausdrucke

ε|a denotiert {ε} ∪ {a} = {a, ε}.εa denotiert {ε} ◦ {a} = {a}.∅|a denotiert ∅ ∪ {a} = {a}.∅a denotiert ∅ ◦ {a} = ∅.cc∗ denotiert die Menge aller Worter uber {c}, derenLange ≥ 1 ist.

(cc)∗ denotiert die Menge aller Worter uber {c}, derenLange eine gerade Zahl ist.

3 / 24

Regular expressions (3)

Beispiel: regulare Ausdrucke

(a|ε)bcc∗ denotiert {bc, abc,bcc, abcc,bccc, abccc, . . . } ={anbcm | 0 ≤ n ≤ 1,m ≥ 1}((a|b)∗|de)∗ denotiert {w1 . . .wn |n ≥ 0, jedes der wi fur1 ≤ i ≤ n ist entweder aus {a,b}∗ oder hat die Form (de)k

fur irgendein k ≥ 0}⇒ die Sprache ist die Menge aller Worter uber den Alpha-bet {a,b,d, e}, in denen jedes d notwendig von einem egefolgt ist.Aquivalent: (a|b|de)∗

a(bd+b)∗a denotiert die Menge aller Worter der Form awa,wobei w = ε oder w = bw′b und es gilt, dass w′ ∈ {b,d}∗mit d anfangt und endet und dass in w′ jede Gruppe vonbs genau die Lange 2 hat.

4 / 24

Regular expressions (4)

The following holds:

Equivalence of regular expressions and FSA

For each language L: there is a regular expression x with L =L(x) iff there is a FSA M with L = L(M).

In order to show this, we

1 define NFAs with empty transitions and show their equiva-lence to NFAs;

2 show how to construct an equivalent NFA with emptytransitions for a given regular expression;

3 show how to construct an equivalent regular expression fora given DFA.

Zu 1. und 2. siehe Vorlesung “Mathematische Grundlagen derComputerlinguistik”.

5 / 24

DFAs and regular expressions (1)

For each DFA, there is an equivalent regular expression.

Beispiel: reg. Ausdruck fur FSA

q0

q1

q2

a

c

ac

b

Um von q0 zu q2 zu gelangen, kann man entweder

(1) einen Weg ohne Unterwegszustand q2 wahlen. Dies kannwiederum entweder ohne Durchlauf von q1 sein (c) oder einerstes Mal von q0 zu q1 (a), dann beliebig oft von q1 zu q1

ohne Durchlauf von q1 (a∗), dann von q1 nach q2 (c)⇒ (c|a+c), oder

(2) man geht ein erstes Mal von q0 zu q2 (c|a+c) und dannbeliebig oft von q2 zu q2 ohne Durchlauf von q2 ((ba∗c)∗).

⇒ (c|a+c)(ba∗c)∗

6 / 24

DFAs and regular expressions (2)

Beispiel: reg. Ausdruck fur FSA

q0

q1

q2

a

d

c

a

c

von q0 nach q2 ohne q1, q2 ∅von q0 nach q1 ohne q1, q2 avon q1 nach q1 ohne q1, q2 a|davon q1 nach q2 ohne q1, q2 cvon q0 nach q2 ohne q2 ∅|a(a|da)∗c = a(a|da)∗cvon q2 nach q2 ohne q2 ca(a|da)∗cvon q0 nach q2 a(a|da)∗c(ca(a|da)∗c)∗

7 / 24

DFAs and regular expressions (3)

Algorithm for construction the regular expression for a givenDFA: Assume that {q1, . . . , qn} it the set of states.

Define Rki,j as the set of sequences of input symbols that

can be obtained from following all paths from qi to qj thatdo not pass through states ql with l > k.Define rki,j as the corresponding regular expression.

q1

q2

q3

a

d

c

a

c

r11,3 = ∅ r11,2 = a

r12,2 = a|da r12,3 = c

r21,3 = ∅|a(a|da)∗c = a(a|da)∗c r23,3 = ca(a|da)∗c

r01,3 = a(a|da)∗c(ca(a|da)∗c)∗

8 / 24

DFAs and regular expressions (5)

Question: What do the Rki,j look like and what are the

corresponding regular expressions rki,j?

k = 0: Only paths of length 0 or 1 can be considered, since,between qi and qj, no other states ql are possible.Each r0i,j has the form a1|a2| · · · |am or (if i = j) a1|a2| · · · |am|εor (if i 6= j and there is no edge from qi to qj) ∅.

9 / 24

DFAs and regular expressions (4)

In general, Rkij contains

all elements from Rk−1ij ,

all concatenations ofa) an element from Rk−1

ik ,

b) any number (including 0) of elements from Rk−1kk and

c) an element from Rk−1kj .

Consequently, rki,j = rk−1ij |rk−1ik (rk−1kk )∗rk−1kj .

Finally, the regular expression for the language accepted by theautomaton is rn1f1 |r

n1f2| · · · |rn1fm if F = {qf1 , . . . , qfm}.

10 / 24

DFAs and regular expressions (5)

From DFA to its regular expression

q1 q2

a

b

c

r212 = r112|r112(r122)∗r122

r112 = r012|r011(r011)∗r012= a|ε(ε∗)a = a

r122 = r022|r021(r011)∗r012= (b|ε)|c(ε)∗a = ε|b|ca

r212 = a|a(ε|b|ca)+ = a(ε|b|ca)∗ = a(b|ca)∗

11 / 24

Abschlusseigenschaften regularer Sprachen (1)

Die von regularen Ausdrucken denotierten Sprachen heißenregulare Sprachen.

Abschlusseigenschaften

1 Wenn L1 und L2 regulare Sprachen sind, dann

ist die Vereinigung von L1 und L2 (L1 ∪ L2) ebenfalls eineregulare Sprache.ist die Schnittmenge von L1 und L2 (L1 ∩ L2) ebenfalls eineregulare Sprache.ist die Konkatenation von L1 und L2 (L1 ◦ L2) ebenfalls eineregulare Sprache.

2 Das Komplement einer regularen Sprache ist eine regulareSprache.

3 Wenn L eine regulare Sprache, dann ist L∗ eine regulareSprache.

12 / 24

Abschlusseigenschaften regularer Sprachen (2)

Fur Vereinigung, Konkatenation und L∗ kann man ganz einfachdie entsprechenden regularen Ausdrucke angeben:

Seien r1 und r2 regulare Ausdrucke mit L1 = L(r1) undL2 = L(r2). Dann gilt

L((r1)|(r2)) = L1 ∪ L2

L((r1)(r2)) = L1 ◦ L2

L((r1)∗) = L∗1

13 / 24

Abschlusseigenschaften regularer Sprachen (3)

Fur die Komplementbildung nimmt man den DFA, der dieSprache erkennt, macht die Ubergangsfunktion vollstandig,indem man einen weiteren Zustand (trap state) hinzufugt, undwahlt als neue Endzustande alle Zustande, die imUrsprungsautomaten kein Endzustand sind.

Komplementbildung im DFA

trap state

q0

q1a

b

a

q2b

a,b

Komplementbildung:q0

q1

q2

a

bba,b

a

14 / 24

Abschlusseigenschaften regularer Sprachen (4)

Schnittbildung kann (de Morgan) durch Vereinigung undKomplementbildung ausgedruckt werden:

L1 ∩ L2 = L1 ∩ L2 = L1 ∪ L2

Damit ist L1 ∩ L2 eine regulare Sprache, falls L1 und L2 regularsind.

15 / 24

Regular Grammars (1)

Another way to define a language is via a grammar. A grammarformalism defines the form of rules and combination operationsallowed in a grammar.

A context-free grammar (CFG) G is a tuple 〈N,T,P, S〉 with

N and T disjoint alphabets, the nonterminals and termi-nals,

S ∈ N the start symbol, and

P a set of productions of the form A→ α with A ∈ N, α ∈(N ∪ T)∗.

16 / 24

Regular Grammars (2)

Let G = 〈N,T,P, S〉 be a CFG. The (string) language L(G) of G

is the set {w ∈ T∗ |S ∗⇒ w} where

for w,w′ ∈ (N ∪ T)∗: w ⇒ w′ iff there is a A→ α ∈ P andthere are v, u ∈ (N∪T)∗ such that w = vAu and w′ = vαu.∗⇒ is the reflexive transitive closure of ⇒:

w0⇒ w for all w ∈ (N ∪ T)∗, and

for all w,w′ ∈ (N ∪ T)∗: wn⇒ w′ iff there is a v such that

w⇒ v and vn−1⇒ w′.

for all w,w′ ∈ (N∪T)∗: w∗⇒ w′ iff there is a i ∈ N such that

wi⇒ w′.

A language is called a context-free language (CFL) iff it isgenerated by a CFG.

17 / 24

Regular Grammars (3)

A CFG is called

right-linear iff for all productions A → β: β ∈ T∗ orβ = β′X with β′ ∈ T∗, X ∈ N.

left-linear iff for all productions A→ β: β ∈ T∗ or β = Xβ′

with β′ ∈ T∗, X ∈ N.

regular if it is left-linear or right-linear.

For each left-linear grammar there is an equivalent right-lineargrammar and vice versa.

18 / 24

Regular Grammars (4)

Examples: regular grammars

A CFG G = 〈{S,A,B}, {a,b, c},P, S〉

1 with P = {S → abS | cA,A → bB,B → cbB | ε} is right-linear.It generates the language L((ab)∗cb(cb)∗).

2 with P = {S → Sa |Sb |Abc,A → B,B → Bcb | ε} isleft-linear.It generates the language L((cb)∗bc(a|b)∗).

3 with P = {S→ abSA | c,A→ b} is not regular.It generates the language {(ab)ncbn |n ≥ 0}.

19 / 24

Regular Grammars (5)

Equivalence of regular grammars and regular expressions

L = L(G) for a regular grammar G iff L is a regular set (i.e.,can be described by a regular expression).

To show this, we show that

for each right-linear grammar, there exists an equivalentNFA; and

for each DFA, there is an equivalent right-linear grammar.

20 / 24

Regular Grammars (6)

Example: Constructing an NFA with empty transition for aright-linear grammar

G = 〈{S,A}, {a, b, c},P,S〉 with P = {S→ abS | cA,A→ bA | c}

Corresponding NFA:

[S]

[abS]

[cA]

[bS]

[c] [ε]

[A] [bA]

ε

ε

εε

b

a

cb

c

21 / 24

Regular Grammars (7)

Construction of equivalent NFA (with empty transitions) forgiven right-linear G:

Start state [S].

For all productions A→ β1β2 there is a state [β2].

Transitions simulate top-down left-to-right traversal ofderivation tree:

For every A→ β:[A] [β]

ε

For every state [aβ] with a ∈ T:[aβ] [β]

a

Final state is [ε].

22 / 24

Regular Grammars (8)

Construction of a right-linear grammar for given DFA:

The states are the nonterminals.

The start state is the start symbol.

For each transitionQ Q′a

add a production Q → aQ′ and, if Q′ ∈ F, a productionQ→ a.

If the start state Q0 is a final state, then add Q0 → ε

Example: Constructing a right-linear grammar for a DFA

q0

q1

q2

a

c

a c

b

〈{q0, q1, q2}, {a, b, c},{q0 → aq1 | cq2 | c,q1 → aq1 | cq2 | c, q2 → bq1},

q0〉

23 / 24

Regular Grammars (9)

For a left-linear grammar, we have to follow the transitions inthe opposite direction.

The nonterminals are Q∪{S} where S is a new start symbol.

S→ qf for all qf ∈ F.

For each transitionq q′a

we add q′ → qa.

q0 → ε where q0 is the DFA’s start state.

Example: Constructing a left-linear grammar for a DFA

q0

q1

q2

a

c

a c

b

〈{q0, q1, q2, S}, {a,b, c},{S→ q2, q0 → ε,q2 → q1c | q0b,q1 → q1a | q0a | q2b},

S〉

24 / 24