einführung in die computerlinguistik - reguläre ausdrücke ... · dfas and regular expressions...

Einfuhrung in die ComputerlinguistikRegulare Ausdrucke und regulare Grammatiken

Laura Kallmeyer

Heinrich-Heine-Universitat Dusseldorf

Summer 2019

1 / 24

Regular expressions (1)

Let Σ be an alphabet. The set of regular expressions over Σ isrecursively defined:

∅ is a regular expression denoting the set ∅.ε is a regular expression denoting the set {ε}.For each a ∈ Σ: a is a regular expression denoting {a}.If r and s are regular expressions denoting R and S, then(r|s), (rs) and (r∗) are also regular expressions denotingR ∪ S, R ◦ S and R∗ respectively.

Assume that ∗ has a higher priority than concatenation which inturn has a higher priority than |.a+ is an abbreviation for aa∗.

2 / 24


Beispiel: regulare Ausdrucke

ε|a denotiert {ε} ∪ {a} = {a, ε}.εa denotiert {ε} ◦ {a} = {a}.∅|a denotiert ∅ ∪ {a} = {a}.∅a denotiert ∅ ◦ {a} = ∅.cc∗ denotiert die Menge aller Worter uber {c}, derenLange ≥ 1 ist.

(cc)∗ denotiert die Menge aller Worter uber {c}, derenLange eine gerade Zahl ist.

3 / 24


Beispiel: regulare Ausdrucke

(a|ε)bcc∗ denotiert {bc, abc,bcc, abcc,bccc, abccc, . . . } ={anbcm | 0 ≤ n ≤ 1,m ≥ 1}((a|b)∗|de)∗ denotiert {w1 . . .wn |n ≥ 0, jedes der wi fur1 ≤ i ≤ n ist entweder aus {a,b}∗ oder hat die Form (de)k

fur irgendein k ≥ 0}⇒ die Sprache ist die Menge aller Worter uber den Alpha-bet {a,b,d, e}, in denen jedes d notwendig von einem egefolgt ist.Aquivalent: (a|b|de)∗

a(bd+b)∗a denotiert die Menge aller Worter der Form awa,wobei w = ε oder w = bw′b und es gilt, dass w′ ∈ {b,d}∗mit d anfangt und endet und dass in w′ jede Gruppe vonbs genau die Lange 2 hat.

4 / 24


The following holds:

Equivalence of regular expressions and FSA

For each language L: there is a regular expression x with L =L(x) iff there is a FSA M with L = L(M).

In order to show this, we

1 define NFAs with empty transitions and show their equiva-lence to NFAs;

2 show how to construct an equivalent NFA with emptytransitions for a given regular expression;

3 show how to construct an equivalent regular expression fora given DFA.

Zu 1. und 2. siehe Vorlesung “Mathematische Grundlagen derComputerlinguistik”.

5 / 24

DFAs and regular expressions (1)

For each DFA, there is an equivalent regular expression.

Beispiel: reg. Ausdruck fur FSA

q0

q1

q2

a

c

ac

b

Um von q0 zu q2 zu gelangen, kann man entweder

(1) einen Weg ohne Unterwegszustand q2 wahlen. Dies kannwiederum entweder ohne Durchlauf von q1 sein (c) oder einerstes Mal von q0 zu q1 (a), dann beliebig oft von q1 zu q1

ohne Durchlauf von q1 (a∗), dann von q1 nach q2 (c)⇒ (c|a+c), oder

(2) man geht ein erstes Mal von q0 zu q2 (c|a+c) und dannbeliebig oft von q2 zu q2 ohne Durchlauf von q2 ((ba∗c)∗).

⇒ (c|a+c)(ba∗c)∗

6 / 24


Algorithm for construction the regular expression for a givenDFA: Assume that {q1, . . . , qn} it the set of states.

Define Rki,j as the set of sequences of input symbols that

can be obtained from following all paths from qi to qj thatdo not pass through states ql with l > k.Define rki,j as the corresponding regular expression.

q1

q2

q3

a

d

c

a

c

r11,3 = ∅ r11,2 = a

r12,2 = a|da r12,3 = c

r21,3 = ∅|a(a|da)∗c = a(a|da)∗c r23,3 = ca(a|da)∗c

r01,3 = a(a|da)∗c(ca(a|da)∗c)∗

8 / 24


Question: What do the Rki,j look like and what are the

corresponding regular expressions rki,j?

k = 0: Only paths of length 0 or 1 can be considered, since,between qi and qj, no other states ql are possible.Each r0i,j has the form a1|a2| · · · |am or (if i = j) a1|a2| · · · |am|εor (if i 6= j and there is no edge from qi to qj) ∅.

9 / 24


In general, Rkij contains

all elements from Rk−1ij ,

all concatenations ofa) an element from Rk−1

ik ,

b) any number (including 0) of elements from Rk−1kk and

c) an element from Rk−1kj .

Consequently, rki,j = rk−1ij |rk−1ik (rk−1kk )∗rk−1kj .

Finally, the regular expression for the language accepted by theautomaton is rn1f1 |r

n1f2| · · · |rn1fm if F = {qf1 , . . . , qfm}.

10 / 24

Abschlusseigenschaften regularer Sprachen (1)

Die von regularen Ausdrucken denotierten Sprachen heißenregulare Sprachen.

Abschlusseigenschaften

1 Wenn L1 und L2 regulare Sprachen sind, dann

ist die Vereinigung von L1 und L2 (L1 ∪ L2) ebenfalls eineregulare Sprache.ist die Schnittmenge von L1 und L2 (L1 ∩ L2) ebenfalls eineregulare Sprache.ist die Konkatenation von L1 und L2 (L1 ◦ L2) ebenfalls eineregulare Sprache.

2 Das Komplement einer regularen Sprache ist eine regulareSprache.

3 Wenn L eine regulare Sprache, dann ist L∗ eine regulareSprache.

12 / 24


Fur Vereinigung, Konkatenation und L∗ kann man ganz einfachdie entsprechenden regularen Ausdrucke angeben:

Seien r1 und r2 regulare Ausdrucke mit L1 = L(r1) undL2 = L(r2). Dann gilt

L((r1)|(r2)) = L1 ∪ L2

L((r1)(r2)) = L1 ◦ L2

L((r1)∗) = L∗1

13 / 24


Fur die Komplementbildung nimmt man den DFA, der dieSprache erkennt, macht die Ubergangsfunktion vollstandig,indem man einen weiteren Zustand (trap state) hinzufugt, undwahlt als neue Endzustande alle Zustande, die imUrsprungsautomaten kein Endzustand sind.

Komplementbildung im DFA

trap state

q0

q1a

b

a

q2b

a,b

Komplementbildung:q0

q1

q2

a

bba,b

a

14 / 24


Schnittbildung kann (de Morgan) durch Vereinigung undKomplementbildung ausgedruckt werden:

L1 ∩ L2 = L1 ∩ L2 = L1 ∪ L2

Damit ist L1 ∩ L2 eine regulare Sprache, falls L1 und L2 regularsind.

15 / 24

Regular Grammars (1)

Another way to define a language is via a grammar. A grammarformalism defines the form of rules and combination operationsallowed in a grammar.

A context-free grammar (CFG) G is a tuple 〈N,T,P, S〉 with

N and T disjoint alphabets, the nonterminals and termi-nals,

S ∈ N the start symbol, and

P a set of productions of the form A→ α with A ∈ N, α ∈(N ∪ T)∗.

16 / 24


Let G = 〈N,T,P, S〉 be a CFG. The (string) language L(G) of G

is the set {w ∈ T∗ |S ∗⇒ w} where

for w,w′ ∈ (N ∪ T)∗: w ⇒ w′ iff there is a A→ α ∈ P andthere are v, u ∈ (N∪T)∗ such that w = vAu and w′ = vαu.∗⇒ is the reflexive transitive closure of ⇒:

w0⇒ w for all w ∈ (N ∪ T)∗, and

for all w,w′ ∈ (N ∪ T)∗: wn⇒ w′ iff there is a v such that

w⇒ v and vn−1⇒ w′.

for all w,w′ ∈ (N∪T)∗: w∗⇒ w′ iff there is a i ∈ N such that

wi⇒ w′.

A language is called a context-free language (CFL) iff it isgenerated by a CFG.

17 / 24


A CFG is called

right-linear iff for all productions A → β: β ∈ T∗ orβ = β′X with β′ ∈ T∗, X ∈ N.

left-linear iff for all productions A→ β: β ∈ T∗ or β = Xβ′

with β′ ∈ T∗, X ∈ N.

regular if it is left-linear or right-linear.

For each left-linear grammar there is an equivalent right-lineargrammar and vice versa.

18 / 24


Examples: regular grammars

A CFG G = 〈{S,A,B}, {a,b, c},P, S〉

1 with P = {S → abS | cA,A → bB,B → cbB | ε} is right-linear.It generates the language L((ab)∗cb(cb)∗).

2 with P = {S → Sa |Sb |Abc,A → B,B → Bcb | ε} isleft-linear.It generates the language L((cb)∗bc(a|b)∗).

3 with P = {S→ abSA | c,A→ b} is not regular.It generates the language {(ab)ncbn |n ≥ 0}.

19 / 24


Equivalence of regular grammars and regular expressions

L = L(G) for a regular grammar G iff L is a regular set (i.e.,can be described by a regular expression).

To show this, we show that

for each right-linear grammar, there exists an equivalentNFA; and

for each DFA, there is an equivalent right-linear grammar.

20 / 24


Example: Constructing an NFA with empty transition for aright-linear grammar

G = 〈{S,A}, {a, b, c},P,S〉 with P = {S→ abS | cA,A→ bA | c}

Corresponding NFA:

[S]

[abS]

[cA]

[bS]

[c] [ε]

[A] [bA]

ε

ε

εε

b

a

cb

c

21 / 24


Construction of equivalent NFA (with empty transitions) forgiven right-linear G:

Start state [S].

For all productions A→ β1β2 there is a state [β2].

Transitions simulate top-down left-to-right traversal ofderivation tree:

For every A→ β:[A] [β]

ε

For every state [aβ] with a ∈ T:[aβ] [β]

a

Final state is [ε].

22 / 24


Construction of a right-linear grammar for given DFA:

The states are the nonterminals.

The start state is the start symbol.

For each transitionQ Q′a

add a production Q → aQ′ and, if Q′ ∈ F, a productionQ→ a.

If the start state Q0 is a final state, then add Q0 → ε

Example: Constructing a right-linear grammar for a DFA

q0

q1

q2

a

c

a c

b

〈{q0, q1, q2}, {a, b, c},{q0 → aq1 | cq2 | c,q1 → aq1 | cq2 | c, q2 → bq1},

q0〉

23 / 24


For a left-linear grammar, we have to follow the transitions inthe opposite direction.

The nonterminals are Q∪{S} where S is a new start symbol.

S→ qf for all qf ∈ F.

For each transitionq q′a

we add q′ → qa.

q0 → ε where q0 is the DFA’s start state.

Example: Constructing a left-linear grammar for a DFA

q0

q1

q2

a

c

a c

b

〈{q0, q1, q2, S}, {a,b, c},{S→ q2, q0 → ε,q2 → q1c | q0b,q1 → q1a | q0a | q2b},

S〉

24 / 24

einführung in die computerlinguistik - reguläre ausdrücke ... · dfas and regular expressions...

Documents