automata theory cs 3313

Automata TheoryCS 3313

Chapter 3

Regular Expression and Regular Languages

Regular language - one that can be generated from the null string and the individual symbols in the alphabets, using a finite number of applications of certain standard operations.

A Regular Expression is a notational device for representing the symbols and the operations used in this construction, and so our first characterization of regular language is that a language is regular if it can be described by a regular expression.

An abstract recognition device is called Finite Automaton, whose memory is limited to its being able to distinguished among the fixed finite number of states.

Three basic operations for constructing new language from existing ones are UNION, CONCATENATION, and the CLOSURE operation *.

Simple possible languages containing a single string that is either the null string or the length 1, is the language that we obtain by using a combination of these operations are the "regular languages".

It is common to simplify the formula slightly, by leaving out the set brackets { , } and U by +; the result is called a "regular expression".

For example:

Language Regular Expression representing the language

{}

{0} 0

{001} (i.e., {0} {0}{1}) 001

{0,1} (i.e., {0} U {1}) 0 + 1

{0,10} (i.e., {0} U {10}) 0 + 10

{1, } {001} (1 + ) 001

{110}* {0, 1} (110)* (0 + 1)

{0, 10)* ({11}* U {001, })* (0 + 10)* ((11)* + (001 +))*

{1}* {10} 1* 10

{10, 111, 11010}* (10 + 111 + 11010)*

We interpret a regular expression as representing the "most typical" string in the language to which the expression corresponds.

For example, 1* 10 stands for "any string consisting of the sub-string 10 preceded by arbitrarily many 1's.

For example in {1}* {10}. We could obtain the language in three steps

1) apply concatenation to {1} and {0}, yielding {10};

2) apply * to {1}, yielding {1}*; and

3) apply concatenation to {1}* and {10}, yielding {1}* {10}

It is exactly this type of definition, involving a sequence of steps in which operations are applied to objects obtained earlier in the sequence, that can be simplified by using recursion.

The recursive definition below, therefore, defines two things at once, regular expressions and the corresponding languages.

Definition:

regular expression over the alphabet , and the corresponding language are defined as follows:

1) is a regular expression, corresponding to the empty language .

2) is a regular expression, corresponding to the language {}

3) for each symbol a , a is a regular expression corresponding to the language {a}.

4) For any regular expression r and s over , corresponding to the languages Lr and Ls, respectively, each of the following is a regular expression over , corresponding to the language indicated.

(r s), corresponding to Lr Ls

(r + s), corresponding to Lr U Ls

(r*), corresponding to L*r

5) Only those things that can be produced using parts 1-4 are regular expressions over .

* A language over the alphabet is a regular language if

there is some regular expression over corresponding to it.

* A few comments:

First, the regular expression with which we begin the definition is , corresponding to the empty language.

* From the earlier examples, we can see why it is helpful to be able to use in building up more complicated regular expression, and it is even more obvious that the single symbols of the alphabet are crucial regular expressions, but it is not obvious that will be helpful in constructing regular expression.

In fact, we seldom use it in larger expression; we don't expect to see regular expression of the form:

((((0 + 1)* ) ) (( + (01)* )) )

Although according to our definition they are legal.

This one, in fact, corresponding to the empty language.

The main reason for including is consistency.

Second, since regular expressions corresponds to languages, and since we have enlarged our set of notion in the case of languages to include expressions like L2, it seems reasonable to make an unofficial addition to he language of regular expressions: if r is a regular expression, we will sometimes write ( r2 ) to stand for the regular expression ( r r ), (r3 ) to stand for ((r r) r), and so forth; and (r+) will stand for the regular expression ((r* ) r).

Third, if the definition is compared with some of the example, it will appear that they are not legitimate regular expressions after all.

For example, according to the definition, ((00)1) is a regular expression corresponding to {001}, and (0 (01)) is another, but clearly the definition doesn't allow 001.

Unfortunately, we can't just leave out the parentheses in part-4 of the definition; if we did it would follow, for example,

that 0 + 1 corresponding to {0, 1}, and 0 + 11 corresponds to {0, 1} {1}.

But it would seem equally unfortunate not to allow 011 as a regular expression.

We can resolve the difficulty the same way we do in ordinary algebraic notation.

* The arithmetic expression a + b * c is interpreted as a + (b * c), because we say that * has higher precedence than +.

* In this way, we can often dispense with parentheses.

* In a similar way, a + b + c is interpreted as (a + b) + c, because among operations of equal precedence, those to the left are performed first.

* Of course (a + b) + c and a + (b + c) have the same value if a, b, and c are numbers; but we are speaking of how the expression is defined, and our convention gives us an unambiguous definition.

* Let us stipulate at this point that in regular expressions the operations * has higher precedence, concatenation next, and + lowest.

* Secondly, we specify that operations associated to the left - e.g., that a + b + c is to be interpreted as (a + b) + c.

* For example, (001)* stands for (((00) 1)* ), and 1*10 + 11* stands for ((((1)* 1) 0) + (1 (1* )).

* Some important expressions equal to :

1* (1 + ) = 1*

1* 1* = 1*

0* + 1* = 1* + 0*

(0* 1*)* = (0 + 1)*

(0 + 1)* 01(0 + 1)* + 1* 0* = (0 + 1)*

Examples and Applications:

1) Let L be the language of all strings of 0s and 1s that have even length, (Since 0 is even, L contains ). Is L regular,

and if so, what is a regular expression corresponding to it?

- We can answer this by realizing that if a string has even length, it can be thought of as consisting of a number,

possibly zero, of string of length 2 concatenated.

- And, conversely, any such concatenation has even length.

- Since we can easily enumerate the strings of length 2, we may write the answer:

(00 + 01 + 10 + 11)*

2) Let L be the language of all string of 0's and 1s that have odd length. We can use the previous example: odd length means in particular length at least one, and so we may view L as the language of all strings consisting of single symbol followed by an even-length string. Since we have a regular expression for even-length strings, and we can easily find one for strings of length 1, a regular expression for L is

- (0 + 1) (00 + 01 + 10 + 11)*

- one may ask why we couldn't have described the language in this example as the set of string consisting of an even-length string followed by a single symbol, which would have led to

- (00 + 01 + 10 + 11)* (0 + 1)

- The answer is that we could, and this brings up a point that is worth commenting on.

- For any regular expression, there are many others that correspond to the same language.

- Since a little earlier we mentioned that problem of simplifying a regular expression, it might seems as though there is a single regular expression that is the simplest, or one that most concisely expresses the language.

- This is not necessarily the case, and in fact the regular expression we choose may depend on the aspect of the structure that we wish to emphasize.

- In this example, the two regular expressions we wrote are equally simple; the first is more appropriate if we want to call attention to the first symbol of the string, the second if we want to emphasize the last symbol of the string.

- If for some reason we wanted to emphasized the third symbol of the string, it might very well be better to use the more complicated regular expression:

- 0 + 1 + (00 + 01 + 10 + 11) (0 + 1) (00 + 01 + 10 + 11)*

- The first two terms are necessary to take care of the length-1 case.

3) Let L be the language of all strings of 0s and 1s containing at least one 1. Here are three regular expressions corresponding to L:

- 0* 1 (0 + 1)*

- (0 + 1)* 1 (0 + 1)*

- (0 + 1)* 10*

- The first expresses the fact that a string in L can be decomposed as follows: an arbitrary number of 0's (possibly none), the first 1, and then any arbitrary string.

- The second, which is some sense is the most general, or the closest to our definition of L, expresses the fact that a string in L has a 1, both preceded and followed by an arbitrary string.

- the third is similar to the first, but emphasized the last 1 in string in L.

4) Let L be the set of all strings of length less than or equal to 6. there is no doubt that L is regular: a simple, but unpleasant, regular expression corresponding to it is

- Λ + 0 + 1 + 00 + 01 + 10 + 11 + 000 + . . . + 111110 + 111111

- The only question is whether or not we can find a regular expression that we can feasibly write in its entirety,

without using " …" to abbreviate.

- There is one for the set of strings of length exactly 6:

- (0 + 1) (0 + 1) (0 + 1) (0 + 1) (0 + 1) (0 + 1)

- Or, in our extended notation, (0 + 1)6.

- If we are careful, we can modify this so that it will suffice.

- We may view a string of length 6 or less as being in fact a concatenation of six things, some or all of which may be Λ

as a third choice in each of the six factors, we obtain (0 + 1 + Λ)6

5) L = {x ε {0, 1}* | x ends with 1 and does not contain the sub-string 00}

- This mean that every string in L corresponds to the regular expression R = (1 + 01)*.

- This extra constraint simply means that Λ can't be included, and that L corresponds to the regular expression.

- (1 + 01)+ = (1 + 01)* (1 + 01)

6) For an example from real life, with a flavor that's a little different, let us enlarge the alphabets.

- Take Σ = {P, M, D, E, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

- Here P should be thought as "+", M as "-", D as "." .

- The letters are used to avoid typographical confusion.

- E, though, really means the symbol E.

- Let the example simpler to write, let S, which stands for "sign", be the regular expression Λ + P + M.

- Let DIG ("digit") be the regular expression 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9

- Consider the regular expression.

- S DIG+ (D DIG+ + D DIG+ E S DIG+ + E S DIG+)

- What language does this corresponds to?

- Of course one answer is:

- {S} {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}+ ({D} …)

Theorem-1:

* The class of regular languages is closed under the operation of union concatenation, and *.

* Proof:

* The theorem says simply that if L1 and L2 are regular

languages, then so are L1 U L2, L1L2, and L1*.

* This is immediate from the definition, since if R1 and R2 are regular expressions corresponding to L1 and L2 respectively, then (R1 + R2), (R1R2), and (R1)* are regular expressions corresponding to the three languages

Theorem-2:

* Every finite language is regular.

* Proof:

* Assignment: Use the Mathematical Induction to prove it.

automata theory cs 3313

Documents