context free grammars - e-studytoccomp.weebly.com/uploads/3/8/9/3/38938075/3.context_free_gra… ·...

Context Free Grammars

UNIT –III

By

Prof.T.H.Gurav

Smt.Kashibai Navale COE, Pune

Context-Free Grammar

Definition. A context-free grammar is a 4-tuple :

G = (V, T, P, S) OR G = (V, Σ, P, S)

V = Non-terminal(variables) a finite set

T = alphabet or terminals a finite set

P = productions a finite set

S = start variable SV

Productions’ form,

A α

where AV, α (VS T)*: NT NT OR NT T

String generation by CFG

• Generate strings by repeated replacement of non-

terminals with string of terminals and non-

terminals.

1. write down start variable (non-terminal)

2. replace a non-terminal with the right-hand-

side of a rule that has that non-terminal as its

left-hand-side.

3. repeat above until no more non-terminals

Context-Free Languages

• Definition. Given a context-free grammar

• G = (T, NT, P, S), the language generated or derived

from G is the set:

L(G) = {w ε T* | }

• All intermediate stages of the strings resulting from the

start S in the derivation process are called as sentential

form.

• Definition. A language L is context-free if there is a

context-free grammar G = (T, NT, P, S), such that L is

generated from G

S * w

Types of derivations

• There are two ways to derive the string from the grammar

1. Leftmost derivation : When at each step of derivation a production is applied to the leftmost NT, then the derivation is said to be leftmost.

2. Rightmost derivation: When at each step of derivation a production is applied to the rightmost NT, then the derivation is said to be rightmost.

• Consider the following grammar

• S A | A B

• A e | a | A b | A A

• B b | b c | B c | b B

Sample derivations:

S AB AAB aAB aaB aabB aabb

S AB AbB Abb AAbb Aabb aabb

These two derivations use same productions, but in different orders.

Parse Trees

• The pictorial representation of the derivations in

the form of a tree is very useful. This tree is called

parse tree OR derivation Tree.

Root label = start node.

Each interior label = variable.

Each parent/child relation = derivation step.

Each leaf label = terminal or e.

All leaf labels together = derived string = yield.

S

A B

A A B b

a a b

Yield of a parse Tree

• If we look at the leaves of any parse tree and

concatenate them from left to right we get a string

called yield of parse tree.

Derivation Trees/parse trees

Infinitely many others

possible.

S

A B

A A b

a

a

b A

S

A

A A

A A b A

a e

a

b A

S

A B

A A B b

a a b

S A | A B

A e | a | A b | A A

B b | b c | B c | b B

w = aabb Other derivation trees

for this string?

? ?

CFGs & CFLs: Example 1

{an bn | n0}

It is non regular already proved by pumping lemma.

Can be represented by CFG

G = ({S}, {a,b}, {S Є, S a S b}, S)

Example2

Eg: Construct a CFG for language L which has all strings

which are palindromes over ∑={a,b}

Example : madam is palindrome

Gpal=({S}, {a,b}, A, S),

where A={S→e,

S→0,

S→1,

S→0S0,

S→1S1}

Sometimes we group productions with the same head, e.g.

S→e | 0 | 1 | 0S0 | 1S1.

Example

• The string abaaba can be derived as

S start symbol

aSa Rule SaSa

abSba Rule SbSb

abaSaba Rule SaSa

abaЄaba Rule S Є

abaaba is a palindrome

Ambiguty:Defination

• A CFG is ambiguous if there is a string in

the language that is the yield of two or more

parse trees.

• A CFG is ambiguous if there is a terminal

string that has multiple leftmost derivations

from the start variable. Equivalently:

multiple rightmost derivations

Example

Let G={{E},{a,b,-,/},P,E}

P = { EE-E | E/E | a | b} E is start symbol

Solution : Consider the derivation of the string-> a – b/a

Derivation 1 Derivation 2

E=>E-E E=>E/E

E=>a-E/E E=>E-E/E

E=>a-a/E E=> a-E/E

E=>a-a/b E=>a-a/E

E=>a-a/b

Parse trees

E

E E

E E

/

-

a b

a

E

E E

E E

-

/

a b

a

Reasons

• The relative precedence of subtraction and

division are not uniquely defined.

• And two groupings correspond to

expressions with different values.

• It doesn’t captures Associativity!!

Unambiguous G

• Try for a-b/a Now!!

E → E - T | T

T → T / F | F

F → (E) | I

I → a|b

CFG Simplification

Grammar may consists of extra symbols which unnecessarily

increases length of grammar. So simplification needed.

1. Eliminate ambiguity.

2. Eliminate “useless” variables.

3. Eliminate e-productions: Ae.

4. Eliminate unit productions: AB.

5. Eliminate redundant productions.

Eliminate “useless” variables.

• A variable is useful if it occurs in a derivation that begins

with the start symbol and generates a terminal string.

• Two types of the symbols are useless

• A symbol (NT or T )

– Non generating symbols : symbols not generating any terminal

string

– Non reachable symbols : can not be reached from Start symbol.

We use Dependency Graph method to decide not reachable NT.

S aA

BA S aA BA

Here A is Reachable and B is not Reachable from S

A B S

Eliminate e-productions: Ae.

• A CFG may have productions of the form A e . This

production is used to erase A . Such production is called as

null production .

• While eliminating e rule from grammar .. Meaning of

CFG should not be changed.

Example:G= S0S| 1S | e construct G’ generating L(G)-{e}

Solution : Then replace Se in other rules to generate new

rules.

ie S0 and S1

There fore G’ = S0S| 1S| 1 | 0

Eliminate unit productions:

AB

• Unit productions are the productions in which one NT

gives another NT

Eg: AB OR XY

Steps : 1. Select unit production AB , such that there exists

production BX1X2X3…Xn

2. Then while removing AB we should

add A X1X2X3…Xn in the grammar .

eliminate AB from grammar

}

Example

G { S0A | 1B | C

A0S | 00

B1 | A

C01 }

Solution : unit productions are SC BA

We have C01 … So S0A | 1B | 01

We have A0S | 00 … So B0S | 00 | 1

Thus G’ = {S0A | 1B | 01

A0S | 00

B0S | 00 | 1

C01 }

Two normal forms :

1. Chomsky N F

2. Greibach N F

Chomsky Normal Form

• If all rules of the grammar are of the form.

NT NT . NT

NT T

In CNF we have restriction on the length of RHS

and nature of Symbols in RHS of Rules.

Greibach Normal Form

• A CFG is in Griebach Normal Form if each rule is of the form

NT one terminal . Any number of NT

Example

SaA is in GNF

Sa

But SAA Or SAa is not in GNF

Rules: 1. Substitution Rule

Let G=(V,T,P,S) be a given Grammar and if production

A Ba &

B β1 | β2 | β3 | ….| βn

then we can convert A rule to GNF as

A β1a | β2a | β3a | ….| βna

Example : let S Aa and A aA | bA | aAS | b

We can apply rule 1 as

S aAa | bAa | aASa | ba

A aA | bA | aAS

2. Left Recursion Rule

Let G=(V,T,P,S) be a given Grammar and if production

such that βi do not start with A then equivalent grammar in

GNF is :

A β1 | β2 | β3 | ….| βn

A β1 Z | β2 Z | β3Z | ….| βnZ

Z a1 | a2 | a3 | ….| an

Z a1Z | a2Z | a3Z | ….| anZ

A Aa1 | Aa2| Aa3|……| β1 | β2 | β3 | ….| βn

Left linear grammar and right

linear grammar 1. If NT appears as a rightmost symbol in each production

of CFG then it is called right-linear grammar.

2. If NT appears as a leftmost symbol in each production of

regular grammar then it is called left-linear grammar.

• Linear grammars (either left or right) actually produce

the Regular Languages, and also called as regular

Grammar. ( which means that all the Regular Languages are also CF.)

Regular grammars

Right Linear Grammars:

Rules of the forms

A → ε

A → a

A → aB

A,B: variables(NT) and

a: terminal

Left Linear Grammars:

Rules of the forms

A → ε

A → a

A → Ba

A,B: variables(NT) &

A: terminal

RLG to FA

Grammar G is right-linear

Example:

aBbB

BaaA

BaAS

|

|

Steps

Consider grammar G is given , corresponding FA ,

M will be obtained as follows:

1. Initial state of FA will be start NT of G.

2. A Production in G corresponds to transition in M

3. The transitions in M are defined as :

1. Each production AaB gives transition from State A

to B on input alphabet ‘a’.

2. Each production Aa gives transition from State A to

Qf(final state of FA) on input alphabet ‘a’.

Example

Construct NFA , M such that

bBaB

BaA

BaAS

|

|

1. Every state is a grammar variable:

2. Add edges to each production

(a) SaA

SB

(b) AaB

(C) Ba

BbB

S FV

A

B

special final state

a

e

aa

b

L(G) = L(M)

FA to RLG

• Steps :

1. Start State of the FA will become the Start Symbol of the G

2. Create set of Productions as

a. If q0(initial state of the FA) Ԑ F then add a production S Ԑ to P

b. For every Transition of the form ,

add production BaC

a

c.

Add production BaC and Ba

B C

B C a

FA to RLG(example)

Convert FA to a RLG

a

b

a

be

0q 1q 2q

3q

0q 1qa

10 aqq

b

11 bqq

a2q

21 aqq

b

3q

32 | bqbq

e

13 qq

LMLGL )()(

Conversion from RLG to LLG and

Vice versa

fig : From RLG to LLG

Steps :

1 Represent RLG using Transition graph(FA) .

2. Interchange the start state and the Final State .

3.Reverse the directions of all transitions keeping the labels

and the states unchanged.

4. Write left linear G from the changed transition graph.

Right Linear G

Transition Graph

Left Linear G

Properties of CFL

1. The union and concatenation of two context-free languages is context-free, but the intersection need not be.

2. The reverse of a context-free language is context-free, but the complement need not be.

3. Every regular language is context-free because it can be described by a regular grammar.

4. The intersection of a context-free language and a regular language is always context-free.

5. There exist context-sensitive languages which are not context-free.

6. To prove that a given language is not context-free, one may employ the pumping lemma for context-free languages

Pumping lemma for CFL

• Let G be a CFG. Then there exists a

constant ‘n’ such that any string w ε L(G)

with |w|>=n can be rewritten as w=uvxyz,

subject to the following conditions: 1. |vxy|<=n , the middle portion is less than n.

2. |vy| = Є strings v and y will be pumped.

3. For all i>=0 uvixyiz is in L. the two strings v and y can

be pumped zero or more times.

x u z y v

Example 1

• L = {anbncn | n 0}

• Assume L is a CFL,

• Choose w = a2b2c2 in L

• Applying PL, w = uvxyz, where |vy|>0 and |vxy|p, such that uvixyiz in L for all i0

• Two possible cases:

– vxy = (combination of a & b) , uv2xy2z will result in more a’s and/or more b’s than c’s, not in L

– vxy = (combination of b & c), uv2xy2z will result in more b’s and/or more c’s than a’s, not in L

• Contradiction, L is not a CFL

Grammar types

• There are 4 types of grammars according to the types of

rules:

• Each type recognizes a set of languages.

– General grammars → RE languages

– Context Sensitive grammars → CS languages

– Context Free grammars → CF languages

– Linear grammars → Regular languages

Chomsky Hierarchy

• Comprises four types of languages and their associated

grammars and machines.

• Type 3: Regular Languages

• Type 2: Context-Free Languages

• Type 1: Context-Sensitive Languages

• Type 0: Recursively Enumerable Languages

• These languages form a strict hierarchy

1. Type 3 : A є

Aa | aB

ABa

2. Type 2:

Aα

where A ε V and α ε (V union T)*

3. Type 1:

αAβ αXβ with | β | >=| α |

where β ,and X are strings of NT and/or T

with X not NULL and A is NT.

Language Grammar Machine Example

Regular

Language

Regular Grammar

Right-linear

grammar

Left-linear

grammar

Deterministic or

Nondeterministic

Finite-state

Acceptor(FA)

a*

Context-free

Language

Context-free

grammar

Pushdown

automaton(PDA)

anbn

Context-

sensitive

Context-sensitive

grammar

Linear-bounded

Automaton

anbncn

Recursively

enumerable

Unrestricted

grammar

Turing

machine(TM)

Any computable

function

Graph grammars

• Graph grammars has been invented in order

to generalize (Chomsky) string grammars.

Graph grammars: definition

• A graph grammar is a pair:

GG = (G0,P)

G0 is called the starting graph and P is a set of production rules

L(GG) is the set of graphs that can be derived starting with G0 and applying the rules in P

Continue..

• A set of production rules are used to replace

one subgraph by another.

• The process of replacing depends upon the

embedding: edges to/from the old subgraph

must be transformed into edges to/from the

new subgraph.

Types of GG

• Often, on a high level, two kinds of graph grammars are distinguished:

– Hyperedge replacement grammars

• Rewrite rule replaces (hyper)edge by new graph

– Node replacement grammars • Rewrite rule replaces node by new graph

Node replacement grammars

• node replacement grammars with rules of

the form:

N G / E

Node label Labeled graph Embedding rules

Replace any node with label N by G, connecting

G to N’s neighborhood according to the embedding rules listed in E.

Embedding rules are based on node labels.

Example NR-GG rule

N a

c c

/ {(a,b), (b,c)} b b

N

c c

b b

a

a b

c a

N

c c

b b

a

a b

c a

b b a

c c

Example NR-GG rule

N a

c c

/ {(a,b), (b,c)} b b

N

c c

b b

a

a b

c a

c c

b b

a

a b

c a

Production Rules

• Following two types are used to describe

the production rules in GG.

1. Algebraic (using gluing construction)

2. Set theoretic( uses expressions to describe

the embedding

Applications

• Picture processing : A picture can be

represented as a graph , where labelled

nodes represents primitives and labelled

edges represents geometric representations(

such as is right of , is bellow)

• Diagram recognition:

Recursively Enumerable Languages

• A TM accepts a string w if the TM halts in a final state. A

TM rejects a string w if the TM halts in a non final state or

the TM never halts.

• A language L is recursively enumerable if some TM

accepts it. Hence they are also called as Turing

Acceptable L .

• Recursively Enumerable Languages are also called

Recognizable

Input string

Turing Machine for Lacceptq

rejectq

For a Turing-Acceptable language : L

It is possible that for some input string the machine enters an infinite loop

Recursive Language

• Recursive Language : A language L is recursive if some TM

accepts it and halts on every input.

• Recursive languages are also called Decidable Languages

because a Turing Machine can decide membership in those

languages (it can either accept or reject a string).

Input string

Accept

Reject

Decider for L

Decision On Halt:

acceptq

rejectq

For a decidable language : L

For each input string, the computation halts in the accept or reject state

Undecidable Languages

• undecidable language = not decidable language

• If there is no Turing Machine which accepts the

language and makes a decision (halts) for every

input string.

• Note : (machine may make decision for some

input strings)

• For an undecidable language, the corresponding

problem is undecidable (unsolvable):

Applications of RE and CFG in

compilers

Programming

Language

(Source) Compiler

Machine

Language

(Target)

c

The Structure of a Compiler

c

1. RE and FA : Are usually used to classify the basic

symbols (e.g. identifiers, constants,keywords) of a

language.

2. Context free Grammar:

1. Describes the structure of a program.

2. are used to count: brackets: (), begin...end,

if...then...else

c

Lexical Analysis/ Scanning

Converts a stream of characters (input program) into a

stream of tokens.

Terminology

Token: Name given to a family of words.

e.g., integer constant

Lexeme: Actual sequence of characters representing a

word. e.g., 32894

Pattern: Notation used to identify the set of lexemes

represented by a token. e.g., [0 − 9]+

c

Some more examples

Token Sample

Lexemes

Pattern

while while while

integer constant 32894, -1093, 0 [0-9]+

identifier buffer size [a-zA-Z]+

c

Patterns

How do we compactly represent the set of all lexemes

corresponding to a token?

For instance: The token integer constant represents

the set of all integers: that is, all sequences of digits (0–9),

preceded by an optional sign (+ or −).

Obviously, we cannot simply enumerate all lexemes.

Use Regular Expressions.

c

Regular Definitions

Assign “names” to regular expressions.

For example,

digit → 0 | 1 | ··· | 9

natural → digit digit∗

Shorthands:

a+: Set of strings with one or more occurrences of a. a*: Set

of strings with zero or one occurrences of a.

Example:

integer → (+|−)*digit+

c

Regular Definitions and Lexical Analysis

Regular Expressions and Definitions specify sets of strings

over an input alphabet.

They can hence be used to specify the set of lexemes

associated with a token.

That is, regular expressions and definitions can be used as

the pattern language

c

Parsing/ syntax analysis

Main function of parser: Produce a parse tree from

the stream of tokens received from the lexical analyzer

which is then used by Code Generator to produce target

code.

This tree will be the main data structure that a

compiler uses to process the program. By traversing this

tree the compiler can produce machine code.

Secondary function of parser: Syntactic error

detection – report to user where any error in the source

code are.

c

Applications of RE

1. Data Validation:

Test for a pattern within a string.

For example, you can test an input string to see if a

telephone number pattern or a credit card number

pattern occurs within the string. This is called data

validation.

c

Continue…

2. Patten matching:

You can find specific text within a document or input field.

For example, you may need to search an entire Web site,

remove outdated material, and replace some HTML

formatting tags. In this case, you can use a regular

expression to determine if the material or the HTML

formatting tags appears in each file. This process reduces

the affected files list to those that contain material

targeted for removal or change. You can then use a

regular expression to remove the outdated material.

Finally, you can use a regular expression to search for and

replace the tags.

c

context free grammars - e-studytoccomp.weebly.com/uploads/3/8/9/3/38938075/3.context_free_gra… ·...

Documents