COT6315 Formal Languages and Theory of Computation

R. E. Newman-Wolfe, University of Florida

Last modified 3/27/95.

Context-Free Grammars

Definition 9.1: Context-Free Grammar

A context-free grammar (CFG) is a 4-tuple, G = , where

V is a finite set of variables
T is a finite set of terminals (an alphabet)
T ^ V = 0 (no intersection between V and T)
S is the distinguished start symbol of V
P is a production function mapping elements of V to finite sets of strings over (V U T)

Generally, we will use by convention capital letters A, B, ... for variables in V, small letters, a, b, c, ... for terminal symbols in T, small letters u, v, x, ... for terminal strings in T*, and we will spell out Greek letter names alpha, beta, gamma, ... for mixed strings in (V U T)*.

Since there may be any (finite) number of such productions for each variable A in V, P maps V into 2^(V U T)*, subsets of (V U T)*. The function P is often thought of as a set of individual productions of the form A -> alpha, where A is in V and alpha is in (V U T)*. In that case, it is said that A -> alpha is a production of P.

Definition 9.1a: Context-Free Grammar Production

A string alpha in (V U T)* produces in one step a string beta in (V U T)* for a CFG, G = , iff

alpha = alpha_1 A alpha_2 where A is in V,
beta = alpha_1 beta_1 alpha_2, and
beta_1 is in P(A) (i.e., A -> beta_1 is a production of P).

Then we say alpha =>_G beta. It is useful to talk about the number of production steps that may be applied to produce one string from another. For this, we superscript the => symbol with the number of steps:

alpha =>0 alpha,
if alpha =>i-1 beta, and beta => gamma, then alpha =>i gamma.

Similarly, =>*_G is the transitive closure of =>_G.

Definition 9.1b: Context-Free Grammar Derivation

A CFG, G = , derives a string alpha in (V U T)* iff S =>*_G alpha. Strings in (V U T)* that are derivable from S are sentential forms. A derivation in which at each step the production applied to the sentential form alpha is applied to the leftmost (rightmost) variable in alpha is called a leftmost (rightmost) derivation.

Definition 9.2: Context-Free Language

Given a CFG, G = , the language generated by G is L(G) = { w in T* | S =>_G w }, exactly those strings of terminals that are derivable by G. For a given language, L, if there is a CFG, G, such that L = L(G), then L is a context-free language (CFL).

Theorem 9.1: CFLs are closed under concatenation, union, Kleene closure

For any L_1, L_2 in CFL, L_1.L_2 is in CFL, L_1 U L_2 is in CFL, and L_1* is in CFL.

Proof

Definition 9.3: Regular Grammar

A CFG, G, is a regular grammar iff every production is of one of the three forms:

B -> a
B -> aC
B -> e (empty string)

where B and C are variables in V and a is a terminal in T.

Theorem 9.2: Regular Languages and Regular Grammars

Any language L in T* is regular iff there is a regular grammar G such that L = L(G).

Proof

Definition 10.0: Derivation Tree

Given a CFG, G = and a string beta in (V U T)*, an ordered, rooted tree D= is a derivation (or parse) tree for beta in G iff:

the nodes of D are labeled with symbols from V U T,
the root of D is labeled S,
the children of a node labeled A are exactly the symbols in alpha, in the order they appear in alpha, where A -> alpha is a production in P,
symbols of beta appear in order as the leaves of D.

Note that the production A -> e is allowed, and e may appear any number of times without effect in the string beta. Also note that a there may be many different derivations corresponding to the same derivation tree (the leftmost and rightmost derivations are in effect, depth-first searches of the derivation tree). Derivation trees may be rooted at variables other than S, but of course, sentential forms of these may not be sentential forms of the grammar.

It is useful to talk about the height of a derivation tree rooted at a variable A that derives a sentential form alpha. For this, we superscript the => symbol with the number of steps in parentheses:

A =>(1) alpha iff A => alpha (i.e., alpha in P(A)),
A =>(j+1) alpha_1.alpha_2....alpha_k for j >= 0 iff
- A => X1.X1....Xk (i.e., X1.X2....Xk in P(A)) and
- for all i=1,2,...,k, Xi =>(j_i) alpha_i if Xi is in V or
- alpha_i = Xi if Xi is in T (in which case, let j_i = 0), and
- j = max{j_i | i=1,2,...,k}.

The utility of this notion and notation will become apparent when we set out to reason about derivations. It is also useful to know the following lemma.

Lemma 9.1: Piecewise Derivation

Let G= in CFG, A in V, x = x1.x2...xn, xi in V U T for i=1,2,...,n. Then for k > 0, A =>k x <=> A => X1.X2....Xn =>k-1 x1.x2.....xn and for i=1,2,...,n, either Xi=xi in T (and ji = 0) or Xi in V and Xi =>ji xi, with Sum {i=1 to n} ji = k.

Definition 9.1b: Context-Free Grammar Derivation

A CFG, G = , derives a string alpha in (V U T)* iff S =>*_G alpha.

Definition 10.1: Ambiguous Grammar

A CFG G is ambiguous iff there is at least one string in L(G) that has two or more distinct derivation trees (or equivalently, two or more distinct leftmost derivations).

Definition 11.1: Nullable Variable

A variable A in V of grammer G = in CFG is nullable iff A =>* e, or equivalently, iff

e is in P(A) or
B1B2...Bk is in P(A) and for all i=1,2,...,k, Bi is nullable.
No other variables are nullable.

The first condition states that A => e directly, while the second recursively accounts for indirect derivation of the null string, which can be shown graphically. We present an algorithm a bit different from the text to find the set of nullable variables.

Algorithm 11.1a: FindNull(G)

    NN = N = 0
    REPEAT
        N = NN
        NN = N U {A in V | P(A) ^ N* <> 0}
    UNTIL (N = NN)
    RETURN N

The algorithm adds variables that can derive a null string in one or more steps (depending on the iteration at which the variable is added) and halts when a fixed point is reached. We now use this to eliminate e-productions from a grammar. This can be shown graphically.

Algorithm 11.1: Elim-e-Prod(G)

    P2 = P1 = P
    N = FindNull(G)
    REPEAT
       P1 = P2
       FOREACH (A in V)
           P2(A) = P1(A) U 
		{alpha1 alpha2 | (alpha1 B alpha2) in P1(A) and B in N}
    UNTIL (P1 = P2)
    FOREACH (A in V) P1(A) = P1(A) - {e,A}
    RETURN

Theorem 11.1: Equivalence of Lambda-Productionless Grammar

G in CFG and G1 = Elim-e-Prod(G) => L(G) - {e} = L(G1) and G1 has no e-productions.

Proof

Definition: A-Derivable Variable

A variable B in V of grammer G = in CFG where G has no e-productions, is A-derivable for a variable A in V iff A =>* B, or equivalently, iff

B is in P(A) or
C is A-derivable, B is in P(C), and B <> A.
No other variables are A-derivable.

The first condition states that A => B directly, while the second applies the transitive closure to obtain all A-derivable variables, which can be shown graphically. As in the e-production case, we first give an algorithm to find all A-derivable variables, then use it in an algorithm to eliminate all unit productions. Note that the initialization step does not include all of P(A), which consists of strings, but only the variables corresponding to singleton strings in P(A).

Algorithm 11.2a: Find-A-Deriv(A,G)

    NAD = {B | B in P(A)}
    REPEAT
       AD = NAD
       NAD = AD U {B in V | Exists C in AD s.t. B in P(C)} - {A}.
    UNTIL (AD = NAD)
    RETURN AD

The following algorithm assumes there are no e-productions in G.

Algorithm 11.2: Elim-Unit-Prod(G)

    FOREACH (B in V) AD(B) = Find-A-Deriv(B,G)
    FOREACH (A in V, B in AD(A))
        P1(A) = P(A) U P(B)
    FOREACH (A in V)
        P1(A) = P1(A) - AD(A)
    RETURN

When P1(A) is first created, it includes all the right-hand sides for every non-e-production for any variable B that A can derive. Finally, all the unit productions are removed. Note that, provided the appropriate minor modifications are made, these two steps may be done in either order!

Theorem 11.2: Equivalence of Unit-Productionless Grammar

G in CFG, G has no e-productions, and G1 = Elim-Unit-Prod(G) => L(G) - {e} = L(G1) and G1 has no unit productions.

Proof

Definition 11.2a: Live Variable

A variable A in V of grammer G = in CFG is live iff

Exists x in P(A)^T*, or
Exists alpha in P(A) and all variables in alpha are live.
No other variables are live.

Definition 11.2b: Reachable Variable

A variable A in V of grammer G = in CFG is reachable iff

Exists alpha1 A alpha2 in P(S), or
Exists reachable B and alpha1 A alpha2 in P(B).
No other variables are reachable.

These can be shown graphically.

Definition 11.2c: Useful Variable

A variable A in V of grammer G = in CFG is useful iff A appears in a derivation of a word in L(G), i.e.,

S =>* alpha1 A alpha2 =>* x in T*

If a variable is not useful, it is useless.

Algorithm 11.3a: FindLive(G)

    NL = {A | P(A) ^ T* <> 0 }
    REPEAT
       L = NL
       NL = L U {B in V | P(B) ^ (T U L)* <> 0 }
    UNTIL (L = NL)
    RETURN L

On iteration i, the algorithm adds those variables that take at least i steps to derive a string of terminals.

Algorithm 11.3b: FindReachable(G)

    NR = {S}
    REPEAT
       R = NR
       NR = R U {A in V | Exists B in R s.t. P(B) ^ (VUT)*A(VUT)* <> 0 }
    UNTIL (R = NR)
    RETURN R

On iteration i, the algorithm adds those variables that only appear in sentential forms that take least i steps to derive from S.

Algorithm 11.3: Elim-Useless(G)

    V' = FindLive(G)
    FOREACH (A in V1)
        P'(A) = P(A) ^ (V' u T)*
    G' = 
    V1 = FindReachable(G')
    FOREACH (A in V1)
        P1(A) = P(A) ^ (V1 u T)*
    RETURN

Only reachable and live variables can be useful (although these alone are not sufficient!), so the algorithm restricts attention to them. Only productions that involve live and reachable variables can contribute to L(G), so the rest are pruned. A variable may be both live and reachable, but may appear only in sentential forms with non-live variables, and so be useless. For this reason, it is important that the grammar G' with only live variables be produced first, since the pruning operation on P will eliminate all productions involving dead variables. If a live, reachable variable can only appear in a sentential form with non-live variables, then it will not be reachable in the pruned grammar, since the necessary step that produced the non-live variable(s) with which it must appear is no longer available.

Theorem 11.3: Equivalence of Grammar lacking Useless Variables

G in CFG and G1 = Elim-Useless(G) => L(G) = L(G1) and G1 has no useless variables.

Proof

Theorem 11.4: Equivalence of e-Productionless, Unit-Productionless Grammar lacking Useless Variables

G in CFG => Exists G1 in CFG such that L(G1) = L(G) - {e} and G1 has no e-productions, no unit productions, and no useless variables.

Proof

Definition 11.3: Chomsky Normal Form

A grammer G = in CFG is in Chomsky Normal Form (CNF) iff for all A in V and x in P(A), either

x in V^2, or
x in T.

All of the productions are of the form A -> BC or A -> a for A, B, C in V and a in T.

Theorem 11.5: Equivalence of CNF Grammar

G in CFG => Exists G1 in CNF such that L(G1) = L(G) - {e}.

Proof

Decision questions for CFLs

For a CFL L with CFG G, is

L infinite?
Determine the pumping lemma constant N for G in GNF, then run all possible derivations with N to 2N steps. This will generate all strings in L with length n, N <= n <= 2N. L is infinite if and only if there is at least one such string.
L empty?
Again, determine the pumping lemma constant N for G in GNF, then run all possible derivations of length up to N. L is empty if and only if there are no strings of length less than N.
a string w in L?
For this, using a grammar in GNF suggests an algorithm that parses one input character at a time, and keeps track of all sentential forms that have a prefix of the input as a prefix of the sentential form consisting of all of the its terminal symbols.
The CYK algorithm takes a grammar in CNF, and using dynamic programming parses the string bottom-up. It may also be used to determine the number of distinct derivation trees in G for a particular string.

Send comments to nemo@cis.ufl.edu