Context Free Grammars

Category: 

A context-free grammar (cfg) is a formal grammar in which exactly one nonterminal symbol is derived on an arbitrarily long sequence of other symbols. They are important for programming languages, because their syntax is described by a cfg.

Definition

A context-free grammar $G$ is a 4-tuple $(N, T, P, S)$ with the following features:

  • Startsymbol $S \in N$
  • Nonterminals $N$ and terminals $T$ are disjunctive alphabets
  • $P$ is a finite subset of $N \times (N \cup T)*$ and defines the production rules

Nonterminals are symbols of the alphabet, which will be replaced by production rules and do not occur in the final set. Terminals on the other hand will not be replaced, and they occur in the final set. The rules of a cfg are of the following form:
$A \rightarrow \alpha$ with $A \in N, \alpha \in (N \cup T)*$
There is exactly one nonterminal symbol on the left hand side of the rule and an arbritary number of terminal and nonterminal symbols on the right hand side.

Example:
$S \rightarrow aSb$
$S \rightarrow \epsilon$ (is the "empty" symbol)
With those two rules, the following language is defined: $\{ a^n b^n \vert n \ge 0\}$.
The same language can be described like this:
$S \rightarrow aSb | \epsilon$ ( | means "or")

Production rules are often written like this: $S ::= \alpha$
By convention, nonterminals are written in uppercase and terminals in lowercase. Moreover instead of 'a' one often writes a.

Programming language example

Let's look at a bit more complex example, which is based on a programming language.

G = ( N, T, P, S ) mit
N = { Program, StmtList, Stmt, While, If, Print, Cond, CondOp, StringOrIdent }
T = { 'while', '{', '}', 'if', 'print', '<', '>', '>=', '<=', '==', '!=', string, ident }
S = Program

P = {
  Program ::= StmtList
  StmtList ::= Stmt StmtList
  StmtList ::=
  Stmt ::= While | If | Print
  While ::= 'while' Cond '{' StmtList '}'
  If ::= 'if' Cond '{' StmtList '}'
  Print ::= 'print' StringOrIdent
  StringOrIdent ::= string | ident
  Cond ::= ident CondOp ident
  CondOp ::= '>' | '<' | '>=' | '<=' | '==' | '!='
}

This cfg is able to check the syntax for programs like this:

while a>b {
  if x==y {
    print "x equals y"
  }
}

Evaluation of a sentence

To check if a sentence, like the program code above, is valid, we use a derivation tree.
We start with the startsymbol and apply different production rules to obtain the sentence. Leave nodes of a tree are terminal symbols.

Let's look at another small example. The formal language of palindromes. A palindrome is word which reads the same forwards and backwards.

S ::= 'a' S 'a' | 'b' S 'b'
S ::= 'a' | 'b' | $\epsilon$

To check if "abba" is a plandrome, we do the following:

S => aSa => abSba => abba
//or with an derivation tree
   S
  /|\
 a S a
  /|\
 b S b
   |
   $\epsilon$

Ambiguity of grammars

A formal grammar is ambiguous iff one can obtain the same sentence by different ways. For a better understanding we look at the following example:

Expr ::= Expr '+' Expr
Expr ::= Number

This language can create the same sentence by different ways. Imagine the sentence "1+2+3":

This grammar can now easily made unambiguous, by only growing in one direction of the tree.

Expr ::= Number '+' Expr
Expr ::= Number

Now the tree can only grow to the right, since the left part is always derived to a Number.

Restrictions

Up to now we allow an arbitrary number of symbols on the right hand side. What happens if one restricts the right side?
Depending on the restriction, the grammar may lose it's expressiveness, but it also leads to a much simpler parser.
Three special restrictions are regular grammars, Chomsky-normal form and Greibach-normal form.

Regular grammar

Strictly speaking, a distinction in right regular and left regular is made. Usually the right regular form is used and it's only called regular.
The right hand side of a production rule must consist of not more than one terminal and one nonterminal symbol. Which of those is written first depends one the type of regularity (left or right).
More specifically, the production must have the following form:
$A \rightarrow b B$ with terminal $b$ and nonterminal $B$ (or $A \rightarrow B b$ in a left regular grammar)
$A \rightarrow b$ with terminal $b$
$A \rightarrow \epsilon$

For each right regular grammar there exists a left regualr grammar, which will define the same language and vice versa.

Not every context-free grammar can be described by an equivalent regular grammar. For example, you can not define the language $L=\{a^n b^n \}$ by a regular grammar.

Chomsky normal form

For every context-free language there exists a grammar in Chomsky normal form. So they are equally powerful.
In order to satisfy the Chomsky normal form, the productions muste be of the following form:
$A \rightarrow BC$ with $B,C \in N$
$A \rightarrow a$ with $a \in T$
$S \rightarrow \epsilon$ (only allowed, iff $S$ does not occur on the right hand side of any rule)
So every nonterminal symbol is derived to either exactly two other nonterminals or on exactly one terminal. The only exception is the start symbol, which can also be mapped to $\epsilon$, but only if S never occurs on the right hand side of any rule.

Greibach normal form

Just like the Chomsky normal form, every cfg can also be described by a Greibach normal form.
A grammar $G$ is in Greibach normal form, iff all productions from $P$ are of the form $A \rightarrow b B_1 ... B_k$ with $k \ge 0$, where $b$ is a terminal and all $B_i$ for $i \in \{1,...,k\} $ are nonterminals. So every rule derives to a single terminal and an after that an arbitrary number of nonterminals. The only exception is again the startsymbol, which can be derived to $\epsilon$, but only iff it never occurs on the right hand side.

On the next page you will find some exercises.