Compilers

FIRST and FOLLOW sets

Can aid both top-down and bottom-up parsers.

FIRST

If x is a terminal, FIRST(x) = {x}.
If X is a non-terminal where X = Y₁Y₂…Yₖ for k ≥ 1,
- FIRST(X) = FIRST(X) ∪ a if a ∈ FIRST(Yᵢ) for the first Yᵢ where ε ∉ FIRST(Yᵢ).
- If ∀i, ε ∈ FIRST(Yᵢ) then FIRST(X) = FIRST(X) ∪ ε
If X → ε is a production, ε ∈ FIRST(X).

FOLLOW

If S is the start symbol, $ ∈ FOLLOW(S).
If A → αBβ, then ∀x ∈ (FIRST(β) - {ε}), x ∈ FOLLOW(B).
If A → αB, then everything in FOLLOW(A) is in FOLLOW(B).
If A → αBβ and ε ∈ FIRST(β), then everything in FOLLOW(A) is in FOLLOW(B).

Parsing table construction

foreach production A → α in the grammar; do
  foreach terminal x in FIRST(α), add T[A, x] = A → α
  if ε ∈ FIRST(α), then
    foreach terminal y in FOLLOW(A), add T[A, y] = A → α
    if ε ∈ FIRST(α) and $ ∈ FOLLOW(A), add T[A, $] = A → α

Examples

Eg1

E  → T E'
E' → + T E' | ε
T  → F T'
T' → * F T' | ε
F  → ( E ) | id

where id is a terminal.

FIRST set:

Symbol	FIRST set
E	F(T) = { (, id }
E'	{ +, ε }
T	F(F) = { (, id }
T'	{ *, ε }
F	{ (, id }

FOLLOW set:

Symbol	FOLLOW set	Comment
E	{ $, ) }	Starting symbol
E'	{ $, ) }
T	{ +, $, ) }	FIRST(E')
T'	{ +, $, ) }	FOLLOW(T)
F	{ *, +, $, ) }	FIRST(T') ∪ FOLLOW(T)

Parsing table:

	id	+	*	(	)	$
E	E → T E'			E → T E'
E'		E' → + T E'			E' → ε	E' → ε
T	T → F T'			T → F T'
T'		T' → ε	T' → * F T'		T' → ε	T' → ε
F	F → id			F → ( E )

E → T E'
- FIRST(T) = {(, id}
E' → + T E'
- FIRST(+) = { + }
E' → ε
- FOLLOW(E') = {$, )}
T → F T'
- FIRST(F) = {(, id}
T' → * F T'
- FIRST(*) = { * }
T' → ε
- FOLLOW(T') = {+, $, )}
F → ( E )
- FIRST( ( ) = { ( }
F → id
- FIRST(id) = { id }

Eg2

S -> ( L ) | a
L -> L , S | S

This grammar is left-recursive. We need to remove it before we can find FIRST set.

S  → ( L ) | a
L  → L , S | S
L  → SL'
L' → ,L | ε

A -> Aα | b

A -> bA' A' -> αA | ε

FIRST set:

Symbol	FIRST set
S	{(}
L

Top-down parsing

Can be thought of as finding a left-most derivation for a sentence.

Example: LL(k)

For example, for the grammar

E → T E'
E'→ + T E' | ε
T → F T'
T'→ * F T' | ε
F → ( E ) | id

parse tree for an input string id + id * id could be built like

E

 E
/ \
T E'


   E
 /    \
 T      E'
 |     
/ \
F  T'



   E
 /    \
 T      E'
 |     
/ \
F  T'
|  
id 


   E
 /    \
 T      E'
 |     
/ \
F  T'
|  |
id ε



   E
 /    \
 T      E'
 |      |
/ \     + T E'
F  T'     
|  |
id ε


   E
 /    \
 T      E'
 |      |
/ \     + T      E'
F  T'     |      |
|  |      F T'   ε
id ε




   E
 /    \
 T      E'
 |      |
/ \     + T      E'
F  T'     |      |
|  |      F T'   ε
id ε      |
          id



   E
 /    \
 T      E'
 |      |
/ \     + T                 E'
F  T'     |                 |
|  |      F    T'           ε
id ε      |    |
          id   * F T'


   E
 /    \
 T      E'
 |      |
/ \     + T                E'
F  T'     |                |
|  |      F    T'          ε
id ε      |    |
          id   * F   T'
             |   |
                 id  ε

Recursive descent parsing

Every symbol may be thought of as functions.
- A procedure is associated with each non-terminal.
May need backtracking.
Left recursive grammars mean infinite loop.
- Same non-terminal's procedure would be called again and again, without consuming any input.

Predictive parsing

A special case of recursive descent parsing.
A type of top-down parsing which doesn't need backtracking.
Appropriate production chosen by looking ahead a fixed number of symbols.
- LL(k) parsing. k is number of lookahead symbols.
- LL means 'Left-to-right (scanning), Left-most derivation'.
- Usually just one symbol => LL(1) parsing
The lookahead symbol(s) unambiguously determines the production to be chosen for a grammar that can be parsed with a predictive parser.

Bottom-up parsing

Examples:

Shift-reduce parser
- LR(k)
- An ambiguous grammar cannot be parsed by an LR parser
Operator precedence parser

For example, for the grammar

E → E + T | T
T → T * F | F
F → ( E ) | id

parse tree for an input string id * id could be built like

id * id                 F * id
                        |
                       id



 T * id                 T * F
 |                      |   |
 F                      F  id
 |                      |
id                     id




   T                      E
   |                      |
 /   \                    T
 T * F                    |
 |   |                  /   \
 F  id                  T * F
 |                      |   |
id                      F  id
                        |
                       id

Bottom-parsing may be thought as a process of 'reducing' input sentence to the start symbol of the grammar.

Goal of bottom parsing is to do a derivation in reverse.

For the example that we just saw:

Right sentential	Handle	Reducing prodn
id₁ * id₂	id₁	F → id
F * id₂	F	T → F
T * id₂	id₂	F → id
T * F	T * F	T → T * F
T	T	E → T

Shift reduce parsing

Stack holds grammar symbols.
Rest of the input in an input buffer.
Top of the stack is a handle => reduce!

Stack has start symbol of grammar and input buffer empty => parsing success.

Shift means pushing to stack.

For our usual example grammar:

E → E + T | T
T → T * F | F
F → ( E ) | id

and input id₁ * id₂,

Stack	input	action
$	id₁ * id₂$	shift
id₁\| * id₂	reduce by F → id
F\| * id₂	reduce by T → F
T\| * id₂	shift
T * \|id₂	shift
$T * id₂	$	shift
$T * id₂	$	reduce by F → id
$T * F	$	reduce by T → T * F
$T	$	reduce by E → T
$E	$	accept input

Conflicts in shift-reduce parsing

shift-reduce conflict: not sure whether to shift or reduce
reduce-reduce conflict: not sure which production should be used for reducing

LR parsing

Stands for Left-to-right, Right-most derivation.
- Left-to-right scanning of input.
- Constructs a right-most derivation in reverse.

LR parsers

table driven.
usually made with parser generators.

Usually, LR parser means LR(1) parser.

LR parsers maybe classified into:

Simple LR (SLR)
Lookahead LR (LALR)
Canonical LR (CLR)

Parsing power and complexity: SLR < LALR < CLR

For most practical purposes, LALR would suffice.

Shift-reduce parsing

Form of bottom-up parsing.
Handle: something that can be reduced to LHS of a production.

Conflicts in shift-reduce parsing

Conflicts may arise while doing shift-reduce parsing (which is a bottom-up parsing mechanism).

From purple dragon book:

The key decisions during bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds.

Reduce

Convert the RHS of a production to its LHS.

From the Purple dragon book:

At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of that production.

For example, if production is:

S -> ab

and input is at:

ab
  ↑

where ab has already matched, we can reduce and make the ab into an S. ie, we decide that we go with this production.

Shift

Delay the decision a bit more. Let's see more of the input before deciding. Leave options open.

Suppose there is this production:

S -> ab

and input is at

a b
 ↑

at this point there is no production is complete, we need to accept more input. In such a case, the current input character is 'shifted' in.

Shift-reduce conflict

Parser isn't sure whether it should shift or reduce.

Suppose we have the following productions:

S -> a
S -> ab

and input is at

a b
 ↑

Here S->a could be used for a reduce. But S->ab could be used for a shift as well. Hence a conflict arises.

Consider a more realistic example:

stmt := if expr then stmt
      | if expr then stmt else expr

When the parsing is at this stage:

Stack: if expr then stmt
Input: else ...

there is a shift-reduce conflict.

Shifting could be done with production 2 and reduce with production 1.

Reduce-reduce conflict

There is more than one production which can be used to do a reduce.

Suppose we have the following productions:

S -> a
E -> a

and input is at

a
 ↑

Here either of the productions can be used to do a reduce. Parser is not sure how to proceed. Hence the conflict.

Why no shift-shift conflict?

Because either way shifting of the same character is done.

LR parsing

Table driven.
A shift-reduce parsing method
SLR < LALR < CLR
Yacc is a LR parser generator
Parser usually generated with a parser generator.
- Still, it is helpful to know how it works.
LALR is the most common LR parser. Why? :DBT:
Multiple entries in parsing table => conflict
LR parsing is bottom up parsing

Reason for the name:

LR(k): k is the number of look-ahead characters
- k not mention => k=1
- only k={0,1} 'are of practical interest'. Why? :DBT:
L: left-to-right scanning of input
R: right-most derivation

An ambigous grammar can never be LR.

Symbol table

Used to keep track of variables and their values. Symbol table needn't be in binary executable. Can be included if we want to debug later.

Python has a module to examine its symbol table: https://docs.python.org/3/library/symtable.html

import symtable
table = symtable.symtable("def foo(): pass", "string", "exec")

print(list(table.get_identifiers()))
# ['foo']

Ambiguous CFGs

A grammar that can produce more than one parse tree.

For example, the following CFG is ambiguous:

E := E + E
   | E * E
   | id

because the string id + id * id allows for two different left-most derivations in this CFG:

E  + E                     E
id + E                     |
id + E  * E                +
id + id * E                |
id + id * id         +----------+
                     |          |
                     E          E
                     |          |
                     id         *
                                |
                            +-------+
                            |       |
                            E       E
                            |       |
                           id      id

and

E       * E                E
E  + E  * E                |
id + E  * E                *
id + id * E                |
id + id * id         +----------+
                     |          |
                     E          E
                     |          |
                     +          id
                     |      
                 +-------+  
                 |       |  
                 E       E  
                 |       |  
                id      id

(Ambiguity only on top-down parsing? Of course not.)

This grammar can be made unambiguous by adding a few extra productions (sort of restricts the entry-point of parsing):

E := E + P | P
P := P * Q | Q
Q := id

—

Not every ambiguous CFG has an unambiguous equivalent. Such grammars are said to be inherently ambiguous.

—

No algorithm exists that can tell if a grammar is ambiguous or not. ˢᵒ Also, there's no algorithm that can give unambiguous counterpart of an ambiguous grammar. Both of these are undecidable problems. ˢᵒ

Optimizations

Loop unrolling
Operator strength reduction
Constant folding
Common subexpression elimination
Dead code elimination

Terms

Nullable non-terminal

A non-terminal is nullable if it can become ε. Even if the derivation is long.

Chomsky Normal Form (CNF)

A grammar is in CNF if every rule is of one of the following forms:

A → BC
A → a

Every grammar in CNF is context free (ie, CNF ⊂ CFG)

Greibach Normal Form (GNF)

Grammar where all production rules' RHS starts with a terminal character, optionally followed by one or more non-terminals.
Doesn't have left recursions.

Left recursion

RHS of a non-terminal (even if via derivation) starts with the same non-terminal.
Top-down parsers can't handle left-recursion.
Immediate left recursion: No need derivation. The given production itself is left-recursive.

Eliminating left recursion

A → Aα | β

can be made into

A  → βA'
A' ⇒ αA' | ε

Or if there are multiple productions for the non-terminal,

A  → Aα₁ | Aα₂ | ... | Aαₘ | β₁ | β₂ | ... | βₙ

can become

A  → β₁A' | β₂A' | ... | βₙA' 
A' ⇒ α₁A' | α₂A' | ... | αₘA' | ε

Left factoring

Making a grammar suitable for predictive, or top-down parsing.

If two productions of a non-terminal has the same prefix, parser cannot choose between them till enough of the input (ie, more than the prefix common to the two productions) has been seen.

A -> αβ₁
A -> αβ₂

We can fix this by making the prefix a non-terminal:

A  -> αA'
A' -> β₁ | β₂

Single production

A production rule whose body consists solely of a single non-terminal. Like: A → B

ε-production

A production rule whose RHS (or body) is ε. Like: A → ε

ε-free grammar

A grammar without any ε-productions.

CYK algorithm

Cocke–Younger–Kasami algorithm.
An algorithm to parse context free grammars.
𝑂(n³) time complexity, where n is length of input string.

Derivations

For a grammar

E := E + E
   | E * E
   | - E
   | ( E )
   | id

E => -E means E derives -E.

And the sequence of replacements

E => -E => -(E) => -(id)

is a derivation of -(id) from E.

A => A' means A derives A' in one step
A =>* A' means A derives A' in zero or more steps
A =>+ A' means A derives A' in one or more steps

For a grammar G with start symbol S, if S =>* α, we say that α is a sentential form of G. Sentential form may contain terminals and non-terminals. May even be empty.

A sentence of G is a sentential form without non-terminals.

Language generated by a grammar is the set of its sentences.

Two grammars are equivalent if the languages generated by them are the same.

Left-most derivation

Left-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).

E =>* -(id+id)

E -E -(E) -(E + E) -(id + E) -(id + id)

If S =>* α by left-most derivation, we can say that α is a left-sentential form of G.

Right-most derivation

Right-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).

E =>* -(id+id)

E -E -(E) -(E + E) -(E + id) -(id + id)

If S =>* α by right-most derivation, we can say that α is a right-sentential form of G.

Right-most derivations are also known as canonical derivations. (WHY??)

Parse tree

E =>* -(id+id)

   E
   |
+--+--+
|     |
-     E
   +--+--+
   |  |  |
   (  E  )
      |
   +--+--+
   |  |  |
   E  +  E
   |     |
  id    id

Parse tree is same for left-most and right-most derivations.

Ambiguous grammars

A grammar capable of producing more than one parse tree for the same sentence. ie, more than one left-most derivation or more than one right-most derivation for the same sentence.

For the grammar

E := E + E
   | E * E
   | ( E )
   | id

id + id * id can have two parse trees.

Left-most derivation:

E => E + E => id + E => id + E * E => id + id * E => id + id * id
E => E * E => E + E * E => id + E * E => id + id * E => id + id * id

Right-most derivation:

E => E + E => E + E * E => E + E * id => id + id * id
E => E * E => E * id => E + E * id => E + id * id => id + id * id

Reentrant parsers

Similar to the concept of reentrant functions
A 'pure' parser à la pure functions
No state maintained
- No global variables
Can parse a bit exit and resume parsing
https://stackoverflow.com/questions/2441351/what-is-a-re-entrant-parser

Grammars

Ambiguity

Every string of the grammar must have only one derivation.
Either the grammar itself should be unambiguous, or there should be something else done to mitigate the grammar's ambiguity.
- Like precedence levels??

Example CFGs

CFG generating balanced parentheses: S := ( S ) S | ε
Palindromes where Σ={a,b}: S := 0S0 | 1S1 | 0 | 1 | ε
Even number zeros and odd number of 1s:

:TODO:

2 zeros: Z -> 001 | 010 | 100
1 one:   O -> 1

This is actually a regular language...

https://cs.stackexchange.com/questions/110821/context-free-grammar-for-language-with-even-number-of-0s-and-1s

Dbts

Why right-most derivation parsers not common?
Is there a parser that can parse an ambiguous grammar?
- Non-determinism => ambiguity not a problem.
What if A → Aα | ε in left recursion elimination?

References

Compilers: Principles, techniques and tools, 2nd edition - Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman
Modern compiler implementation in ML - Andrew W. Appel