FIRST and FOLLOW sets
Can aid both top-down and bottom-up parsers.
FIRST
- If
x
is a terminal, FIRST(x) = {x}. - If X is a non-terminal where X = Y₁Y₂…Yₖ for k ≥ 1,
- FIRST(X) = FIRST(X) ∪ a if a ∈ FIRST(Yᵢ) for the first Yᵢ where ε ∉ FIRST(Yᵢ).
- If ∀i, ε ∈ FIRST(Yᵢ) then FIRST(X) = FIRST(X) ∪ ε
- If
X → ε
is a production, ε ∈ FIRST(X).
FOLLOW
- If S is the start symbol, $ ∈ FOLLOW(S).
- If
A → αBβ
, then ∀x ∈ (FIRST(β) - {ε}), x ∈ FOLLOW(B). - If
A → αB
, then everything in FOLLOW(A) is in FOLLOW(B). - If
A → αBβ
andε ∈ FIRST(β)
, then everything in FOLLOW(A) is in FOLLOW(B).
Parsing table construction
foreach production A → α in the grammar; do
foreach terminal x in FIRST(α), add T[A, x] = A → α
if ε ∈ FIRST(α), then
foreach terminal y in FOLLOW(A), add T[A, y] = A → α
if ε ∈ FIRST(α) and $ ∈ FOLLOW(A), add T[A, $] = A → α
Examples
Eg1
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
where id
is a terminal.
FIRST set:
Symbol | FIRST set |
---|---|
E | F(T) = { (, id } |
E' | { +, ε } |
T | F(F) = { (, id } |
T' | { *, ε } |
F | { (, id } |
FOLLOW set:
Symbol | FOLLOW set | Comment |
---|---|---|
E | { $, ) } | Starting symbol |
E' | { $, ) } | |
T | { +, $, ) } | FIRST(E') |
T' | { +, $, ) } | FOLLOW(T) |
F | { *, +, $, ) } | FIRST(T') ∪ FOLLOW(T) |
Parsing table:
id | + | * | ( | ) | $ | |
---|---|---|---|---|---|---|
E | E → T E' | E → T E' | ||||
E' | E' → + T E' | E' → ε | E' → ε | |||
T | T → F T' | T → F T' | ||||
T' | T' → ε | T' → * F T' | T' → ε | T' → ε | ||
F | F → id | F → ( E ) |
- E → T E'
- FIRST(T) = {(, id}
- E' → + T E'
- FIRST(+) = { + }
- E' → ε
- FOLLOW(E') = {$, )}
- T → F T'
- FIRST(F) = {(, id}
- T' → * F T'
- FIRST(*) = { * }
- T' → ε
- FOLLOW(T') = {+, $, )}
- F → ( E )
- FIRST( ( ) = { ( }
- F → id
- FIRST(id) = { id }
Eg2
S -> ( L ) | a
L -> L , S | S
This grammar is left-recursive. We need to remove it before we can find FIRST set.
S → ( L ) | a
L → L , S | S
L → SL'
L' → ,L | ε
A -> Aα | b
A -> bA' A' -> αA | ε
FIRST set:
Symbol | FIRST set |
---|---|
S | {(} |
L |
Top-down parsing
Can be thought of as finding a left-most derivation for a sentence.
Example: LL(k)
For example, for the grammar
E → T E'
E'→ + T E' | ε
T → F T'
T'→ * F T' | ε
F → ( E ) | id
parse tree for an input string id + id * id
could be built like
E
E
/ \
T E'
E
/ \
T E'
|
/ \
F T'
E
/ \
T E'
|
/ \
F T'
|
id
E
/ \
T E'
|
/ \
F T'
| |
id ε
E
/ \
T E'
| |
/ \ + T E'
F T'
| |
id ε
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε |
id
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε | |
id * F T'
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε | |
id * F T'
| |
id ε
Recursive descent parsing
- Every symbol may be thought of as functions.
- A procedure is associated with each non-terminal.
- May need backtracking.
- Left recursive grammars mean infinite loop.
- Same non-terminal's procedure would be called again and again, without consuming any input.
Predictive parsing
- A special case of recursive descent parsing.
- A type of top-down parsing which doesn't need backtracking.
- Appropriate production chosen by looking ahead a fixed number of symbols.
- LL(k) parsing. k is number of lookahead symbols.
- LL means 'Left-to-right (scanning), Left-most derivation'.
- Usually just one symbol => LL(1) parsing
- The lookahead symbol(s) unambiguously determines the production to be chosen for a grammar that can be parsed with a predictive parser.
Bottom-up parsing
Examples:
- Shift-reduce parser
- LR(k)
- An ambiguous grammar cannot be parsed by an LR parser
- Operator precedence parser
For example, for the grammar
E → E + T | T
T → T * F | F
F → ( E ) | id
parse tree for an input string id * id
could be built like
id * id F * id
|
id
T * id T * F
| | |
F F id
| |
id id
T E
| |
/ \ T
T * F |
| | / \
F id T * F
| | |
id F id
|
id
Bottom-parsing may be thought as a process of 'reducing' input sentence to the start symbol of the grammar.
Goal of bottom parsing is to do a derivation in reverse.
For the example that we just saw:
Right sentential | Handle | Reducing prodn |
---|---|---|
id₁ * id₂ | id₁ | F → id |
F * id₂ | F | T → F |
T * id₂ | id₂ | F → id |
T * F | T * F | T → T * F |
T | T | E → T |
Shift reduce parsing
- Stack holds grammar symbols.
- Rest of the input in an input buffer.
- Top of the stack is a handle => reduce!
Stack has start symbol of grammar and input buffer empty => parsing success.
Shift means pushing to stack.
For our usual example grammar:
E → E + T | T
T → T * F | F
F → ( E ) | id
and input id₁ * id₂
,
Stack | input | action |
---|---|---|
$ | id₁ * id₂$ | shift |
id₁| * id₂ | reduce by F → id | |
F| * id₂ | reduce by T → F | |
T| * id₂ | shift | |
T * |id₂ | shift | |
$T * id₂ | $ | shift |
$T * id₂ | $ | reduce by F → id |
$T * F | $ | reduce by T → T * F |
$T | $ | reduce by E → T |
$E | $ | accept input |
Conflicts in shift-reduce parsing
- shift-reduce conflict: not sure whether to shift or reduce
- reduce-reduce conflict: not sure which production should be used for reducing
LR parsing
- Stands for Left-to-right, Right-most derivation.
- Left-to-right scanning of input.
- Constructs a right-most derivation in reverse.
LR parsers
- table driven.
- usually made with parser generators.
Usually, LR parser means LR(1) parser.
LR parsers maybe classified into:
- Simple LR (SLR)
- Lookahead LR (LALR)
- Canonical LR (CLR)
Parsing power and complexity: SLR < LALR < CLR
For most practical purposes, LALR would suffice.
Shift-reduce parsing
- Form of bottom-up parsing.
- Handle: something that can be reduced to LHS of a production.
Conflicts in shift-reduce parsing
- Conflicts may arise while doing shift-reduce parsing (which is a bottom-up parsing mechanism).
From purple dragon book:
The key decisions during bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds.
Reduce
Convert the RHS of a production to its LHS.
From the Purple dragon book:
At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of that production.
For example, if production is:
S -> ab
and input is at:
ab
↑
where ab has already matched, we can reduce and make the ab
into an S
. ie, we decide that we go with this production.
Shift
Delay the decision a bit more. Let's see more of the input before deciding. Leave options open.
Suppose there is this production:
S -> ab
and input is at
a b
↑
at this point there is no production is complete, we need to accept more input. In such a case, the current input character is 'shifted' in.
Shift-reduce conflict
Parser isn't sure whether it should shift or reduce.
Suppose we have the following productions:
S -> a
S -> ab
and input is at
a b
↑
Here S->a
could be used for a reduce. But S->ab
could be used for a shift as well. Hence a conflict arises.
Consider a more realistic example:
stmt := if expr then stmt
| if expr then stmt else expr
When the parsing is at this stage:
Stack: if expr then stmt
Input: else ...
there is a shift-reduce conflict.
Shifting could be done with production 2 and reduce with production 1.
Reduce-reduce conflict
There is more than one production which can be used to do a reduce.
Suppose we have the following productions:
S -> a
E -> a
and input is at
a
↑
Here either of the productions can be used to do a reduce. Parser is not sure how to proceed. Hence the conflict.
Why no shift-shift conflict?
Because either way shifting of the same character is done.
LR parsing
- Table driven.
- A shift-reduce parsing method
- SLR < LALR < CLR
- Yacc is a LR parser generator
- Parser usually generated with a parser generator.
- Still, it is helpful to know how it works.
- LALR is the most common LR parser. Why? :DBT:
- Multiple entries in parsing table => conflict
- LR parsing is bottom up parsing
Reason for the name:
- LR(k): k is the number of look-ahead characters
- k not mention => k=1
- only k={0,1} 'are of practical interest'. Why? :DBT:
- L: left-to-right scanning of input
- R: right-most derivation
An ambigous grammar can never be LR.
Symbol table
Used to keep track of variables and their values. Symbol table needn't be in binary executable. Can be included if we want to debug later.
Python has a module to examine its symbol table: https://docs.python.org/3/library/symtable.html
import symtable
= symtable.symtable("def foo(): pass", "string", "exec")
table
print(list(table.get_identifiers()))
# ['foo']
Ambiguous CFGs
A grammar that can produce more than one parse tree.
For example, the following CFG is ambiguous:
E := E + E
| E * E
| id
because the string id + id * id
allows for two different left-most derivations in this CFG:
E + E E
id + E |
id + E * E +
id + id * E |
id + id * id +----------+
| |
E E
| |
id *
|
+-------+
| |
E E
| |
id id
and
E * E E
E + E * E |
id + E * E *
id + id * E |
id + id * id +----------+
| |
E E
| |
+ id
|
+-------+
| |
E E
| |
id id
(Ambiguity only on top-down parsing? Of course not.)
This grammar can be made unambiguous by adding a few extra productions (sort of restricts the entry-point of parsing):
E := E + P | P
P := P * Q | Q
Q := id
—
Not every ambiguous CFG has an unambiguous equivalent. Such grammars are said to be inherently ambiguous.
—
No algorithm exists that can tell if a grammar is ambiguous or not. ˢᵒ Also, there's no algorithm that can give unambiguous counterpart of an ambiguous grammar. Both of these are undecidable problems. ˢᵒ
Optimizations
- Loop unrolling
- Operator strength reduction
- Constant folding
- Common subexpression elimination
- Dead code elimination
Terms
Nullable non-terminal
A non-terminal is nullable if it can become ε. Even if the derivation is long.
Chomsky Normal Form (CNF)
A grammar is in CNF if every rule is of one of the following forms:
A → BC
A → a
Every grammar in CNF is context free (ie, CNF ⊂ CFG)
Greibach Normal Form (GNF)
- Grammar where all production rules' RHS starts with a terminal character, optionally followed by one or more non-terminals.
- Doesn't have left recursions.
Left recursion
- RHS of a non-terminal (even if via derivation) starts with the same non-terminal.
- Top-down parsers can't handle left-recursion.
- Immediate left recursion: No need derivation. The given production itself is left-recursive.
Eliminating left recursion
A → Aα | β
can be made into
A → βA'
A' ⇒ αA' | ε
Or if there are multiple productions for the non-terminal,
A → Aα₁ | Aα₂ | ... | Aαₘ | β₁ | β₂ | ... | βₙ
can become
A → β₁A' | β₂A' | ... | βₙA'
A' ⇒ α₁A' | α₂A' | ... | αₘA' | ε
Left factoring
Making a grammar suitable for predictive, or top-down parsing.
If two productions of a non-terminal has the same prefix, parser cannot choose between them till enough of the input (ie, more than the prefix common to the two productions) has been seen.
A -> αβ₁
A -> αβ₂
We can fix this by making the prefix a non-terminal:
A -> αA'
A' -> β₁ | β₂
Single production
A production rule whose body consists solely of a single non-terminal. Like: A → B
ε-production
A production rule whose RHS (or body) is ε. Like: A → ε
ε-free grammar
A grammar without any ε-productions.
CYK algorithm
- Cocke–Younger–Kasami algorithm.
- An algorithm to parse context free grammars.
- 𝑂(n³) time complexity, where n is length of input string.
Derivations
For a grammar
E := E + E
| E * E
| - E
| ( E )
| id
E => -E
means E
derives -E
.
And the sequence of replacements
E => -E => -(E) => -(id)
is a derivation of -(id) from E.
A => A'
means A derives A' in one stepA =>* A
' means A derives A' in zero or more stepsA =>+ A
' means A derives A' in one or more steps
For a grammar G with start symbol S, if S =>* α, we say that α is a sentential form of G. Sentential form may contain terminals and non-terminals. May even be empty.
A sentence of G is a sentential form without non-terminals.
Language generated by a grammar is the set of its sentences.
Two grammars are equivalent if the languages generated by them are the same.
Left-most derivation
Left-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).
E =>* -(id+id)
E -E -(E) -(E + E) -(id + E) -(id + id)
If S =>* α by left-most derivation, we can say that α is a left-sentential form of G.
Right-most derivation
Right-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).
E =>* -(id+id)
E -E -(E) -(E + E) -(E + id) -(id + id)
If S =>* α by right-most derivation, we can say that α is a right-sentential form of G.
Right-most derivations are also known as canonical derivations. (WHY??)
Parse tree
E =>* -(id+id)
E
|
+--+--+
| |
- E
+--+--+
| | |
( E )
|
+--+--+
| | |
E + E
| |
id id
Parse tree is same for left-most and right-most derivations.
Ambiguous grammars
A grammar capable of producing more than one parse tree for the same sentence. ie, more than one left-most derivation or more than one right-most derivation for the same sentence.
For the grammar
E := E + E
| E * E
| ( E )
| id
id + id * id
can have two parse trees.
Left-most derivation:
E
=>E + E
=>id + E
=>id + E * E
=>id + id * E
=>id + id * id
E
=>E * E
=>E + E * E
=>id + E * E
=>id + id * E
=>id + id * id
Right-most derivation:
E
=>E + E
=>E + E * E
=>E + E * id
=>id + id * id
E
=>E * E
=>E * id
=>E + E * id
=>E + id * id
=>id + id * id
Reentrant parsers
- Similar to the concept of reentrant functions
- A 'pure' parser à la pure functions
- No state maintained
- No global variables
- Can parse a bit exit and resume parsing
- https://stackoverflow.com/questions/2441351/what-is-a-re-entrant-parser
Grammars
Ambiguity
- Every string of the grammar must have only one derivation.
- Either the grammar itself should be unambiguous, or there should be something else done to mitigate the grammar's ambiguity.
- Like precedence levels??
Example CFGs
CFG generating balanced parentheses:
S := ( S ) S | ε
Palindromes where Σ={a,b}:
S := 0S0 | 1S1 | 0 | 1 | ε
Even number zeros and odd number of 1s:
:TODO:
2 zeros: Z -> 001 | 010 | 100
1 one: O -> 1
This is actually a regular language...
Dbts
- Why right-most derivation parsers not common?
- Is there a parser that can parse an ambiguous grammar?
- Non-determinism => ambiguity not a problem.
- What if
A → Aα | ε
in left recursion elimination?
References
- Compilers: Principles, techniques and tools, 2nd edition - Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey D. Ullman
- Modern compiler implementation in ML - Andrew W. Appel