Can aid both top-down and bottom-up parsers.
x
is a terminal, FIRST(x) = {x}.X → ε
is a production, ε ∈ FIRST(X).A → αBβ
, then ∀x ∈ (FIRST(β) - {ε}), x ∈ FOLLOW(B).A → αB
, then everything in FOLLOW(A) is in FOLLOW(B).A → αBβ
and ε ∈ FIRST(β)
, then everything in FOLLOW(A) is in FOLLOW(B).foreach production A → α in the grammar; do
foreach terminal x in FIRST(α), add T[A, x] = A → α
if ε ∈ FIRST(α), then
foreach terminal y in FOLLOW(A), add T[A, y] = A → α
if ε ∈ FIRST(α) and $ ∈ FOLLOW(A), add T[A, $] = A → α
E → T E'
E' → + T E' | ε
T → F T'
T' → * F T' | ε
F → ( E ) | id
where id
is a terminal.
FIRST set:
Symbol | FIRST set |
---|---|
E | F(T) = { (, id } |
E' | { +, ε } |
T | F(F) = { (, id } |
T' | { *, ε } |
F | { (, id } |
FOLLOW set:
Symbol | FOLLOW set | Comment |
---|---|---|
E | { $, ) } | Starting symbol |
E' | { $, ) } | |
T | { +, $, ) } | FIRST(E') |
T' | { +, $, ) } | FOLLOW(T) |
F | { *, +, $, ) } | FIRST(T') ∪ FOLLOW(T) |
Parsing table:
id | + | * | ( | ) | $ | |
---|---|---|---|---|---|---|
E | E → T E' | E → T E' | ||||
E' | E' → + T E' | E' → ε | E' → ε | |||
T | T → F T' | T → F T' | ||||
T' | T' → ε | T' → * F T' | T' → ε | T' → ε | ||
F | F → id | F → ( E ) |
S -> ( L ) | a
L -> L , S | S
This grammar is left-recursive. We need to remove it before we can find FIRST set.
S → ( L ) | a
L → L , S | S
L → SL'
L' → ,L | ε
A -> Aα | b
A -> bA' A' -> αA | ε
FIRST set:
Symbol | FIRST set |
---|---|
S | {(} |
L |
Can be thought of as finding a left-most derivation for a sentence.
Example: LL(k)
For example, for the grammar
E → T E'
E'→ + T E' | ε
T → F T'
T'→ * F T' | ε
F → ( E ) | id
parse tree for an input string id + id * id
could be built like
E
E
/ \
T E'
E
/ \
T E'
|
/ \
F T'
E
/ \
T E'
|
/ \
F T'
|
id
E
/ \
T E'
|
/ \
F T'
| |
id ε
E
/ \
T E'
| |
/ \ + T E'
F T'
| |
id ε
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε |
id
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε | |
id * F T'
E
/ \
T E'
| |
/ \ + T E'
F T' | |
| | F T' ε
id ε | |
id * F T'
| |
id ε
Examples:
For example, for the grammar
E → E + T | T
T → T * F | F
F → ( E ) | id
parse tree for an input string id * id
could be built like
id * id F * id
|
id
T * id T * F
| | |
F F id
| |
id id
T E
| |
/ \ T
T * F |
| | / \
F id T * F
| | |
id F id
|
id
Bottom-parsing may be thought as a process of 'reducing' input sentence to the start symbol of the grammar.
Goal of bottom parsing is to do a derivation in reverse.
For the example that we just saw:
Right sentential | Handle | Reducing prodn |
---|---|---|
id₁ * id₂ | id₁ | F → id |
F * id₂ | F | T → F |
T * id₂ | id₂ | F → id |
T * F | T * F | T → T * F |
T | T | E → T |
Stack has start symbol of grammar and input buffer empty => parsing success.
Shift means pushing to stack.
For our usual example grammar:
E → E + T | T
T → T * F | F
F → ( E ) | id
and input id₁ * id₂
,
Stack | input | action |
---|---|---|
$ | id₁ * id₂$ | shift |
id₁| * id₂ | reduce by F → id | |
F| * id₂ | reduce by T → F | |
T| * id₂ | shift | |
T * |id₂ | shift | |
$T * id₂ | $ | shift |
$T * id₂ | $ | reduce by F → id |
$T * F | $ | reduce by T → T * F |
$T | $ | reduce by E → T |
$E | $ | accept input |
LR parsers
Usually, LR parser means LR(1) parser.
LR parsers maybe classified into:
Parsing power and complexity: SLR < LALR < CLR
For most practical purposes, LALR would suffice.
From purple dragon book:
The key decisions during bottom-up parsing are about when to reduce and about what production to apply, as the parse proceeds.
Convert the RHS of a production to its LHS.
From the Purple dragon book:
At each reduction step, a specific substring matching the body of a production is replaced by the nonterminal at the head of that production.
For example, if production is:
S -> ab
and input is at:
ab
↑
where ab has already matched, we can reduce and make the ab
into an S
. ie, we decide that we go with this production.
Delay the decision a bit more. Let's see more of the input before deciding. Leave options open.
Suppose there is this production:
S -> ab
and input is at
a b
↑
at this point there is no production is complete, we need to accept more input. In such a case, the current input character is 'shifted' in.
Parser isn't sure whether it should shift or reduce.
Suppose we have the following productions:
S -> a
S -> ab
and input is at
a b
↑
Here S->a
could be used for a reduce. But S->ab
could be used for a shift as well. Hence a conflict arises.
Consider a more realistic example:
stmt := if expr then stmt
| if expr then stmt else expr
When the parsing is at this stage:
Stack: if expr then stmt
Input: else ...
there is a shift-reduce conflict.
Shifting could be done with production 2 and reduce with production 1.
There is more than one production which can be used to do a reduce.
Suppose we have the following productions:
S -> a
E -> a
and input is at
a
↑
Here either of the productions can be used to do a reduce. Parser is not sure how to proceed. Hence the conflict.
Because either way shifting of the same character is done.
Reason for the name:
An ambigous grammar can never be LR.
Used to keep track of variables and their values. Symbol table needn't be in binary executable. Can be included if we want to debug later.
Python has a module to examine its symbol table: https://docs.python.org/3/library/symtable.html
import symtable
= symtable.symtable("def foo(): pass", "string", "exec")
table
print(list(table.get_identifiers()))
# ['foo']
A grammar that can produce more than one parse tree.
For example, the following CFG is ambiguous:
E := E + E
| E * E
| id
because the string id + id * id
allows for two different left-most derivations in this CFG:
E + E E
id + E |
id + E * E +
id + id * E |
id + id * id +----------+
| |
E E
| |
id *
|
+-------+
| |
E E
| |
id id
and
E * E E
E + E * E |
id + E * E *
id + id * E |
id + id * id +----------+
| |
E E
| |
+ id
|
+-------+
| |
E E
| |
id id
(Ambiguity only on top-down parsing? Of course not.)
This grammar can be made unambiguous by adding a few extra productions (sort of restricts the entry-point of parsing):
E := E + P | P
P := P * Q | Q
Q := id
—
Not every ambiguous CFG has an unambiguous equivalent. Such grammars are said to be inherently ambiguous.
—
No algorithm exists that can tell if a grammar is ambiguous or not. ˢᵒ Also, there's no algorithm that can give unambiguous counterpart of an ambiguous grammar. Both of these are undecidable problems. ˢᵒ
A non-terminal is nullable if it can become ε. Even if the derivation is long.
A grammar is in CNF if every rule is of one of the following forms:
A → BC
A → a
Every grammar in CNF is context free (ie, CNF ⊂ CFG)
A → Aα | β
can be made into
A → βA'
A' ⇒ αA' | ε
Or if there are multiple productions for the non-terminal,
A → Aα₁ | Aα₂ | ... | Aαₘ | β₁ | β₂ | ... | βₙ
can become
A → β₁A' | β₂A' | ... | βₙA'
A' ⇒ α₁A' | α₂A' | ... | αₘA' | ε
Making a grammar suitable for predictive, or top-down parsing.
If two productions of a non-terminal has the same prefix, parser cannot choose between them till enough of the input (ie, more than the prefix common to the two productions) has been seen.
A -> αβ₁
A -> αβ₂
We can fix this by making the prefix a non-terminal:
A -> αA'
A' -> β₁ | β₂
A production rule whose body consists solely of a single non-terminal. Like: A → B
A production rule whose RHS (or body) is ε. Like: A → ε
A grammar without any ε-productions.
For a grammar
E := E + E
| E * E
| - E
| ( E )
| id
E => -E
means E
derives -E
.
And the sequence of replacements
E => -E => -(E) => -(id)
is a derivation of -(id) from E.
A => A'
means A derives A' in one stepA =>* A
' means A derives A' in zero or more stepsA =>+ A
' means A derives A' in one or more stepsFor a grammar G with start symbol S, if S =>* α, we say that α is a sentential form of G. Sentential form may contain terminals and non-terminals. May even be empty.
A sentence of G is a sentential form without non-terminals.
Language generated by a grammar is the set of its sentences.
Two grammars are equivalent if the languages generated by them are the same.
Left-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).
E =>* -(id+id)
E -E -(E) -(E + E) -(id + E) -(id + id)
If S =>* α by left-most derivation, we can say that α is a left-sentential form of G.
Right-most non-terminal in the sentential (ie, sentential form) is chosen as the next non-terminal to be replaced (always).
E =>* -(id+id)
E -E -(E) -(E + E) -(E + id) -(id + id)
If S =>* α by right-most derivation, we can say that α is a right-sentential form of G.
Right-most derivations are also known as canonical derivations. (WHY??)
E =>* -(id+id)
E
|
+--+--+
| |
- E
+--+--+
| | |
( E )
|
+--+--+
| | |
E + E
| |
id id
Parse tree is same for left-most and right-most derivations.
A grammar capable of producing more than one parse tree for the same sentence. ie, more than one left-most derivation or more than one right-most derivation for the same sentence.
For the grammar
E := E + E
| E * E
| ( E )
| id
id + id * id
can have two parse trees.
Left-most derivation:
E
=> E + E
=> id + E
=> id + E * E
=> id + id * E
=> id + id * id
E
=> E * E
=> E + E * E
=> id + E * E
=> id + id * E
=> id + id * id
Right-most derivation:
E
=> E + E
=> E + E * E
=> E + E * id
=> id + id * id
E
=> E * E
=> E * id
=> E + E * id
=> E + id * id
=> id + id * id
CFG generating balanced parentheses: S := ( S ) S | ε
Palindromes where Σ={a,b}: S := 0S0 | 1S1 | 0 | 1 | ε
Even number zeros and odd number of 1s:
:TODO:
2 zeros: Z -> 001 | 010 | 100
1 one: O -> 1
This is actually a regular language...
A → Aα | ε
in left recursion elimination?