Dana Vrajitoru 
C311 Programming Languages 
Parsers
Parsing
- The most important part of any compiler.
 - It relies on a context-free grammar.
 - The input consists of code tokens.
 - If n is the number of tokens in the code, then a typical parser
can compile a program in O(n3) time.
 - Some particular grammars can be parsed in linear time: LL (left to
right, left-most derivation) and LR (left to right, right-most
derivation).
 - LL parsers are top-down (predictive), while LR parsers are
bottom-up (constructive).
 
LL and LR Grammars
- An LL grammar is a grammar that can be recognized by an LL parser.
 - Notation: LL(k) is an LL grammar for which the parser needs to
scan ahead k terminal symbols in the string to recognize. A similar
notation is used for LR(k).
 - LL grammars are more general than LR. There exist grammars that
are not LL(k) for any given k.
 
Example
----- A: LL(1)--------
P -> ( P )
P -> R
R -> [ R ]
R -> e
----- B: non-LL--------
P -> ( P ) 
P -> P P
P -> e
----- C: LL(2)--------
R -> [ R ]
R -> [ ]
R -> P
P -> ( P )
P -> ( R )
P -> e
Languages Accepted by the Above Grammars
- A: (n [m ]m )n, where
n, m >= 0.
  
In other words, a sequence of parentheses followed by a sequence of
brackets, followed by the same number of closed brackets and
parantheses in reverse order.
Example: ( ( ( [ [ ] ] ) ) )
 - B: a1 a2 a3 ... an where 
ai is either ( or ),
for any i<n, count{ak = (, for k <= i} >=
count{ak = ), for k <= i}, (or the count of left parentheses
seen so far should be larger than or equal to the count of right
parenteses at any intermediate point in the string) and
count{ak = (, for k <= n} = count{ak = ), for k
<= n} (or the total count of left parentheses in the whole string
should be equal to the total count of right parenteses in the string).
In other words, a well formed sequence of parentheses where we never
close more parentheses than we have opened and where all opened
parentheses are closed by the end of the string.
Example: ( ( ( ) ) ( ) ) ( ( ) )
 - C: a1 a2 a3 ... an
bn bn-1 ... b2 b1 where
n>=1 and
 ai is either ( or [ , bi is either
) or ] and 
 if ai = (, then bi = ) otherwise
bi = ] (in other words, the parentheses and brackets should
be closed in reverse order than they were open).
  
In other words, a sequence of any combination of opened parentheses
and brackets followed by the same number of closed parentheses and
brackets in reverse order.
Example: [ ( ( [ ( [ ] ) ] ) ) ]
 
LL Parsing
- The parser starts with the root of the tree.
 - At every step, it will try to expand the left-most non-terminal
active symbol (let's say S).
 - The parser analyzes the next terminal symbol in the string (a) and
tries to make a prediction about the rule to be applied. 
S =>a b c T
 - It must match all terminal symbols to the left of the non-terminal
in the production rule (the string must contain a b c). The active
non-terminal symbol is replaced with the one to the right (T).
 - The process stops when the parsed string is composed only of
terminal symbols.
 
More Examples LL Grammars
A Non-LL Grammar
 This grammar is non-LL because to
choose between the two rules expanding UN one must find if there is a
preiod in the string or not. For that the parser must search forward
through all the digits. No matter what constant k we choose, it is
possible to have a number having more than k digits to the left of the
period, in which case the parser would fail to make a decision. So the
grammar is not LL(k) for any constant k, and so it is not an LL
grammar.
N => { + | - | e } UN
UN => UI . UI
UN => UI
UI => digit UI
UI => digit
LL - LR Parsing
- An example of LL(1) and LR(2) grammars to recognize a list of
identifiers in C++:
 - LL:
id_list => id id_list_tail
id_list_tail => , id id_list_tail
id_list_tail=> ;
 - LR:
id_list => id_list_prefix ;
id_list_prefix => id_list_prefix , id
id_list_prefix => id
 
Example
- Parsing the following list:   n, m, size, result;
 - Starting with the id_list.
 - Applying the first rule
id_list =>n id_list_tail
 - Then we find a "," in the string which means that we apply the
second rule:
n id_list_tail =>n, m id_list_tail
 - And so on until we find the ";" which prompts us to apply the
third rule:
n, m, size, result id_list_tail =>n, m, size, result;
 
Parse Tree
LR Parsing
- A LR parser is bottom-up.
 - It starts by scanning the string and storing partially-completed
subtrees in a list.
 - As soon as it notices that the right hand side of the list
corresponds to sequence to the right hand side of a rule, it replaces
that sequence with the non-terminal symbol to the left of the rule.
 - The algorithm continues until we arrive at the starting
non-terminal, when the tree becomes complete.
 
Example Bottom-Up
- It starts from the left side and asks what subtreen could be part
of. Since it’s an id, it marks it as an IDLP(id(n)).
 - Then it examines the next terminal, the comma, and asks what rule
could follow an IDLP with a comma. So it builds another
subtree IDLP(IDLP(id(n)) , id(m)).
 - It continues this way adding things on top of the partial tree
that was constructed this far until the end of the input.
 - It is called the right-most derivation because the root of the
tree starts with the terminals to the right. The processing of the
input string is still left to right.
 
Example LR
- Starting with the processing list
( n , m , size , result ; )
 - We identify n as an id that can be a subtree of IDLP
(is_list_prefix):
( (IDLP n) , m , size , result ; )
 - Then we notice that the combination of IDLP , m is the subtree of
another IDLP based on the second rule:
( (IDLP ( IDLP n) , m) , size , result ; )
 - After repeating these steps the final result is:
 
(IDL (IDLP (IDLP (IDLP (IDLP n) , m) , size) , result) ; )
 
Parse Tree
Backus-Naur Notation
- BN notation used to describe the language Algol (1960).
 - The Backus-Naur Form of a grammar is a convenient way to describe it.
 - It condenses the rules by separating the alternatives with "|".
id_list_tail => , id id_list_tail | ;
 - Kleene star or plus are used to denote repetition:
id_list => id ( , id) * ;
UN => (digit)+
 - * means that the content of the parenthesis is repeated 0 or any number of times.
 - + means that the content of the parenthesis is repeated one or more times.
 
Arithmetic Expression
expr => term term_tail				1
term_tail => add_op term term_tail | e		2|3
term => factor factor_tail				4
factor_tail => mult_op factor factor_tail | e	5|6
factor => ( expr ) | id | number			7|8|9
add_op => + | -					10|11
mult_op => * | /					12|13
Discriminating Terminals
- expr : no need to check a terminal
 - term_tail : 
+ or -  -> the first rule, 
anything else 	-> the second
 - term : one rule, no need to check
 - factor_tail :
* or / 		-> the first rule
anything else 	-> the second
 - factor :
( 			-> the first rule
id or number 	-> the two others
anything else 	-> error
 - add_op, mult_op: +, -, *, / ok, anything else error
 - This is LL(1).
 
Parsing Process
2 + a * ( 5 – b )
E ->T  TT ->F  FT  TT -> (F 2) FT  TT -> (F 2) TT ->
(F 2) AO  T  TT -> (F 2) (AO +) T  TT -> (F 2) (AO +) F  FT  TT ->
(F 2) (AO +) (F a) FT  TT -> (F 2) (AO +) (F a) MO  F  FT  TT ->
(F 2) (AO +) (F a) (MO *)  F  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (E)  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  (F FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) TT)  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) AO T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) T TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) F FT TT )  FT  TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b) FT TT ) FT TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b) TT) FT TT ->
(F 2) (AO +) (F a) (MO *)  ( (F 5) (AO -) (F b)  )
Parse Tree