# CS 131: Programming Languages and Compilers

ShanghaiTech University - Spring 2021

# Introduction

## Definition

A Compiler is "A program that takes a source-code program and translates it into an equivalent program in target language"

## String

A String is a sequence of characters.

• Alphabet: A finite set of symbols （ASCII characters）

• String has words and sentences

• $st$ is the concatenation of $s$and $t$

• $s \epsilon = \epsilon s = s$

• $\epsilon$is the empty string

• $s^0 = \epsilon$

• $\mid s \mid$ is the length of s

• Suppose s is banana

• Prefix:前缀

• ban banana

• Suffix:后缀

• banana, ana

• Substring:子字符串

• Subsequence:子字符

• bnan, nn

A Language is a set of Strings over a fixed Alphabet $\Sigma$, constructed using a specific Grammar.

• e.g. $\{\varepsilon, 0,01,011,0111, \ldots\}$

• Not all Strings of chars in the Alphabet is in the certain Language, only those who satisfy the Grammar rules.

• Alphabet $={0,1}$ and using Grammar rule $\mathrm{RE}=01^{*}$ , we can specify the above example Language, while $01$ isn't.

Below is Operations and Examples.

A Grammar $G$ is the description of method （rules） of how to construct a certain Language over a certain Alphabet

• Type 0: Turing Machine Recursive Enumerable Gramma

• Type 1: Context-sensitive Grammar （CSG）

• Type 2: Context-free Grammar (CFG,上下文无关文法）， mostly recursive

• Type 3: Right-linear Grammar Regular Expressions (RE,正则表达式）， non-recursive

Expressiveness: Type 0 > Type 1 > Type 2 > Type 3

# Lexical Analyzer （词法分析器）

Reads the source program character by character and returns the tokens of the source program

• 分析并把 identifiers 放入符号表

• 利用正则表达式来分析 tokens

• 利用有限自动机来分析 Token 以完成词法分析

• 词法分析器 = 读入（scanning） + 词法分析（lexical analysis）

## Token

Describes a pattern of characters having some meaning in the source program (such as identifiers, operators, keywords, numbers, delimiters and so on)

• e.g., identifiers, operators, numbers

A Lexeme （词素） is an instance of a Token, along with its unique attributes.

• e.g. 17

• INT

• Token.value = 17

## Regular Expression （正则表达式）

• 每个正则表达式 $r$ 描述一个语言 $L(r)$

• $r$被叫做 regular set

• 正则表达式是左结合的，从左向右匹配的

• 正则表达式是一个 Type-3 Grammar Rule

### Properties

• e.g.

• Integers:

• Digit = [0-9]*

• Integer = Digit Digit*

• Identifier

• letter = [a-zA-Z]

• identifier = letter （letter + digit）*

• Whitespace

• WS = ('\n' + '\t' + ' ')+

• Examples

• Even Binary number

• 1[0+1]*0 | 0

## Finite Automata （有限自动机）

### Transition Diagram

• 如果是关键字，那么会返回 Token of the keyword

• 如果是符号表里的 ID，那么返回 entry of symbol table

• 如果符号表没有这个ID，那么加入ID并返回新的 entry of symbol table

switch (state){case 0:    c = nextchar();    if (c == [something])        state = [state], lexeme_beginning ++;    else if (c == [something])        [do something]    else if (c == [final state]) // terminal        retract(1); //forward        lexical_value = install_num(); // install something         return (NUM);    else        state = fail();

### Finite automata

A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.

We call the recognizer of the tokens as a finite automaton.

Example

• Start Arrow

• State

• Transition Edge

• Accepting State:同心圆，接受并结束，状态3

• Death State:错误状态，未定义的 transition 指向该状态

• Transition table:

 state a b 0 {0,1} {0} 1 -- {2} 2 -- {3}

## NFA

Non-Deterministic Finite Automata （NFAs） easily represent regular expression, but are less precise.

Accept s: an Accepting State that spells out s.

An NFA is a mathematical model that consists of:

• S, a finite set of states

• $\Sigma$, the symbols of the input alphabet

• move, a transition function

• move(state, symbol) $\to$ sets of states

• A state $s_0 \in S$, the start state

• $F \subseteq S$, a set of final or accepting states

• $\epsilon$move: ε- transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol

The $\epsilon$-Closure of $S = S \cup \{$ All States that can go to without consuming any input $\}$

## DFA

Deterministic Finite Automata （DFAs） require more complexity to represent regular expressions but offer more precision.

Does not allow $\epsilon$move, for every $s \in S$, there is ONLY ONE decision for every input Symbol.

Accept s: an Accepting State that spells out s and ONLY and ONLY ONE path.

No $\epsilon$-closure !!!

## Implementation of Lexers

One Token, A Recognizer. There are 4 ways. (r stands for RE)

1. $r \to NFA \to Recognizer$

2. $r \to NFA \to DFA \to Recognizer$

3. $r \to DFA \to Recognizer$

4. $r \rightsquigarrow DFA \to Minimized ~ DFA \to Recognizer$

## RE to NFA

Algorithm is called Thompson's Construction.

There are some requirements on such construction:

• $N(s)$and $N(t)$CANNOT have any intersections

• REMEMBER to assign unique names to all states

Properties of the resulting NFA:

• Exactly 1 Start State & 1 Accepting State

• of States in NFA $\leq 2 \times$(# of Symbols + # of Operators) in $r$

• States do not have multiple outgoing edges with the same input symbol

• States have at most 2 outgoing $\epsilon$ edges

## RE to DFA

[Step 1]: We make Augmented RE: concatenate with symbol # (meaning "finish").

• e.g. (a+b)*a#

• Ensures at least one operator in the RE

[Step 2]: Build syntax tree for this Augmented RE:

• $\epsilon$, # and $a \in \Sigma$ all are at leaves

• All other operators are inner nodes

• Non-$\epsilon$ leaves get its position number, increasing from left $\to$ right

[Step 3]: Compute nullable(), firstpos() & lastpos() for ALL nodes.

1. firstpos(n): Function returning the set of positions where the first Symbol can be at, in the sub-RE rooted at n

2. lastpos(n): Function returning the set of Positions where the last Symbol can be at, in the sub-RE rooted at n

3. nullable(n): Function judging whether the sub-RE rooted at n can generate $\epsilon$

[Step 4]: Compute followpos() for Leaf positions

followpos(i): Function returning the set of positions which can follow position i in the generated String

Conduct a Post-order Depth First Traversal on the syntax tree, and do the following oprations when leaving $\cdot$ / * nodes:

• $c{1} \cdot c{2}:$ For all $i \in$ lastpos(c1) , followpos(i)= followpo(i) $\cup$ firstpos(c2)

• $c^{*}:$ For all $i \in$ lastpos(c), followpos(i)$=$ followpos(i) $\cup$ firstpos(c)

[Step 5]: Construct the DFA.

void construct() {    S0=firstpos(root);    DStates= {(S0, unmarked)};    while (DStates has an unmarked State U) {        Mark State U;        for (each possible input char c)         {            V= {};            for (each position p in U whose symbol is c)                V=UnionofVandfollowpos(p);            if (V is not empty) {                if (V is not in DStates)                    Include V in DStates, unmarked;                Add the Transition U--c->V;            }        }    }}
• A State $S$in resulting DFA is an Accepting State iff # node $\in S$

• Start State of the resulting DFA is $S_0$

### Calculate \epsilon-Closure

set epsClosure(set S) {        for (each State s in S)            Push s onto stack;        closure = S;        while (stack is not empty)         {        Pop State u;        for (each State v that u->v is an epsilon Transition)         {              if (v is not in closure)               {                  Include v in closure;                  Push v onto stack;                          }         }             }         return closure; }​

### Implement NFA as Recognizer

bool  recognizer() {    S=epsClosure(s0);    while ((c=getchar()) !=EOF)        S=epsClosure(move(S, c));    if (S and F has intersections)        return ACCEPT;    return REJECT;}

Performance of NFA-type Recognizers: Space $O(|r|)$; Time $O(|r| \times |s|)$

### Implement DFA as Recognizer

bool recognizer() {    s=s_0;    while ((c=getchar()) !=EOF)        s=move(s, c);    if (s is in F)        return ACCEPT;    return REJECT;}

Performance of DFA-type Recognizers: Space $O(|2^{|r|})$; Time $O(|s|)$

### Convert NFA to DFA

Algorithm is called Subset Construction(子集构造法), since we make subset of States in original NFA into a single State in resulting DFA

void subsetConstruction() {    S0=epsClosure({s0});    DStates= {(S0, unmarked)};    while (DStates has any unmarked State U) {        MarkState U;        for (each possible inputchar c) {            V=epsClosure(move(U, c));            if (V is not empty) {                if (V is not in DStates)                    Include V in DStates, unmarked;                Add the Transition U--c->V;            }        }    }}
• A State $S$ in resulting DFA is an Accepting State iff $\exists s \in S, s$ is an Accepting State in original NFA

• Start State of the resulting DFA is $S_0$

## Minimize DFA

void minimize() {    PI = {G_A, G_n};    do {        for (every group G in PI){            for (every pair of States (s,t) in G){                if (for every possible input char c, transition s--c -> and t--c-> go to states in the same group)                    s,t are in the same subgroup;                else                    s,t should split into different subgroups;            }            Split G according to the above information;        }    }while (PI changed in this iteration)    Every Group in PI is a state in the minimal DFA;}
• A State S in the minimal DFA is an Accepting State iff $\exists s \in S$, s is an Accepting State in original DFA

• Start State of the minimal DFA is the one containing original Starting State

# Syntax Analyzer （句法分析）

## Difference with Lexical Analyzer

• The syntax analyzer deals with recursive constructs of the language

• Both do similar things; But the lexical analyzer deals with simple non-recursive constructs of the language.

• The lexical analyzer recognizes the smallest meaningful units （tokens） in a source program.

• The syntax analyzer works on the smallest meaningful units （tokens） in a source program to recognize meaningful structures （sentences） in our programming language.

## Parse Tree Abstraction

A Parse Tree / Syntax Tree (语法树) is a graphical representation of the structure of a program, where leaf nodes are Tokens.

## CFG （上下文无关文法）

A Context-free Grammar (CFG) is a Type-2 Grammar rule, which serves the construction of a Parse Tree from a streamof Tokens. We use a set of Production Rules to characterize a CFG

A Terminal (终结符号) is a Token; A Non-terminal (非终结符号) is a syntactic variable.

• The Start Symbol is the first one of Non-terminals; Usually represents the whole program

• A Sentence is a string of Terminals such that Start Symbol $S \Rightarrow{ }^{+} s$

A Production Rule (生成规则) is a law of production, from a Non-terminal to a sequence of Terminals & Non-terminals.

• e.g. $A \rightarrow \alpha A \mid \beta$, where $A$ is a Non-terminal and $\alpha, \beta$ are Terminals

• May be recursive

• The procedure of applying these rules to get a sentence of Terminals is called Sentential Form / Derivation

$|$ Context-free Languages $|>|$ Regular Languages $|$, e.g. $\{(^{i})^{i}: i \geq 0 \}$.

## Derivation Directions（派生文法）&Ambiguity（二义性）

Left-most Derivation (左递归)$\left(\Rightarrow_{l m}\right)$ means to replace the leftmost Non-terminal at each step.

• If $\beta A \gamma \Rightarrow \operatorname{lm} \beta \delta \gamma$, then NO Non-terminals in $\mathcal{\beta}$

• Corresponds to Top Down Parsing

Right-most Derivation $(\Rightarrow r m)$means Replace the rightmost Non-terminal at each step.

• If $\beta A \gamma \Rightarrow_{r m} \beta \delta \gamma$, then NO Non-terminals in $\gamma$

• Corresponds to Bottom Up Parsing, in reversed manner

A CFG is Ambiguous when it produces more than one Parse Tree for the same sentence. Must remove Ambiguity for apractical CFG, by:

• Enforce Precedence (优先级) and Associativity (结合律)

• e.g. $* > +$ , then $+$ gets expanded first

• Grammar Rewritten

## Top-Down Parsers

Construction of the parse tree starts at the root, and proceeds towards the leaves.

• Recursive Predictive Parsing

• Non-Recursive Predictive Parsing （LL Parsing）. （L-left to right; L-leftmost derivation）

• 语法构架能力弱

### Implement

1. Eliminate Left Recursion $\to$ Recursive-descent Parsing

2. Eliminate Left Recursion $\to$ Left Factoring $\to$Recursive Predictive Parsing

3. Eliminate Left Recursion $\to$Left Factoring $\to$Construct Parsing Table $\to$Non-recursive Predictive Parsing

### Left Recursion Elimination (消除左递归)

$A \Rightarrow^{+} A_{\alpha}$: Left Recursion

• Top Down Parsing CANNOT handle Left-recursive Grammars

• Can be eliminated by rewriting

For Immediate Left Recursions (Left Recursion that may appear in a single step), eliminate by:

/* Non-terminals arranged in order: A1, A2, ... An. */void eliminate() {    for (i from 1 to n) {        for (j from 1 to i-1)            Replace Aj with its products in every Prodcution Rule Ai->Aj ...;        Eliminate Immediate Left Recursions Ai->Ai ...;        }}

### Implementing Recursive-descent Parsing

/*  Example:*    E -> T | T + E*         T -> int | int * T | ( E )*/bool term(TOKENtok)  { return*ptr++==tok; }bool E1()             { returnT(); }bool E2()             { returnT() &&term(PLUS) &&E(); }bool E() {     TOKEN*save=ptr;     return (ptr=save, E1()) || (ptr=save, E2());}bool T1()             { returnterm(INT); }bool T2()             { returnterm(INT) &&term(TIMES) &&T(); }bool T3()             { returnterm (OPEN) &&E() &&term(CLOSE); }bool T() {     TOKEN*save=ptr;     return (ptr=save, T1()) || (ptr=save, T2()) || (ptr=save, T3());}

### Left Factoring: Produce LL(1) Grammar

LL(1) means Only 1 Token Look-ahead ensures which Pruduction Rule to expand now.

To convert LL(1) to a CFG, for each Non-terminal :

|| LL(1) || < || CFG ||, so not all Grammar can be convert to LL(1)

• Such Grammar will have an entry with multiple Production Rules to use in the Parsing Table, thusWill be inappropriate for Predictive Parsing

### Implementing Recursive Predictive Parsing

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

### Parsing Table Construction

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

### Implementing LL(1) Parsing

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

## Bottom-Up Parsers

Construction of the parse tree starts at the leaves, and proceeds towards the root.

• Bottom-up parsing is also known as shift-reduce parsing

• LR Parsing – much general form of shift-reduce parsing: LR, SLR, LALR (L-left to right; R-rightmost derivation)

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.