# CS 131: Programming Languages and Compilers

ShanghaiTech University - Spring 2021

Compiler

A

**Compiler**is "A program that takes a source-code program and translates it into an equivalent program in target language"A

**String**is a sequence of characters.- Alphabet: A finite set of symbols （ASCII characters）
- String has words and sentences
- $st$is the concatenation of$s$and$t$
- $s \epsilon = \epsilon s = s$

- $\epsilon$is the empty string
- $s^0 = \epsilon$

- $\mid s \mid$is the length of s
- Suppose s is banana
**Prefix:**前缀- ban banana

**Suffix:**后缀- banana, ana

**Substring:**子字符串**Subsequence**:子字符- bnan, nn

A

**Language**is a set of Strings over a fixed**Alphabet**$\Sigma$

, constructed using a specific Grammar.- e.g.$\{\varepsilon, 0,01,011,0111, \ldots\}$
- Not all Strings of chars in the Alphabet is in the certain Language, only those who satisfy the Grammar rules.
- Alphabet$={0,1}$and using Grammar rule$\mathrm{RE}=01^{*}$, we can specify the above example Language, while$01$isn't.

Below is

**Operations**and**Examples**.Operations of Languages

Example of Laguage Operations

A

**Grammar**$G$

**is the description of method （***rules*） of how to construct a certain Language over a certain Alphabet- Type 0: Turing Machine Recursive Enumerable Gramma
- Type 1: Context-sensitive Grammar （CSG）
- Type 2: Context-free Grammar (CFG,上下文无关文法），
*mostly recursive* - Type 3: Right-linear Grammar Regular Expressions (RE,正则表达式），
*non-recursive*

Expressiveness: Type 0 > Type 1 > Type 2 > Type 3

Reads the source program character by character and returns the

**tokens**of the source program- 分析并把 identifiers 放入符号表
- 利用正则表达式来分析 tokens
- 利用有限自动机来分析 Token 以完成词法分析
- 词法分析器 = 读入（scanning） + 词法分析（lexical analysis）

Lexical Analyzer

Describes a pattern of characters having some meaning in the source program (such as identifiers, operators, keywords, numbers, delimiters and so on)

- e.g., identifiers, operators, numbers

A

**Lexeme （词素）**is an instance of a Token, along with its unique attributes.- e.g.
`17`

- INT
- Token.value = 17

Process of determing a Token 通常使用双 buffers 来避免 buffer 区太小的问题

我们利用

**正则表达式**来 描述**Tokens**匹配**Tokens**- 每个
**正则表达式**$r$描述一个语言$L(r)$- $r$被叫做
**regular set**

**正则表达式**是左结合的，从左向右匹配的**正则表达式**是一个**Type-3 Grammar Rule**

Properties

Extended Properties

- e.g.
- Integers:
- Digit = [0-9]*
- Integer = Digit Digit*

- Identifier
- letter = [a-zA-Z]
- identifier = letter （letter + digit）*

- Whitespace
- WS = ('\n' + '\t' + ' ')+

- Examples
- Even Binary number
- 1[0+1]*0 | 0

Transition Disgram Examples

当一个 Token 被识别：

- 如果是关键字，那么会返回
**Token of the keyword** - 如果是符号表里的 ID，那么返回
**entry of symbol table** - 如果符号表没有这个ID，那么加入ID并返回新的
**entry of symbol table**

switch (state){

case 0:

c = nextchar();

if (c == [something])

state = [state], lexeme_beginning ++;

else if (c == [something])

[do something]

else if (c == [final state]) // terminal

retract(1); //forward

lexical_value = install_num(); // install something

return (NUM);

else

state = fail();

A

**recognizer**for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.We call the recognizer of the tokens as a

**finite automaton**.**Example**

Regular expression: (a+b)*abb

**Start Arrow****State****Transition Edge****Accepting State**:同心圆，接受并结束，状态3**Death State:错误**状态，未定义的 transition 指向该状态- Transition table:

state | a | b |

0 | {0,1} | {0} |

1 | -- | {2} |

2 | -- | {3} |

注意 empty string 可以被某些自动机接收

Non-Deterministic Finite Automata （NFAs）

**easily**represent regular expression, but are**less precise.****Accept s:**an Accepting State that spells out

**s**.

An NFA is a mathematical model that consists of:

- S, a
**finite**set of**states** - $\Sigma$, the symbols of the
**input alphabet** *move*, a**transition function**- move(state, symbol)$\to$sets of states

- A state$s_0 \in S$, the
**start state** - $F \subseteq S$, a set of
**final**or**accepting states** - $\epsilon$
**move: ε- transitions**are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol

The

$\epsilon$

-Closure of $S = S \cup \{$

All States that can go to without consuming any input $\}$

Deterministic Finite Automata （DFAs） require

**more complexity**to represent regular expressions but offer**more precision**.**Does not allow**

$\epsilon$

**move,**for every

$s \in S$

, there is ONLY ONE decision for every input *Symbol.*

**Accept s:**an Accepting State that spells out

**s**

*and*

*ONLY and ONLY ONE**path*.

No

$\epsilon$

-closure !!!One Token, A

*Recognizer. There are 4 ways. (r stands for RE)*- 1.$r \to NFA \to Recognizer$
- 2.$r \to NFA \to DFA \to Recognizer$
- 3.$r \to DFA \to Recognizer$
- 4.$r \rightsquigarrow DFA \to Minimized ~ DFA \to Recognizer$

Algorithm is called

**Thompson's Construction**.There are some requirements on such construction:

- $N(s)$and$N(t)$CANNOT have any intersections
- REMEMBER to assign unique names to all states

Properties of the resulting NFA:

- Exactly 1 Start State & 1 Accepting State
- of States in NFA$\leq 2 \times$(# of Symbols + # of Operators) in$r$
- States do not have multiple outgoing edges with the same input symbol
- States have at most 2 outgoing$\epsilon$edges

**[Step 1]**: We make Augmented RE: concatenate with symbol # (meaning "finish").

- e.g. (a+b)*a#
- Ensures at least one operator in the RE

**[Step 2]:**Build syntax tree for this Augmented RE:

- $\epsilon$, # and$a \in \Sigma$all are at leaves
- All other operators are inner nodes
- Non-$\epsilon$leaves get its position number, increasing from left$\to$right

**[Step 3]:**Compute

`nullable()`

,` firstpos()`

& `lastpos()`

for ALL nodes.- 1.
`firstpos(n)`

: Function returning the set of positions where the*first*Symbol can be at, in the*sub-RE*rooted at`n`

- 2.
`lastpos(n)`

: Function returning the set of Positions where the*last*Symbol can be at, in the sub-RE rooted at`n`

- 3.
`nullable(n)`

: Function judging whether the*sub-RE*rooted at`n`

can generate$\epsilon$

**[Step 4]:**Compute

`followpos()`

for Leaf positions`followpos(i)`

: Function returning the set of positions *which can follow*position i in the generated String

Conduct a

*Post-order**Depth First Traversal*on the syntax tree, and do the following oprations when leaving$\cdot$

/ * nodes:- $c{1} \cdot c{2}:$For all$i \in$
`lastpos(c1)`

,`followpos(i)`

=`followpo(i)`

$\cup$`firstpos(c2)`

- $c^{*}:$For all$i \in$
`lastpos(c)`

,`followpos(i)`

$=$`followpos(i)`

$\cup$`firstpos(c)`

**[Step 5]:**Construct the DFA.

void construct() {

S0=firstpos(root);

DStates= {(S0, unmarked)};

while (DStates has an unmarked State U) {

Mark State U;

for (each possible input char c)

{

V= {};

for (each position p in U whose symbol is c)

V=UnionofVandfollowpos(p);

if (V is not empty) {

if (V is not in DStates)

Include V in DStates, unmarked;

Add the Transition U--c->V;

}

}

}

}

- A State$S$in resulting DFA is an Accepting State iff # node$\in S$
- Start State of the resulting DFA is$S_0$

set epsClosure(set S)

{

for (each State s in S)

Push s onto stack;

closure = S;

while (stack is not empty)

{

Pop State u;

for (each State v that u->v is an epsilon Transition)

{

if (v is not in closure)

{

Include v in closure;

Push v onto stack;

}

}

}

return closure;

}

bool recognizer() {

S=epsClosure(s0);

while ((c=getchar()) !=EOF)

S=epsClosure(move(S, c));

if (S and F has intersections)

return ACCEPT;

return REJECT;

}

Performance of NFA-type Recognizers: Space

$O(|r|)$

; Time $O(|r| \times |s|)$

bool recognizer() {

s=s_0;

while ((c=getchar()) !=EOF)

s=move(s, c);

if (s is in F)

return ACCEPT;

return REJECT;

}

Performance of DFA-type Recognizers: Space

$O(|2^{|r|})$

; Time $O(|s|)$

Algorithm is called

**Subset Construction(子集构造法)**, since we make subset of States in original NFA into a single State in resulting DFAvoid subsetConstruction() {

S0=epsClosure({s0});

DStates= {(S0, unmarked)};

while (DStates has any unmarked State U) {

MarkState U;

for (each possible inputchar c) {

V=epsClosure(move(U, c));

if (V is not empty) {

if (V is not in DStates)

Include V in DStates, unmarked;

Add the Transition U--c->V;

}

}

}

}

- A State$S$in resulting DFA is an Accepting State iff$\exists s \in S, s$is an Accepting State in original NFA
- Start State of the resulting DFA is$S_0$

void minimize() {

PI = {G_A, G_n};

do {

for (every group G in PI){

for (every pair of States (s,t) in G){

if (for every possible input char c, transition s--c -> and t--c-> go to states in the same group)

s,t are in the same subgroup;

else

s,t should split into different subgroups;

}

Split G according to the above information;

}

}while (PI changed in this iteration)

Every Group in PI is a state in the minimal DFA;

}

- A State S in the minimal DFA is an Accepting State iff$\exists s \in S$, s is an Accepting State in original DFA
- Start State of the minimal DFA is the one containing original Starting State

如果说词法分析这一步提供了可供计算机识别的

**词**，那么句法分析是为了理解句子结构。通常这一步会生成

**parse tree**,**parse tree**用以描述句法结构。- The syntax analyzer deals with
**recursive**constructs of the language - Both do similar things; But the lexical analyzer deals with simple
**non-recursive**constructs of the language. - The lexical analyzer recognizes the smallest meaningful units （
**tokens**） in a source program. - The syntax analyzer works on the smallest meaningful units （
**tokens**） in a source program to recognize meaningful structures （**sentences**） in our programming language.

A

**Parse Tree / Syntax Tree (语法树)**is a graphical representation of the structure of a program, where leaf nodes are Tokens.A

**Context-free Grammar (CFG)**is a**Type-2**Grammar rule, which serves the construction of a Parse Tree from a streamof Tokens. We use a set of Production Rules to characterize a CFGA

**Terminal (终结符号)**is a Token; A**Non-terminal (非终结符号)**is a syntactic variable.- The Start Symbol is the first one of Non-terminals; Usually represents the whole program
- A Sentence is a string of Terminals such that Start Symbol$S \Rightarrow{ }^{+} s$

A

**Production Rule (生成规则)**is a law of production, from a Non-terminal to a sequence of Terminals & Non-terminals.- e.g.$A \rightarrow \alpha A \mid \beta$, where$A$is a Non-terminal and$\alpha, \beta$are Terminals
- May be
*recursive* - The procedure of applying these rules to get a sentence of Terminals is called
**Sentential Form**/**Derivation**

$|$

Context-free Languages $|>|$

Regular Languages $|$

, e.g. $\{(^{i})^{i}: i \geq 0 \}$

.**Left-most Derivation**

**(左递归)**

$\left(\Rightarrow_{l m}\right)$

means to replace the leftmost Non-terminal at each step.- If$\beta A \gamma \Rightarrow \operatorname{lm} \beta \delta \gamma$, then NO Non-terminals in$\mathcal{\beta}$
- Corresponds to
*Top Down Parsing*

**Right-most Derivation**

$(\Rightarrow r m)$

means Replace the rightmost Non-terminal at each step.- If$\beta A \gamma \Rightarrow_{r m} \beta \delta \gamma$, then NO Non-terminals in$\gamma$
- Corresponds to
*Bottom Up Parsing*, in reversed manner

A CFG is

**Ambiguous**when it produces more than one Parse Tree for the same sentence. Must remove Ambiguity for apractical CFG, by:- Enforce
*Precedence (优先级)*and*Associativity (结合律)*- e.g.$* > +$, then$+$gets expanded first

- Grammar Rewritten

Construction of the parse tree starts at the root, and proceeds towards the leaves.

- Recursive Predictive Parsing
- Non-Recursive Predictive Parsing （
**LL Parsing**）. （**L**-left to right;**L**-leftmost derivation） - 语法构架能力弱

- 1.Eliminate Left Recursion$\to$Recursive-descent Parsing
- 2.Eliminate Left Recursion$\to$Left Factoring$\to$Recursive Predictive Parsing
- 3.Eliminate Left Recursion$\to$Left Factoring$\to$Construct Parsing Table$\to$Non-recursive Predictive Parsing

$A \Rightarrow^{+} A_{\alpha}$

: Left Recursion- Top Down Parsing
**CANNOT**handle Left-recursive Grammars - Can be eliminated by rewriting

For

*Immediate*Left Recursions (Left Recursion that may appear in a single step), eliminate by:立即左递归的消除

/* Non-terminals arranged in order: A1, A2, ... An. */

void eliminate()

{

for (i from 1 to n) {

for (j from 1 to i-1)

Replace Aj with its products in every Prodcution Rule Ai->Aj ...;

Eliminate Immediate Left Recursions Ai->Ai ...;

}

}

左递归的消除

/* Example:

* E -> T | T + E*

T -> int | int * T | ( E )

*/

bool term(TOKENtok) { return*ptr++==tok; }

bool E1() { returnT(); }

bool E2() { returnT() &&term(PLUS) &&E(); }

bool E() {

TOKEN*save=ptr;

return (ptr=save, E1()) || (ptr=save, E2());

}

bool T1() { returnterm(INT); }

bool T2() { returnterm(INT) &&term(TIMES) &&T(); }

bool T3() { returnterm (OPEN) &&E() &&term(CLOSE); }

bool T() {

TOKEN*save=ptr;

return (ptr=save, T1()) || (ptr=save, T2()) || (ptr=save, T3());

}

LL(1) means Only 1 Token Look-ahead ensures which Pruduction Rule to expand now.

To convert LL(1) to a CFG, for each Non-terminal :

|| LL(1) || < || CFG ||, so not all Grammar can be convert to LL(1)

- Such Grammar will have an entry with multiple Production Rules to use in the Parsing Table, thusWill be inappropriate for Predictive Parsing

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

Construction of the parse tree starts at the leaves, and proceeds towards the root.

- Bottom-up parsing is also known as
**shift-reduce parsing** **LR Parsing**– much general form of shift-reduce parsing:**LR**,**SLR**,**LALR**(**L**-left to right;**R**-rightmost derivation)

- TAs: 季杨彪, 杨易为, 尤存翰

Last modified 2yr ago