# CS 131: Programming Languages and Compilers

ShanghaiTech University - Spring 2021

## Introduction

### Definition

A **Compiler **is "A program that takes a source-code program and translates it into an equivalent program in target language"

### String

A **String **is a sequence of characters.

Alphabet: A finite set of symbols （ASCII characters）

String has words and sentences

Suppose s is banana

**Prefix:**前缀ban banana

**Suffix:**后缀banana, ana

**Substring:**子字符串**Subsequence**:子字符bnan, nn

Not all Strings of chars in the Alphabet is in the certain Language, only those who satisfy the Grammar rules.

Below is **Operations **and **Examples**.

Type 0: Turing Machine Recursive Enumerable Gramma

Type 1: Context-sensitive Grammar （CSG）

Type 2: Context-free Grammar (CFG,上下文无关文法），

*mostly recursive*Type 3: Right-linear Grammar Regular Expressions (RE,正则表达式），

*non-recursive*

Expressiveness: Type 0 > Type 1 > Type 2 > Type 3

### Phases

## Lexical Analyzer （词法分析器）

Reads the source program character by character and returns the **tokens **of the source program

分析并把 identifiers 放入符号表

利用正则表达式来分析 tokens

利用有限自动机来分析 Token 以完成词法分析

词法分析器 = 读入（scanning） + 词法分析（lexical analysis）

### Token

Describes a pattern of characters having some meaning in the source program (such as identifiers, operators, keywords, numbers, delimiters and so on)

e.g., identifiers, operators, numbers

A **Lexeme （词素） **is an instance of a Token, along with its unique attributes.

e.g.

`17`

INT

Token.value = 17

### Regular Expression （正则表达式）

我们利用**正则表达式**来 描述 **Tokens **匹配 **Tokens**

**正则表达式**是左结合的，从左向右匹配的**正则表达式**是一个**Type-3 Grammar Rule**

**Properties**

**Properties**

e.g.

Integers:

Digit = [0-9]*

Integer = Digit Digit*

Identifier

letter = [a-zA-Z]

identifier = letter （letter + digit）*

Whitespace

WS = ('\n' + '\t' + ' ')+

Examples

Even Binary number

1[0+1]*0 | 0

### Finite Automata （有限自动机）

#### Transition Diagram

当一个 Token 被识别：

如果是关键字，那么会返回

**Token of the keyword**如果是符号表里的 ID，那么返回

**entry of symbol table**如果符号表没有这个ID，那么加入ID并返回新的

**entry of symbol table**

#### Finite automata

A **recognizer **for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.

We call the recognizer of the tokens as a **finite automaton**.

**Example**

**Start Arrow****State****Transition Edge****Accepting State**:同心圆，接受并结束，状态3**Death State:错误**状态，未定义的 transition 指向该状态Transition table:

注意 empty string 可以被某些自动机接收

### NFA

Non-Deterministic Finite Automata （NFAs） **easily **represent regular expression, but are **less precise.**

**Accept s: **an Accepting State that spells out **s**.

An NFA is a mathematical model that consists of:

S, a

**finite**set of**states***move*, a**transition function**

### DFA

Deterministic Finite Automata （DFAs） require **more complexity** to represent regular expressions but offer **more precision**.

**Accept s: **an Accepting State that spells out **s*** and ***ONLY and ONLY ONE ***path*.

### Implementation of Lexers

One Token, A *Recognizer. There are 4 ways. (r stands for RE)*

### RE to NFA

Algorithm is called **Thompson's Construction**.

There are some requirements on such construction:

REMEMBER to assign unique names to all states

Properties of the resulting NFA:

Exactly 1 Start State & 1 Accepting State

States do not have multiple outgoing edges with the same input symbol

### RE to DFA

**[Step 1]**: We make Augmented RE: concatenate with symbol # (meaning "finish").

e.g. (a+b)*a#

Ensures at least one operator in the RE

**[Step 2]:** Build syntax tree for this Augmented RE:

All other operators are inner nodes

**[Step 3]:** Compute `nullable()`

,` firstpos()`

& `lastpos()`

for ALL nodes.

`firstpos(n)`

: Function returning the set of positions where the*first*Symbol can be at, in the*sub-RE*rooted at`n`

`lastpos(n)`

: Function returning the set of Positions where the*last*Symbol can be at, in the sub-RE rooted at`n`

**[Step 4]:** Compute `followpos()`

for Leaf positions

`followpos(i)`

: Function returning the set of positions *which can follow* position i in the generated String

**[Step 5]:** Construct the DFA.

#### Calculate \epsilon-Closure

#### Implement NFA as Recognizer

#### Implement DFA as Recognizer

#### Convert NFA to DFA

Algorithm is called **Subset Construction(子集构造法)**, since we make subset of States in original NFA into a single State in resulting DFA

### Minimize DFA

Start State of the minimal DFA is the one containing original Starting State

### Other Issues for Lexers

#### Look ahead

#### Comment Skip

#### Symbol Table

## Syntax Analyzer （句法分析）

如果说词法分析这一步提供了可供计算机识别的**词**，那么句法分析是为了理解句子结构。

通常这一步会生成 **parse tree**, **parse tree** 用以描述句法结构。

### Difference with Lexical Analyzer

The syntax analyzer deals with

**recursive**constructs of the languageBoth do similar things; But the lexical analyzer deals with simple

**non-recursive**constructs of the language.The lexical analyzer recognizes the smallest meaningful units （

**tokens**） in a source program.The syntax analyzer works on the smallest meaningful units （

**tokens**） in a source program to recognize meaningful structures （**sentences**） in our programming language.

### Parse Tree Abstraction

A **Parse Tree / Syntax Tree (语法树) **is a graphical representation of the structure of a program, where leaf nodes are Tokens.

### CFG （上下文无关文法）

A** Context-free Grammar (CFG)** is a** Type-2** Grammar rule, which serves the construction of a Parse Tree from a streamof Tokens. We use a set of Production Rules to characterize a CFG

A **Terminal (终结符号)** is a Token; A **Non-terminal (非终结符号) **is a syntactic variable.

The Start Symbol is the first one of Non-terminals; Usually represents the whole program

A **Production Rule (生成规则)** is a law of production, from a Non-terminal to a sequence of Terminals & Non-terminals.

May be

*recursive*The procedure of applying these rules to get a sentence of Terminals is called

**Sentential Form**/**Derivation**

### Derivation Directions（派生文法）&Ambiguity（二义性）

Corresponds to

*Top Down Parsing*

Corresponds to

*Bottom Up Parsing*, in reversed manner

A CFG is **Ambiguous** when it produces more than one Parse Tree for the same sentence. Must remove Ambiguity for apractical CFG, by:

Enforce

*Precedence (优先级)*and*Associativity (结合律)*Grammar Rewritten

### Top-Down Parsers

Construction of the parse tree starts at the root, and proceeds towards the leaves.

Recursive Predictive Parsing

Non-Recursive Predictive Parsing （

**LL Parsing**）. （**L**-left to right;**L**-leftmost derivation）语法构架能力弱

#### Implement

#### Left Recursion Elimination (消除左递归)

Top Down Parsing

**CANNOT**handle Left-recursive GrammarsCan be eliminated by rewriting

For *Immediate* Left Recursions (Left Recursion that may appear in a single step), eliminate by:

#### Implementing Recursive-descent Parsing

#### Left Factoring: Produce LL(1) Grammar

LL(1) means Only 1 Token Look-ahead ensures which Pruduction Rule to expand now.

To convert LL(1) to a CFG, for each Non-terminal :

|| LL(1) || < || CFG ||, so not all Grammar can be convert to LL(1)

Such Grammar will have an entry with multiple Production Rules to use in the Parsing Table, thusWill be inappropriate for Predictive Parsing

#### Implementing Recursive Predictive Parsing

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

#### Parsing Table Construction

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

#### Implementing LL(1) Parsing

This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.

### Bottom-Up Parsers

Construction of the parse tree starts at the leaves, and proceeds towards the root.

Bottom-up parsing is also known as

**shift-reduce parsing****LR Parsing**– much general form of shift-reduce parsing:**LR**,**SLR**,**LALR**(**L**-left to right;**R**-rightmost derivation)

## IR

## Thanks

[Guanzhou HU's Notes](https://www.josehu.com/)

TAs: 季杨彪, 杨易为, 尤存翰

Last updated