CS 131: Programming Languages and Compilers
ShanghaiTech University - Spring 2021
Last updated
ShanghaiTech University - Spring 2021
Last updated
A Compiler is "A program that takes a source-code program and translates it into an equivalent program in target language"
A String is a sequence of characters.
Alphabet: A finite set of symbols (ASCII characters)
String has words and sentences
Suppose s is banana
Prefix:前缀
ban banana
Suffix:后缀
banana, ana
Substring:子字符串
Subsequence:子字符
bnan, nn
Not all Strings of chars in the Alphabet is in the certain Language, only those who satisfy the Grammar rules.
Below is Operations and Examples.
Type 0: Turing Machine Recursive Enumerable Gramma
Type 1: Context-sensitive Grammar (CSG)
Type 2: Context-free Grammar (CFG,上下文无关文法), mostly recursive
Type 3: Right-linear Grammar Regular Expressions (RE,正则表达式), non-recursive
Expressiveness: Type 0 > Type 1 > Type 2 > Type 3
Reads the source program character by character and returns the tokens of the source program
分析并把 identifiers 放入符号表
利用正则表达式来分析 tokens
利用有限自动机来分析 Token 以完成词法分析
词法分析器 = 读入(scanning) + 词法分析(lexical analysis)
Describes a pattern of characters having some meaning in the source program (such as identifiers, operators, keywords, numbers, delimiters and so on)
e.g., identifiers, operators, numbers
A Lexeme (词素) is an instance of a Token, along with its unique attributes.
e.g. 17
INT
Token.value = 17
我们利用正则表达式来 描述 Tokens 匹配 Tokens
正则表达式是左结合的,从左向右匹配的
正则表达式是一个 Type-3 Grammar Rule
e.g.
Integers:
Digit = [0-9]*
Integer = Digit Digit*
Identifier
letter = [a-zA-Z]
identifier = letter (letter + digit)*
Whitespace
WS = ('\n' + '\t' + ' ')+
Examples
Even Binary number
1[0+1]*0 | 0
当一个 Token 被识别:
如果是关键字,那么会返回 Token of the keyword
如果是符号表里的 ID,那么返回 entry of symbol table
如果符号表没有这个ID,那么加入ID并返回新的 entry of symbol table
A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.
We call the recognizer of the tokens as a finite automaton.
Example
Start Arrow
State
Transition Edge
Accepting State:同心圆,接受并结束,状态3
Death State:错误状态,未定义的 transition 指向该状态
Transition table:
注意 empty string 可以被某些自动机接收
Non-Deterministic Finite Automata (NFAs) easily represent regular expression, but are less precise.
Accept s: an Accepting State that spells out s.
An NFA is a mathematical model that consists of:
S, a finite set of states
move, a transition function
Deterministic Finite Automata (DFAs) require more complexity to represent regular expressions but offer more precision.
Accept s: an Accepting State that spells out s and ONLY and ONLY ONE path.
One Token, A Recognizer. There are 4 ways. (r stands for RE)
Algorithm is called Thompson's Construction.
There are some requirements on such construction:
REMEMBER to assign unique names to all states
Properties of the resulting NFA:
Exactly 1 Start State & 1 Accepting State
States do not have multiple outgoing edges with the same input symbol
[Step 1]: We make Augmented RE: concatenate with symbol # (meaning "finish").
e.g. (a+b)*a#
Ensures at least one operator in the RE
[Step 2]: Build syntax tree for this Augmented RE:
All other operators are inner nodes
[Step 3]: Compute nullable()
, firstpos()
& lastpos()
for ALL nodes.
firstpos(n)
: Function returning the set of positions where the first Symbol can be at, in the sub-RE rooted at n
lastpos(n)
: Function returning the set of Positions where the last Symbol can be at, in the sub-RE rooted at n
[Step 4]: Compute followpos()
for Leaf positions
followpos(i)
: Function returning the set of positions which can follow position i in the generated String
[Step 5]: Construct the DFA.
Algorithm is called Subset Construction(子集构造法), since we make subset of States in original NFA into a single State in resulting DFA
Start State of the minimal DFA is the one containing original Starting State
如果说词法分析这一步提供了可供计算机识别的词,那么句法分析是为了理解句子结构。
通常这一步会生成 parse tree, parse tree 用以描述句法结构。
The syntax analyzer deals with recursive constructs of the language
Both do similar things; But the lexical analyzer deals with simple non-recursive constructs of the language.
The lexical analyzer recognizes the smallest meaningful units (tokens) in a source program.
The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures (sentences) in our programming language.
A Parse Tree / Syntax Tree (语法树) is a graphical representation of the structure of a program, where leaf nodes are Tokens.
A Context-free Grammar (CFG) is a Type-2 Grammar rule, which serves the construction of a Parse Tree from a streamof Tokens. We use a set of Production Rules to characterize a CFG
A Terminal (终结符号) is a Token; A Non-terminal (非终结符号) is a syntactic variable.
The Start Symbol is the first one of Non-terminals; Usually represents the whole program
A Production Rule (生成规则) is a law of production, from a Non-terminal to a sequence of Terminals & Non-terminals.
May be recursive
The procedure of applying these rules to get a sentence of Terminals is called Sentential Form / Derivation
Corresponds to Top Down Parsing
Corresponds to Bottom Up Parsing, in reversed manner
A CFG is Ambiguous when it produces more than one Parse Tree for the same sentence. Must remove Ambiguity for apractical CFG, by:
Enforce Precedence (优先级) and Associativity (结合律)
Grammar Rewritten
Construction of the parse tree starts at the root, and proceeds towards the leaves.
Recursive Predictive Parsing
Non-Recursive Predictive Parsing (LL Parsing). (L-left to right; L-leftmost derivation)
语法构架能力弱
Top Down Parsing CANNOT handle Left-recursive Grammars
Can be eliminated by rewriting
For Immediate Left Recursions (Left Recursion that may appear in a single step), eliminate by:
LL(1) means Only 1 Token Look-ahead ensures which Pruduction Rule to expand now.
To convert LL(1) to a CFG, for each Non-terminal :
|| LL(1) || < || CFG ||, so not all Grammar can be convert to LL(1)
Such Grammar will have an entry with multiple Production Rules to use in the Parsing Table, thusWill be inappropriate for Predictive Parsing
This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.
This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.
This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.
Construction of the parse tree starts at the leaves, and proceeds towards the root.
Bottom-up parsing is also known as shift-reduce parsing
LR Parsing – much general form of shift-reduce parsing: LR, SLR, LALR (L-left to right; R-rightmost derivation)
This part stongly suggest to see https://www.josehu.com/assets/file/compilers.pdf for better understanding.
[Guanzhou HU's Notes](https://www.josehu.com/)
TAs: 季杨彪, 杨易为, 尤存翰
is the concatenation of and
is the empty string
is the length of s
A Language is a set of Strings over a fixed Alphabet , constructed using a specific Grammar.
e.g.
Alphabet and using Grammar rule , we can specify the above example Language, while isn't.
A Grammar is the description of method (rules) of how to construct a certain Language over a certain Alphabet
每个正则表达式 描述一个语言
被叫做 regular set
, the symbols of the input alphabet
move(state, symbol) sets of states
A state , the start state
, a set of final or accepting states
move: ε- transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol
The -Closure of All States that can go to without consuming any input
Does not allow move, for every , there is ONLY ONE decision for every input Symbol.
No -closure !!!
and CANNOT have any intersections
# of States in NFA (# of Symbols + # of Operators) in
States have at most 2 outgoing edges
, # and all are at leaves
Non- leaves get its position number, increasing from left right
nullable(n)
: Function judging whether the sub-RE rooted at n
can generate
Conduct a Post-order Depth First Traversal on the syntax tree, and do the following oprations when leaving / * nodes:
For all lastpos(c1)
, followpos(i)
= followpo(i)
firstpos(c2)
For all lastpos(c)
, followpos(i)
followpos(i)
firstpos(c)
A State in resulting DFA is an Accepting State iff # node
Start State of the resulting DFA is
Performance of NFA-type Recognizers: Space ; Time
Performance of DFA-type Recognizers: Space ; Time
A State in resulting DFA is an Accepting State iff is an Accepting State in original NFA
Start State of the resulting DFA is
A State S in the minimal DFA is an Accepting State iff , s is an Accepting State in original DFA
A Sentence is a string of Terminals such that Start Symbol
e.g. , where is a Non-terminal and are Terminals
Context-free Languages Regular Languages , e.g. .
Left-most Derivation (左递归) means to replace the leftmost Non-terminal at each step.
If , then NO Non-terminals in
Right-most Derivation means Replace the rightmost Non-terminal at each step.
If , then NO Non-terminals in
e.g. , then gets expanded first
Eliminate Left Recursion Recursive-descent Parsing
Eliminate Left Recursion Left Factoring Recursive Predictive Parsing
Eliminate Left Recursion Left Factoring Construct Parsing Table Non-recursive Predictive Parsing
: Left Recursion
state
a
b
0
{0,1}
{0}
1
--
{2}
2
--
{3}