Phases of Compiler
Lexical Analysis
The first phase of compiler is also known as Scanner. The
scanner works as a text scanner. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. Lexical analyzer
represents these lexemes in the form of tokens as:<Token-name,
attribute-value>
- Token: Token is a sequence of characters that represent lexical unit, which matches with the pattern, such as keywords, operators, identifiers etc.
- Lexeme: Lexeme is instance of a token i.e., group of characters forming a token.
- Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure that must be matched by strings.
• Once a token is generated the corresponding
entry is made in the symbol table.
Syntax Analysis
The next phase is called the Syntax Analysis or Parser. It
takes the token produced by lexical analysis, as input and generates a parse
tree (or syntax tree). In this phase, token arrangements are checked against
the source code grammar, i.e., the parser checks if the expression made by the
tokens is syntactically correct or not.
Ex. a = b + i can be represented in syntax tree form as
Semantic Analysis
Semantic analysis checks whether the parse tree constructed
thus follows the rules of language. For example, it checks type casting, type
conversions issues and so on. Also, the semantic analyzer keeps track of
identifiers, their types and expressions; whether identifiers are declared
before use or not, etc. The semantic analyzer produces an annotated syntax tree
as an output.
- Convert i to real
- Add to b to get result T(assume)
Intermediate Code Generation
After semantic analysis, the compiler generates an
intermediate code of the source code for the target machine. It represents a
program for some abstract machine. It is in between the high-level language and
the machine language. This intermediate code should be generated in such a way
that it makes it easier to be translated into the target machine code. The
intermediate code may be a Three Address code or Assembly code. As each
statement requires maximum three addresses, so it is called three address code.
Ex. a = d * b + c
Three address code is,
temp1 = d * b
temp2 = temp1 +
c
a = temp2
where temp1 and temp2 are compiler generated temporary
names.
Code Optimization
The next phase does code optimization, it is an optional
phase. Optimization can be assumed as something that removes unnecessary code
lines, and arranges the sequence of statements in order to speed up the program
execution without wasting resources like CPU, memory. The output of this phase
is an optimized intermediate code.
Code Generation
In this phase, the code generator takes the optimized
representation of the intermediate code and maps it to the target machine
language. The code generator translates the intermediate code into a sequence
of re-locatable machine code. Sequence of instructions of machine code performs
the task as the intermediate code would do.
Ex. a = b + c is converted as
LOAD b
ADD c
STORE a
Now if by previous
operation b is already loaded then LOAD b is redundant. Such statements can be
avoided at code generation stage. To avoid this compiler has to keep track of
run time contents of registers.
Symbol Table is also known as Book Keeping. It is a
data-structure maintained throughout all the phases of a compiler. All the identifiers‟
names along with their information like type, size, etc., are stored here. The
symbol table makes it easier for the compiler to quickly search and retrieve
the identifier’s record. The symbol table is also used for scope management.
Error Hander
A parser should be able to detect and report any error in
the program. It is expected that when an error is encountered, the parser
should be able to handle it and do parsing with the rest of the inputs. Mostly
it is expected from the parser to check for errors. But errors may be
encountered at various stages of the compilation process.