How Do Compilers Work

Feb 5th, 2021 - written by Kimserey with .

With the recent introduction of Apple M1 chip, we have seen the start of the adoption of the ARM processor architecture for Apple Macbook line. ARM being a different family of instruction set architecture than x86 (most common as of today), this means that any application built for x86 (for Mac with Intel or AMD chips) needs to be recompiled to run on ARM (M1 chip). The process of compilation is a multistep process starting taking code written in programming languages and transforming it into machine code understood by the processor. In today’s post we will look at the different part involved in the compilation process.

Compiler
Front end
Back end
Emulator

Compiler

A compiler is a program translating code written in a programming language into another language. For the purpose of this post, we will look into compilers which take code written in general programming languages like Python or C# and transform them to machine code to be ran on computers. The process of translation is composed of multiple steps, most commonly:

lexical analysis - tokenizing the characters from the code,
parsing (syntax analysis) - understanding the token generated by the lexical analysis,
semantic analysis - generating the intermediate representation,
execute the instruction from intermediate representaiton - generating the machine code from it or interpreting it directly depending on the compiler.

A common structure of a compiler separates the functionalities into two disctincts parts known as the front end and the back end with both being linked by the intermediate representation.

The front end takes care of the lexical analysis, parsing and semantic analysis, transforming the program into the intermediate representation while the back end takes the intermediate representation generated by the front end and takes care of code generation and running the application.

The intermediate representation solves the complexity of having many languages and many hardware with different instruction set. For example if we have C, C++ and Python as programming languages, and we have x86 and ARM, we would need n * m compiler implementations where n is the programming languages and m is the instruction representation sets. With an intermediate representation we can have all languages C, C++, Python and other languages being translated to the intermediate representation, and a virtual machine per instruction set able to run the intermediate representation on the specific architecture. This reduces the problem to n + m, decoupling languages from processor architectures.

The instruction set is an abstraction on its own allowing a program compiled to x86 to run on any processor implementing the x86 instruction set architecture. For example given a program compiled to x86 with a VM able to run the machine code, we can run on machine with a Intel i9-10900K or Intel i5-10400 or AMD Ryzen 5 3600 as they are all x86 processors.

Front end

The front end of a compiler role is to build the intermediate representation. The front end tasks are in order:

lexical analysis
syntax analysis
semantic analysis

The lexical analysis is the step taking the programming language source code text and tokenizing it in order to prepare the tokens to be provided to the syntax analyzer. The tokenizer breaks down a string into tokens identifying keyword, separator, operators, identifiers, literals, etc..

The syntax analyzer, or simply called parser, takes the tokens strings identified by the lexical analyzer and constructs the abstract syntax tree by identifying the keywords and applying the language grammar. For example, the parser will take an operator like if and create the appropriate branches for the condition, the true path and the false path in the abstract syntax tree.

The last step is the semantic analysis which verifies the semantic of the code, the meaning of the code, adding checks around type checking or variable instantiations.

The output of the front end is the intermediate representation. Depending on the compiler the intermediate representation is named differently for example for Java it is Java bytecode, while dotnet is known as IL, Python would be Python bytecode and LLVM woul generate LLVM IR.

Back end

Once the intermediate representation is generated, it is passed to the back end. The role of the back end is to:

transform intermediate representation into machine code
run the machine code

Languages like C or C++ are directly compiled into machine code and can directly be executed by the OS.

Other languages like C#, Java or Python include a virtual machine (JVM for Java, coreCLR for dotnet, PVM for Python) which is an abstraction on top of the processor providing added functionalities for the runtime like memory management (heap and stack for example are concepts provided by the virtual machine) or garbage collection - the processor does not have such concepts.

Intermediate representation is not always entirely translated into machine code, compilers also provide JIT (Just-in-time) compilation which allows to selectively translate portions of the intermediate representation during runtime. This decouples build and run where artifacts are created containing the intermediate representation and then distributed, and only translated at runtime where the compiler knows exactly what chip is being used as the application runs on the users machine, hence can tailor customization during translation from intermediate representation to machine code.

We can see now how binaries created for a specific intruction set architecture will not be able to be used for another architecture. For programs which were created with a language with a compiler having an intermediate representation, it would be easier to port as long as a VM is available for the targeted architecture. But even this isn’t a given as some applications use specific functionalities from an instruction set making them tied to a specific type of processor.

Emulator

When the process of recompiling an existing application to target a new architecture is too expensive, emulation is usually the option, more precisely dynamic binary translation is an option. Dynamic binary translation reads the compiled binary and translates the call to a binary compatible for the targeted architecture.

In the latest MacOS Big Sur, Apple has included Rosetta 2, an emulator allowing to run x86 applications on ARM. This allows to decouple the porting of applications from software companies from the release of the computer line as any existing application is expected to work.

Conclusion

Today we’ve looked into the composition of compilers. We saw the common composition with a front end and a back end and we dived into each part with their respective responsibilities. The latest M1 chip was a great opportunity to understand how changing a chip with a different architecture would impact the whole ecosystem with all applications needing to be recompiled. Another interesting point is how abstraction is present everywhere at each level, not just in computer programs but also in physical objects like CPUs or even printers. If we look from the beginning of the chain, a programming language is simply a specification which is an abstraction of the compilers implementing that specification. Then within the compiler, the intermediate representation is an abstraction of the back end, allowing different back end to target various sort of processors and add functionalities (with virtual machines). Then under a specific back end, we saw that an instruction set architecture is an abstraction of a family of processor which allows a program compiled to run on all processors from the same family (provided that the OS and library set are the same). And that concludes today’s post, I hope you liked this one and I see you on the next one!