Developing an LLVM backend is not a particularly glamorous affair. You will soon realize that it is largely an exercise in copy-pasting and adapting code from other existing backends. In fact, LLVM developers in online forums suggest getting started by “copying an existing backend, rename it and modify it to suit your needs”. Sounds simple, except that even relatively small backends, like Lanai or XCore, are rather complex and the code is not easy to follow!
I will take a slightly different approach in this series of posts. We will be using an existing LLVM backend as a starting point, but I have stripped out most of the code and reduced it to the bare minimum needed to compile a (tiny) program. The stripped-down backend, called RISCW, is simple enough to help understand the LLVM Target-Independent Code Generator without getting bogged down in the details. In the remaining of this post I will use the RISCW backend to show how to set up a new LLVM backend. We will also see how to build LLVM with an experimental backend and even compile a (very simple) C program down to assembly.
NOTE: The code for the RISCW backend can be found here.
NOTE: Posts in this series:
- Getting Started
- Setting Up a New Backend
- Configuring the Build System
- Instruction Selection
- Arithmetic Instructions
LLVM Triple and ELF Configuration
We start by configuring a new target triple for our backend. For historical reasons, the triple is a way of encoding information about the target such as the architecture, vendor and operating system. Here are the steps to configure a new triple:
- Declare a new architecture for the triple in
- Provide a type conversions between string and Triple architecture (see here, here, here and here).
- Indicate what type of object format the backend generates, e.g. ELF, COFF, etc. RISCW will work with ELF only (see here).
- Indicate the architecture variant, e.g. 32- or 64-bit, and the pointer size (see here, here and here).
NOTE: The architecture variant does not necessarily imply a pointer size.
For example, it is not always the case that pointers are 64-bit when compiling
for RV64. The pointer size is usually given by the ABI which could be
long and pointers are 32 bits) in a 64-bit machine.
Since RISCW uses ELF, this is a good time to configure the following parameters related to that:
- Create a new machine architecture enum for RISCW (see
This integer is encoded in the
e_machinefield of the ELF header. The value is not arbitrary; it must match the registered architecture types for the ELF format e.g. 0xF3 for RISCV. But we will set it to an unused value for now.
- Declare the ELF relocation types (see here and here). Again, these are architecture-dependent and those for RISCV are listed here. At this stage, we will simply put place-holders for RISCW.
- The file format name (see here).
- Indicate the target triple for a given class (see here). Currently, the class in the ELF header is a byte that encodes whether the format is 32- or 64- bit.
NOTE: Take a look at wikipedia for more information on the ELF file format.
Recall that we are using clang to compile the input C code down to LLVM IR. But clang is not just our frontend compiler, it is also a driver, like GCC, that drives the compilation pipeline to transform an input C program into another representation e.g. C to assembly or object code. Therefore, we need to modify clang to tell it
- that there is a new RISCW backend target with a particular feature set. For example, clang needs to be aware whether RISCW is 32- or 64-bit.
- what is the RISCW compilation pipeline. For instance, what assembler should it use? what linker? which include paths? etc
We can tell clang about RISCW by adding a new target class
that is instanciated alongside the existing LLVM targets as shown
The class is declared and defined
There are a few important things to highlight in this code:
RISCWTargetInfodescribes the data layout via a string. This string encodes information like the bits in a pointer and stack alignment requirements.
- The target may indicate what is the size of basic C data types.
- A function
RISCWTargetInfo::getTargetDefines()indicates what C preprocessor macros are defined at compile-time. For example, these macros are defined when compiling code using the RISCV target. The macros generally describe what architecture is used, the ABI, any enabled/disabled architectural features, etc.
NOTE: A backend might target multiple instruction sets, ABIs, etc, so the
driver configuration must be changed according to the selected target triple.
For example, the
RISCVTargetInfo changes the data layout string depending on
whether the triple contains
NOTE: Take a look
at the declaration of the parent class
contains a lot more options that you can configure.
Configuring the toolchain is relatively straight-forward. We simply need to
RISCWToolChain class that inherits from
Toolchain as shown
The code is mostly self-explanatory, but there are a lot more options that your
target can modify by overriding the members of the
ToolChain class (see
Creating a New Target
Each backend has a separate directory under
llvm/lib/Target where the
majority of its code is contained. We will not go into the details in this post
(we will do that later on) because even a small backend, like RISCW, has a lot
of files. For now, it suffices to say that we can broadly classify the files
into three groups:
- TableGen files: The LLVM Target-Independent Code Generation framework implements an elaborate pattern matching algorithm to select instructions for the input program. The patterns used for matching are described to LLVM using the TableGen syntax. Additionally, TableGen files also describe important architecture-specific features like the number of registers and the procedure calling convention.
- Build files: The directory for every backend must be declared
otherwise it will not be built. Additionally, the top directory for our target,
llvm/lib/Target/RISCW, and every subdirectory must contain two build files:
LLVMBuild.txt. The former adds source files and any subdirectories to the build target while the latter sets simple build parameters for the target component. Parameters include the library name, required libraries for linking, etc.
- C++ classes: The C++ files comprise the bulk of the backend code and implement everything from simple configuration options to more complex instruction selection functionality that is not (or cannot) be captured by TableGen.
Building the Experimental Backend
Now that everything is set up, we can build LLVM with our new RISCW backend.
But we cannot simply modify the
-DLLVM_TARGETS_TO_BUILD option to the CMake
command from the previous post to include RISCW because that backend is still
experimental. Instead, we use the
cmake -G "Ninja" -DLLVM_ENABLE_PROJECTS="clang" -DLLVM_TARGETS_TO_BUILD="ARM;Lanai;RISCV" -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="RISCW" -DCMAKE_BUILD_TYPE="Debug" -DLLVM_ENABLE_ASSERTIONS=On ../llvm
When the build is complete, you can check that RISCW is now an available target as follows:
$ ./build/bin/llc --version
LLVM version 10.0.1
DEBUG build with assertions.
Default target: x86_64-unknown-linux-gnu
Host CPU: znver2
arm - ARM
armeb - ARM (big endian)
lanai - Lanai
riscv32 - 32-bit RISC-V
riscv64 - 64-bit RISC-V
riscw - 32-bit RISC-V <== YAY!!
thumb - Thumb
thumbeb - Thumb (big endian)
Compiling our First C Program
Our RISCW backend can only emit two instructions
ret, but it cannot
properly handle function calls, stacks and pretty much everything else! So
we will restrain ourselves and only compile this tiny function:
int test(int a, int b)
return a + b;
And voilà! We get this code:
.globl test ; -- Begin function test
test: ; @test
; %bb.0: ; %entry
add x0, x1, x0
.size test, .Lfunc_end0-test
; -- End function
.ident "clang version 10.0.1 (https://github.com/llvm/llvm-project 89f2d2cc3bba7cb12cee346b3205cb0335e758cd)"
Again, there are a lot of things missing and the code is actually incorrect
x0 in RISCV is a read-only register hard-wired to 0. But I think we
achieved our objective: we set up a minimal LLVM backend that we can easily
extend with more features.
NOTE: Make sure you set clang’s
-target riscw and llc’s
you are using the commands from the previous post to compile the
NOTE: Attempting to compile more complex programs will result in a
cannot select... error. Give it a try if you are interested.
NOTE: You can instruct the compiler to print debug information by passing
-debug option to