The purpose of this project is to get you acquainted with LLVM. In particular, it will give you familiarity with the data structures, types, and code organization you will need to understand to implement program analyses.
For this class, we require that you use Linux for development.1 If Linux is not your primary operating system, please use a virtual machine (e.g., VirtualBox). We recommend Ubuntu 12.04 as a relatively pain-free Linux distribution. All code will be written in C++, as that is the language in which LLVM is written.
In this section, we’ll get started by downloading and compiling the appropriate LLVM source packages to set up your development environment. Then we’ll test out the installation by running a simple analysis on a sample program.
First, create a new directory for this project.
$ mkdir cse231-proj1
$ cd cse231-proj1
$ mkdir llvm
Download LLVM 3.4 and Clang 3.4 to the project directory and decompress them as follows.
$ tar xvzf llvm-3.4.src.tar.gz
$ tar xvzf clang-3.4.src.tar.gz
$ mv llvm-3.4 llvm/src
$ mv clang-3.4 llvm/src/tools/clang
Now download the Project 1 Utilities and extract them in the project root.
$ tar xvzf cse231-proj1.tar.gz
Now we’re ready to compile everything. Doing so requires a C++ compiler like g++
, so please install one now if it’s not already available. If you are on Ubuntu, the easiest way to do this is to install the build-essential
package via sudo apt-get install build-essential
.
Once you’re ready, compile like so:
$ cd llvm
$ mkdir build
$ cd build
$ ../src/configure
$ make
LLVM is a large codebase, so this will take a long time (~30 minutes). If all went well, you should see the following message:
llvm[0]: ***** Completed Release+Asserts Build
We have supplied a shell script called startenv.sh
that adds LLVM to your PATH and sets up required environment variables. Make sure to source it, or the infrastructure we have provided will not work.
$ source startenv.sh
The script adds the following variables to your shell environment.
Variable | Description |
---|---|
CSE231ROOT |
The project root directory. |
LLVMLIB |
The LLVM lib directory. |
LLVMBIN |
The LLVM bin directory. |
BENCHMARKS |
The location of example programs for testing analyses. |
INSTRUMENTATION |
The location to place instrumentation code for Sections 2 and 3 of this assignment. |
OUTPUTLOGS |
The location to place analysis output files. |
Finally, run make
from the project root directory to generate various project-related files provided by us.
$ cd $CSE231ROOT
$ make
At a high level, LLVM can be split conceptually into three parts:
In this class, we’ll only be adding code to (2). In particular, we will be writing additional passes that analyze and transform LLVM IR programs.
LLVM is architected such that a single run consists of a sequence of many passes run back to back. By keeping the logic of each pass relatively isolated from the others, LLVM achieves a nice modularity that makes it easy to add new passes. Additionally, you’ll probably find that LLVM IR is a refreshingly simple representation compared to, say, C++ or x86 assembly. This allows our analyses to avoid the complexity of reasoning about the idiosyncrasies of those more complex languages.
The binary file format for LLVM IR is called bitcode, and by convention it uses the .bc file extension. Although we have supplied makefiles to generate the bitcode for the test programs you are expected to analyze, you will often need to generate your own bitcode for your own purposes (for example, in Section 2 of this assignment). This is the general recipe:
$ clang -O0 -emit-llvm -c welcome.cpp -o welcome.bc
$ llvm-dis welcome.bc
The first line generates the unoptimized bitcode file welcome.bc
, and the second line disassembles welcome.bc
to welcome.ll
in a textual format that you can read more easily.
Go ahead and open $BENCHMARKS/welcome/welcome.ll
, which contains the human-readable LLVM IR produced from welcome.cpp
in the same directory. You will notice that LLVM IR looks a lot like assembly code. In fact, LLVM IR is an assembly language, with a few unusual features:
i32
and i8*
%
Otherwise, the usual suspects are there: arithmetic instructions, logical instructions, branch and jump instructions, etc. The LLVM Language Reference lists them all, in case you are interested.
LLVM includes an example pass called Hello that simply prints the names of all functions in a program, each prefixed with the string “Hello:”. You can find it at llvm/src/lib/Transforms/Hello/Hello.cpp
.
To run this pass, execute the following command from the project root directory:
opt -load $LLVMLIB/LLVMHello.so -hello < $BENCHMARKS/welcome/welcome.bc > /dev/null
Let’s dissect this command line:
opt
is LLVM’s command line tool for executing passes.-load $LLVMLIB/LLVMHello.so
causes opt
to load the shared library that contains the Hello pass.-hello
tells opt
that we wish to run the Hello pass, which was bound to the -hello
command line flag by the RegisterPass<Hello>
declaration at line 40 of Hello.cpp.opt
is a transformed program. Since the Hello pass doesn’t perform any transformations, we just redirect the output to /dev/null
to ignore it.Analogous to -hello
, opt
has flags to enable each of the other passes provided by LLVM. If you’re curious, running opt -help
dumps out a massive laundry list of all the passes that are built into LLVM.
Although the Hello pass may be trivial, it contains a lot of the boilerplate that you will need to use and understand to write your own passes. Part of this assignment will be to read the documentation and learn how to use the existing infrastructure provided by LLVM.
Here is some high-level guidance for how to understand Hello.cpp.
Pass
class: Hello
subclasses FunctionPass
, which implements functionality for analyses and optimizations that only look at a single function at a time. There are analogous passes like ModulePass
and BasicBlockPass
that look at entire modules or single basic blocks, respectively.Pass
, there is a corresponding entry point function called runOn<suffix>
, where <suffix>
depends on the kind of pass. For example, FunctionPass
defines a virtual function called runOnFunction(Function &F)
that you fill in with your pass’s implementation. You don’t worry about how this function gets called: you simply write the details of the function, and the system makes sure it gets called for each function in the program.raw_ostream.h
rather than those in standard C++. errs()
is the error output stream, and there is a corresponding outs()
for standard output.We’re ready to start writing code. In this section, we’ll write a pass that counts the number of static instructions in a program. When you’re done writing this pass, you should have a basic understanding of the in-memory representation of LLVM IR and the API for navigating and inspecting the elements of a program.
Your task is to write a pass that counts the number of instructions in the program. The pass should output the total number of instructions, as well as a per-instruction count. For example, if there are 2 loads and 3 adds, then the output should be similar to this:
load 2
add 3
TOTAL 5
Directions
$CSE231ROOT/llvm/src/lib/CSE231/CountStaticInstructions.cpp
.STATISTIC
convenience macro provided by LLVM.make llvm
from the project root directory. If everything went well, then the pass should exist in the shared library $LLVMLIB/CSE231.so
, ready to load into opt
.$BENCHMARKS
directory. For each benchmark, generate output file $OUTPUTLOGS/<benchmark name>.static.log
.Hints
begin()
on a Module instance produces an iterator over the functions defined in the module.) This section of the LLVM Programmer’s Guide is a good resource for understanding this interface.Instruction
, use Instruction::getOpcode()
. Just like in machine languages, each instruction has its own unique integer opcode. Instruction
also provides a function to get the name of an instruction based on its opcode.std::map
or LLVM’s DenseMap
to store instruction counts.print
provided by the Pass
class, and run opt
with the -analyze
flag.The analysis you wrote in Section 1 was a static analysis because it did not actually execute the program to analyze it. In this section, we’re going to write a dynamic analysis that executes the program and collects runtime information. In particular, you will write a pass that instruments the original program by inserting code to count the number times each instruction executes.
The general strategy you should follow:
Directions
$CSE231ROOT/llvm/src/lib/CSE231/CountDynamicInstructions.cpp
.$INSTRUMENTATION/dynamic
directory.$BENCHMARKS
directory. For each benchmark, generate output file $OUTPUTLOGS/<benchmark name>.dynamic.log
.Hints
IRBuilder
class, which is a convenience class for adding instructions. IRBuilder::SetInsertPoint
sets the insertion point, and various create...
methods on the IRBuilder
instance insert instructions at the specified point.Module::getOrInsertFunction
IRBuilder::CreateCall
FunctionType::get
$INSTRUMENTATION/sample
directory contains an example of how you link a module into the welcome
benchmark, then execute the linked binary.llc
to produce a .cpp file containing the calls needed to reconstruct that bitcode: llc -march=cpp file.bc
Now, write a dynamic analysis that computes the branch bias on a per-function basis: count the number of times branch instructions are taken/not taken and compute the taken/total ratio. The analysis output should look like this:
Function Bias Taken Total
func1 0.23 23 100
func2 0.48 48 100
func3 0.79 79 100
Do not unmangle the function names. Only consider conditional branches. A branch is considered taken if its condition evaluates to true.
Directions
$CSE231ROOT/llvm/src/lib/CSE231/BranchBias.cpp
.$INSTRUMENTATION/branchbias
directory.$BENCHMARKS
directory. For each benchmark, generate output file $OUTPUTLOGS/<benchmark name>.branchbias.log
.Hints
llc -march=cpp
.Write a short document (2-3 pages) describing your algorithm, implementation, and results for Sections 1, 2, and 3. Here are some guidelines:
Create directory $CSE231ROOT/runscripts
and populate it with shell scripts (that you write) to run your analyses. The contents of this directory should be three scripts named run-section1.sh
, run-section2.sh
, and run-section3.sh
. Each script should take a single argument that is the path to the benchmark on which it runs. For instance, run-section3.sh $BENCHMARKS/hadamard
should run the analysis from Section 3 on the hadamard benchmark.
These scripts can assume that CSE231.so
has already been compiled. They can also assume that bitcode files for the benchmarks have already been generated. However, they should not assume that your instrumentation code has been compiled yet. (in other words, your scripts should ensure that the instrumented bitcode gets correctly generated).
Create a text file $CSE231ROOT/MEMBERS
that lists the members of the team in this format:
<pid> <last name> <first name>
<pid> <last name> <first name>
<pid> <last name> <first name>
Place your project report at $CSE231ROOT/report.pdf
.
Download the file package.sh
(link) and place it in $CSE231ROOT
. When you are ready to turnin, execute package.sh
from $CSE231ROOT
, which will perform some simple sanity checking and generate an archive cse231-proj1.turnin.tar.gz
that contains the contents of $CSE231ROOT
. Follow the instructions given by package.sh
to submit your archived project.