Recent Updates


The purpose of this project is to get you acquainted with LLVM. In particular, it will give you familiarity with the data structures, types, and code organization you will need to understand to implement program analyses.

For this class, we require that you use Linux for development.1 If Linux is not your primary operating system, please use a virtual machine (e.g., VirtualBox). We recommend Ubuntu 12.04 as a relatively pain-free Linux distribution. All code will be written in C++, as that is the language in which LLVM is written.

Section 0: Getting Started

In this section, we’ll get started by downloading and compiling the appropriate LLVM source packages to set up your development environment. Then we’ll test out the installation by running a simple analysis on a sample program.

Installing LLVM

First, create a new directory for this project.

$ mkdir cse231-proj1
$ cd cse231-proj1
$ mkdir llvm

Download LLVM 3.4 and Clang 3.4 to the project directory and decompress them as follows.

$ tar xvzf llvm-3.4.src.tar.gz
$ tar xvzf clang-3.4.src.tar.gz
$ mv llvm-3.4 llvm/src
$ mv clang-3.4 llvm/src/tools/clang

Now download the Project 1 Utilities and extract them in the project root.

$ tar xvzf cse231-proj1.tar.gz

Now we’re ready to compile everything. Doing so requires a C++ compiler like g++, so please install one now if it’s not already available. If you are on Ubuntu, the easiest way to do this is to install the build-essential package via sudo apt-get install build-essential.

Once you’re ready, compile like so:

$ cd llvm
$ mkdir build
$ cd build
$ ../src/configure
$ make

LLVM is a large codebase, so this will take a long time (~30 minutes). If all went well, you should see the following message:

llvm[0]: ***** Completed Release+Asserts Build

Setting Up Your Environment

We have supplied a shell script called startenv.sh that adds LLVM to your PATH and sets up required environment variables. Make sure to source it, or the infrastructure we have provided will not work.

$ source startenv.sh

The script adds the following variables to your shell environment.

Variable Description
CSE231ROOT The project root directory.
LLVMLIB The LLVM lib directory.
LLVMBIN The LLVM bin directory.
BENCHMARKS The location of example programs for testing analyses.
INSTRUMENTATION The location to place instrumentation code for Sections 2 and 3 of this assignment.
OUTPUTLOGS The location to place analysis output files.

Important! Remember to source startenv.sh in any shell you use for this class.

Finally, run make from the project root directory to generate various project-related files provided by us.

$ cd $CSE231ROOT
$ make

The Design of LLVM

At a high level, LLVM can be split conceptually into three parts:

  1. A compiler frontend called Clang that takes C++ and translates it to a simplified program representation called LLVM IR,
  2. a target-independent backend that performs various analyses and transformations on LLVM IR, and
  3. a target-dependent backend that lowers LLVM IR to native machine code like x86.

In this class, we’ll only be adding code to (2). In particular, we will be writing additional passes that analyze and transform LLVM IR programs.

LLVM is architected such that a single run consists of a sequence of many passes run back to back. By keeping the logic of each pass relatively isolated from the others, LLVM achieves a nice modularity that makes it easy to add new passes. Additionally, you’ll probably find that LLVM IR is a refreshingly simple representation compared to, say, C++ or x86 assembly. This allows our analyses to avoid the complexity of reasoning about the idiosyncrasies of those more complex languages.

First Look at LLVM IR

The binary file format for LLVM IR is called bitcode, and by convention it uses the .bc file extension. Although we have supplied makefiles to generate the bitcode for the test programs you are expected to analyze, you will often need to generate your own bitcode for your own purposes (for example, in Section 2 of this assignment). This is the general recipe:

$ clang -O0 -emit-llvm -c welcome.cpp -o welcome.bc
$ llvm-dis welcome.bc

The first line generates the unoptimized bitcode file welcome.bc, and the second line disassembles welcome.bc to welcome.ll in a textual format that you can read more easily.

Go ahead and open $BENCHMARKS/welcome/welcome.ll, which contains the human-readable LLVM IR produced from welcome.cpp in the same directory. You will notice that LLVM IR looks a lot like assembly code. In fact, LLVM IR is an assembly language, with a few unusual features:

  • It contains types, such as i32 and i8*
  • It has an unlimited number of registers with names that start with %
  • No register occurs on the left hand side of more than one assignment.2

Otherwise, the usual suspects are there: arithmetic instructions, logical instructions, branch and jump instructions, etc. The LLVM Language Reference lists them all, in case you are interested.

First Look at an LLVM Pass

LLVM includes an example pass called Hello that simply prints the names of all functions in a program, each prefixed with the string “Hello:”. You can find it at llvm/src/lib/Transforms/Hello/Hello.cpp.

To run this pass, execute the following command from the project root directory:

opt -load $LLVMLIB/LLVMHello.so -hello < $BENCHMARKS/welcome/welcome.bc > /dev/null

Let’s dissect this command line:

  • opt is LLVM’s command line tool for executing passes.
  • -load $LLVMLIB/LLVMHello.so causes opt to load the shared library that contains the Hello pass.
  • -hello tells opt that we wish to run the Hello pass, which was bound to the -hello command line flag by the RegisterPass<Hello> declaration at line 40 of Hello.cpp.
  • By default, the output of opt is a transformed program. Since the Hello pass doesn’t perform any transformations, we just redirect the output to /dev/null to ignore it.

Analogous to -hello, opt has flags to enable each of the other passes provided by LLVM. If you’re curious, running opt -help dumps out a massive laundry list of all the passes that are built into LLVM.

Although the Hello pass may be trivial, it contains a lot of the boilerplate that you will need to use and understand to write your own passes. Part of this assignment will be to read the documentation and learn how to use the existing infrastructure provided by LLVM.

Here is some high-level guidance for how to understand Hello.cpp.

  • In LLVM, each pass is implemented in a separate C++ class that inherits from one of the subclasses of the Pass class: Hello subclasses FunctionPass, which implements functionality for analyses and optimizations that only look at a single function at a time. There are analogous passes like ModulePass and BasicBlockPass that look at entire modules or single basic blocks, respectively.
  • For each kind of Pass, there is a corresponding entry point function called runOn<suffix>, where <suffix> depends on the kind of pass. For example, FunctionPass defines a virtual function called runOnFunction(Function &F) that you fill in with your pass’s implementation. You don’t worry about how this function gets called: you simply write the details of the function, and the system makes sure it gets called for each function in the program.
  • Idiomatic I/O in LLVM uses the output streams provided by raw_ostream.h rather than those in standard C++. errs() is the error output stream, and there is a corresponding outs() for standard output.
  • Read the Quick Start section of Writing An LLVM Pass, which goes line-by-line explaining Hello.cpp. Use the rest of the guide as a resource for navigating the LLVM APIs.
  • For more detailed reference and guidance, see the LLVM Programmer’s Manual, especially the section Helpful Hints for Common Operations. LLVM’s documentation is quite extensive, as you can see from the full list here.

Section 1: Collecting Static Instruction Counts

We’re ready to start writing code. In this section, we’ll write a pass that counts the number of static instructions in a program. When you’re done writing this pass, you should have a basic understanding of the in-memory representation of LLVM IR and the API for navigating and inspecting the elements of a program.

Your task is to write a pass that counts the number of instructions in the program. The pass should output the total number of instructions, as well as a per-instruction count. For example, if there are 2 loads and 3 adds, then the output should be similar to this:

load  2
add   3
TOTAL 5

Directions

Hints

Section 2: Collecting Dynamic Instruction Counts

The analysis you wrote in Section 1 was a static analysis because it did not actually execute the program to analyze it. In this section, we’re going to write a dynamic analysis that executes the program and collects runtime information. In particular, you will write a pass that instruments the original program by inserting code to count the number times each instruction executes.

The general strategy you should follow:

  1. Write a C++ module that contains helper function(s) and global data structure(s) that your instrumentation code will use.
  2. From your LLVM pass, instrument the program with calls to your module’s functions at the correct locations.
  3. When it’s time to run the analysis, run your LLVM pass on the original bitcode file, link your helper module to the instrumented bitcode, then execute the linked program. The instrumented program should output the analysis results at program termination.

Directions

Hints

Section 3: Profiling Branch Bias

Now, write a dynamic analysis that computes the branch bias on a per-function basis: count the number of times branch instructions are taken/not taken and compute the taken/total ratio. The analysis output should look like this:

Function Bias Taken Total
func1    0.23 23    100
func2    0.48 48    100
func3    0.79 79    100

Do not unmangle the function names. Only consider conditional branches. A branch is considered taken if its condition evaluates to true.

Directions

Hints

Project Report

Write a short document (2-3 pages) describing your algorithm, implementation, and results for Sections 1, 2, and 3. Here are some guidelines:

Turnin Instructions

  1. Create directory $CSE231ROOT/runscripts and populate it with shell scripts (that you write) to run your analyses. The contents of this directory should be three scripts named run-section1.sh, run-section2.sh, and run-section3.sh. Each script should take a single argument that is the path to the benchmark on which it runs. For instance, run-section3.sh $BENCHMARKS/hadamard should run the analysis from Section 3 on the hadamard benchmark.

    These scripts can assume that CSE231.so has already been compiled. They can also assume that bitcode files for the benchmarks have already been generated. However, they should not assume that your instrumentation code has been compiled yet. (in other words, your scripts should ensure that the instrumented bitcode gets correctly generated).

  2. Create a text file $CSE231ROOT/MEMBERS that lists the members of the team in this format:

    <pid> <last name> <first name>
    <pid> <last name> <first name>
    <pid> <last name> <first name>
  3. Place your project report at $CSE231ROOT/report.pdf.

  4. Download the file package.sh (link) and place it in $CSE231ROOT. When you are ready to turnin, execute package.sh from $CSE231ROOT, which will perform some simple sanity checking and generate an archive cse231-proj1.turnin.tar.gz that contains the contents of $CSE231ROOT. Follow the instructions given by package.sh to submit your archived project.


  1. Although LLVM is also available on Windows and Mac OS X, they each have their idiosyncrasies that will make life harder than it needs to be.

  2. This is called Static Single Assignment Form, which will be covered later in the quarter.