Cse240a Project 1: Data prefetching competition


Due: Thursday, March 5, at 11 am

Introduction:

In this project you are asked to implement a hardware prefectcher. We use the simulation framework provided by Data Prefetch Championship. A CMP simulator is given as a library. It includes a header file specifying the prefetcher API. Your task will be to implement the prefetcher with a prefetching algorithm of your own choice. The effectiveness of your prefetcher will be tested against a baseline prefetcher. You will also compete against your fellow classmates for extra credits in class!

Please note that unlike homeworks, you are asked to do this project individually.

Simulation Infrastructure:

Your prefetcher will work in the context of a well-defined simulation framework based on the CMP$im simulator. The framework models a simple out-of-order core with the following basic parameters:

Setting up the simulation infrastructure:

Prefetcher Interface:

Your main task in this project will be to implement a hardware prefetcher using the given prefetcher API. The prefetcher API will have access to a processor state structure which reflects processor state from the last cycle. The following is a list of data fields that will be updated every cycle, and can be polled by the prefetcher API.

An L1 state structure is updated with the following information about events in the last cycle:

Up to 4 records can be given in a single cycle depending on the number of requests (up to 4) issued to the L1 data cache. If a requested address misses in the L1 data cache, a request is made to the L2 data cache. Each cycle, an L2 state structure is updated with the following information:

In addition to this information, you are provided with 32Kbits of storage to implement state machines to manage the prefetching algorithm. This algorithm will consist of prefetchers for the L1 data cache and the L2 data cache. The storage limit includes all the state required for the L1 and the L2 prefetchers. This limit only applies to the storage available to implement the algorithm. You will not be penalized for the complexity of the algorithm logic. Your source code must clearly indicate which variables are used as state. Furthermore, you will need to provide a detailed accounting in your project report of how much state is kept.

There are three files under the directory /src/prefetch/: interface.h, sample_prefetcher.h, and sample_prefetcher.cpp. The file interface.h pre-defines three prefetching functions that must be implemented. You are forbidden to change first file. The only modifications you should make is to the prefetching interface contained in the files sample_prefetcher.h and sample_prefetcher.cpp. The only source code you will be submitting are these two files. A sample prefetcher implementation has been provided for you to start with.

Generating traces to simulate:

To generate a trace for any single-threaded application, do the following (from the PREF_KIT/ directory):

	mkdir traces

cd traces/

../pin/pin -t ../bin/CMPsim.gentrace32 -threads 1 -o [trace file name] -- [app] [app_args]

For example, to generate a trace for the application "/bin/ls" with the arguments "-al":

	../pin/pin -t ../bin/CMPsim.gentrace32 -threads 1 -o ls.out -- /bin/ls -al

This will create three files:

   ls.out.gz        ==> stats file on how many instructions it was able to trace<

ls.out.trace.gz ==> a binary of the trace file collected

ls.out.dep.gz ==> an ASCII representation of the instruction dependency information for out-of-order simulation

To collect traces for a snippet of an application, use the -fwd and -icount options to the pin tool.

   -fwd :   e.g. -fwd 10 will forward 10 million instructions (note: arg is in millions)

-icount : e.g. -icount 10 will trace 10 million instructions of the application

For example, to generate a trace of "ls -al" that fast-forwards for 1 million instructions and traces another 1 million instructions:

	../pin/pin -t ../bin/CMPsim.gentrace32 -threads 1 -o ls.out -fwd 1 -icount 1 -- /bin/ls -al

To run the simulator on the generated trace, do the following (from the PREF_KIT/ directory):

	mkdir runs

cd runs/

../pin/pin -t ../bin/CMPsim.usetrace -threads 1 -t [trace file name] -- /bin/ls

Note that the purpose for adding the '/bin/ls' at the end is just to complete the pin usage command line syntax. Any valid command can be used.

The trace file name is the name of the trace collected by the CMPsim.gentrace tool. For example,if you wanted to run the ls.out.trace.gz trace generated above, you can run:

	../pin/pin -t ../bin/CMPsim.usetrace -threads 1 -t ../traces/ls.out.trace.gz -o ls.out.stats -- /bin/ls

Configurations Used for Evaluation:

You code will be evaluated based on the measured performance of your prefetching algorithm across three different configurations, that differ in L2 data cache size and cache/memory bandwidth as follows:

Below are the command lines for the three configurations with and without prefetching: (All commands are executed from the PREF_KIT/runs directory. Note that for actual traces you may need to change the -icount parameter to 100 so that it runs for 100 million instructions)

Simulation Statistics

The given simulator will output load/store hit/miss counts for all caches. This information will help you understand how your prefetcher is performing and why.

The product of CPI and the average memory access time AMAT will be used for comparisons of your prefetcher to the baseline and your colleagues' prefetchers. The AMAT value can be computed as follows:
AMAT = 1 + (L1_misses * 20 + L2_misses * 200)/L1_accesses
The values of L1 misses, L2 misses, and L1 accesses are in the [trace].out.configi_pf.stats.gz file.
We provide 4 traces from SPEC 2000, namely, art, mcf, parser, and twolf for you to test your prefetcher. These traces are generated for 50 million instructions. the ``-icount" flag should be set to 50 when you run these traces. For example, to run art with configuration 2, you need to use the following command:

	../pin/pin -t ../bin/CMPsim.usetrace -threads 1 -t ../traces/art.out.trace.gz -o art.out.config2_pf.stats -icount 50 -l2lat 20 -dramlat 200 -cache DL1:32:64:8 -cache UL2:2048:64:16 -pref 1 -bwlimited 1 -- /bin/ls

Prefetcher starting points

Here are some papers to get you thinking about different approaches to prefetching. You are not required to choose an algorithm from these papers, but they can provide some useful starting thoughts. Their bibliographies will provide pointer to other papers on the topic.

Collaboration

You should feel free to discuss the project with others in the class including sharing detailed performance results of your predictor. Sharing code is expressly forbidden.

Evaluation

50% of your grade for the project will be based on your write up the prefetcher you implemented, while the other 50% will be based on the performance of your prefetcher. In your report you should discuss how the prefetcher works and why it performs the way it does.

Fabulous PRIZES!!!

The authors of the three top prefetcher will receive prizes and extra credits of 3 points in their final grades. The prizes will be awarded in class on March 10th.

Deliverable

Project Report

Your report should consist of the following sections:

  • Description of Prefetcher - One or two paragraphs describing the prefetching algorithm that you are using. If your prefetcher is based on an existing design, you must explicitly state the original source. While you are free to implement existing designs, failure to provide a citation will result in you receiving no credit for the project. Your description should include the rationale for the design, supported, where appropriate data evaluating "teaks" or changes you made to improve performance.
  • State Accounting - How much state your prefetcher uses as well as a detailed description of how and where the state is used. For example, if you have a prediction table, you should list its size as well as its layout.
  • Result table - Put your simulation results in a table as follows:
      Config 1 Config 2 Config 3
      L1 miss L2 miss AMAT CPI AMAT*CPI L1 miss L2 miss AMAT CPI AMAT*CPI L1 miss L2 miss AMAT CPI AMAT*CPI
    art                              
    mcf                              
    parser                              
    twolf                              
    You need to run 3 prefetching configurations for each trace. For each configuration 5 values should be reported, L1 miss count, L2 miss count, the average memory access time (AMAT) of your prefetcher, the CPI, and the product of CPI and AMAT.
Your report should not exceed 3 pages, and should be in PDF format and have the following name: lastname-firstname-cse240a-wi09-proj.pdf

Source Code

You only need to submit two source code files: lastname-firstname-prefetcher.h and lastname-firstname-prefetcher.cpp. Your prefetcher should be able to compile with the unmodified 32-bit simulation infrastructure that has been provided. Your can test your code on the APE lab computers. This file should contain your name as well as your PID at the top.

Electronic Submission

Please submit the report and source code file via e-mail to Chengmo.

Due: 5 pm, March 5