1. The AMD Bulldozer microarchitecture has a 16K, 4-way, 64-byte blocked L1 data cache in each core. This processor uses 64-bit memory addresses. For this processor, please answer the following questions:

   (1) When the L1 data cache receives a memory access request, how many bits will be used as “tag”? How many bits will be used as “index”? How many bits will be used as “offset”?

   (2) In addition to the data array, the cache needs to contain tag arrays and dirty bits that are considered as overheads to the cache. Including these overheads, how many bits are required to build this cache?
2. Consider the following matrix multiplication code

```c
int i, j,k;
double *A, *B, *C;
A = (double *)malloc(sizeof(double)*N*N);
B = (double *)malloc(sizeof(double)*N*N);
C = (double *)calloc(N*N, sizeof(double));
init_data(A, B, N*N);
for(i = 0; i < N; i++)
    for(j = 0; j < N; j++)
        for(k = 0; k < N; k++)
            C[i*N+j] += A[i*N+k]*B[k*N+j];
// assume load A[i*N+k], load B[k*N+j], load C[i*N+j] then
store C[i*N+j]
output_data(C, N*N);
```

Assume that the starting address of array A is 0x10000, array B is 0x18000, and array C is 0x20000. Assume N = 64. Please answer the following questions:

(1) How many memory accesses are there in the nested for-loop? How many of them are writes? How many of them are reads?
(2) Assuming that you have an AMD Phenom II processor, which has a 64KB, 2-way, 64-byte blocked L1 data cache, please estimate the cache miss rate.
3. You are building a system around a single-issue in-order processor running at 1 GHz and the processor has a base CPI of 1 if all memory accesses are hits. The only instructions that read or write data from memory are loads (20% of all instructions) and stores (5% of all instructions). The memory system for this computer is composed of a split L1 cache that imposes no penalty on hits. Both the I-cache and D-cache are direct mapped and hold 32KB each. You may assume the caches use write-allocate and write-back policies. The L1 I-cache has a 2% miss rate and the L1 D-cache has a 5% miss rate. Also, 50% of all blocks replaced from L1 D-cache are dirty. The 512KB write-back, unified L2 cache has an access time of 10ns. Of all memory references sent to the L2 cache in this system, 80% are satisfied without going to main memory. Also 25% of all blocks replaced are dirty. The main memory has an access latency of 60ns.

(1) What is the overall CPI, including memory accesses?
(2) You are considering replacing the 1 GHz CPU with one that runs at 2 GHz, but is otherwise identical. How much faster does the system run with a faster processor? Assume the L1 cache still has no hit penalty, and that the speeds of the L2 cache, and main memory remain the same in absolute terms (e.g. the L2 cache still has a 10 ns access time).
4. Please download the C code from
http://cseweb.ucsd.edu/classes/sp16/cse141-a/homework/homework4.tar.gz and compile the source files on a machine.

(1) Please find out the cache configuration of the machine and fill the following table:

<table>
<thead>
<tr>
<th>Memory layer</th>
<th>Capacity</th>
<th>Block size</th>
<th>Way associativity</th>
</tr>
</thead>
<tbody>
<tr>
<td>L1 data cache</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>L2 cache</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>L3 cache</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

(2) Please execute the program “sampling” and draw a line graph in which the x-axis represents the “distance between samples” and the y-axis represents the “throughput (MB/secs)”.  

(3) Explain the performance curve of the line graph in (2) using the cache configuration on your machine.
(4) Please execute the program “sampling_2” and draw a line graph in which the x-axis represents the “array size” and the y-axis represents the “throughput (MB/secs)”.

(5) Explain the performance curve of the line graph in (4) using the cache configuration on your machine.