cse.240b Parallel Computer Architecture - Winter 2010

    

Course Goals

This class is designed to enable students to follow the latest developments in computer architecture, especially those related to parallel computer architecture. Although this is clearly useful for those who wish to do research in computer architecture, it is also useful for those who work in related areas or who have general interests. The class strives for these goals through four aspects:
  1. Covering advanced material which is commonly understood by practicing architects but is not covered in core-level grad classes.

  2. Presenting programming assignments that facilitate advanced understanding.

  3. Providing students with the opportunity to
    { find, analyze, communicate, discuss } advanced material.

  4. Examining fundamental ideas that are the "frontier" of the field and have not yet made it into industry. This will be done through reading papers and through course assignments.

      
[Michael Taylor]
Prof. Michael Taylor

Topics

Directory Coherence Memory Consistency
Interconnection Networks Synchronization
Transactions Streams
Vector Architectures Heterogeneous Multi-core
Simultaneous Multi-threading Cell Architecture
GPUs Tiled Architectures
CUDA Cilk
Arsenal Style Processors

Announcements

January 5, 2010The course forum is up! Make sure to sign up in order to receive important course details. Click here to join. You must give your name as your nickname, and enter in your UCSD email address in the information box. It may take a day or two for you to be comfirmed.
January 5, 2010240B is being held in room 4140 in the Computer Science and Engineering Building!
January 10, 2010Yes, one analysis per paper! No analysis for textbook items UNLESS specified below.

Course Materials

The class will consist of readings generally found in the following locations:
  1. IEEE Explore (free access from UCSD network)
    For IEEE publications.

  2. ACM Portal (free access from UCSD network)
    For ACM publications.

  3. Computer Architecture: A Quantitative Approach, Hennessy & Patterson.
    Hopefully, you already have this.


Grading

I expect a high level of work quality and independence in this class, since it is an advanced graduate class. Participation in in-class discussions is an integral part of the class, and comprises a significant component of the class participation grade.

In true computer architecture form, your Spec240B number (also known as your grade!) includes the multiplication function:
Final Grade = Paper Analysis   *  

Class Participation 20 %
Midterm 20 %
Programming Assignments 40 %
Final Exam 20 %

Paper Analysis

Since much of the class will consist of discussions, it is absolutely essential that you do the reading BEFORE class. There is no greater waste of everybody's time than a discussion class where nobody has read the paper. To help keep the quality of class high, we will collect responses to a set of questions for each paper via a google form. A script will be used to harvest results; it will take the last entry that you submitted before 1:30 pm on the day of the class. I will provide a 5 minute grace period, no exceptions.

If you have concerns over this policy, I recommend you submit early to allow yourself margin for unexpected issues; maybe even the day before! To ensure that you receive credit, you should record your notes in a google doc and then copy it into the form after you are done. This way, if there are any issues with the form, you will be able to show both your content and the times it was written. I will provide people with a list of their answers at the end of class to confirm grades.

Your Paper Analysis Grade will be based on the percentage of paper summaries that you have filled out with thoughtful (but not perfect) answers. If 90% appear thoughtful, then you will receive full credit. Thus, if an act of god or force majeure causes two classes' worth of entries to be unentered, it will not affect your grade. However, the third such event will affect your grade as much as getting 25 % wrong on your final!

Programming Assignments

Thanks to generosity of SDSC, the programming assignments will be done on 8-core to 32-core cutting-edge Nehalem machines at the San Diego Super Computer Center. Awesome! We will be using the Cilk programming language.

Midterm and Final Exams

The midterm and final exams will test different things. The midterm will test your ability to really understand a subject in depth. The final will cover the breadth of your knowledge of the material in the class. This material will include both the reading and topics that come up in class discussion. Some of these topics will almost certainly not be in the reading. Generally speaking, the final will be fairly easy if you have done the reading carefully, and participated in the in-class discussion, and jotted down a few keywords to remind yourself what to study (i.e. via internet source) later.

The midterm

Each student's "midterm exam" consists of presenting (if presenting with another student, for 20 minutes per student, otherwise 30-45 minutes per student) the papers that were assigned for a particular day. Each day, one or two students take responsibility for being the "class experts" for a particular set of papers that we read. They will give a 40 minute presentation on the papers (roughly 20 minutes each). This presentation will overview background material and the context of the research in compared to the related work and the time it was written, motivate the problem the paper is trying to solve, and present the key ideas in the paper, using the IMD (ideas, mechanisms, dinosaurs) framework. Be sure to go through the key architecture mechanisms proposed in the paper in detail. You must create your own slides; you may not simply download a talk from the internet. However, you may reuse diagrams.

Each student will have to do this once or twice, dependent on enrollment.

In order to do this, student experts must read other materials outside the paper (tracing back the references) in order to give the class additional context. It should be clear from the presentation that the students have done this.

I expect that student experts will practice the presentation in order to make sure that it fits in the allotted time and has appropriate transitions. This is a good preparation for giving talks, whether for communicating one's own research, or for a research exam.

Logistically, this works in the following way - students will sign up for a given day ahead of time. They will prepare the slides, and will send them to me on the following schedule:

Day Due
Tuesday Classes Previous Wednesday, 11 p.m.
Thursday Classes Previous Thursday, 11 p.m.

If I do not receive it by those times, a late penalty may apply. I will follow up with feedback by Sunday at noon, quite but possibly even earlier. Remember, this is worth 20% of your Spec240B. After getting feedback, you will revise the presentation, practice it and then give it in class. During the presentation, students are free to ask questions of the experts, and if a discussion topic arises, we will pause to discuss it.

Reading Assignments

The readings

Check this regularly for updates; I may post clarifications!

Abbreviations (if none given, available on IEEE Explore or ACM Portal):
H&P Computer Architecture: A Quantitative Approach, 4rd Ed., Hennessey and Patterson.


Note: calendar below subject to change!
Due         Item
Tu Jan 5 First Class.
Th 7 Multiprocessors / Coherence
H & P 4th ed.: 4.1-4.4
Tu 12 Advanced Coherence and DSM

D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, ``The directory-based cache coherence protocol for the DASH multiprocessor,'' in ISCA '90: Proceedings of the 17th annual international symposium on Computer Architecture, pp. 148-159, 1990 link.

D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy, ``The DASH prototype: implementation and performance,'' in ISCA '98: 25 years of the international symposia on Computer architecture (selected papers), pp. 418-429, 1998 link.

H & P 4.5 - 4.8 (please submit an "analysis" for the description of the Sun T1, for a total of 3 analyses today)
Th 14
Tu 19 Tiled Microprocessors
Tiled Microprocessors are interesting because they blur the boundaries between multiprocessors and microprocessors.

The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs, by Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Jae-Wook Lee, Paul Johnson, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe and Anant Agarwal. IEEE Micro, March/April 2002.

Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams, by Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. Proceedings of the International Symposium on Computer Architecture, June 2004.

Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine, by Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose, CA, October 4-7, 1998.
Tu 26 Tilera: Tiled Processors Commercialized
Presenter: Gopi
Bell, S et al. "TILE64 - Processor: A 64-core SoC with Mesh Interconnect", ISSCC, February 2008.

Wentzlaff et al. "On-Chip Interconnection Architecture of the Tile Processor", IEEE Micro, Sept-Oct 2007.

TILE-Gx Processor Family

Tilera: 5 Innovations
Tu Feb 2 Assignment 1 Out.
Th Feb 4 Tiled Discussion.
Tu Feb 9 Assignment 2.
Tu 16 IBM / Sony Cell
Presenter: Samson
M. Baron, ``The Cell, At One,'' Microprocessor Report, March 2006 link.

Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor, by Pham et al. IEEE Journal of Solid-State Circuits, January 2006.

The Microarchitecture of the Synergistic Processor for a Cell Processor, by Flachs et al. IEEE Journal of Solid-State Circuits, January 2006.
Th 18 Vectors

Appendix F of H & P.
Su21 Assignment 2 due (11:59p); Assignment 3 out
Tu 23 Tiled Dataflow
Presenter: Sravanthi Kota Venkata
K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore, ``Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture,'' ISCA 2003 link.

S. Swanson, A. Schwerin, M. Mercaldi, A. Petersen, A. Putnam, K. Michelson, M. Oskin, and S. J. Eggers, ``The WaveScalar Architecture.''
To Appear in ACM Transactions On Computer Systems. link
Th 25 Cilk
The Implementation of the Cilk-5 Multithreaded Language. Frigo et al. PLDI 1998.

The Cilk++ concurrency platform. Leiserson, C.E.; Design Automation Conference, 2009. DAC '09. 46th ACM/IEEE Publication Year: 2009 , Page(s): 522 - 527

SMT
Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor Dean Tullsen, Susan Eggers, Joel Emer, Henry Levy, Jack Lo, and Rebecca Stamm Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996.

E. ek, M. Krman, N. Krman, and J.F. Martinez. Core Fusion: Accommodating software diversity in chip multiprocessors. In Intl. Symp. on Computer Architecture, San Diego, CA, June 2007
Mar 1
Programming Assignment Three Due (11:59p)
Tu Mar 2 Streams;
J.D. Owens, P.R. Mattson, S. Rixner, W.J. Dally, U.J. Kapasi, B. Khailany, A. Lopez-Lagunas, "A Bandwidth-Efficient Architecture for Media Processing," 31st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'98), 1998

Rixner et al. Memory access scheduling International Symposium on Computer Architecture Vancouver, British Columbia, Canada, 2000

Pizza Party; brief analysis of experiment outcome
Th 4 GPU
NVIDIA Tesla: A Unified Graphics and Computing Architecture Lindholm, E.; Nickolls, J.; Oberman, S.; Montrym, J.; Micro, IEEE Mar 2008.

Scalable Parallel Programming with CUDA. Nickolls et al. ACM Queue. 2008.
Tu 9 Arsenal Style Processors

Conservation Cores: Reducing the Energy of Mature Computations. Ganesh Venkatesh, John Sampson, Nathan Goulding, Saturnino Garcia, Slavik Bryskin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Architectural Support for Programming Languages and Operating Systems, March 2010. link
Asanovic et al, "A View of the Parallel Computing Landscape", Communications of the ACM, October 2009. Th
11 No class.
Th 18 Final Exam

email: mbtaylor at you see ess dee dot ee dee you
web:   Michael Taylor's Website.