CSE 260 (Fall 2008): Notes for Lecture 1 (9/25/08)

Introduction to the course

The URL for the class home page is http://www.cse.ucsd.edu/classes/fa08/cse260. Important announcements will be made via this page, so  watch for changes regularly.

There is one required text for the course:

        Introduction to Parallel Computing 2nd Ed, by Grama, Gupta, Karypis, and Kumar. ISBN 0-201-64865-2, Addison-Wesley Publisher, 2003
        Be sure to get the 2nd edition, not the 1st edition.

A handy book on MPI is

      Parallel Programming with MPI, by P. Pacheco, Morgan Kaufmann Publishers, 1997.

I recommend that you purchase this book if you anticipate doing a lot of MPI programming in the future, or if you are more comfortable using a text than on-line materials. But, see a list of MPI reading materials at http://www-cse.ucsd.edu/users/baden/Doc/mpi.html)

Additional course readings will be posted on-line or handed out in class.

Background

The prerequisite for CSE 260 is graduate standing. I recommend that you have a background in undergraduate-level computer architecture or operating systems. This background may also be met by equivalent work experience. Students from other departments are welcome; contact me if you have any questions about your academic background.
 


Grading and Course policies

Grading will be based on homework assignments, class participation, and the project. Written homework assignments must be done individually, but the programming assignments may be done in teams of two.  Be sure to complete all assigned readings and be prepared to discuss them in class.

Homework assignments (45%)
There will be 4 homework assignments. The assignments will vary in form: commenting on assigned readings problem sets, working out pencil and paper designs, programming and experimentation. Some will be more involved than others. The written assignments must be done individually. The programming assignments may be done in teams of two. I encourage you to discuss the assignments with your classmates,  but be sure to think and write independently.  If you aren't sure about this, see me.   Here are some notes on technical writing.
 
Class participation (10%)
Your grade will be based in part on our advance preparation and participation in lecture. Be prepared to ask and answer questions and to contribute to the discussions and to comment on the readings. You may find it helpful to keep a journal of what you've read, especially research papers that will be assigned from time to time. Here are some tips on writing summaries.
 
Project (45%)
One of the purposes of this course is to teach you how to conduct research. A project may be done individually, but many projects are better suited to a team of 2. Extreme programming is an extremely helpful technique in parallel programming! You should allow 5 weeks to complete the project. I will discuss the projects in a couple of weeks. If you have an idea for a research project, come see me.
 

Academic Integrity

I expect you to do your own work. Though I don't expect there to be a problem in a graduate course, academic honesty which will be strictly enforced. Anyone found plagiarizing another's work, or making their own work available to others, will receive a 0 on the assignment in question and face other possible consequences.  You are assumed to be familiar with the Academic Integrity Policies for this course, as described in the following document: http://www.cse.ucsd.edu/~baden/Integrity.html. If you aren't sure about this policy be sure and see me.

Lateness and Grading appeals

I'll consider any reasonable request for a delayed turnin, but all other assignments must be turned in on time. I'll accept regrade requests for one week after the assignment or exam has been returned to the class. After that, the grading decision is final.

Contact

If you have comments or suggestions, send email to
baden @ ucsd.
edu  

My faculty assistant is Sheila Manalo
shmanalo @ ucsd.
edu  



Computing environment

Various hardware platforms will be used in the class, including Beowulf clusters and the STI Cell Broadband Engine. The first programming assignment will involve MPI. Students who wish to do their project on Cell BE will be given the option to switch to that machine early in the course, and their programming labs will involve Cell rather than MPI. This arrangement will accomodate the diverse needs of the class. Students wishing to work on Cell machine should submit an account request form for a machine called
CellBuzz which is located at the Georgia Institute of Technology. See Getting starting with Cell or a handy web page of on-line resource for Cell and GPGPU programming. The high performance resources should only be used for parallel program development and experimentation. If you need to use a Linux or UNIX system, please use your student account in CSE or your home department.

Please see the following web page with extensive documentation, papers, software repositories, and other goodies:

http://www.cse.ucsd.edu/users/baden/Doc


Topics in the course

Parallel computation has a rich and varied history, and the techniques for solving problems on high performance parallel computers are both intriguing and intellectually appealing. In this course we'll study techniques for solving applications drawn from science and engineering as well as the  fundamental principles of  parallel computing.  The course is divided into five main topics as follows.

Fundamentals
We'll begin with the fundamental underpinnings of parallel computation, including address space organization, programming models, and performance.

Software
The choice of an appropriate software interface is important both in of both programmer productivity and machine performance. An effective interface permits the user to employ information specific to the application to optimize performance while providing a convenient notation for expressing the problem concisely and at a high level. We'll begin with basic models such as message passing and multithreading and work our way up to programming language and run time support. We'll learn how to to write message passing programs with MPI, the primary means today of writing scalable scientific applications. Some students may choose to use STI Cell, and it will therefore be instructive to also learn about threads programming and vectorization for SIMD parallelism.

Algorithms
A guiding principle in parallel computation, and in high performance computation more generally, is that knowledge about an application may often be used to improve performance significantly. We'll study some important algorithms for solving problems and use them to motivate the study of programming, performance, and implementation. Algorithms will include iterative finite difference methods, numerical linear algebra, sorting, and some irregular problems. We'll articulate application requirements and explore cross cutting issues. 
 
Performance programming
The underlying problem representation constrains the implementation strategy and performance.  Data structures play an important role, and their selection is a crucial cross cutting issue. We'll look at the interaction of data structures with data partitioning and software issues. We'll also look at advanced programming techniques with MPI and come up with performance models. Of course, architecture also plays a role and we’ll look at some classic designs for inspiration.

Technology
Historically, technological change has had deep impact in efficient algorithm design, and we'll look at architectural trends to better understand where the field is heading.

Readings

The lecture reading schedule is linked into the main course web page.  Readings will come from the text (ICA below), class handouts and on-line resources. The following texts will be used, and are shown with the abbreviations to be used in the schedule.
  • PCA: Parallel Computer Architecture, by Culler, Singh with Gupta, Morgan Kaufmann Publisher
  • ICA:  Iintroduction to Parallel Computing 2nd Ed, by Grama, Gupta, Karypis, and Kumar. ISBN 0-201-64865-2, Addison-Wesley Publisher, 2003.
  • Pacheco:  Parallel Programming with MPI, by P. Pacheco, Morgan Kaufmann Publishers, 1997.
  • Foster:  Designing and building Parallel Programs, Addison Wesley. Available on-line as http://www-unix.mcs.anl.gov/dbpp/text/book.html.
  • Recipes: Numerical Recipes in C, 2nd Ed., by Press et al., Cambridge University Press. Also available on-line
  • UCB: Extensive on-line course notes from a course given several times over a period of years at U.C.Berkeley by Jim Demmel, Kathy Yelick, Horst Simon, David Bailey.  A particularly useful set of notes  from CS267, Applications of Parallel Computers, are posted here:  http://www.cs.berkeley.edu/~demmel/cs267_Spr99.
    These will be referred to as Demmel. Also see CSE 267class pages from other incarnations of the course:

    [ spring 2007 | spring 2006 | spring 2005 | spring 2004 | Fall 2002 | spring 2000 | spring 1999 | spring 1997 | spring 1996 ]

  • What is parallelism and why is it useful?

    Beginning in the mid 1980's processing rates were increasing at an exponential rate of about 50% per year, doubling about every 18 months. This phenomenon, a consquence of Moore's Law, was primarily dependent on exponentially increasing clock rates. However, due to high power densities, clock speeds started flattening out a few years ago. In response, industry began developing multi-core processors, where multiple CPUs are integrated onto a single chip. It is hoped that multi-core processing will enable a return to historic trends. (If you are interested in learning more, contact my colleagues Steve Swanson, Michael Taylor, or Dean Tullsen).

    However, even this extraordinary rate of improvement is not sufficient to meet the needs of some applications, which can require more memory and I/O capacity than can be connected to a single computer, or a multi-core computer. We define parallel processing as the simultaneous computation or overlap over separate physical resource to increase capacity or speed.

    A parallel computer is a collection of processing elements that may execute concurrently, and communicate via an interconnect. The processing elements co-operate to solve a related set of tasks comprising a single problem. For example, an automobile assembly line is an example of parallel processing; each stage of the assembly line may be overlapped with the other stages. Parallelism is attractive because it provides a way to deliver increased machine performance,  memory,  and storage, in a manner that compounds improvements in processor technology. At present, it is also an effective way of dealing with the physical limitations of micro-electronics, i.e. multi-core CPUs. Thus, it is possible to have a hierarchically constructed parallel computer comprising many multi-core processors.

    It follows that if we have 1 computer capable of computing 5 Billion FlOating Point Operations per second--5 gigaflops/second (109/sec)--then a team of 1000 computers should run at 1000 times the rate, that is,  5000 Gigaflop/sec = 5 Teraflops/sec. In practice, this expectation is optimistic unless the algorithm and its implementation can map optimally onto the technology. In fact, some some machines may deliver only 10% of their peak performance for certain applications. Attaining this level of performance can still tax the programmer, as application code may require extensive tuning and recoding with each tenfold increase in parallelism. The study of parallel computation helps us to understand the guiding principles that can lead to an effective solution.

    We may employ parallel processing without being aware of it, and we may write a parallel program without having a parallel computer. In the first case, we may rely on  a compiler to take care of the details. In the second case we may rely on our system to emulate parallelism in a way that allows us to debug our software in a familiar environment. We note that multiprogramming, which is supported by time-sharing operating systems, treats parallelism by interleaving the execution of multiple instruction streams running in separate processes.

    Concurrency can occur at many different levels, usually referred to as granularity.  Here are several possibilities, ranging from fine to coarse grained.

  • Pipelined arithmetic units within a CPU (i.e. IBM Power 4) also support concurrency.
     
  • Shared memory multiprocessors provide fine-grain communication via cache coherence protocols. Examples include the Sun Microsystems Enterprise server series, which are Symmetric MultiProcessors (SMPs). Other examples include the http://now.cs.berkeley.edu are connected by low-cost switches like Ethernet.

  • Multicomputers may employ high performance interconnect, and range from low cost machines such as "Beowulf" clusters to high end mainframes with thousands of processors. SDSC's DataStar is an example of a high end multicomputer. It has 2368 processors. Like many other high end machines, it is hierarchically organized. Data Star's computational nodes are multiprocessor servers. Some nodes have 8 CPUs while others have 32. The primary distinctions between a low-cost cluster and a mainframe are that mainframes like employ higher speed but costlier interconnect, have more aggressive network connectivity and I/O capacity, and an extensive development environment and support infrastructure. Despite this, mainframes are generally configured with less ambitious local cache and memory than found on workstations based on the same CPU. Beowulf class clusters fit somewhere in between mainframes and networks of workstations in terms of performance and cost. Today's fastest machine is IBM's Blue Gene/L, which clocks in at 280.6 TeraFlop/second (That's 280.6 mega mega flops/second). It has 128K processors organized into 64K dual processor nodes. See the Top 500 Computers web site to see how Blue Gene L and other machines measure up.
     
  • All the computers in the world may be viewed as a single global parallel computer. This is more accurately referred to as a distributed computer. Distributed computation has many application, ranging from   bank auto-tellers, airline reservation systems, and air traffic control systems, as well as multicomponent scientific models that run on several parallel computers and may access large amounts of observational data., either obtained  from an instrument in real time or from a data repository.
     

  • Distributed vs. parallel processing

    In this course we'll consider medium to coarse levels of granularity found in multicomputers. However, as the result of improvements in networking technology, it has become feasible to connect geographically distributed computing resources together, possibly with repositories of stored data, sometimes called "Grid" or "Cloud" Computing. For example, collaborative activities, in which multiple users at geographically distributed locations may share information introduces issues about scheduling, security, and connectivity. An exciting application involves remote access to scientific instruments such as microscopes and telescopes, such that the user may share the results with others. Another application is to provide convenient access to virtualized remote computing resources and data stores. For example, using a portal or a web service a user may not be entirely aware of where there job is run, or for that matter, the mechanics of running the job. The results may be delivered to them by email, or via convenient web interface.

    In this course, we will focus on relatively tightly coupled applications, involving parallel computation rather than distributed or grid computation. For our purposes this implies that compared with distributed computation, our parallel computations

  • exhibit more tightly coupled communication;
  • don't incorporate  knowledge about the processor interconnect structure;
  • won't need to synchronize or form a consensus;
  • won't incur significant external contention due to other jobs on the processor, memory, or interconnect
  • won't address security issues; and
  • perform I/O locally.
  • It is important to remember that the performance of a distributed application is limited by the rates at which the individual resources are able to process their workloads. Thus, the lessons learned in this course may be used to improve performance of distributed applications comprising parallel parts. The issues involved in remote access to data touch upon some of the issues involved in accessing and manipulating local data.

    Motivation for using a parallel computer

    Parallel computers are often used to perform elaborate numerical simulations, which can provide cost-effective alternative to expensive "wet lab" experimentation. The Boeing 777 jetliner, for example, was designed by computer, avoiding costly wind tunnel experiments. The reliance on computers to solve problems effectively has stimulated increased resource demands. (For example, see Jim Demmel's notes "Motivation for Parallel Computing" at http://www.eecs.berkeley.edu/~demmel/cs267/lecture01.html).

    Example of a parallel computation: oceanographic simulation of overturn regions

  • Embed a 3-D mesh in a cubical volume of water, compute observables like velocity, temperature, and pressure at each point.
  • Carry out a long computation to determine location and characteristics of pockets of water laying above water that has a lower density
  • Single processor simulations run with a 1283 mesh
  • Takes several hours on a single processor workstation, and use 256 MB of memory
  • We'd like to run with larger meshes, in order to resolve features at a wider range of spatial scales.
  • When we double the mesh, memory requirements increase by x8, time by x16
  • A 10243 run takes about a year and requires 128 GB of memory
  • Running in parallel

  • Divide the ocean (and the underlying mesh) into sub-problems
  • Solve problem in each sub-problem locally, communicate information across boundaries between the sub-problems.
  • Issues

  • Communication doesn't come for free.
  • The best way of handling communication depends on the hardware.
  • We must give each processor a fair share of the work, or the speedup of using multiple processors will be low.
  • Some parts of our task can't run on multiple processors, i.e. Amdahl's law, and these ultimately limit performance.
  • How do we manage and access the data?

     

  • The bottom line: cost effectiveness

    Our success in the above endeavor depends on our ability to split up the problem effectively across multiple processors. Parallel processing is cost effective when the running time and programming effort are competitive with alternative computing solutions. Note that there are often many different ways of solving a problem on a parallel computer, and that the best implementation may depend on the hardware in use. Thus, we need to choose a "good" algorithm, since the best implementation cannot be expected to make up for a "bad" algorithm. On the other hand, if we have selected a good algorithm, we want to avoid introducing excessive overhead costs. Programming a parallel computer involves juggling many more factors than on a single processor computer, and is correspondingly more challenging. Managing software development costs is vital.

    To get a rough idea of why parallel computing can be more challenging, we should note that most of us tend to take the operating system and compiler for granted. These tools often handle many decisions about memory hierarchy management (e.g. cache, virtual memory and registers) and scheduling for us. What if we had to manage cache locality and virtual memory ourselves? Parallel processing introduces other activities that are often part and parcel of OS programming: messages and synchronization.

    Two different views on the benefits of parallelism

    There are generally two reasons for using a parallel computer

  • Capability: tackle a larger or more ambitious problems requiring more memory, more processing capability, or both
  • Performance:  solve the same problem in less time.
  • While both viewpoints are valid, users are often attracted to the opportunity offered by improved capability. The reason is that a user's primary activity is to engage in scientific discovery.  While performance is important, it is often not the primary goal. Scientists are less concerned if they got the answer in 10 hours or 20.  Why is this the case?  There are two reasons.

    First, the difference between 10 and 20 hours may not be the factor in determining whether or not a scientist makes a discovery (though it may impact their CPU time budget!). Moreover, on a mainframe system,  job turnaround may include long queuing delays--think of what happens when everyone is trying to meet a conference deadline at the same time, and you get the idea. Second, it takes time to improve performance, and the added software development time generally increases with the level of parallelism and hence performance. This is especially true if the user has old "legacy" or "dusty desk" code, and may have difficulty in upgrading their software.

    On the other hand, performance programming is vital in maintaining scalability, whereby we expect performance to increase as we increase the level of parallelism. (Larry Carter has lots to say about performance programming.)  The moral of the story is that computer centers may have more of an incentive  than the user in seeing that  user programs run efficiently, but the equalizer is that computing time is a finite resource.

    Data management is also an important concern to many users. With increased levels of performance comes the ability to generate prodigious amounts of data. We will briefly discuss data-intensive applications and software support issues.


    Technology

    Parallel computers are generally constructed from commercial off-the-shelf components (COTS): microprocessors. Microprocessors are fast and inexpensive, and are becoming even faster and cheaper because they are mass-produced. Commercial machines like the IBM SP systems are generally constructed using customized interconnect. Processing clusters are built using commercial switch fabric like Myrinet, as with Valkyrie, or others, such as Infiniband.  Another trend in system design is to employ a hierarchical organization. Each computational node is in turn a tightly coupled parallel computer called a symmetric multiprocessor (SMP).   We'll return to these multi-tier systems later on. 

    Not long ago, processing clusters were  not yet in production status, but that is no longer true.  Many computer centers now sport  lower costs clusters. A production quality machine provides services like user accounting, job submission services, and a full range of numerical libraries.   One dilemma with production machines is their upgrade path. Generally a machine is designed around a particular chip, and it may take some time before newly designed chips can be incorporated into an existing system. The lifetime of a mainframe is 3 to 4 years. During this time, lower cost systems such as laptops and PCs will have improved in performance. As a result, some low cost cost Beowulf clusters can outperform a mainframe at lower levels of parallelism.

    The Processor-Memory Gap

    While processing speeds were rising at an astronomical rate as predicted by Moore's law, memory speeds increased far more slowly. During the 1990s, this "processor-memory gap" had a dramatic effect on processor design and on as the difficulty of obtaining acceptable machine performance. Design techniques such as super-scalar processing, multi-threaded execution, instruction level parallelism, and deeper memory hierarchies have historically played an important role. However, these techniques also consume large amounts of power. Recent work in low power processing examines ways of avoiding power-hungry features, in favor of a more compact design.  The results are a higher physical density of processing power, or a lower power consumption cost per flop.  Another direction is to employ multi-core processors, though the increased processing rates require an aggressive on-chip memory hierarchy.

    In dealing with the memory hierarchy, we may have multiple levels of cache to contend with, in addition to main memory and virtual memory (Even the processing pipeline may be included.). Our goal is to improve memory locality whereby we access memory most frequently at the lowest levels of the hierarchy (i.e. registers and primary cache which is on the processor). Serial computers are more forgiving about poor memory locality than parallel computers, though the gap is narrowing due to the effects of Moore's Law. When commercial parallel computers began appearing in the early to mid 80's, we were not as far along on the processing speed curve as we are today. So main memory wasn't so "far away" from the processor as it is today. But in the intervening decades processing speeds ramped up more rapidly than memory speeds, and the situation has changed. The moral of the story: perhaps parallelism provided a prescient image of what was to come.


    Copyright © 2008 Scott B. Baden. Last modified: 09/24/2008 20:24 -0700