The URL for the class home page is http://www.cse.ucsd.edu/classes/fa08/cse260. Important announcements will be made via this page, so watch for changes regularly.
There is one required text for the course:
Introduction to Parallel
Computing 2nd Ed, by Grama, Gupta, Karypis, and Kumar. ISBN 0-201-64865-2,
Addison-Wesley Publisher, 2003
Be sure to get the 2nd edition, not
the 1st edition.
A
handy book on MPI is
Parallel Programming with MPI, by P.
Pacheco, Morgan Kaufmann Publishers, 1997.
I recommend that you purchase this
book if you anticipate doing a lot of MPI programming in the future, or if you
are more comfortable using a text than on-line materials.
But, see a list of MPI
reading materials at
http://www-cse.ucsd.edu/users/baden/Doc/mpi.html)
Additional course readings will be posted on-line or handed out in class.
The prerequisite for CSE 260 is graduate standing. I recommend that you
have a background in undergraduate-level
computer architecture or operating systems. This background
may also be met by equivalent
work experience. Students from other departments are welcome; contact me if
you have any questions about your academic background.
Grading will be based on homework assignments, class participation, and the project. Written homework assignments must be done individually, but the programming assignments may be done in teams of two. Be sure to complete all assigned readings and be prepared to discuss them in class.
I expect you to do your own work. Though I don't expect there to be a problem in a graduate course, academic honesty which will be strictly enforced. Anyone found plagiarizing another's work, or making their own work available to others, will receive a 0 on the assignment in question and face other possible consequences. You are assumed to be familiar with the Academic Integrity Policies for this course, as described in the following document: http://www.cse.ucsd.edu/~baden/Integrity.html. If you aren't sure about this policy be sure and see me.
I'll consider any reasonable request for a delayed turnin, but all other
assignments must be turned in on time. I'll accept regrade requests for one week
after the assignment or exam has been returned to the class. After that, the
grading decision is final.
If you have
comments or suggestions, send email to
| baden | @ | ucsd. |
edu |
| shmanalo | @ | ucsd. |
edu |
Please see the following web page with extensive documentation, papers, software repositories, and other goodies:
Parallel computation has a rich and varied history, and the techniques for solving problems on high performance
parallel computers are both intriguing and intellectually appealing. In this course we'll study techniques for solving
applications drawn from science and engineering as well as the fundamental
principles of parallel computing. The course is divided into
five main topics as follows.
Beginning in the mid 1980's processing rates were increasing at an exponential rate of about 50% per year, doubling about every 18 months. This phenomenon, a consquence of Moore's Law, was primarily dependent on exponentially increasing clock rates. However, due to high power densities, clock speeds started flattening out a few years ago. In response, industry began developing multi-core processors, where multiple CPUs are integrated onto a single chip. It is hoped that multi-core processing will enable a return to historic trends. (If you are interested in learning more, contact my colleagues Steve Swanson, Michael Taylor, or Dean Tullsen).
However, even this extraordinary rate of improvement is not sufficient to meet the needs of some applications, which can require more memory and I/O capacity than can be connected to a single computer, or a multi-core computer. We define parallel processing as the simultaneous computation or overlap over separate physical resource to increase capacity or speed.
A parallel computer is a collection of processing elements that may execute concurrently, and communicate via an interconnect. The processing elements co-operate to solve a related set of tasks comprising a single problem. For example, an automobile assembly line is an example of parallel processing; each stage of the assembly line may be overlapped with the other stages. Parallelism is attractive because it provides a way to deliver increased machine performance, memory, and storage, in a manner that compounds improvements in processor technology. At present, it is also an effective way of dealing with the physical limitations of micro-electronics, i.e. multi-core CPUs. Thus, it is possible to have a hierarchically constructed parallel computer comprising many multi-core processors.
It follows that if we have 1 computer capable of computing 5 Billion FlOating Point Operations per second--5 gigaflops/second (109/sec)--then a team of 1000 computers should run at 1000 times the rate, that is, 5000 Gigaflop/sec = 5 Teraflops/sec. In practice, this expectation is optimistic unless the algorithm and its implementation can map optimally onto the technology. In fact, some some machines may deliver only 10% of their peak performance for certain applications. Attaining this level of performance can still tax the programmer, as application code may require extensive tuning and recoding with each tenfold increase in parallelism. The study of parallel computation helps us to understand the guiding principles that can lead to an effective solution.
We may employ parallel processing without being aware of it, and we may write a parallel program without having a parallel computer. In the first case, we may rely on a compiler to take care of the details. In the second case we may rely on our system to emulate parallelism in a way that allows us to debug our software in a familiar environment. We note that multiprogramming, which is supported by time-sharing operating systems, treats parallelism by interleaving the execution of multiple instruction streams running in separate processes.
Concurrency can occur at many different levels, usually referred to as granularity. Here are several possibilities, ranging from fine to coarse grained.
In this course we'll consider medium to coarse levels of granularity found in multicomputers. However, as the result of improvements in networking technology, it has become feasible to connect geographically distributed computing resources together, possibly with repositories of stored data, sometimes called "Grid" or "Cloud" Computing. For example, collaborative activities, in which multiple users at geographically distributed locations may share information introduces issues about scheduling, security, and connectivity. An exciting application involves remote access to scientific instruments such as microscopes and telescopes, such that the user may share the results with others. Another application is to provide convenient access to virtualized remote computing resources and data stores. For example, using a portal or a web service a user may not be entirely aware of where there job is run, or for that matter, the mechanics of running the job. The results may be delivered to them by email, or via convenient web interface.
In this course, we will focus on relatively tightly coupled applications, involving parallel computation rather than distributed or grid computation. For our purposes this implies that compared with distributed computation, our parallel computations
It is important to remember that the performance of a distributed application is limited by the rates at which the individual resources are able to process their workloads. Thus, the lessons learned in this course may be used to improve performance of distributed applications comprising parallel parts. The issues involved in remote access to data touch upon some of the issues involved in accessing and manipulating local data.
Parallel computers are often used to perform elaborate numerical simulations, which can provide cost-effective alternative to expensive "wet lab" experimentation. The Boeing 777 jetliner, for example, was designed by computer, avoiding costly wind tunnel experiments. The reliance on computers to solve problems effectively has stimulated increased resource demands. (For example, see Jim Demmel's notes "Motivation for Parallel Computing" at http://www.eecs.berkeley.edu/~demmel/cs267/lecture01.html).
Our success in the above endeavor depends on our ability to split up the problem effectively across multiple processors. Parallel processing is cost effective when the running time and programming effort are competitive with alternative computing solutions. Note that there are often many different ways of solving a problem on a parallel computer, and that the best implementation may depend on the hardware in use. Thus, we need to choose a "good" algorithm, since the best implementation cannot be expected to make up for a "bad" algorithm. On the other hand, if we have selected a good algorithm, we want to avoid introducing excessive overhead costs. Programming a parallel computer involves juggling many more factors than on a single processor computer, and is correspondingly more challenging. Managing software development costs is vital.
To get a rough idea of why parallel computing can be more challenging, we should note that most of us tend to take the operating system and compiler for granted. These tools often handle many decisions about memory hierarchy management (e.g. cache, virtual memory and registers) and scheduling for us. What if we had to manage cache locality and virtual memory ourselves? Parallel processing introduces other activities that are often part and parcel of OS programming: messages and synchronization.
There are generally two reasons for using a parallel computer
While both viewpoints are valid, users are often attracted to the opportunity offered by improved capability. The reason is that a user's primary activity is to engage in scientific discovery. While performance is important, it is often not the primary goal. Scientists are less concerned if they got the answer in 10 hours or 20. Why is this the case? There are two reasons.
First, the difference between 10 and 20 hours may not be the factor in
determining whether or not a scientist makes a discovery (though it may impact
their CPU time budget!). Moreover, on a mainframe system, job turnaround may
include long queuing delays--think of what happens when everyone is trying to
meet a conference deadline at the same time, and you get the idea. Second, it
takes time to improve performance, and the added software development time
generally increases with the level of parallelism and hence performance. This is
especially true if the user has old "legacy" or "dusty desk" code, and may have
difficulty in upgrading their software.
On the other hand, performance programming is vital in maintaining scalability,
whereby we expect performance to increase as we increase the level of
parallelism. (Larry Carter has
lots to say about performance programming.) The moral of the story is that
computer centers may have more of an incentive than the user in seeing
that user programs run efficiently, but the equalizer is that computing
time is a finite resource.
Data management is also an important concern to many users. With increased levels of performance comes the ability to generate prodigious amounts of data. We will briefly discuss data-intensive applications and software support issues.
Parallel computers are generally constructed from commercial off-the-shelf components (COTS): microprocessors. Microprocessors are fast and inexpensive, and are becoming even faster and cheaper because they are mass-produced. Commercial machines like the IBM SP systems are generally constructed using customized interconnect. Processing clusters are built using commercial switch fabric like Myrinet, as with Valkyrie, or others, such as Infiniband. Another trend in system design is to employ a hierarchical organization. Each computational node is in turn a tightly coupled parallel computer called a symmetric multiprocessor (SMP). We'll return to these multi-tier systems later on.
Not long ago, processing clusters were not yet in production status, but that is no longer true. Many computer centers now sport lower costs clusters. A production quality machine provides services like user accounting, job submission services, and a full range of numerical libraries. One dilemma with production machines is their upgrade path. Generally a machine is designed around a particular chip, and it may take some time before newly designed chips can be incorporated into an existing system. The lifetime of a mainframe is 3 to 4 years. During this time, lower cost systems such as laptops and PCs will have improved in performance. As a result, some low cost cost Beowulf clusters can outperform a mainframe at lower levels of parallelism.
While processing speeds were rising at an astronomical rate as predicted by Moore's law, memory speeds increased far more slowly. During the 1990s, this "processor-memory gap" had a dramatic effect on processor design and on as the difficulty of obtaining acceptable machine performance. Design techniques such as super-scalar processing, multi-threaded execution, instruction level parallelism, and deeper memory hierarchies have historically played an important role. However, these techniques also consume large amounts of power. Recent work in low power processing examines ways of avoiding power-hungry features, in favor of a more compact design. The results are a higher physical density of processing power, or a lower power consumption cost per flop. Another direction is to employ multi-core processors, though the increased processing rates require an aggressive on-chip memory hierarchy.
In dealing with the memory hierarchy, we may have multiple levels of cache to
contend with, in addition to main memory and virtual memory (Even the processing
pipeline may be included.). Our goal is to improve memory locality
whereby we access memory most frequently at the lowest levels of the hierarchy
(i.e. registers and primary cache which is on the processor). Serial computers
are more forgiving about poor memory locality than parallel computers, though
the gap is narrowing due to the effects of Moore's Law.
When commercial parallel computers began appearing
in the early to mid 80's, we were not as far along on the processing speed curve
as we are today. So main memory wasn't so "far away" from the processor
as it is today.
But in the intervening decades processing speeds ramped up more
rapidly than memory speeds, and the situation has changed. The moral of the story:
perhaps parallelism provided a prescient image of what was to come.
Copyright © 2008 Scott B. Baden. Last modified: 09/24/2008 20:24 -0700