![]() |
Overview of Parallel Computing |
| 1.1 | Von Neumann Machine | ||||
|
A discussion of parallel computing must begin with a discussion of sequential
computing and the von Neumann machine, our sequential computer.
The von Neumann machine is one of the computer designs of John von Neumann. A
processor fetches instructions and data from a memory, operates on the data, and writes
the results back into memory. Computation is accomplished by making small
incremental changes[Backus] in the global memory. The problem with the von Neumann machine is that the design relies on making a sequence of small changes, a highly sequential process. Note that current programming languages are designed assuming the von Neumann machine. The assignment statement fetches data from memory on the right hand side, performs computations, and writes results back into the memory for the left hand side variable. The statements are executed sequentially with control accomplished by branches. In the language, the branches are given the syntactic sugar of if statements, while statements, and so on. There is a problem trying to speed up von Neumann machines. They are inherently sequential in principle. Attempts may be made to execute several instructions at once ( super scalar execution) but that gives only a few times the speed up. Similarly, it is difficult to gain high speedup from a program written in a von Neumann language without doing an extensive rewrite of the program. Flynn's taxonomy Flynn produced a taxonomy of parallel machines that is widely used. He classified computers with respect to how many different instruction streams they are fetching and executing at the same time, and how many data sets (data streams) they are fetching and processing. His taxonomy is
MIMD machines are divided into two varieties: shared memory and distributed memory. |
|||||
| 1.2 | Control/Memory taxonomy | ||||
|
Flynn's taxonomy is usually applied to machines with von Neumann processors. Further
insight may be gained by considering other control mechanisms. The von Neumann
machines may be characterized as "control driven"; it is the flow of control represented
by the program counter that schedules operations for execution. "Data driven" processors schedule operations for execution when their operands become available. In the paradigmatic variety of data driven machines, scalar data values flow in tokens over an interconnection network to the instructions that work upon them (hence the term "data flow"). When a token arrives, the hardware checks that all operands are present, and if they are, schedules the instruction for execution. Data flow machines are easily built distributed-memory. It is also possible to store the data in shared memory and signal each instruction whose operand is now available. Similarly, there is a technique for handling large data structures. An entire data structure cannot be passed in a token, so the structure is stored in a memory where components of the structure arrive as they are computed. Fetches of the elements arrive in tokens and wait for the value of the element to arrive, whereupon the value is sent in a token on to where the fetch specifies. A "demand driven" processor performs computations when their values are demanded. For example, when the value of a binary operator is demanded, the operator in turn demands the values of its operands. A common implementation of demand driven processors is based on "reductions," where a functional program is repeatedly rewritten until the solution is computed. The rewritings include replacing an operator applied to data values with its result and replacing a function call with the function body with the actual parameters substituted for the formal. Reductions are performed on an internal representation of the program. Two common representations are graphs and strings. Graphs consist of nodes linked together with pointers and hence work best with shared memory. Strings can be spread across a chain of processors so that an individual processor can reduce subexpressions contained entirely in its memory and neighboring processors can shift expressions falling across their boundary into one of them. "Pattern driven" computation is typically not done with specialized hardware, but implemented atop von Neumann machines. Shared-memory, pattern-driven programming usually means parallel logic programming. Distributed-memory, pattern-driven programming is represented in this book by Message-Driven Computing (MDC). Messages are sent to locations where a pattern-matches on the messages present decide what computations, if any, are to be triggered. Consider the table ... . |
|||||
| 1.3 | Speedup, efficiency | ||||
We want to use parallelism to compute answers more quickly. How much more quickly?
We define speedup as follows:
S = T1 / Tn
where T1 is defined as the execution time of the best sequential algorithm for the problem
on a single processor and Tn is the execution time of the parallel algorithm on n
processors. Notice several things:
We would expect that speedup cannot be better (larger) than linear, and indeed should be smaller. If the entire work of the sequential program could be evenly divided among the n processors they could all complete in 1/n the time, but it is unlikely that the work could be divided evenly; programs tend to have a sequential part, such as initialization such as reading data from or writing results to sequential files. If the sequential part can only be done on a single machine, only the rest can be run in parallel. We will examine this in more detail when we discuss Amdahl's Law. Even if the program could be evenly divided among n processors, the processors would probably have to coordinate their work with each other, which will take extra instruction executions beyond the sequential program. Therefore, Tn may be 1/n of a larger value than T1. Moreover, T1 is supposed to be the best possible sequential algorithm. If the parallel algorithm runs faster on a single machine, it would be a better sequential algorithm and you'd use it, so you can expect the algorithm T1 to be at least as good as the algorithm for Tn. You cannot expect any help from differences in the algorithms in achieving even linear speedup.
The hardware is different. The parallel machine has more processors, hence more cache memory, for one thing. The algorithm is different. For example, a depth-first search on a sequential machine might be translated into a collection of depth first searches on the nodes of a parallel computer, but the parallel depth-first searches would have an element of breadth-first search. A single depth-first search might spend a large amount of time on one fruitless branch, where with several searches, it is likely that another path might find the solution quickly Efficiency is defined as The formula shows two ways to think about efficiency.
|
|||||
| 1.4 | Amdahl's law | ||||
|
"Amdahl's Law" does not really deserve the title "law." It is merely a back-of-the-
envelope attempt to prove there are severe limits to the speedup that can be achieved by a
parallel program. Amdahl's Law asserts there is a serial part of any parallel program that
much be executed sequentially and the time required for this part will be a lower limit
on the time the program takes to execute. Consider a serial program that executes in
time T. Let's calculate what is the best speedup we could achieve if fraction f of the
execution time is taken up by sequential execution. If you divide the parallel execution time
into the serial and parallel parts, you get speedup bounded by : We get this by taking the definition of speedup and breaking down Tn into the time taken by the serial fraction (f*T) and the time taken by the parallel fraction ((1-f)*T). The parallel fraction we divide by n to calculate the best we could expect from a linear speedup.
Removing T, we have
Note that so that no matter how many processors we use, we would not expect to gain any more speedup than the reciprocal of the serial fraction. If five per cent of the program must run sequentially, speedup will never be better than twenty. |
|||||
| 1.5 | Scalability | ||||
|
A flaw in the reasoning behind Amdahl's Law is that it deals with fixed sized problems
and questions how much faster they can be run. This is not, however, the way massively
parallel processors are used. Take the example of weather forecasting. The calculations
are done by superimposing a mesh on the atmosphere and calculating pressure, tempurature,
humidity, and so on at each mesh point from the values at the surrounding points
repeatedly at small time intervals. The more the mesh points and the smaller the time
intervals, the better the forecast. But the more calculations required the slower the
program runs and for weather forecasting, if the calculation takes too long it loses all value.
When presented with a faster machine, weather forecasters will use more grid points
and a smaller step size. They increase the problem size to the largest they can and still
get the answer in the same time.
Let's rephrase the the calculation, starting with a parallel program with serial fraction g that runs in time R on n processors. If we ran it on a single processor, how long would it take?
since the serial fraction will still take the same time (g*R), and the n parts of the parallel fraction (\(1-g)*R) would have to be interleaved. This gives the speedup
a linear speedup with slope (1-g). It gives an efficiency which approaches the parallel fraction as the number of processors increases. In this formulation there is no limit on speedup. As long as we scale the problem size to the size of the machine, we will not run into limits. Another aspect of this argument against Amdahl's Law is that as the problem size increases, the serial fraction may decrease. Consider a program that reads in two NxN matrices, multiplies them, and writes out the result. The serial I/O time grows as N2 while the multiply, which is highly parallelizable, grows as N3. |
|||||
| 1.6 | Problems of parallelism | ||||
| 1.6.1 | Grain size | ||||
|
Grain size loosely means the amount of computation that is done between
communications or synchronizations. Too large a grain size can result in an unbalanced load. Too
small a grain size can waste too much time on system overhead.
Consider eight processors that are to execute ten independent tasks each of which takes
t time units. Suppose the system takes s time units to run a task. The schedule looks as
follows:
Suppose we divide each task into ten independent tasks giving us 100 tasks for the
entire job. Each task now will take t/10 time units. The schedule now looks as follows: The overall completion time is the maximum of any processor's completion time: 13t/10+13s.
How do these compare? If s is negligible compared to t, then schedule (1) will complete
in 2t and (2) in 1.3 t. However, 13 s is significantly larger than 2 s, so system overhead s
being even a small fraction of grain size t might destroy all the advantages of load balancing.
What is the cutover point? I.e. at what fraction of t does s cause schedule (2) to
take as long as schedule (1)? So if s is even seven per cent of t, the version with 100 tasks will be as slow as the version with ten.
So how do you choose a good grain size? |
|||||
| 1.6.2 | Starvation | ||||
| Starvation results when some user computations do not get adequate processor time.
Here's an example of starvation on a distributed-memory machine: For some distributed
computations, it is difficult to determine if they are finished. There are some algorithms
that send around system probe messages to inquire about the state of the computation.
Starvation can result if the probe messages use up significant processor time making
processor time unavailable to the user computations. On shared-memory machines, processors lock and unlock resources. When a resource is unlocked, one of the processors waiting for it (if any) is allowed to proceed. If the resource allocation mechanism is unfair, some waiting process may be long delayed while other processes acquire the resource repeatedly. |
|||||
| 1.6.3 | Deadlock | ||||
| A set of processes is deadlocked if each is waiting for resourses others in the set hold
and none will release resources until they have been granted the other resources they are
waiting for. There are four conditions required for deadlock:
There are three things you can try to do about deadlock. |
|||||
| 1.6.4 | Flooding & throttling | ||||
| Strangely, one of the problems with parallelism is having too much rather than too little.
For many parallel algorithms (especially divide and conquer and combinatoric search),
a problem is repeatedly broken into smaller parts which can be run in parallel. Once the
number of parallel parts significantly exceeds the number of processors available, it is
detrimental to create more parallelism: all processors will be kept busy anyway; the
time to create more parallel tasks will be wasted; and the storage for those task descriptions
will tax the parallel machine's memory. Preventing a flood of parallelism typically requires extra programming: the algorithm must be broken into the code that is executed before enough parallel tasks are created which creates more tasks, and the code that is executed after sufficient tasks are available which does its work within a single task. Choice of when to switch from creating more tasks to executing within tasks can be made statically, before the algorithm runs, or dynamically in response to the system load. Dynamic switching requires additional information about the current state of the system that is often-times not available or is highly imprecise. |
|||||
| 1.6.5 | Layout | ||||
| The layout of a data structure on a distributed-memory machine can make a significant
difference in performance. There are two interacting concerns. First, it is important to balance the load so that all nodes have approximately the same amount of work to do. Secondly, it helps to have most communication between neighboring nodes; there won't be as many queueing delays as messages contend for communication edges along longer paths. Consider though a simulation of the cosmos: If you divide space into equally sized cubes and assign one each to nodes of a multicomputer, then communication of gravitation and movements of mass can be done between neighboring nodes on a mesh-connected computer. Unfortunately, there will be vast regions of nearly empty space mapped to some regions of nodes, while those parts of space with clusters of galaxies will be mapped into other nodes; the load will be horribly imbalanced. A way to balance the load is to divide space into a larger number of regions and randomize their mapping on the nodes, say by hashing their coordinates to give the node number. Then you can count on the law of large numbers to balance the load, but communication between neighboring regions is no longer between neighboring nodes. Suppose we have N rows of an array that must be processed. How can we divide them evenly among P nodes?
Some algorithms divide arrays into regions that communicate along their edges, so messages will be shorter and faster if the perimeters of regions are smaller: square regions tend to be better than long rectangular regions. |
|||||
| 1.6.6 | Latency | ||||
| As machines get larger, physical packaging itself requires that components get further
apart and it will take longer for information to flow between some components rather
than others. This implies that the larger shared-memory machines will be NUMA (non-
uniform memory access). Data layout becomes increasingly important, and algorithms
may benefit by being rewritten to fetch remote objects, operate on them locally, and
then write them back - rather than just manipulating them in-place. Latency is also one of the considerations in laying out tasks and data on distributed memory machines. On distributed memory machines, one has the extra option of using asynchronous message passing to allow other computations to be performed while messages are being passed. |
|||||
| 1.6.7 | Scheduling | ||||
| Scheduling assigns tasks to processors to be run in a particular order or at particular
times. There is a large literature devoted to process scheduling, the major import of which is this: almost any scheduling of activities on processors is an NP-hard problem. For practical purposes, the meaning of NP-hard is this: the worst-case time to run an NP-hard algorithm grows so rapidly (e.g. doubling when you add one element to the input) that you may not get a solution for an modestly sized problem before the sun burns out. Do not seek perfect solutions to NP-hard problems. Instead, look for ways to get solutions quickly that are reasonably good most of the time. Static scheduling of tasks on processors would be done before the tasks are run. Dynamic scheduling assigns tasks during execution. Self scheduling is a form of dynamic scheduling where the processors themselves select which task to execute next.
The techniques for partitioning rows among nodes we saw in the discussion of layout
are also applicable to processor scheduling on shared-memory machines. For technique
(3), an easy way to assign process i all rows j such that j mod P = i is to handle the rows
in a loop the rows are numbered 0 through N-1 my_id is the node number in the range 0..P-1
Rather than assign entire rows or columns to processors, better load ballancing can
sometimes be accomplished by assigning groups of elements. If there are K elements
total in an array, we can assign them numbers 0 through K-1, assign ranges of those
numbers to processors, and convert from the element number to the array indices when
necessary. For an MxN matrix A with zero origin addressing,
A simple form of self scheduling for K elements is to keep the index C of the next element
to be processed and to allocate items by incrementing or decrementing C. For
example,
However, if the processing of a single element takes little time, the grain size is too
small. Of course the processor could allocate some constant number of elements greater
than one at a time. This is clearly better for grain size, but may still have load balance
problems. |
|||||
| 1.7 | Designing experiments | ||||
| 1.8 | Programming techniques | ||||
| There are a number of techniques for programming MIMD parallel processors. There are two general techniques based on SIMD and MIMD computers. Dataparallel computation is based on SIMD. A common data structure is divided up among the processors so that they perform the same operations at the same time but on different data elements. Control parallel computation is based on MIMD. The processors execute different code at the same time. There are a number of ways to divide up the computations. In the job jar - or processor farm - organization, the program keeps a list of jobs (or tasks, or chores, or whatever you wish to call them); we call this list the job jar. When a processor becomes free, it pulls a job out of the job jar and executes it. The job may place more jobs in the job jar. In the pipelined or stream organization, a stream of data passes down a chain of processes each of which reads, transforms, and writes the stream. A reactive object organization is based on message passing. All computation is performed by objects responding to messages. Similar to the reactive object system is the client/server organization: client processes send requests to servers which perform the services requested and send responses back. Macro dataflow is based on the data-driven computation model. The presence of its operands triggers a computation that uses them. The "macro" refers to the operands being larger than individual scalars, e.g. sections of arrays. |