![]() |
Termination Detection with The Overlapping-Intervals /Charged Systems Metaphor In Systems With Asynchronous Message Passing |
| 5.1 | Abstract | |
| The overlapping interval/charged system metaphor distributed termination detection
technique is presented. This technique is applicable to systems with a fixed number of
computational locations and asynchronous message passing. It is experimentally compared
to the conventional technique of maintaining a tree of active locations.
Keywords: distributed termination detection, asynchronous message passing, Bell-man-Ford algorithm. |
||
| 5.2 | INTRODUCTION | |
| One of the problems with distributed algorithms is simply determining when the algorithm
has terminated. Distributed termination detection algorithms have received considerable
attention [DIJK80] [DIJK83] [FRAN80] [FRAN82] [APT84] [TOPO84] [APT86], typically with
synchronous message passing of the CSP style. Asynchronous message passing complicates
termination detection, since messages cannot be detected while in transit, but the
computation has terminated only when there are no further such messages. Our work with
Message Driven Computing [CHRI92a] necessitated our investigation of termination detection
with asynchronous message passing. We were inspired by the weighted throw count scheme of
[ROKU88], although our algorithm differs from theirs.
In this paper we investigate a pair of termination detection algorithms based on
overlapping intervals to determine the global state of the system at
some instant of time and the charged system metaphor to determine that no messages are in transit. We prove the cor- |
||
| 5.3 | ALGORITHMS | |
| 5.3.1 | Active Tree Algorithm | |
| The conventional active tree termination detection algorithm [DIJK80] works as follows:
A location may be either active or inactive. A location is initially inactive. It becomes
active upon receipt of a task message, which will be called the critical message.
An active location maintains a count of the number of tasks it has sent which have
not yet been acknowledged. While active, a location immediately acknowledges task
messages, increments its count for each task message it sends, and decrements the count
for each acknowledgement it receives. When it has received acknowledgements for all
the task messages it has sent, a location becomes inactive and acknowledges the critical
message. The active locations from a tree. Each directed edge leads from one active location to the location that sent the critical message that made it active. All paths lead to the root. An edge cannot lead to an inactive location, since that location cannot become inactive until the critical message is acknowledged, which will not happen until the source of the edge becomes inactive, dropping out of the tree, and implicitly removing the edge. The computation has terminated when the root becomes inactive. |
||
| 5.3.2 | Ordering of computational events | |
| Before presenting the overlapping intervals/charged-system metaphor termination
detection algorithm, we need to present a model of distributed computation in terms of
computational events connected by message passing. A computational event is an
execution of a block of code which begins and ends with sending or receiving messages (or
the initiation or termination of the task). Message transmission imposes a partial order on the events in an execution of a distributed algorithm. Let mean that event Ei sends a message that immediately triggers event Ej. (In MDC90, the language we use, there are no processes, and the only causality between events is in fact by message passing. For more conventional languages, please consider the activation stack to be a message passed between sequential events within a process.)
The relation ->+ is the transitive closure of ->. We can linearize the events in the execution of the algorithm, numbering them so that if Ei->+ Ej then i<j. Further, if i<j then Ei precedes Ej in the ordering (they are not necessarily ordered via messages). Our use of "linearize" is derived from [HERL87], albeit simplified by the functional nature of computational events and their mutual exclusion at locations. |
||
| 5.3.3 | Overlapping Intervals | |
| We will use the term location to refer to an object or
process that computes in response to asynchronous messages. We will deal only with algorithms in
which the computations occur at a fixed number of locations. Let the n locations in the
system be numbered from 1 to n: L1, L2, ..., Ln. To determine whether an execution of the algorithm has terminated, probe messages are sent throughout the system. The arrival of a probe message at a location will cause a probe event. Let Pik be the kth probe event at location i. We would like to probe all locations simultaneously to determine the global state of the system, but unfortunately we are unable to. What we can and will do is determine the state of the system over an interval.
The kth interval at location i is the time between probe Pi(k-1) and Pik.
(Pi0 is defined as the beginning of the execution at location i.)
Probe tree. In the tree method [FRAN82] [TOPO84], one location Lr
is selected as the root, a set of locations, I, are the internal nodes, and the remaining locations
form the set of leaf nodes, F. If there is more than one location, then Lr
is in set I. We shall assume this case, since a single location system does not need elaborate termination
detection.
Query messages travel down the tree from root to leaves, and probes travel back up the
tree from leaves to the root. Query messages are sent by Q events. Let
Qik be the kth Q event at location i.
Note: |
||
| 5.3.4 | Charged system metaphor | |
| The computational events relevant to the algorithm (i.e. the events other than those only
used for termination detection) will be called actions. Let Aijk be the
jth action at location i in the kth interval. Actions are triggered
by the arrival of one or more task messages and may in turn send
zero or more task messages. When there are no more task messages in transit, no further
actions can occur, and the algorithm has terminated. Unfortunately, the number of messages
in transit cannot be directly examined. We will detect the number of task messages in transit reaching zero by using a charged system metaphor. Each location Li keeps a count ci which records the current charge of Li. Each task message carries a charge of -1. Each task message sent increases the charge of its sender by one; each task message received decreases the charge of its receiver. Action Aijk which is triggered by the receipt of g>0 task messages and which sends h >= 0 task messages will add h-g to the charge of location Li. Assuming that when the algorithm begins executing the charges in the locations sum up to the initial number of task messages, the sum of the charges in a linearization of the execution will always equal the number of task messages in transit. |
||
| 5.3.5 | Overlapping Interval/charged system metaphor | |
| The overlapping interval/charged system metaphor termination detection algorithm works as follows: Probe messages are used as described above to create a sequence of overlapping intervals. The probes accumulate the sum of the charges and the logical or of the busy bits across all locations. The system detects that the computation has terminated if at the end of some set of overlapping intervals, the sum of the charges is zero and the or of the busy bits is false. The proof is essentially this:
This proof is at a rather high level. For a more detailed proof, it is advantageous to
phrase the proof in terms of linearizations of the computational events.
Theorem : The algorithm has terminated iff |
||
| Proof | ||
|
||
| 5.4 | TEST METHODS | |
| We compared the overlapping interval/charged-system metaphor termination detection algorithm to the conventional active tree algorithm. For our experiments, we used the Bellman-Ford algorithm for the single source/shortest path problem. | ||
| Test problem | ||
| Single-source, shortest path problem. A network is a directed graph where each
edge has a non-negative length. The length of a directed path is the sum of the lengths of
the edges in the path. The single source shortest path problem associates a distance Dv
with each vertex v where Dv is the minimum length of any directed path from the start
vertex to v. The Bellman-Ford Algorithm. The asynchronous Bellman-Ford algorithm works as follows: messages are sent along edges of the graph giving the length of a path from the start vertex that ends with that edge. All vertices have an initial distance of infinity, except for the start vertex which has distance zero. When a vertex receives a message specifying a lower distance than it is already assigned, the vertex assigns itself that distance and sends messages along each of its outgoing edges containing its new distance plus the length of the edge. The algorithm begins with the start vertex sending messages along its outgoing edges giving the length of the edge. In our experiments, we use the two following networks: Chain+Random. The "chain+random" graph has a chain of all the vertices in numeric order. In addition, each vertex except the last has two extra edges leading from it to other vertices chosen at random. The final vertex has three randomly chosen edges. There are no duplicate edges. The lengths of the edges are uniformly distributed random integers in the range [0..9]. X-ladder. The x-ladder graph is shown in Figure 1. There are even number, N, of vertices numbered 0 ... N-1. |
||
| The edges are | ||
As shown in [CHRI92b], the X-ladder graph with random edge lengths yields an number of distance
messages in the Bellman-Ford algorithm exponential in the number of vertices.
Experiments. In our experiments, we ran each algorithm 32 times for a given number of vertices to permit the use of large sample statistics. The lengths assigned to the edges were assigned using a random number generator seeded from a different random number generator. No two runs in the same experiment used the same seed. The order of runs was randomized to avoid being influenced by secular trends (e.g. other load on the machine). Hardware system. The experiments were run on an Encore Multimax computer with 20 processors at a time of low utilization. To implement MDC90's distributed memory model, processes on the Encore simulate the nodes on a multicomputer: an MDC90 location is assigned to a particular process. Messages, however, are allocated in shared memory and passed among the processes through linked queues, not by copying. |
||
| 5.5 | RESULTS | |
| We would expect the chain+random networks to have relatively shallow active trees and relatively few task
messages sent. Hence these networks should terminate the Bellman-Ford algorithm quickly. The X-ladder networks
would have deep active trees and an exponential number of task messages. The cost of termination detection in Active Tree algorithm is proportional to the number of task messages sent since there is one acknowledgement message per task message. Hence we would expect the Active Tree algorithm to be more efficient for the Chain+random networks than the X-ladder. The cyclic probe algorithm puts the least demand of the system while the underlying algorithm is running: at any one time there is at most one probe message waiting to be processed. The major problem with the cyclic probe technique is that it visits the locations sequentially and the probe message must make two cycles after the underlying algorithm has terminated before detecting the termination. Worse yet, our cyclic probe visits all locations in numeric order. Since location i is allocated to process (i modulus P) for P processes, the probe cycles through the processes round robin. Since there are no other activities keeping the processes busy, each probe transmission during the last cycles wakes up a process that is waiting on a semaphore. Our probe tree algorithm uses a multiway tree. The probe intervals run reasonably quickly, but that means that the system has a high density of query and probe messages to process, which takes processing power away from the underlying algorithm. The probe tree algorithm could potentially be the worst of the three by swamping the system with termination detection messages. Figures 2 and 3 show Chain+Random networks of a range of sizes both with four and with ten processes. The cyclic probe algorithm is uniformly the worst, probably due to the high cost of the final two cycles waking a process for each location visited. The Active Tree algorithm is better with four processes; the Probe Tree with ten. It is unclear why, but it is possible that having extra processors available to process the probe messages speeds up the Probe Tree. Figure 4 shows the algorithms under heavier load. The X-ladder graph has an exponential growth in the number of task messages sent, so fewer total vertices were tested to save computing time. Here the Active Tree algorithm was the worst, due to the costs of termination growing proportionally to the number of task messages. The Probe Tree algorithm is the best and appears to be diverging from the Cyclic Probe algorithm as in the earlier figures. Figures 5 and 6 hold the number of vertices constant while varying the number of processors. Figure 5 tests Chain+Random networks composed of 500 vertices. Figure 6 tests the X-ladder network with 150 vertices. Folklore says that the best number of processors to use on a shared-memory machine is about half the total available, so the results for around ten processors are likely to be of most interest. Probably the cyclic+random graph computation (Figure 5) terminates very quickly causing the cyclic probe to show the worst behavior: the probe message must move from location to location, all of which are on separate processes, waking each of them from a semaphore wait. However, that fact alone would argue that the costs across varying numbers of processes should be nearly constant with a slight decline for the underlying algorithm terminating more quickly as more processors are applied. The rising cost is unexpected. The rising cost may be due to a strain placed on underlying UNIX scheduling algorithms by a strictly round-robin activation of processes (due to page replacement, perhaps). In Figure 6, the active tree algorithm is significantly worse than the other techniques. The acknowledgement messages equal the number of distance messages sent, and since the number of distance messages are exponential in the number of nodes, the number of probe messages that have to be processed are far overshadowed by the number of acknowledgements. |
||
| 5.6 | DISCUSSION | |
| We have presented the overlapping interval/charged system metaphor technique for detecting termination of
distributed computations that take place at a fixed collection of locations. We have compared this technique
to the conventional "active tree" technique using the Bellman-Ford single-source, shortest path algorithm.
We find the overlapping interval/charged system metaphor technique is slower for some underlying networks
and faster for others. The time taken to detect termination in the active tree technique is
proportional to the number of task messages sent, but the time taken in the overlapping interval/charged
system metaphor technique is independent of the number of task messages sent. Probe trees were faster than cyclic probes for creating sets of overlapping intervals, but cyclic probes may not have been treated fairly; further experiments should be run with the probe visiting all locations in the same process before going on to (and waking) the next process. |
||
| 5.7 | ACKNOWLEDGEMENTS | |
| We ran these experiments at the Advanced Computing Research Facility, Mathematics and Computer Science Division, Argonne National Laboratory, whose provision of computing facilities we gratefully acknowledge. | ||
| 5.8 | REFERENCES | |
|
||