Chapter 5 Termination Detection with The Overlapping-Intervals / Charged Systems Metaphor In Systems With Asynchronous Message Passing

Copyright © 1994. Thomas W. Christopher


5.1 Abstract
  The overlapping interval/charged system metaphor distributed termination detection technique is presented. This technique is applicable to systems with a fixed number of computational locations and asynchronous message passing. It is experimentally compared to the conventional technique of maintaining a tree of active locations.

Keywords: distributed termination detection, asynchronous message passing, Bell-man-Ford algorithm.

5.2 INTRODUCTION
  One of the problems with distributed algorithms is simply determining when the algorithm has terminated. Distributed termination detection algorithms have received considerable attention [DIJK80] [DIJK83] [FRAN80] [FRAN82] [APT84] [TOPO84] [APT86], typically with synchronous message passing of the CSP style. Asynchronous message passing complicates termination detection, since messages cannot be detected while in transit, but the computation has terminated only when there are no further such messages. Our work with Message Driven Computing [CHRI92a] necessitated our investigation of termination detection with asynchronous message passing. We were inspired by the weighted throw count scheme of [ROKU88], although our algorithm differs from theirs.

In this paper we investigate a pair of termination detection algorithms based on overlapping intervals to determine the global state of the system at some instant of time and the charged system metaphor to determine that no messages are in transit. We prove the cor-
rectness of these techniques and experimentally compare their performance to the conventional technique of maintaining a tree of active locations (objects or processes).

5.3 ALGORITHMS
5.3.1 Active Tree Algorithm
  The conventional active tree termination detection algorithm [DIJK80] works as follows: A location may be either active or inactive. A location is initially inactive. It becomes active upon receipt of a task message, which will be called the critical message. An active location maintains a count of the number of tasks it has sent which have not yet been acknowledged. While active, a location immediately acknowledges task messages, increments its count for each task message it sends, and decrements the count for each acknowledgement it receives. When it has received acknowledgements for all the task messages it has sent, a location becomes inactive and acknowledges the critical message.
The active locations from a tree. Each directed edge leads from one active location to the location that sent the critical message that made it active. All paths lead to the root. An edge cannot lead to an inactive location, since that location cannot become inactive until the critical message is acknowledged, which will not happen until the source of the edge becomes inactive, dropping out of the tree, and implicitly removing the edge.
The computation has terminated when the root becomes inactive.
5.3.2 Ordering of computational events
  Before presenting the overlapping intervals/charged-system metaphor termination detection algorithm, we need to present a model of distributed computation in terms of computational events connected by message passing. A computational event is an execution of a block of code which begins and ends with sending or receiving messages (or the initiation or termination of the task).
Message transmission imposes a partial order on the events in an execution of a distributed algorithm. Let
Ei -> Ej
mean that event Ei sends a message that immediately triggers event Ej. (In MDC90, the language we use, there are no processes, and the only causality between events is in fact by message passing. For more conventional languages, please consider the activation
stack to be a message passed between sequential events within a process.)

The relation ->+ is the transitive closure of ->.
Ei->+Ej
means either Ei-> Ej or there is an event Ek such that Ei-> Ek and Ek ->+ Ej.
The relation ->+ induces a partial temporal order on the events in the algorithm. We say
Ei precedes Ej (Ej follows Ei) if Ei->+ Ej.

We can linearize the events in the execution of the algorithm, numbering them so that if Ei->+ Ej then i<j. Further, if i<j then Ei precedes Ej in the ordering (they are not necessarily ordered via messages). Our use of "linearize" is derived from [HERL87], albeit simplified by the functional nature of computational events and their mutual exclusion at locations.

5.3.3 Overlapping Intervals
  We will use the term location to refer to an object or process that computes in response to asynchronous messages. We will deal only with algorithms in which the computations occur at a fixed number of locations. Let the n locations in the system be numbered from 1 to n: L1, L2, ..., Ln.
To determine whether an execution of the algorithm has terminated, probe messages are sent throughout the system. The arrival of a probe message at a location will cause a probe event. Let Pik be the kth probe event at location i. We would like to probe all locations simultaneously to determine the global state of the system, but unfortunately we are unable to. What we can and will do is determine the state of the system over an interval.

The kth interval at location i is the time between probe Pi(k-1) and Pik. (Pi0 is defined as the beginning of the execution at location i.)
We create overlapping intervals by requiring that
Pi(k-1)->+ Pjk
for all locations i and j and probe number k>0. The set of all intervals Pi(k-1) to Pik will be the kth set of overlapping intervals.
Two techniques for achieving sets of overlapping intervals are cyclic probes and probe trees.
Cyclic probe. In the cyclic probe method [DIJK83], a probe message travels from location to location around a ring. Suppose the locations are numbered L1, L2 , ..., Ln and the probe message visits them in ascending order. Then
Pik -> P(i+1)k i<n
Pnk -> P1(k+1)
Clearly
Pik ->+Pj(k+1) for all locations Li ,Lj

Probe tree. In the tree method [FRAN82] [TOPO84], one location Lr is selected as the root, a set of locations, I, are the internal nodes, and the remaining locations form the set of leaf nodes, F. If there is more than one location, then Lr is in set I. We shall assume this case, since a single location system does not need elaborate termination detection.
We define a function parent that maps each location other than Lr into its parent in the tree. Since the locations form a tree, there are no cycles in the parent graph. For any location Li not in F, children(Li) = {Lj | parent (Lj) = Li}

Query messages travel down the tree from root to leaves, and probes travel back up the tree from leaves to the root. Query messages are sent by Q events. Let Qik be the kth Q event at location i.
For all locations Lj, j < >r, Qjk is triggered by a query message send by event Qik, Li = parent(Lj). For all Li in I, Qik sends query messages to each location Lj in children(Li).
For each leaf Lj, Qjk = Pjk, i.e. the kth Q event is the kth probe event. A probe event Pjk,j < > r, sends a probe message to parent (Lj). A probe event at each internal location, Lj, is triggered by the arrival of probe messages from all its children.
The probe event at the root, Prk, is the same as the k+1st Q event at the root, Prk = Qr(k+1)

Note:
Pjk ->+ Prk, all j < >r
Pr(k-1) = Qrk ->+ Qjk, all j < > r
Pjk -> Pik, all j in children (Li)
It can be seen by induction on the distance from the leaves that
Pr(k-1)->+ Pik, all i
Therefore
Pj(k-1)->+ Pik, all i and j
hence the tree method gives overlapping intervals.

5.3.4 Charged system metaphor
  The computational events relevant to the algorithm (i.e. the events other than those only used for termination detection) will be called actions. Let Aijk be the jth action at location i in the kth interval. Actions are triggered by the arrival of one or more task messages and may in turn send zero or more task messages. When there are no more task messages in transit, no further actions can occur, and the algorithm has terminated. Unfortunately, the number of messages in transit cannot be directly examined.
We will detect the number of task messages in transit reaching zero by using a charged system metaphor. Each location Li keeps a count ci which records the current charge of Li. Each task message carries a charge of -1. Each task message sent increases the charge of its sender by one; each task message received decreases the charge of its receiver. Action Aijk which is triggered by the receipt of g>0 task messages and which sends h >= 0 task messages will add h-g to the charge of location Li. Assuming that when the algorithm begins executing the charges in the locations sum up to the initial number of task messages, the sum of the charges in a linearization of the execution will always equal the number of task messages in transit.
5.3.5 Overlapping Interval/charged system metaphor
  The overlapping interval/charged system metaphor termination detection algorithm
works as follows: Probe messages are used as described above to create a sequence of overlapping intervals. The probes accumulate the sum of the charges and the logical or of the busy bits across all locations.
The system detects that the computation has terminated if at the end of some set of overlapping intervals, the sum of the charges is zero and the or of the busy bits is false.
The proof is essentially this:
  1. If the underlying algorithm terminates, then in the following interval, no busy bits will be set, and since there are no messages in transit, the sum of the charges will be zero.
  2. If at the end of one interval all busy bits are false and the sum of the charges is zero, then the charge at each location at the end of the interval is the charge there throughout the entire interval. Since the intervals are overlapping, there is some instant across all the locations at which the sum of the charges at the locations is zero and hence no message is in transit. If there are no messages in transit, the algorithm has terminated.

This proof is at a rather high level. For a more detailed proof, it is advantageous to phrase the proof in terms of linearizations of the computational events.
Let bi be the busy bit at location Li. This bit is set (to true) by any action Aijk. It is reset (to false) by any probe Pik. Let kbi be the value of bi as probe event Pik begins execution.
Let kci be the value of ci at event Pik.

Theorem : The algorithm has terminated iff
(SUMik ci = 0) AND ¬ (kb1 ORkb2 OR ... ORkbn)

Proof  
 
  1. Proof that if the algorithm has terminated, then
    (SUMik ci = 0) AND ¬ (kb1 ORkb2 OR ... ORkbn)
    Linearize the events in the execution giving a sequence E1, E2 , ..., EN. (Recall that if Ei ->+ Ej , then i Let
    • bij be the value of bi following event Ej. bi0 is the initial value, true.
    • cij be the value of ci following event Ej. ci0 is the initial value.
    • Mj be the number of task messages in transit following event Ej.

    is invariant.

    SUMiCij = Mj
    If the algorithm terminates at action Ex=Aij(k'), then Mx = Mx+1 = Mx+2 = .... = 0. Also, there is no action Eu where u>x, hence cgx = cg(x+1) = cg(x+2) = .... for all locations Lg

    Pick an event Ew = Ai'v'k", w <= x, for which k" is largest. (It might not be Ex.)
    Following P
    ik" (for all i), bi will remain false since Pik resets the bit and there are no actions to set it. Therefore k"+1 bi = false for all i.
    Recall E
    x=Aij(k') is the event (the last action) at which the algorithm terminated, and k'<= k". Let Ez=Pik". Observe x<z, since the actions of the kth interval occur before the kth probe.
    Since P
    ik" ->+ Pj(k"+1) for all Lj , and since after termination, M=0

    SUMik"+1Ci = 0

    Therefore, letting k=k"+1, (SUMik ci = 0) AND¬ (kb1 ORkb2 OR ... ORkbn)

  2. Proof that if
    (SUM
    ik ci = 0) AND¬ (kb1 ORkb2 OR ... ORkbn)
    for some k
    th set of overlapping intervals, then the algorithm has terminated.
    Suppose the condition (SUM
    ik ci = 0) AND¬ (kb1 ORkb2 OR ... ORkbn)
    is met at the end of interval k.

    We will show that for any legal linearizations of events E1, E2, ..., EN, the formula will hold only if the algorithm has terminated.
    Choose an arbitrary linearization of the events.
    Let E
    x =Pi(k-1) such that for all Pj(k-1)=Ez , z <= x. Ex is the last probe event for any of the k-1st intervals in the linearization.
    By the definition of overlapping intervals, P
    i(k-1) ->+ Pjk. In the linearization, this is preserved, hence for all Ey = Pjk, y>x.
    Since no location L
    i has been busy over interval k, ci has remained constant from Pi(k-1) to Pik. At the instant following event Ex, which is within all of the kth intervals and the

    SUMiCix=Mx = 0

    algorithm has terminated.

5.4 TEST METHODS
  We compared the overlapping interval/charged-system metaphor termination detection algorithm to the conventional active tree algorithm. For our experiments, we used the Bellman-Ford algorithm for the single source/shortest path problem.
Test problem  
  Single-source, shortest path problem. A network is a directed graph where each edge has a non-negative length. The length of a directed path is the sum of the lengths of the edges in the path. The single source shortest path problem associates a distance Dv with each vertex v where Dv is the minimum length of any directed path from the start vertex to v.
The Bellman-Ford Algorithm. The asynchronous Bellman-Ford algorithm works as follows: messages are sent along edges of the graph giving the length of a path from the start vertex that ends with that edge. All vertices have an initial distance of infinity, except for the start vertex which has distance zero. When a vertex receives a message specifying a lower distance than it is already assigned, the vertex assigns itself that distance and sends messages along each of its outgoing edges containing its new distance plus the length of the edge. The algorithm begins with the start vertex sending messages along its outgoing edges giving the length of the edge.
In our experiments, we use the two following networks:
Chain+Random. The "chain+random" graph has a chain of all the vertices in numeric order. In addition, each vertex except the last has two extra edges leading from it to other vertices chosen at random. The final vertex has three randomly chosen edges. There are no duplicate edges. The lengths of the edges are uniformly distributed random integers in the range [0..9].
X-ladder. The x-ladder graph is shown in Figure 1. There are even number, N, of vertices numbered 0 ... N-1.
The edges are  
 
  • <i,i+2> for all 0 <= i <i,i+1> for all even i in the range 0...N-2.
  • <i,i+3> for all odd i in the range 0...N-5.
  • The lengths of the edges are uniformly distributed random integers in the range [0..9].

As shown in [CHRI92b], the X-ladder graph with random edge lengths yields an number of distance messages in the Bellman-Ford algorithm exponential in the number of vertices.
We laid out the vertices so that vertex number i is allocated to processor number (i modulus P) for P processors.
Details of termination detection algorithms. All locations in a process are scheduled round robin. Each time a location is dispatched in the active tree algorithm, it consumes one distance or acknowledgement message, giving higher priority to distance messages; a location will not consume an acknowledgement message if there is a distance message waiting to be processed. The probe tree algorithm superimposes a hypercube upon the locations and builds a spanning tree along edges of the hypercube. The locations are numbered from zero. Location zero is the root. If the binary representation of a location's number has k rightmost zeros, the location has k children in the tree whose numbers are the same as their parent's except for precisely one of k low order zeros being changed to a one (if they exist).
Language. We use the language MDC90 [CHRI92] based on the Message Driven Computing model. A Message Driven Computation system has five characteristics:

  1. Computational events are executions of functions that map sets of input messages into output messages.
  2. Computational events communicate with each other only through asynchronous, unidirectional message passing.
  3. Messages are sent to locations where the messages matching a pattern become inputs to a computational event.
  4. Locations have structured, computable names.
  5. There is mutual exclusion between computational events at the same location.

Experiments. In our experiments, we ran each algorithm 32 times for a given number of vertices to permit the use of large sample statistics. The lengths assigned to the edges were assigned using a random number generator seeded from a different random number generator. No two runs in the same experiment used the same seed. The order of runs was randomized to avoid being influenced by secular trends (e.g. other load on the machine).

Hardware system. The experiments were run on an Encore Multimax computer with 20 processors at a time of low utilization. To implement MDC90's distributed memory model, processes on the Encore simulate the nodes on a multicomputer: an MDC90 location is assigned to a particular process. Messages, however, are allocated in shared memory and passed among the processes through linked queues, not by copying.

5.5 RESULTS
  We would expect the chain+random networks to have relatively shallow active trees and relatively few task messages sent. Hence these networks should terminate the Bellman-Ford algorithm quickly. The X-ladder networks would have deep active trees and an exponential number of task messages.
The cost of termination detection in Active Tree algorithm is proportional to the number of task messages sent since there is one acknowledgement message per task message. Hence we would expect the Active Tree algorithm to be more efficient for the Chain+random networks than the X-ladder.
The cyclic probe algorithm puts the least demand of the system while the underlying algorithm is running: at any one time there is at most one probe message waiting to be processed. The major problem with the cyclic probe technique is that it visits the locations sequentially and the probe message must make two cycles after the underlying algorithm has terminated before detecting the termination. Worse yet, our cyclic probe visits all locations in numeric order. Since location i is allocated to process (i modulus P) for P processes, the probe cycles through the processes round robin. Since there are no other activities keeping the processes busy, each probe transmission during the last cycles wakes up a process that is waiting on a semaphore.
Our probe tree algorithm uses a multiway tree. The probe intervals run reasonably quickly, but that means that the system has a high density of query and probe messages to process, which takes processing power away from the underlying algorithm. The probe tree algorithm could potentially be the worst of the three by swamping the system with termination detection messages.
Figures 2 and 3 show Chain+Random networks of a range of sizes both with four and with ten processes. The cyclic probe algorithm is uniformly the worst, probably due to the high cost of the final two cycles waking a process for each location visited.
The Active Tree algorithm is better with four processes; the Probe Tree with ten. It is unclear why, but it is possible that having extra processors available to process the probe messages speeds up the Probe Tree.
Figure 4 shows the algorithms under heavier load. The X-ladder graph has an exponential growth in the number of task messages sent, so fewer total vertices were tested to save computing time. Here the Active Tree algorithm was the worst, due to the costs of termination growing proportionally to the number of task messages. The Probe Tree algorithm is the best and appears to be diverging from the Cyclic Probe algorithm as in the earlier figures.
Figures 5 and 6 hold the number of vertices constant while varying the number of processors. Figure 5 tests Chain+Random networks composed of 500 vertices. Figure 6 tests the X-ladder network with 150 vertices. Folklore says that the best number of processors to use on a shared-memory machine is about half the total available, so the results for around ten processors are likely to be of most interest.
Probably the cyclic+random graph computation (Figure 5) terminates very quickly causing the cyclic probe to show the worst behavior: the probe message must move from location to location, all of which are on separate processes, waking each of them from a semaphore wait. However, that fact alone would argue that the costs across varying numbers of processes should be nearly constant with a slight decline for the underlying algorithm terminating more quickly as more processors are applied. The rising cost is unexpected. The rising cost may be due to a strain placed on underlying UNIX scheduling algorithms by a strictly round-robin activation of processes (due to page replacement, perhaps).
In Figure 6, the active tree algorithm is significantly worse than the other techniques. The acknowledgement messages equal the number of distance messages sent, and since the number of distance messages are exponential in the number of nodes, the number of probe messages that have to be processed are far overshadowed by the number of acknowledgements.
5.6 DISCUSSION
  We have presented the overlapping interval/charged system metaphor technique for detecting termination of distributed computations that take place at a fixed collection of locations. We have compared this technique to the conventional "active tree" technique using the Bellman-Ford single-source, shortest path algorithm. We find the overlapping interval/charged system metaphor technique is slower for some underlying networks and faster for others. The time taken to detect termination in the active tree technique is proportional to the number of task messages sent, but the time taken in the overlapping interval/charged system metaphor technique is independent of the number of task messages sent.
Probe trees were faster than cyclic probes for creating sets of overlapping intervals, but cyclic probes may not have been treated fairly; further experiments should be run with the probe visiting all locations in the same process before going on to (and waking) the next process.
5.7 ACKNOWLEDGEMENTS
  We ran these experiments at the Advanced Computing Research Facility, Mathematics and Computer Science Division, Argonne National Laboratory, whose provision of computing facilities we gratefully acknowledge.
5.8 REFERENCES
 
  • AGHA86 Agha, G. A.; Actors: A Model of Concurrent Computation in Distributed Systems, MIT Press, Cambridge, MA., 1986.
  • APT84 Apt, K. R., Francez, N., "Modeling the Distributed Termination Convention of CSP," ACM TOPLAS , 6,3, July, 1984.
  • APT86 Apt, K. R., "Correctness Proofs of Distributed Termination Algorithms," ACM TOPLAS, 8,3, July, 1986.
  • CHRI92a Christopher, T. W., and Freitag, G. A. MDC90 Language Description , Version 2.0, Technical Report, Computer Science Department, Illinois Institute of Technology, Chicago IL 60616, 1992.
  • CHRI92b Christopher, T. W. "A Technique for Damping Exponential Behavior in Reactive Object Algorithms With Invalidation," 1992 Int'l. Conf. on Parallel Processing, St. Charles IL, Aug. 1992.
  • DIJK80 Dijkstra, E. W., and Scholten, C. S., "Termination Detection for Diffusing Computations," Information Processing Letters, 11, pp1-4, Aug. 1980.
  • DIJK83 Dijkstra, E. W., Feijen, W.H.J., van Gasteren, A.J.M., "Derivation of a Termination Detection Algorithm for Distributed Computations," Information Processing Letters, 16, pp 217-219, June 1983.
  • FRAN80 Francez, N., "Distributed Termination," ACM TOPLAS, 2,1, Jan. 1980.
  • FRAN82 Francez, N., Rodeh, M., "Achieving Distributed Termination without Freezing," IEEE Trans. on Software Engineering, SE-8,3, May 1982.
  • HERL87 Herlihy, M. P., and Wing, J. M., "Axioms for Concurrent Objects", ACM Principles of Programming Languages Conference, January 1987.
  • TOPO84 Topor, R. W., "Termination Detection for Distributed Computations," Information Processing Letters, 18 pp.33-36, Jan. 20, 1984.
  • ROKU88 Kazuaki Rokusawa, Nobuyuki Ichiyoshi, Takashi Chikayama, Hiroshi Nakashima, "An Efficient Termination Detection and Abortion Algorithm for Distributed Processing Systems," 1988 Int'l. Conf. on Parallel Processing, St. Charles IL, Aug. 1988.

Copyright © 1994. Thomas W. Christopher