10-1996

An empirical comparison of networks and routing strategies for parallel computation

Ronald I. Greenberg
Rgreen@luc.edu

Lee Guan

Author Manuscript
This is a pre-publication author manuscript of the final, published article.

Recommended Citation

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.
An Empirical Comparison of Networks and Routing Strategies for Parallel Computation
(preliminary version)

Ronald I. Greenberg
Mathematical and Computer Sciences
Loyola University
6525 North Sheridan Road
Chicago, IL 60626
rig@math.luc.edu

Lee Guan
Electrical Engineering
University of Maryland
College Park, MD 20742
leegaun@eng.umd.edu

Abstract
This paper compares message routing capabilities of important networks proposed for general-purpose parallel computing. All the networks have been proven to have some type of universality property, i.e., an ability to simulate other networks of comparable cost with modest slowdown, using appropriate cost and communication models. But in this paper we seek an empirical comparison of communication capability under typical direct use rather than an analysis of worst-case results for simulating message traffic of another network.

1 Introduction
A significant challenge to massively parallel computing is providing an economical interconnection network that can support general patterns of communication among processors. It has been noted that the hypercube is universal in the sense that it can simulate any network on the same number of processors with logarithmic slowdown (e.g., see [18]). The high pin-out and area requirements of the hypercube are serious detriments, however, which led to development of various types of “fat-tree” networks, which have the property that they can simulate any network of comparable VLSI area with slowdown polylogarithmic in the area under circuit-switched, packet, or wormhole routing models [3, 4, 8, 9, 10, 13, 14]. This research has influenced the design of parallel computers by Thinking Machines Corporation and Meiko [5, 12, 15]. Most area-universality analyses have been performed in the “unit wire delay model”, i.e., assuming that unit time suffices to send a bit across a wire regardless of its length. This model has generally proved reasonable for current technology, but may become less appropriate as we build larger systems; an extension of the fat-tree referred to as the “fat-pyramid” [8] has been shown to be area-universal given any reasonable dependence of delay on wire length. We use a simple version of the “butterfly fat-tree” (BFT) and fat-pyramid as illustrated in Figure 1. Finally, the mesh has been a popular network for parallel computing, and it is easy to see that the mesh is area-universal under linear wire delay, which may be the most accurate model in the distant future (e.g., see [6]).

We perform an empirical comparison of message routing on the networks mentioned above, since examination of typical performance in practice may lead to substantially different conclusions than an analysis of worst-
case slowdown for simulating another network. This paper focuses on the unit wire delay model; though the universality advantages of the mesh or fat-pyramid come in to play with different models of wire delay, it is interesting to see whether there is significant detriment to using these networks in the unit delay model. Most of the simulations here are performed in the simple (store-and-forward) packet routing model, but we also compare to wormhole routing, where messages (worms) are composed of fits or flow control digits, and worms snake through the network one fit after another with only a constant number of fits being stored in an intermediate node at any time. In store-and-forward routing, messages are conceptually transferred from node to node as atomic units, but we still count an appropriate number of fit steps to transfer a packet of many bits from one node to another to achieve a fair comparison with wormhole routing.

2 Equalizing Hardware Cost

To make fair comparisons between different networks (with a given number of processors), we adjust the channel width (the number of wires connecting adjacent nodes) in each, to make the cost of interconnections equal. We consider three models of hardware cost that have received substantial recent attention. Bisection width and pin-out constraints have been considered in prior empirical studies of wormhole routing on k-ary n-cube networks [1, 7]. The Thompson model for area [16, 17] has been the focus of theoretical analyses of area-universality [3, 4, 8, 9, 10, 13, 14]. The bisection width of a network is the minimum number of wires cut when the network is divided into two equal halves. The pin-out of a node is the degree times the channel width W, and total pin-out is the sum of the pin-outs over all nodes. VLSI layout area is evaluated based on the assumption that all processors and switches are placed on a 2-D substrate. The substrate has two layer of interconnect for the x-direction and y-direction, respectively, with a minimum wire width and separation. Such a model can serve as a good abstraction for a variety of VLSI packaging technologies, such as wafer-scale integration or printed circuit boards [2]. Since the area required for the processors is the same for all networks, we consider only the area necessary to achieve the interconnections. For each of the networks, we can analyze the area by expressing the side length with n processors as \( S(n) = \sqrt{n} \cdot d \cdot W \cdot P \), where \( P \) is the wiring pitch and \( d \) can be thought of as an average wire density per row or column (with channel width 1) when the processors are laid out in a \( \sqrt{n} \) by \( \sqrt{n} \) grid.

Tables 1 through 3 give the bisection widths (\( B \)), pin-outs (\( PO \)), and average wire densities (\( d \)) for a range of values of \( n \) along with channel widths (\( W \)) to equalize cost.

<table>
<thead>
<tr>
<th>( n )</th>
<th>mesh</th>
<th>hypercube</th>
<th>BFT</th>
<th>fat-pyramid</th>
</tr>
</thead>
<tbody>
<tr>
<td>B</td>
<td>W</td>
<td>B</td>
<td>W</td>
<td>B</td>
</tr>
<tr>
<td>16</td>
<td>4</td>
<td>32</td>
<td>8</td>
<td>16</td>
</tr>
<tr>
<td>64</td>
<td>8</td>
<td>32</td>
<td>32</td>
<td>8</td>
</tr>
<tr>
<td>256</td>
<td>16</td>
<td>32</td>
<td>128</td>
<td>4</td>
</tr>
<tr>
<td>1024</td>
<td>32</td>
<td>32</td>
<td>512</td>
<td>2</td>
</tr>
<tr>
<td>4096</td>
<td>64</td>
<td>32</td>
<td>2048</td>
<td>1</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 1: The bisection width with channel width 1 and the channel width to maintain constant bisection width across different networks.

<table>
<thead>
<tr>
<th>( n )</th>
<th>mesh</th>
<th>hypercube</th>
<th>BFT</th>
<th>fat-pyramid</th>
</tr>
</thead>
<tbody>
<tr>
<td>PO</td>
<td>W</td>
<td>PO</td>
<td>W</td>
<td>PO</td>
</tr>
<tr>
<td>16</td>
<td>48</td>
<td>32</td>
<td>64</td>
<td>24</td>
</tr>
<tr>
<td>64</td>
<td>224</td>
<td>32</td>
<td>384</td>
<td>19</td>
</tr>
<tr>
<td>256</td>
<td>960</td>
<td>32</td>
<td>2048</td>
<td>15</td>
</tr>
<tr>
<td>1024</td>
<td>3968</td>
<td>32</td>
<td>1024</td>
<td>12</td>
</tr>
<tr>
<td>4096</td>
<td>16128</td>
<td>32</td>
<td>49152</td>
<td>10</td>
</tr>
</tbody>
</table>

Table 2: The pin-out with channel width 1 and the channel width to maintain constant pin-out across different networks.

<table>
<thead>
<tr>
<th>( d )</th>
<th>mesh</th>
<th>hypercube</th>
<th>BFT</th>
<th>fat-pyramid</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>16</td>
<td>1</td>
<td>32</td>
<td>2</td>
<td>16</td>
</tr>
<tr>
<td>64</td>
<td>1</td>
<td>32</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td>256</td>
<td>1</td>
<td>32</td>
<td>10</td>
<td>3</td>
</tr>
<tr>
<td>1024</td>
<td>1</td>
<td>32</td>
<td>21</td>
<td>2</td>
</tr>
<tr>
<td>4096</td>
<td>1</td>
<td>32</td>
<td>42</td>
<td>1</td>
</tr>
</tbody>
</table>

Table 3: The wire density per row/column and the channel width under constant layout area constraints.

3 Experimental Results and Conclusions

It has been customary to use network latency as the primary performance measure because of its tendency to limit performance in practice in today’s fine-grained parallel systems. The average latency is the average time to completely transmit a message from source to destination. It depends on load rate (the number of message bits generated per cycle per node), and simulations are run at a fixed load rate with random sources and destinations until average latency reaches a steady state. The average latency generally stays rather constant at low load rates and then increases rapidly as the network saturates. In practice, parallel networks should be designed to operate on the the flat portion of the latency curve. Maximum throughput is another important performance metric, and for certain applications such as sample sorting
can be dominant [11]. Maximum throughput can be read from the latency graphs by looking for the load rate at which the network saturates.

Figures 2 through 4 show packet routing simulation results under constant bisection width, constant pin-out, and constant area constraints, respectively. Simulations are shown for several values of $n$ up to $n = 4096$. Message lengths of 320 bits are used throughout.

The most striking aspect of the packet-routing simulation results in Figures 2 through 4 is that the mesh always performs very well in comparison to the other networks despite the use of the unit wire delay model. While the best low-load latency is obtained with the fat-tree under constant bisection and constant pin-out constraints (for large networks), it is surprising that the performance of the fat-tree is not generally better than what is shown by our simulations, particularly under the sort of area constraint that motivated study of the fat-tree. Performance of the fat-tree and fat-pyramid might be better with the more area-efficient variation in [8, Secs. II–III]. Also interesting is that for the most part, the packet routing graphs look qualitatively very similar to those obtained from wormhole routing with a reasonable range of worm lengths. (As would be expected, however, the packet routing results tend to show higher average latencies and higher maximum throughput.) Only in the case of constant pin-out did the choice of packet routing versus wormhole routing cause some change in the ranking of networks by low-load latency; our wormhole routing results for constant pin-out are shown in Figure 5.

References


Figure 2: Comparison of packet routing latency under constraint of equal bisection width.

Figure 3: Comparison of packet routing latency under constraint of equal pin-out.
Figure 4: Comparison of packet routing latency under constraint of equal interconnect area.

Figure 5: Comparison of wormhole routing latency under constraint of equal pin-out.