Dynamic Load-balancing

Motivation

Load balance is one of the major challenges for the efficient supercomputer, especially for applications that exhibit workload variations. Load imbalance is an considerable impedance on the path towards higher degree of parallelism.

In particular, when load conditions changes dynamically, efficient mesh partitioning becomes an indispensible part of scalable design.

Goals

Fluid dynamic application in the field of industrial engineering require high degrees of parallelism to achieve an acceptable time to solution for a large problem size. Typical mesh-based approaches therefore rely on suitable partitioning strategies to distribute the computational load across the set of processes.

Therefore, an efficient load-balancing approach aims to achieve two goals:

  • The work load should be distribute evenly
    • avoid waiting times of processing units

  • At the same time the interfacing boundaries between partitions should be as small as possible.
    • minimize the time spend in communication

The optimization problem is NP-hard.

Implementing SFC

The numerical approximation of wave equation is a hp-adaptive approach. That is, elements can split or merge (h-adaptive) according to therequired resolution. Also, they can raise or decrease the polynomial degree (p-adaptive) to adjust the convergence rate.

Due to the hp-adaptivity, different element can have differernt individual computation times, load imbalance is introuduce to this application. Re-meshing and domain partitioning are not avoidable.

With the help of a SFC, the 2D structured mesh partitioning problem can be reduced to a 1D chains-on-chains partitioning (CCP) problem. Knowing the index od an element, its neighbours indices can be computed locally.

Hilbert Curve

There are many SFCs, for example Morton Curve (z-curve) and Hilbert Curve.

We choose Hilbert Curve as our SFC. Although Hilbert ordering is less efficient (with flip and rotation) than Morton Curve, Hilbert Curve brings out a better locality (no sudden “jump”).

pic1 pic2

(Left Morton and right Hilbert)

Static Grid Neighbour-finding algorithm

In Computation Fluid Dynamics, most of the cases, elements needs to exchange information (e.g. fluxes, velocity, pressure) with their neighbour. Thus, an effective way to locate your neighbours would cut down the computation time. When the neighbour is not stored locally, communication between processors is inevitable.

_images/hilber_numbering.svg

For instance, we are on element 31. The domain is partitioned into4 parts and each part is assigned to one processor. The integer coordingate of element 31 is (3, 4).

Therefore, its neighbours coordinates can be computed easily. Say we want to find its North and East neighbour, their coordinates are (3, 5) and (4, 4), respectively.

North neighbour: We can use our Hilbert-numbering function to map between coordinate and element index. Then (3, 5) corresponding to element 28. We successfully locate the Neighbour.

East neighbour: By using the same methond, we are able to compute the east neighbour index: 32. However, this element is not stored locally. Locate the processor who stores the target element is done by broadcasting the element range stored in each processor after the partitioning. And one-sided communication is invoked to warrent effective MPI message-changing.

Dynamic grid Neighbour-finding algorithm

When h-adaptivity is introduced to the code, element splits or merge according to the error indicator. Once an element split, it generates four identical “children” quadrants. The Octree partitioning is motivated by octree-based mesh generation.

_images/quardtree_mesh.jpg

Neighbour-finding is achieved by using a global index (k, l, j, s) to identify element.

  • k: Root element number.

  • l: h-refinement level (split number).

  • j: child relative position inside a parent octant.

  • s: element state, can be used to determined Hilbert Curve orientation.

_images/hilbert_adaptive_mesh.png

Partitioning stratigy

We consider a 2D mesh being represented by a one dimensional array using Hilbert Curve.

Implementation

We followed the CCP strategy described in [HKR+12]. The array has the length \(N\) which corresponding to the number of mesh cells. Weights are give as \(\omega_i\), where \(i\) corresponding to teh global index for each element. The weights represents the computation effort of each element. In fact, the load on each element due to fluid computation is \(O(N^4)\)[ZBMZ+18].

_images/Hilbert_uniform_grid_partition.png

The task of the partition step is to decide which element to move to which processor. Here, we use \(p\) to denote the total number of processors, and every processor can be identified by a unique number called \(rank\). (\(0 \leqslant rank \leqslant p\))

We use an exclusive prefix sum to determine the partition.

(1)\[prefix(I) = \sum_{i = 0}^{N - 1}\omega_i\]

For \(0 < I \leqslant N\) and \(prefix(0) = 0\). Local prefix sus are calculated, and the global offsets are adjusted afterwards using MPI_EXSCAN() collective with MPI_SUM as reduction operation. Then each prossessor has the global prefix sum for each of its local elements.

The ideal work load per partition is given by

(2)\[\omega_{opt} = \frac{\omega_{globalsum}}{p}\]

Where \(\omega_{globalsum}\) is the global sum of all weights. Since the overall sum already computed through the prefix sum, we can use the last processor as a root to broadcast (MPI_BCAST) the \(\omega_{opt}\). Then the splitting positions between balanced partitions can be computed locally. There is no need further information changing to decide which element to move to which processor. The complete message changing for the partitioning only relies on two collective operation in MPI. Both collectives can be implemented efficiently using asymptotic running time ad memory complexity of \(O(logp)\).

Assuming homogeneous processors, ideal splitters are multiples of \(\omega_{opt}\), i.e., \(r \cdot \omega_{opt}\) for all integer \(r\) with \(1 \leqslant r < p\). The closest splitting positions between the actual elements to the ideal splitters can be found by comparing with the global prefix sum of each element.

The efficiency \(E\) of the distribution work is bounded by the slowest process, and thus cannot better than:

\[E = \frac{\omega_{opt}}{max_{r=0}^{p-1}(\omega_{sum}(r))}\]

Exchange of Element

After the splitting positions are decided, elements needs to be relocated. The relocation, or exchange of elements is done via communication between processors. The challenge part is, though, the sender knows which element to send to which processor, the receiver does not know what they will receive.

Some application use a regular all-to-all collective operation to imform all processors about the their communication partners before doing teh actual exchange of the elements with an irregular all-to-all collective operation (e.g. MPI_Alltoallv).

Alternatively, elements can be propagated only between neighbour processors in an iterative fasion. This method can be benigh when the re-partitioning modifies an existing distribution of element only slightly. Unfortunately, worse cases can lead to \(O(p)\) forwarded messages.

In our implementation, One-sided Communication in MPI is invoked. In one-sided MPI operations, also known as RDMA or RMA (Remote Memory Access) operation. In RMA, the irregular communication patten can be handle easily without an extra step to determine how many sends-receives to issue. This makes dynamic communication easier to code in RMA, with the help of MPI_Put and MPI_Get.

References

HKR+12

Daniel F. Harlacher, Harald Klimach, Sabine Roller, Christian Siebert, and Felix Wolf. Dynamic load balancing for unstructured meshes on space-filling curves. 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, pages 1661–1669, 2012.

TSA06

Srikanta Tirthapura, Sudip Seal, and Srinivas Aluru. A formal analysis of space filling curves for parallel domain decomposition. 2006 International Conference on Parallel Processing (ICPP‘06), pages 505–512, 2006.

ZBMZ+18

Keke Zhai, Tania Banerjee-Mishra, David Zwick, Jason Hackl, and Sanjay Ranka. Dynamic load balancing for compressible multiphase turbulence. ArXiv, 2018.