Team for Advanced Flow Simulation and Modeling
For more information:
AHPCRC Bulletin: Winter/Spring 1997 - Volume 7 Number 1-2
Parallelization of a Zonal Missile Aerodynamics Code
Marek Behr (AHPCRC-UM) and Jubaraj Sahu (ARL-WMRD)Introduction
The researchers at the Weapons and Materials Research Directorate (WMRD) at the Army Research Laboratory (ARL) in Aberdeen, Maryland, in their day to day activities rely on a set of missile aerodynamics CFD codes. A typical analysis involving a one million cell grid requires over 8 hours of computation on the CRAY C90. In a recent effort to incorporate scalable algorithms and move the computations to modern parallel machines, the WMRD collaborates with the ARL Corporate Information and Computing Center (CICC) and the AHPCRC author on rewriting the F3D zonal code for use on distributed memory architectures. The objective is to generate a possible replacement for the vector implementation of the F3D, with minimal effect on the code interface and numerical characteristics. The final version should be portable among the scalable architectures available through the DoD Major Shared Resource Centers and Distributed Centers, and should offer significant reduction in analysis time.
The F3D code is based on Reynolds averaged 3D compressible Navier-Stokes equations written in conservation form. A curvilinear body conforming discretization is mapped onto a uniform computational space, and a thin layer approximation made in the grid direction pointing away from the body surface. The resulting equations are solved using a flux-split upwind finite difference scheme in the streamwise direction and central differencing in the other directions. Both implicit and explicit versions of the algorithm exist, although only the implicit one is in widespread use. An overlapping composite grid scheme is used in order to reduce storage costs, as well as to enable handling of complicated geometries . The original code was optimized for use on a single vector processor, achieving sustained speed of 300 MegaFLOPS on the CRAY C90.
In order to better explain the parallelization issues, an overview of the F3D computation steps is given below. All operations proceed on a zone-by-zone basis, with inactive zone data stored either in memory, or on a fast mass storage device. A single zone is constructed of a regular NJ NK NL block of cells aligned with J, K, and L directions. The J direction is assumed to be streamwise, and is treated semi-implicitly with two solver sweeps in the J+ and J- directions. During the J+ sweep, for each consecutive streamwise plane, the grid points are coupled in the L direction, while they are treated independently in the K direction. This requires a solution of K tridiagonal systems of size L with 5 5 blocks. In the J- sweep the roles are reversed, with the coupling present in K direction only, and L block-tridiagonal systems of size K. Before the sweeping can commence, a volume calculation of the right-hand side (RHS) must take place (Figure 1). An efficient parallel implementation of these two distinct computation phases, RHS formation, and solver sweeps, is crucial to the overall effectiveness and scalability of the code.
Out of the two computation-intensive stages of the code, the RHS formation yields itself to parallelization most easily. This is a volume computation, where each grid point is operated on independently, with only older values at neighboring points required to complete the computation. The entire set of zone cells can be distributed over the available processing elements (PEs) in an arbitrary manner. However, for the sake of subsequent solver computations, it makes sense to decompose only K and L grid dimensions, leaving an entire J dimension associated with a single PE. The K-L plane is mapped onto a rectangular grid of all PEs. To avoid repetition of interprocessor transfers, each rectangular portion of the K-L plane also contains 2 layers of "ghost" points which track the two closest sets of values in the subgrids belonging to neighboring PE. The parallelization of the solver sweeps is not as straightforward. The algorithm requires sequential processing in the J direction, and can also be simultaneously parallelized in both K-L directions only at a great added computation cost, e.g. via a cyclic reduction algorithm. An alternative method is to accept serial treatment of the J and L directions (J and K for J- sweep), and devote all PEs to parallelizing the K dimension (L for J- sweep). This approach has the obvious disadvantage, since the scalability is not maintained as the number of PEs exceeds either the NK or NL zone dimensions. In typical computations however, the number of PEs and the zone dimensions are matched so that the problem does not arise. Therefore, for the solver sweeps the desired data distribution has the entire J and L dimensions associated with a single PE, and the K dimension decomposed among all available PEs for the J+ sweep; and J and K dimensions associated with a single PE and the L dimension distributed for the J- sweep. This requires repeated reshaping of a small number of arrays between the original and two solver-specific layouts (Figure 2). A number of smaller parallelization issues had to be resolved as well, including parameter reading and broadcasting them among PEs, efficient disk I/O, and exchange of boundary data between zones.
Implementation and Timing
The initial attempt to port the F3D code to a scalable architecture involved the CRAY T3D and CRAFT shared memory programming model. The advantages of code maintainability and ease of transition was offset by the poor performance, and alternative approaches were explored. The more difficult task of rewriting the code in a message-passing framework was undertaken, and the PVM-based code provided initial speed-ups. The reshaping of the arrays during solver sweeps however was a difficult target for efficient implementation when using two-sided PVM communication. A much better solution was found in the form of the one-sided SHMEM CRAY communication libraries . In addition to eliminating concerns about deadlocking, the use of SHMEM reduces message latency and increases bandwidth. The problem of SHMEM non-portability was addressed with transition of the SHMEM calls to equivalent Bulk Synchronous Parallel (BSP) library calls [3, www.bsp-worldwide.org]. The BSP library, developed at Oxford University, provides one-sided communication capabilities equivalent to SHMEM on a wide range of distributed memory machines, including IBM SP2 and SGI PCA. Apart from the communication issues, the scalar optimization of the code was attempted in order to extract a reasonable fraction of peak speed on cache-constrained architectures, but that aspect still leaves something to be desired. The current code speed-up on the CRAY T3D and T3E platforms reach a factor 3.7 over the C90 for the 49-PE T3E (Figure 3), using Mach 1.8 flow past an ogive cylinder on a 3-zone 1-million cell grid as a benchmark case.
The current implementation allows only for a single-cell overlap between separate zones. Many cases of interest to WMRD involve moving grids for aerodynamic computations of multibodies in relative motion . A more general interpolation scheme is required for data exchange between zones, and such a scheme is incorporated in a Chimera version  of the vector implementation. Addition of the Chimera methodology to the scalable version of the F3D is the main goal of future work.
 J. Sahu and J. Steger, "Numerical Simulation of Three-Dimensional Transonic Flows,"
AIAA Paper 87-2293 (1987).