## Team for Advanced Flow Simulation and Modeling |

For more information: |
## AHPCRC Bulletin: Winter/Spring 1997 - Volume 7 Number 1-2## Parallelization of a Zonal Missile Aerodynamics Code## Marek Behr (AHPCRC-UM) and Jubaraj Sahu (ARL-WMRD)Introduction
The researchers at the Weapons and Materials Research Directorate (WMRD) at the Army Research Laboratory (ARL) in Aberdeen, Maryland, in their day to day activities rely on a set of missile aerodynamics CFD codes. A typical analysis involving a one million cell grid requires over 8 hours of computation on the CRAY C90. In a recent effort to incorporate scalable algorithms and move the computations to modern parallel machines, the WMRD collaborates with the ARL Corporate Information and Computing Center (CICC) and the AHPCRC author on rewriting the F3D zonal code for use on distributed memory architectures. The objective is to generate a possible replacement for the vector implementation of the F3D, with minimal effect on the code interface and numerical characteristics. The final version should be portable among the scalable architectures available through the DoD Major Shared Resource Centers and Distributed Centers, and should offer significant reduction in analysis time.
The F3D code is based on Reynolds averaged 3D compressible Navier-Stokes equations written in conservation form. A curvilinear body conforming discretization is mapped onto a uniform computational space, and a thin layer approximation made in the grid direction pointing away from the body surface. The resulting equations are solved using a flux-split upwind finite difference scheme in the streamwise direction and central differencing in the other directions. Both implicit and explicit versions of the algorithm exist, although only the implicit one is in widespread use. An overlapping composite grid scheme is used in order to reduce storage costs, as well as to enable handling of complicated geometries [1]. The original code was optimized for use on a single vector processor, achieving sustained speed of 300 MegaFLOPS on the CRAY C90.
In order to better explain the parallelization issues, an overview of the F3D computation steps is
given below. All operations proceed on a zone-by-zone basis, with inactive zone data stored either
in memory, or on a fast mass storage device. A single zone is constructed of a regular NJ NK
NL block of cells aligned with J, K, and L directions. The J direction is assumed to be
streamwise, and is treated semi-implicitly with two solver sweeps in the J
Out of the two computation-intensive stages of the code, the RHS formation yields itself to
parallelization most easily. This is a volume computation, where each grid point is operated on
independently, with only older values at neighboring points required to complete the computation.
The entire set of zone cells can be distributed over the available processing elements (PEs) in an
arbitrary manner. However, for the sake of subsequent solver computations, it makes sense to
decompose only K and L grid dimensions, leaving an entire J dimension associated with a single
PE. The K-L plane is mapped onto a rectangular grid of all PEs. To avoid repetition of
interprocessor transfers, each rectangular portion of the K-L plane also contains 2 layers of "ghost"
points which track the two closest sets of values in the subgrids belonging to neighboring PE. The
parallelization of the solver sweeps is not as straightforward. The algorithm requires sequential
processing in the J direction, and can also be simultaneously parallelized in both K-L directions
only at a great added computation cost, e.g. via a cyclic reduction algorithm. An alternative
method is to accept serial treatment of the J and L directions (J and K for J- sweep), and devote all
PEs to parallelizing the K dimension (L for J
The initial attempt to port the F3D code to a scalable architecture involved the CRAY T3D and CRAFT shared memory programming model. The advantages of code maintainability and ease of transition was offset by the poor performance, and alternative approaches were explored. The more difficult task of rewriting the code in a message-passing framework was undertaken, and the PVM-based code provided initial speed-ups. The reshaping of the arrays during solver sweeps however was a difficult target for efficient implementation when using two-sided PVM communication. A much better solution was found in the form of the one-sided SHMEM CRAY communication libraries [2]. In addition to eliminating concerns about deadlocking, the use of SHMEM reduces message latency and increases bandwidth. The problem of SHMEM non-portability was addressed with transition of the SHMEM calls to equivalent Bulk Synchronous Parallel (BSP) library calls [3, www.bsp-worldwide.org]. The BSP library, developed at Oxford University, provides one-sided communication capabilities equivalent to SHMEM on a wide range of distributed memory machines, including IBM SP2 and SGI PCA. Apart from the communication issues, the scalar optimization of the code was attempted in order to extract a reasonable fraction of peak speed on cache-constrained architectures, but that aspect still leaves something to be desired. The current code speed-up on the CRAY T3D and T3E platforms reach a factor 3.7 over the C90 for the 49-PE T3E (Figure 3), using Mach 1.8 flow past an ogive cylinder on a 3-zone 1-million cell grid as a benchmark case.
The current implementation allows only for a single-cell overlap between separate zones. Many cases of interest to WMRD involve moving grids for aerodynamic computations of multibodies in relative motion [4]. A more general interpolation scheme is required for data exchange between zones, and such a scheme is incorporated in a Chimera version [5] of the vector implementation. Addition of the Chimera methodology to the scalable version of the F3D is the main goal of future work.
[1] J. Sahu and J. Steger, "Numerical Simulation of Three-Dimensional Transonic Flows,"
AIAA Paper 87-2293 (1987). |