Hello, today we’re going to talk about the theory of numerical method parallelization for scalable high-performance computing simulation, using a simple computational fluid dynamics case. Did your eyes just gloss over reading that? Well, don’t worry! This is a practical discussion intended for users to find real-world benefits to speed up their computations, not some deep dive into mathematics, and I’ll even bring along a few analogies to help it all make sense.

First, the background and some definitions.

“Numerical methods” is a broad term for the field of mathematics by which the solutions to equations are found through developing algorithms and offloading the extensive calculations to computers. It turns out, certain mathematical equations are just simply too complex for humans to solve by hand, so we instead develop a roadmap of repetitive calculations for a computer to follow, and the computer then executes them many billions of times until the solution is obtained. A particular subset of these routines is called “iterative”. For these types of methods, a computer is instructed to “iterate” (or take incremental steps) as it “converges” toward (or slowly approaches) the solution. Since that solution is not typically known ahead of time (or why would we be doing the analysis?), a layman’s description of this could be “guess and check” – we start with a guess and some mechanism to judge error, and then methodically reduce the error in subsequent guesses until we determine the guesses are close enough.

By and large, most of the tools of simulation follow similar iterative approaches. The example we’re discussing today, that of Computational Fluid Dynamics, or CFD, is certainly one of the most widely known versions and if you’re reading this blog, I assume you’re familiar with the field. Here, we solve a series of equations, known as the Navier-Stokes equations through iterative methods. I won’t go into the mathematics here (I promised I wouldn’t), but the gist of it is this: **they depict the motion and physical properties of fluid when the user gives given boundary conditions and geometry of a physical problem to solve.**

The reason I’m reminding you of those equations for this discussion is that they are considered “highly coupled”, which means they must be solved simultaneously. You can’t know the density somewhere without knowing the pressure and temperature. You can’t know the pressure and temperature without knowing the velocity and streamline history, and you can’t determine that velocity field information without knowing the pressure field information, which relies on momentum (a product of density and velocity) and viscous forces (also a function of velocity gradients). Furthermore, the solution in one geometric space changes the result in other geometric spaces nearby, so you can’t parcel out the geometry and solve each portion independently.

All of this is what causes the field of parallel distribution to be so interesting. Let’s introduce one of those analogies here:

Imagine an exam given to middle school students learning multiplication. The exam has 60 questions on it – each one a simple product of two numbers (3*2=6, 5*7=35, etc.). Say the teacher gives the test to a student and records their time to completion. Each student would differ, certainly, but let’s assume this exam takes an average of 4 minutes (or 4 seconds per problem) for the students in this class. This type of computational method would be called “serial”, meaning each equation is solved sequentially until the exam is completed. Now, let’s instead assume there are 30 students in the classroom and the teacher changes the assignment, now giving each student 2 of the questions to solve. This would be called a “parallel” method of computation. The time savings would be hugely significant, resulting in the entire class finishing in the time it takes the slowest student to complete their 2 questions (say 12 seconds). Since there were 30 times as many minds being devoted to the problem, and the solution time was cut by a factor of 20 (12 seconds/4 minutes), we could say that the “parallel efficiency” was 20/30 or 66.7%. The reason why it isn’t 100% was that every student did not have an identical time to completion and therefore was not given an appropriate balance of tasks. In the field of numerical methods, this is called “load balancing” and is a huge field.

Now let’s introduce a new concept. Let’s say the students are arranged in a circle, and instead of both questions given to the student being independent, let’s say the second question each student was given relied on the answer to the first question of the person sitting to their left. Something like:

Student #4 -

1) 5*3 = __

2) 4*x = __ where x is the solution to question 1 from Student #3

Therefore, at the completion of question #1, every student is required to share their answer with the person to their right, while receiving the answer from the person on their left. They can then carry out solving question #2. Remember when I mentioned how in fluid dynamics the solution to one location relies heavily on the solution to the other nearby locations?

This communication time certainly would impact the overall solution time, resulting in a further reduction in parallel efficiency. If those students were very quick to communicate, this could maybe only reduce it to 50% (a total exam time of 16 seconds), but it’s far more likely the students’ communication time would drop that number to 20% (a total exam time of 40 seconds).

So now you can see how two major factors influence the efficiency of parallel scalability

1) Load balancing to ensure each task is completed in as close to equal time as possible.

2) Communication speed, to ensure no process is sitting idle waiting for information to be conveyed.

These exact same concepts apply directly to computational parallelization. The students in this analogy are computer cores, and the communication is done through memory channels. And the types of problems being solved can motivate very different architectures. Graphics processing, for instance, which typically does not require much coupling to other core processes (i.e., the rendering of one section of the screen doesn’t typically require another part of the screen to be rendered first) can be done on hugely parallel arrays, which is why modern graphics cards have several thousand compute cores. Think of this as a class of 1000 kids each independently doing 1 question on a 1000 question exam. Other types of computations though, such as those found in CFD, require a significant amount of communication, so, therefore, can be quite slow in such highly parallel environments as each core is bogged down waiting for others to convey the information needed to proceed.

So now, let’s dive into our CFD example.

We’ll start with something that could not be simpler – the straight duct – and we’ll assign it simple dimensions to make meshing very simple (Figure 1).

*Figure 1: Geometry used for analysis. Flow direction from bottom-left to top-right.*

And to illustrate a point, I’m going to take things to extremes – I’ll use a structured grid ranging from extremely coarse to extremely fine. Somewhere in the middle will be a mesh-independent solution where the results can begin to be trusted for accuracy, but our goal here isn’t to find that, our goal is solely to assess the speed of the computation under a range of parallel conditions.

However, another interesting component is worth discussing here. To converge a solution in a reasonable time, you need to consider 1) how many iterations are required for convergence and 2) how fast each iteration computes. This post attempts to address point #2 with how computer architecture and core count results in different **iteration speeds**. But the size and structure of the grid and therefore how well it captures the necessary gradients of the flowfield can influence #1 by **how many iterations are required for convergence**. Assessing the entire time required would require fully converging the solution under each scenario and comparing the results – a much more rigorous exercise and of course completely unique to each problem type solved. The important (and potentially unintuitive) point here is that there may be times where a finer mesh converges in less time than a coarser mesh simply due to being more numerically stable and requiring fewer iterations.

For the mesh of this simple duct, we used a structured H-grid (Figure 2). Not the most efficient or effective for capturing near-wall gradients, but again, we aren’t looking for accuracy here, we’re looking for computational efficiency and stability, and perfectly orthogonal elements are ideal for load balancing.

*Figure 2: Mesh structure and 3 sizes used for analysis*

A little aside on “partitioning”. A “partition” is a segment of the problem that is divided up and given to each process to solve (think “questions” in our student exam analogy). People may think of these as abstract computational concepts buried within the operating system’s code, but in the field of CFD, they are easily viewable physical regions of the geometry. I mentioned that there’s a huge field of study related to this, but luckily for us, Fluent engineers have largely solved this problem for us decades ago, and a modern CFD user hardly needs to be aware of it. If you were curious, however, there’s a field variable in Fluent called “Active Cell Partition” that returns a numerical value for each cell, indicating which processor core it is being assigned to. Pretty handy and allows me to show the partitions within this duct under a range of core counts 1 to 256 (Figure 3).

*Figure 3: Visualization of mesh partitions in Fluent*

The physics being solved in this duct will be 1 m/s velocity fed in one end, and backpressure of 0 Pa gauge applied to the other end. The walls will be no-slip walls, thereby creating a pressure gradient as the viscous boundary layers develop. The fluid will be incompressible water and we’ll use the SST turbulence model. Very simple.

For comparison, for each core count tested, we will analyze the time required to solve 100 iterations. The computational hardware is a cluster of identical compute nodes, each with two Xeon Gold 6242 processors on it and each connected via an InfiniBand fabric. Therefore, any computation with 16 or fewer cores will typically be solved on a single CPU with information passed within the memory lanes of the chip. Any computation with 32 cores will be solved on a single compute node requiring information to be passed across the board between the 2 chips, and any solution over 32 cores is solved over multiple nodes with information passed through InfiniBand.

Now a disclaimer: This is most certainly a case where we must be careful not to over-scrutinize. There are thousands of variables at play here, and the constants in all of it that allow us to even begin to analyze trends are:

- Simple flow in a perfectly constant cross-section duct
- Incompressible flow with simple gradients
- Perfectly orthogonal mesh elements
- Identical computer hardware between tests (core frequency, memory lanes, client submission machines, etc.)
- Repeatable partitioning/load-balance methodology
- Consistent pressure-based method
- Consistent turbulence model
- A defined number of iterations

However, like many things in simulation – trends are king, and the lessons we can learn from this data are likely to inform us going forward.

So now for the results data and interpretation (Figure 4). First, looking directly at the time required to compute the 100 iterations on a log-log plot we see the obvious – that as we increase mesh count, computational times increase while increasing computational cores reduces time. You can also search horizontally along the chart to deduce that by multiplying node count by 8, you need around 16-32x as many cores to solve it in the same amount of time – another way to think about scalability. The second plot, converting the data to the linear scalability metric, helps illustrate the trends well and now we can more easily see the effects of diminishing (and eventually negative) returns. On this simple log plot, each curve follows an S-shape. Initially, the scalability is quite good, in that going from 1 core to 2 can have linear or even super-linear time savings with more benefit seen with higher node counts. Then, that return tapers off and eventually flatlines, or with very low node code begins to slightly increase the time required to solve. This is the equivalent of students spending a proportionately high amount of their time communicating their answers, and relatively low amounts of time-solving the math problems. Conversely though, the higher amount of work each student has (i.e., more mesh nodes to compute the solution on), the further that scalability holds on before it also begins to fall.

Another way to process the data for interpretation is through this fairly made-up term of “Core Productivity” (Figure 5), which is the product of the outputs of the simulation (namely the total iterations multiplied by all mesh nodes) divided by the inputs (number of CPU core-seconds). If we make the assumption that the mathematical output rate of a CPU core is relatively constant between runs, we can infer this to be a sort of metric of how much of that output is directly being used to solve the problem. A bit of hand-waving, but the results are interesting. It tells us that for low core counts, the CPU is more productive when the node count is kept low while that trend flips at higher core counts. It also implies an optimum location along each curve, which can be seen on the higher node count cases, while the two lower node count cases are likely too small to see one.

*Figure 4: Time in seconds required to compute 100 iterations with various mesh node and CPU core counts. (Water in duct, SST)*

*Figure 5: Core Productivity metric for various node counts (Water in duct, SST)*

From this very simple study, we can start to take away a few lessons through trend analysis, but any discussion of specific inflection point locations or rules of thumb can go right out the window as soon as any of that huge list of variables change – computer to computer, problem to problem, mesh to mesh. Some of the lessons that can be seen:

- Increasing core counts can have a positive effect on solving time, but it can’t be expected to maintain linear scaling forever. It will have diminishing returns. Sometimes the returns can be negative.
- The higher the mesh node count, the better the scalability of higher core counts will be – up to a point.
- For HPC environments with sufficient licenses to support it, the highest productivity will therefore often be to split the cluster up into separate jobs runs in parallel than to run each job sequentially with the maximum number of nodes.

Hopefully, this was a helpful discussion and lent some insight that can inform our intuition of how parallel scalability works in practice. It turns out there’s much more going on than just throwing computational power at the problem. It’s… complicated.