Vectorization

Simcenter STAR-CCM+ uses AVX-2 instructions on all Xeon processors that support the AVX-2 instruction set—currently, the Haswell, Broadwell, and Skylake processors.

Machines with older processors continue to use the SSE2 instruction set as before. This change enhances the performance of Simcenter STAR-CCM+ on the newer processors for simulations that make use of coupled solvers, though it does not have as much impact on simulations that make use of segregated solvers. You do not need to do anything special to use this feature—Simcenter STAR-CCM+ detects the hardware automatically and runs the appropriate level of vectorization.

What Is Vectorization?

Vectorization refers to the calculation of several arithmetic operations with a single instruction. To illustrate this, consider two vectors of arbitrary length that you wish to sum and store the result in a third vector. Scalar arithmetic (the opposite of vectorization) would add the first element in the first vector to the first element of the second vector and store the result in the first element of the third vector, then repeat the process for the second element, third element, and so on. Vectorization, however, makes it possible for the CPU to add, for example, the first and second element of the first vector to the first and second element of the second vector simultaneously and store the respective results in the first and second elements of the third vector, decreasing the time required for this particular calculation by a factor of 2.

What Are Different Levels of Vectorization, and How Do I Know What I Have?

Vectorization requires hardware support. If your hardware only has the capability to do one calculation at a time, then no amount of software manipulation can obtain any performance benefit from vectorization.

The different levels of available vectorization are as follows:

SSE2 vectorization has been available since 2001 when Intel released its Pentium 4 processors.
SSE2 vectorization allows simultaneous arithmetic operations on 2 double-precision floating point numbers or 4 single-precision floating point numbers.
AVX vectorization was first released by Intel with its Sandy Bridge architecture in 2011 (which means that Sandy Bridge and Ivy Bridge computers are capable of both SSE2 and AVX). This level of vectorization allows simultaneous arithmetic operations on 4 double-precision floating point numbers or 8 single-precision floating point numbers—double what was available with SSE2.
AVX-2 vectorization was first released by Intel with its Haswell architecture in 2013 (which means that Haswell and Broadwell are capable of all of AVX-2, AVX, and SSE2).

AVX-2 still only allows for 4 double-precision floating point numbers or 8 single-precision floating point numbers—the same as AVX—but it provides extra operations. For example, one new operation that is available in AVX-2 is called fused-multiply-add (or FMA for short). FMA is an instruction that allows the computer to calculate $a \cdot b + c$ in one calculation instead of two (that is, first calculating $a \cdot b$ and then adding $c$ to the result).

How Vectorization Affects Results

Although vectorization does not cause any accuracy issues, it can cause a precision issue. That is, the final results are just as accurate as before, but instead of, for example, a temperature of 300.0001 K, the solver might return 299.9999 K when running with one level of vectorization versus another.

To understand better how vectorization affects results, consider a primary issue with floating-point arithmetic: round-off error. Round-off error can produce a counterintuitive situation in which $(a + b) + c \neq a + (b + c)$ . For details on round-off error, see the article on Wikipedia, Floating-point arithmetic > Accuracy problems.

Avoid Using Mixed-Vector Hardware in a Single Parallel Job

You are strongly advised to avoid using a heterogeneous compute cluster with respect to vectorization. That is, do not use a set of resources in which some nodes have AVX-2 capabilities while others do not. Variation in vectorization levels can lead to different round-off errors (as stated above). Moreover, for the same simulation, results computed on nodes with AVX-2 capabilities can be slightly different from those computed on nodes without them.

Bibliography

[12]

Wikipedia. [2018.] "Floating-point arithmetic", https://en.wikipedia.org/wiki/Floating-point_arithmetic.