By Jim Dempsey
This blog post is a summary of a technical article that you can find in the Intel Software Network Knowledge Base.
In the newest series of Intel processors, the programmer is presented with AVX (Advanced Vector eXtensions). AVX provides 32-byte wide small vector instructions; double that of SSE’s 16-byte wide small vector instructions. AVX now provides for manipulating 4-wide doubles or 8-wide floats. Integers are supported; however, the article addresses double precision floating point. The problem vexing programmers is how to program to take advantage of the ever-widening small vectors. The Intel® Xeon™ Phi (Knights Corner) supports 64-byte SIMD instructions (8-wide doubles).
With the introduction of AVX in the Sandy Bridge architecture, and now Ivy Bridge, programmers are finding their programs are getting left behind with increasing vector widths. While a programmer can use the intrinsic templates in C++, they will find that doing so is a very intrusive way of programming with minimal return on their investment of time. The technical article addresses how users can make a small change to the data layout and code for this change.
Most programmers are taught that there are two principal ways to program:
Array of structures (AOS)
Structure of arrays (SOA)
The technical article describes a third way, which is a hybrid of AOS and SOA. The author uses the term:
Packed array of structures (PAOS)
But the programming technique can alternately be called
Partitioned structure of arrays (PSOA)
Both descriptive terms describe the same thing and are coded the same way.
Most programs are written using AOS. The current generation Intel C++ and FORTRAN compilers do an excellent job at vectorization of AOS code using SSE and AVX instructions. However, compiler vectorization opportunities are greatly dependent on data layout. Large arrays often have good opportunities for code vectorization. But not all AOS applications can organize the data in this manner. Often data are organized in structures where the structures contain: scalar values (single variables), small 3-wide XYZ vectors, small 3×3 translation matrices, and for some classes of these structures, large arrays, such as vertices, to draw an object.
Scalar variables typically cannot be vectorized. The 3-wide XYZ small vectors are either half vectorized when performing addition or subtraction with another such XYZ small vector (one 2-wide SSE instruction followed by one scalar instruction), and use 2-wide half-vectorization when multiplying by a scalar.
When performing an operation, such as obtaining the magnitude of an XYZ vector, the generated code is also mostly scalar. The squaring of the three components can be performed 2-wide half vectored, but the summation and square root are performed scalar. A similar half scalar situation arises when applying the 3×3 matrix to 3-wide vectors for rotations. Large arrays in many applications are often arrays of 3-wide vectors (e.g., a list a apexes). For these types of large arrays, many of the operations may degrade to half 2-wide plus one scalar operations.
This puts the programmer in a challenging position: How to make use of a 4-wide SIMD processor when most of the application is scalar and part of the application is 2-wide vectored. The technique discussed in the article is what the author calls packed array of structures – PAOS. In the PAOS method the programmer matches up like objects in groupings that match the SIMD vector width.
// AOS
struct Object_t
{
double Position[3];
double Velocity[3];
double Acceleration[3];
double AngularVelocity[3];
double AngularAcceleration[3];
double Mass;
…
double BodyLines[nPoints][3];
…
};
// PAOS for 4-wide SIMD
struct ObjectAVX_t
{
double Position[3][4];
double Velocity[3][4];
double Acceleration[3][4];
double AngularVelocity[3][4];
double AngularAcceleration[3][4];
double Mass[4];
…
double BodyLines[nPoints][3][4];
…
};
The PAOS method packs 4 objects across the width of the SIMD register. All operations are performed on 4 objects at the same time. Furthermore, as the processor runs through your code, the same data for these 4 objects will reside in the same cache line. This increases not only instruction efficiency but also L1 cache efficiency. PAOS will require some code changes, but most of these changes can be eliminated by using operator overload functions. For the C++ programmer, Intel provides the header dvec.h.
The SIMD width [4] in the above struct would generally not be used. Instead, the programmer would use a type declaration for the 4-wide vector and a struct of 3-point vector of these types. Furthermore, the 4-wide vector would have operator overloads for +-*/ etc.
The Intel compiler contains the appropriate header (dvec.h) with the 4-wide vectors and operator overload functions.
#include < dvec.h >
struct xyzF64vec4
{
F64vec4 xyz[3];
};
// PAOS for 4-wide SIMD
struct ObjectAVX_t
{
xyzF64vec4 Position;
xyzF64vec4 Velocity;
xyzF64vec4 Acceleration;
xyzF64vec4 AngularVelocity;
xyzF64vec4 AngularAcceleration;
F64vec4 Mass;
…
xyzF64vec4 BodyLines[nPoints];
…
};
Note, existing programs are generally written in type opaque manner, meaning “double” or “float” is not directly used. Instead, a pseudo name is used either as #define or typedef. By changing the declaration of the pseudo name in one place, much of the conversion effort is complete.
Instruction efficiencies gained using PAOS
The technical article contains an example of a subroutine that computes a time determinant vector at the estimated midpoint between the current integration time and the time at the next integration interval. The application was written in FORTRAN, but for the benefit of you C++ programmers, an analog of this section of the subroutine was written in C++ below.
AOS code (before modification)
struct xyz_t
{
double xyz[3];
};
struct FiniteSoluton_t
{
double rDELTIS;
int nBeads;
xyz_t* rBeadIPIF;
xyz_t* rBeadDIPIF;
xyz_t* rBeadDDIPIF;
FiniteSoluton_t()
{
nBeads = 0;
rBeadIPIF = rBeadDIPIF = rBeadDDIPIF = NULL;
}
void Allocate(int n)
{
if(rBeadIPIF) delete [] rBeadIPIF;
if(rBeadDIPIF) delete [] rBeadDIPIF;
if(rBeadDDIPIF) delete [] rBeadDDIPIF;
nBeads = n;
rBeadIPIF = new xyz_t[n];
rBeadDIPIF = new xyz_t[n];
rBeadDDIPIF = new xyz_t[n];
}
};
FiniteSoluton_t fs; // or an array of FiniteSolution_t’s
void TENSEG(FiniteSoluton_t* pFiniteSolution)
{
int JSEG;
// pull in frequently used items from fs struct
int JSEGlast = pFiniteSolution->nBeads+1;
double rDELTIS = pFiniteSolution->rDELTIS;
xyz_t* rBeadIPIF = pFiniteSolution->rBeadIPIF;
xyz_t* rBeadDIPIF = pFiniteSolution->rBeadDIPIF;
xyz_t* rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF;
xyz_t TOSVX1, TOSVX2, SUI;
for(JSEG=0; JSEG < JSEGlast; ++JSEG)
{
for(int i=0; i<3; ++i)
{
// Determine TUI at 1/2 integration step from now
// using current velocity and acceleration
// Position of X-End of segment
TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
+ (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
+ (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
// Position of Y-End of segment
TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
+ (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
+ (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
// SUI X->Y
SUI.xyz[i] = TOSVX2.xyz[i] – TOSVX1.xyz[i];
} // for(int i=0; i<3; ++i)
} // for(JSEG=0; JSEG < JSEGlast; ++JSEG)
}
PAOS code
#include < dvec.h >
struct xyzF64vec4
{
F64vec4 xyz[3];
};
struct FiniteSolutonF64vec4
{
F64vec4 rDELTIS;
int nBeads;
xyzF64vec4* rBeadIPIF;
xyzF64vec4* rBeadDIPIF;
xyzF64vec4* rBeadDDIPIF;
FiniteSolutonF64vec4()
{
nBeads = 0;
rBeadIPIF = rBeadDIPIF = rBeadDDIPIF = NULL;
}
void Allocate(int n)
{
if(rBeadIPIF) delete [] rBeadIPIF;
if(rBeadDIPIF) delete [] rBeadDIPIF;
if(rBeadDDIPIF) delete [] rBeadDDIPIF;
nBeads = n;
rBeadIPIF = new xyz_t[n];
rBeadDIPIF = new xyz_t[n];
rBeadDDIPIF = new xyz_t[n];
}
};
FiniteSolutonF64vec4 fsF64vec4;
void TENSEG(FiniteSolutonF64vec4* pFiniteSolution)
{
int JSEG;
// pull in frequrently used items from fs struct
int JSEGlast = pFiniteSolution->nBeads+1;
F64vec4 rDELTIS = pFiniteSolution->rDELTIS;
xyzF64vec4* rBeadIPIF = pFiniteSolution->rBeadIPIF;
xyzF64vec4* rBeadDIPIF = pFiniteSolution->rBeadDIPIF;
xyzF64vec4* rBeadDDIPIF = pFiniteSolution->rBeadDDIPIF;
xyzF64vec4 TOSVX1, TOSVX2, SUI;
for(JSEG=0; JSEG < JSEGlast; ++JSEG)
{
for(int i=0; i<3; ++i)
{
// Determine TUI at 1/2 integration step from now
// using current velocity and acceleration
// Position of X-End of segment
TOSVX1.xyz[i] = rBeadIPIF[JSEG].xyz[i]
+ (rBeadDIPIF[JSEG].xyz[i]*(rDELTIS/2.0))
+ (rBeadDDIPIF[JSEG].xyz[i]*(rDELTIS*rDELTIS/4.0));
// Position of Y-End of segment
TOSVX2.xyz[i] = rBeadIPIF[JSEG+1].xyz[i]
+ (rBeadDIPIF[JSEG+1].xyz[i]*(rDELTIS/2.0))
+ (rBeadDDIPIF[JSEG+1].xyz[i]*(rDELTIS*rDELTIS/4.0));
// SUI X->Y
SUI.xyz[i] = TOSVX2.xyz[i] – TOSVX1.xyz[i];
} // for(int i=0; i<3; ++i)
} // for(JSEG=0; JSEG < JSEGlast; ++JSEG)
}
Other than for the type declarations, the code section remains the same.
Some sections of code will have to be adapted for PAOS
Some additional coding consideration will have to be made when programming in PAOS. One such example is defensive code for divide by zero protection. One example where you may have had:
xyz_t& unitVector(xyz_t& v)
{
xyz_t ret;
double Length = sqrt(
v.xyz[0]*v.xyz[0]
+ v.xyz[1]*v.xyz[1]
+ v.xyz[2]*v.xyz[2]);
// defensive code to avoid divide by 0
// When (if) Length = 0.0 then substitute 1.0
// X = 0.0 / 1.0 = uX 0.0 (not divide by 0)
// Y = 0.0 / 1.0 = uY 0.0 (not divide by 0)
// Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0)
if(Length == 0.0) Length = 1.0;
ret.xyz[0] = v.xyz[0] / Length;
ret.xyz[1] = v.xyz[1] / Length;
ret.xyz[2] = v.xyz[2] / Length;
return ret;
}
Becomes:
xyzF64vec4& unitVector(xyzF64vec4& v)
{
xyzF64vec4 ret;
static const F64vec4 Zero(0.0);
static const F64vec4 One(1.0);
F64vec4 Length = sqrt(
v.xyz[0]*v.xyz[0]
+ v.xyz[1]*v.xyz[1]
+ v.xyz[2]*v.xyz[2]);
// defensive code to avoid divide by 0
// When (if) Length = 0.0 then substitute 1.0
// X = 0.0 / 1.0 = uX 0.0 (not divide by 0)
// Y = 0.0 / 1.0 = uY 0.0 (not divide by 0)
// Z = 0.0 / 1.0 = uZ 0.0 (not divide by 0)
Length = select_eq(Length, Zero, Length, One);
ret.xyz[0] = v.xyz[0] / Length;
ret.xyz[1] = v.xyz[1] / Length;
ret.xyz[2] = v.xyz[2] / Length;
return ret;
}
Other than for the type declaration, only one statement is changed.
The next blog post will go down to the assembler level to show why PAOS is so effective.
Copyright © 2012 Jim Dempsey. All rights reserved.
Mr. Dempsey is available for program- or code-related consultations.
jim.00.dempsey@gmail.com
http://www.linkedin.com/in/jim00dempsey