SimpleMath - a simplified wrapper for DirectXMath

Originally posted to Shawn Hargreaves Blog on MSDN, Tuesday, January 8, 2013

SimpleMath, created by my colleague Chuck Walbourn, is a header file that wraps the DirectXMath SIMD vector/matrix math API with an easier to use C++ interface.  It provides the following types, with similar names, methods, and operator overloads to the XNA Game Studio math API:

Download SimpleMath here.

 

Why wrap DirectXMath?

DirectXMath provides highly optimized vector and matrix math functions, which take advantage of SSE SIMD intrinsics when compiled for x86/x64, or the ARM NEON instruction set when compiled for an ARM platform such as Windows RT or Windows Phone.  The downside of being designed for efficient SIMD usage is that DirectXMath can be somewhat complicated to work with.  Developers must be aware of correct type usage (understanding the difference between SIMD register types such as XMVECTOR vs. memory storage types such as XMFLOAT4), must take care to maintain correct alignment for SIMD heap allocations, and must carefully structure their code to avoid accessing individual components from a SIMD register.  This complexity is necessary for optimal SIMD performance, but sometimes you just want to get stuff working without so much hassle!

Enter SimpleMath...

These types derive from the equivalent DirectXMath memory storage types (for instance Vector3 is derived from XMFLOAT3), so they can be stored in arbitrary locations without worrying about SIMD alignment, and individual components can be accessed without bothering to call SIMD accessor functions. But unlike XMFLOAT3, the Vector3 type defines a rich set of methods and overloaded operators, so it can be directly manipulated without having to first load its value into an XMVECTOR.  Vector3 also defines an operator for automatic conversion to XMVECTOR, so it can be passed directly to methods that were written to use the lower level DirectXMath types.

If that sounds horribly confusing, the short version is that the SimpleMath types pretty much Just Work™ the way you would expect them to.

By now you must be wondering, where is the catch?  And of course there is one.  SimpleMath hides the complexities of SIMD programming by automatically converting back and forth between memory and SIMD register types, which tends to generate additional load and store instructions.  This can add significant overhead compared to the lower level DirectXMath approach, where SIMD loads and stores are under explicit control of the programmer.

 

Who is SimpleMath for?

You should use SimpleMath if you are:

You should go straight to the underlying DirectXMath API if you:

This need not be a global either/or decision.  The SimpleMath types know how to convert themselves to and from the corresponding DirectXMath types, so it is easy to mix and match.  You can use SimpleMath for the parts of your program where readability and development time matter most, then drop down to DirectXMath for performance hotspots where runtime efficiency is more important.

 

Example

Here is a simple object movement calculation, implemented using DirectXMath.  Note the skullduggery to make sure the PlayerCat instance will always be 16 byte aligned (and I didn't even include the implementation of the AlignedNew helper here!)

    #include <DirectXMath.h>

    using namespace DirectX;


    __declspec(align(16)) class PlayerCat : public AlignedNew<PlayerCat>
    {
    public:
        void Update()
        {
            const float cFriction = 0.99f;

            XMVECTOR pos = XMLoadFloat3A(&mPosition);
            XMVECTOR vel = XMLoadFloat3A(&mVelocity);

            XMStoreFloat3A(&mPosition, pos + vel);
            XMStoreFloat3A(&mVelocity, vel * cFriction);
        }

    private:
        XMFLOAT3A mPosition;
        XMFLOAT3A mVelocity;
    };

Using SimpleMath, the same math is, well, a little more simple :-)

    #include "SimpleMath.h"

    using namespace DirectX::SimpleMath;


    class PlayerCat
    {
    public:
        void Update()
        {
            const float cFriction = 0.99f;

            mPosition += mVelocity;
            mVelocity *= cFriction;
        }

    private:
        Vector3 mPosition;
        Vector3 mVelocity;
    };

Here is the x86 SSE code generated for the DirectXMath version of the Update method:

     movaps      xmm2,xmmword ptr [ecx+10h]
     movaps      xmm1,xmmword ptr [ecx]
     andps       xmm2,xmmword ptr [?g_XMMask3@DirectX@@3UXMVECTORI32@1@B]
     andps       xmm1,xmmword ptr [?g_XMMask3@DirectX@@3UXMVECTORI32@1@B]
     movaps      xmm0,xmmword ptr [__xmm@3f7d70a43f7d70a43f7d70a43f7d70a4]
     addps       xmm1,xmm2
     mulps       xmm0,xmm2
     movq        mmword ptr [ecx],xmm1
     shufps      xmm1,xmm1,0AAh
     movss       dword ptr [ecx+8],xmm1
     movq        mmword ptr [ecx+10h],xmm0
     shufps      xmm0,xmm0,0AAh
     movss       dword ptr [ecx+18h],xmm0
     ret

The SimpleMath version generates slightly more than twice as many machine instructions:

     movss       xmm2,dword ptr [ecx]
     movss       xmm0,dword ptr [ecx+4]
     movss       xmm1,dword ptr [ecx+0Ch]
     unpcklps    xmm2,xmm0
     movss       xmm0,dword ptr [ecx+8]
     movlhps     xmm2,xmm0
     movss       xmm0,dword ptr [ecx+10h]
     unpcklps    xmm1,xmm0
     movss       xmm0,dword ptr [ecx+14h]
     movlhps     xmm1,xmm0
     addps       xmm2,xmm1
     movss       dword ptr [ecx],xmm2
     movaps      xmm0,xmm2
     shufps      xmm0,xmm2,55h
     movss       dword ptr [ecx+4],xmm0
     shufps      xmm2,xmm2,0AAh
     movss       dword ptr [ecx+8],xmm2
     movss       xmm1,dword ptr [ecx+0Ch]
     movss       xmm0,dword ptr [ecx+10h]
     unpcklps    xmm1,xmm0
     movss       xmm0,dword ptr [ecx+14h]
     movlhps     xmm1,xmm0
     mulps       xmm1,xmmword ptr [__xmm@3f7d70a43f7d70a43f7d70a43f7d70a4]
     movaps      xmm0,xmm1
     movss       dword ptr [ecx+0Ch],xmm1
     shufps      xmm0,xmm1,55h
     shufps      xmm1,xmm1,0AAh
     movss       dword ptr [ecx+10h],xmm0
     movss       dword ptr [ecx+14h],xmm1
     ret

Most of this difference is because I was able to used aligned loads and stores in the DirectXMath version, while the SimpleMath code must do extra work to handle memory locations that might not be properly aligned.  Also note how the SimpleMath version loads the mVelocity value from memory into SIMD registers twice, while the extra control offered by DirectXMath allowed me to do this just once.

But hey, sometimes performance isn't the most important goal.  If you care more about optimizing for developer efficiency, SimpleMath could be for you.

 

Resources

Blog index   -   Back to my homepage