Introduction

Recently I was doing some testing into SSE2 vectorization. I was curious as to how it worked syntactically and what kind of performance I can expect.

SSE2 works by providing SIMD operations. This allows a single operation to manipulate lots of data. In the case of SSE2, we can utilize 128-bit registers which can process 4 single-precision floating point numbers at a time. At best in theory this gives a 4x speedup. However, there are other factors such as loading data into the registers and storing out the results which means in practice sse2 vectorization will provide less than 4x speedup.

My initial tests were producing strange results, though. I was getting some speedups pushing 5x and higher. So let's investigate what's going on.

Testing Setup

Here's my basic test setup. I'm using Visual Studio 2010.

Note that this code won't work as-is in gcc or other compilers.

The sse2 vectorized code

void simd_test(const size_t length)
{
    using namespace std;
    vector<double> times;
    for(size_t count = 0; count < TIMES; ++count)
    {
        float *vec1 = (float*) _aligned_malloc(length * sizeof(float), 16);
        float *vec2 = (float*) _aligned_malloc(length * sizeof(float), 16);
        float *result = (float*) _aligned_malloc(length * sizeof(float), 16);
        for(int i = 0; i < length; ++i)
        {
            vec1[i] = i;
            vec2[i] = -i;
        }
        LARGE_INTEGER start, end;
        LARGE_INTEGER freq;
        QueryPerformanceFrequency(&freq);

        QueryPerformanceCounter(&start);
        __m128* vec1_sse2 = (__m128*) vec1;
        __m128* vec2_sse2 = (__m128*) vec2;
        __m128* result_sse2 = (__m128*) result;
        for(size_t i = 0; i < length / 4; ++i)
        {
#if COMPLEXITY == 0
            // basic add
            *result_sse2 = _mm_mul_ps(*vec1_sse2, *vec2_sse2);
#elif COMPLEXITY == 1
            // distance
            *result_sse2 = _mm_add_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2));
            *result_sse2 = _mm_sqrt_ps(*result_sse2);
#else
            *result_sse2 = _mm_add_ps(_mm_mul_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2), *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2));
            *result_sse2 = _mm_add_ps(*result_sse2, _mm_mul_ps(*vec1_sse2, *vec2_sse2));
            *result_sse2 = _mm_sqrt_ps(*result_sse2);
            *result_sse2 = _mm_sqrt_ps(*result_sse2);
#endif
            ++vec1_sse2;
            ++vec2_sse2;
            ++result_sse2;
        }
        QueryPerformanceCounter(&end);
        times.push_back((end.QuadPart - start.QuadPart) * 1000. / (double) freq.QuadPart);

        _aligned_free(vec1);
        _aligned_free(vec2);
        _aligned_free(result);
    }
    cout << "sse2 aligned." << endl;
    double min_time = *std::min_element(times.begin(), times.end());
    double max_time = *std::max_element(times.begin(), times.end());
    std::sort(times.begin(), times.end());
    double median_time = times[times.size() / 2];
    cout << "min: " << min_time << " max_time: " << max_time << " median: " << median_time << endl;
}

The standard un-vectorized code

void regular_test(const size_t length)
{
    using namespace std;
    vector<double> times;
    for(size_t count = 0; count < TIMES; ++count)
    {
        float *vec1 = (float*) _aligned_malloc(length * sizeof(float), 16);
        float *vec2 = (float*) _aligned_malloc(length * sizeof(float), 16);
        float *result = (float*) _aligned_malloc(length * sizeof(float), 16);
        for(int i = 0; i < length; ++i)
        {
            vec1[i] = i;
            vec2[i] = -i;
        }
        LARGE_INTEGER start, end;
        LARGE_INTEGER freq;
        QueryPerformanceFrequency(&freq);

        QueryPerformanceCounter(&start);
        for(size_t i = 0; i < length; ++i)
        {
#if COMPLEXITY == 0
            // add
            result[i] = vec1[i] + vec2[i];
#elif COMPLEXITY == 1
            // distance
            result[i] = vec1[i] * vec1[i] + vec2[i] * vec2[i];
            result[i] = sqrtf(result[i]);
#else
            result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i];
            result[i] = sqrtf(result[i]);
            result[i] = sqrtf(result[i]);
#endif
        }
        QueryPerformanceCounter(&end);
        times.push_back((end.QuadPart - start.QuadPart) * 1000. / (double) freq.QuadPart);

        _aligned_free(vec1);
        _aligned_free(vec2);
        _aligned_free(result);
    }
    cout << "regular." << endl;
    double min_time = *std::min_element(times.begin(), times.end());
    double max_time = *std::max_element(times.begin(), times.end());
    
    std::sort(times.begin(), times.end());
    double median_time = times[times.size() / 2];
    cout << "min: " << min_time << " max_time: " << max_time << " median: " << median_time << endl;
}

The goal was to make sure the code was as similar as possible. I used a preprocessor define to allow me to change the computational complexity, ranging from a simple add, to a distance computation, and a random computation. The code is run multiple times to ensure that external factors are limited. I ran it 32 times just to get a good sample size.

I also ran the benchmarks over a variety of data sizes.

int main(void)
{
    using namespace std;
    for(size_t size = 4096; size <= 16777216; size *= 2)
    {
        cout << "size: " << size << endl;
        simd_test(size);
        regular_test(size);
        cout << endl;
    }

    cout << "done" << endl;
    cin.ignore();
    return 0;
}

Initial Results

Here are the initial test results using COMPLEXITY = 2.

Times in milliseconds.

size: 4096
sse2 aligned.
min: 0.0144926 max_time: 0.0153984 median: 0.0144926
regular.
min: 0.122735 max_time: 0.175723 median: 0.123187

size: 8192
sse2 aligned.
min: 0.03442 max_time: 0.208332 median: 0.0348729
regular.
min: 0.206973 max_time: 0.767657 median: 0.207879

size: 16384
sse2 aligned.
min: 0.0579706 max_time: 0.770375 median: 0.0584235
regular.
min: 0.414399 max_time: 0.683419 median: 0.415758

size: 32768
sse2 aligned.
min: 0.116394 max_time: 0.219201 median: 0.116847
regular.
min: 0.830157 max_time: 1.23369 median: 0.834686

size: 65536
sse2 aligned.
min: 0.233241 max_time: 0.316574 median: 0.234147
regular.
min: 1.66167 max_time: 2.23051 median: 1.67662

size: 131072
sse2 aligned.
min: 0.546644 max_time: 1.08332 median: 0.570648
regular.
min: 3.42026 max_time: 3.67434 median: 3.45106

size: 262144
sse2 aligned.
min: 1.12499 max_time: 1.68703 median: 1.22282
regular.
min: 6.89442 max_time: 7.41208 median: 7.02757

size: 524288
sse2 aligned.
min: 2.32969 max_time: 2.71465 median: 2.42525
regular.
min: 13.8183 max_time: 20.84 median: 14.4487

size: 1048576
sse2 aligned.
min: 4.82514 max_time: 5.74044 median: 4.94924
regular.
min: 28.0206 max_time: 30.0473 median: 28.6325

size: 2097152
sse2 aligned.
min: 9.76759 max_time: 11.4356 median: 10.0072
regular.
min: 55.9955 max_time: 60.2586 median: 57.779

size: 4194304
sse2 aligned.
min: 19.5642 max_time: 23.1543 median: 21.1434
regular.
min: 113.14 max_time: 120.474 median: 115.781

size: 8388608
sse2 aligned.
min: 40.5033 max_time: 50.6785 median: 43.6292
regular.
min: 233.114 max_time: 245.116 median: 237.776

size: 16777216
sse2 aligned.
min: 80.6615 max_time: 86.9463 median: 81.7471
regular.
min: 464.804 max_time: 491.374 median: 476.572

Figure 1. The speedup of SSE2 vectorization

Wait, I thought SSE2 vectorization can only provide at best a 4x speedup. What's going on here?

Investigation

My first thoughts were that there might not have been hardware acceleration when using the CMath sqrt function. To check this theory, I took a look at the dissassembly.

SSE2 Vectorization Disassembly

            *result_sse2 = _mm_add_ps(_mm_mul_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2), *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2));
01171111  movaps      xmm2,xmmword ptr [eax]  
01171114  movaps      xmm1,xmmword ptr [edx+eax]  
01171118  mulps       xmm1,xmm1  
0117111B  movaps      xmm0,xmm2  
0117111E  mulps       xmm0,xmm2  
01171121  mulps       xmm0,xmm2  
01171124  addps       xmm0,xmm1  
01171127  movaps      xmmword ptr [ecx+eax],xmm0  
            *result_sse2 = _mm_add_ps(*result_sse2, _mm_mul_ps(*vec1_sse2, *vec2_sse2));
0117112B  movaps      xmm1,xmmword ptr [edx+eax]  
0117112F  mulps       xmm1,xmmword ptr [eax]  
01171132  addps       xmm1,xmm0  
            *result_sse2 = _mm_sqrt_ps(*result_sse2);
01171135  sqrtps      xmm0,xmm1  
            *result_sse2 = _mm_sqrt_ps(*result_sse2);
01171138  sqrtps      xmm0,xmm0  
0117113B  movaps      xmmword ptr [ecx+eax],xmm0

Standard un-vectorized Disassembly

            result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i];
01171490  movss       xmm0,dword ptr [eax]  
01171494  movss       xmm2,dword ptr [ecx+eax]  
01171499  cvtps2pd    xmm1,xmm0  
            result[i] = sqrtf(result[i]);
            result[i] = sqrtf(result[i]);
0117149C  cvtps2pd    xmm4,xmm0  
0117149F  cvtps2pd    xmm0,xmm0  
011714A2  cvtps2pd    xmm3,xmm2  
011714A5  mulsd       xmm1,xmm1  
011714A9  mulsd       xmm1,xmm4  
011714AD  mulsd       xmm3,xmm3  
011714B1  cvtps2pd    xmm2,xmm2  
011714B4  mulsd       xmm2,xmm0  
011714B8  xorps       xmm0,xmm0  
011714BB  addsd       xmm1,xmm3  
011714BF  addsd       xmm1,xmm2  
011714C3  cvtpd2ps    xmm0,xmm1  
011714C7  cvtps2pd    xmm0,xmm0  
011714CA  sqrtsd      xmm0,xmm0  
011714CE  cvtpd2ps    xmm0,xmm0  
011714D2  cvtss2sd    xmm0,xmm0  
011714D6  sqrtsd      xmm0,xmm0  
011714DA  cvtpd2ps    xmm0,xmm0  
011714DE  movss       dword ptr [edx+eax],xmm0  
011714E3  add         eax,4  
011714E6  dec         esi  
011714E7  jne         regular_test+0D0h (1171490h)

Well, line 20 shows that the program is using the hardware sqrt function so that's not the main issue.

However, take a look at all of those cvtps2pd and cvtpd2ps commands! All of my intermediates are being converted to double precision, all the math is being performed with double precision, and the results are casted back to single precision. It's no wonder the code is noticably slower!

Fixing the Issue

So how do we fix the issue? Luckily for me, this mainly comes down to a compiler setting. Under Project Properties > Configuration Properties > C/C++ > Code Generation I just need to set the Floating Point Model to Fast (/fp:fast).

Taking another look at the disassembly and this looks much better:

            result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i];
00F11490  movss       xmm1,dword ptr [eax]  
00F11494  movss       xmm0,dword ptr [ecx+eax]  
00F11499  movaps      xmm3,xmm0  
00F1149C  mulss       xmm3,xmm0  
00F114A0  movaps      xmm2,xmm1  
00F114A3  mulss       xmm2,xmm1  
00F114A7  mulss       xmm2,xmm1  
00F114AB  mulss       xmm0,xmm1  
00F114AF  addss       xmm2,xmm3  
00F114B3  addss       xmm2,xmm0  
            result[i] = sqrtf(result[i]);
00F114B7  sqrtss      xmm0,xmm2  
            result[i] = sqrtf(result[i]);
00F114BB  sqrtss      xmm0,xmm0  
00F114BF  movss       dword ptr [edx+eax],xmm0  
00F114C4  add         eax,4  
00F114C7  dec         esi  
00F114C8  jne         regular_test+0D0h (0F11490h)

And another look at the results:

Times in milliseconds.

size: 4096
sse2 aligned.
min: 0.030344 max_time: 0.0683872 median: 0.0307969
regular.
min: 0.0579706 max_time: 122987 median: 0.0584235

size: 8192
sse2 aligned.
min: 0.0289853 max_time: 0.248639 median: 0.0289853
regular.
min: 0.116394 max_time: 0.242299 median: 0.116847

size: 16384
sse2 aligned.
min: 0.0579706 max_time: 0.0611408 median: 0.0584235
regular.
min: 0.233241 max_time: 0.477351 median: 0.234147

size: 32768
sse2 aligned.
min: 0.116394 max_time: 0.156702 median: 0.116847
regular.
min: 0.467388 max_time: 0.792566 median: 0.469652

size: 65536
sse2 aligned.
min: 0.233241 max_time: 0.481427 median: 0.2346
regular.
min: 0.93387 max_time: 1.1259 median: 0.942928

size: 131072
sse2 aligned.
min: 0.546191 max_time: 0.928435 median: 0.557967
regular.
min: 1.94564 max_time: 2.39219 median: 1.98323

size: 262144
sse2 aligned.
min: 1.13496 max_time: 1.61638 median: 1.20878
regular.
min: 3.95242 max_time: 4.45513 median: 3.9891

size: 524288
sse2 aligned.
min: 2.37 max_time: 2.98322 median: 2.45424
regular.
min: 7.95148 max_time: 8.76216 median: 8.30066

size: 1048576
sse2 aligned.
min: 4.87904 max_time: 9.01306 median: 5.0294
regular.
min: 16.1742 max_time: 18.9614 median: 17.4233

size: 2097152
sse2 aligned.
min: 9.88398 max_time: 12.1344 median: 11.0724
regular.
min: 32.4169 max_time: 41.6229 median: 34.4182

size: 4194304
sse2 aligned.
min: 19.6149 max_time: 21.9179 median: 20.4586
regular.
min: 65.1077 max_time: 69.9886 median: 67.2517

size: 8388608
sse2 aligned.
min: 39.6559 max_time: 43.4222 median: 40.9526
regular.
min: 131.373 max_time: 559018 median: 134.09

size: 16777216
sse2 aligned.
min: 81.2548 max_time: 93.9051 median: 85.3752
regular.
min: 264.099 max_time: 291.895 median: 274.135

Figure 2. The speedup of SSE2 vectorization, with a fair /fp:fast comparison

That looks much better! There are a few border-line results with very small data sizes, but I think it's reasonable to assume that's due to external factors.

Note: Funny thing that this issue only shows up when compiling under x86/Win32. Switching over to x64 and with /fp:precise or /fp:fast the same code is generated. I don't know if this is a bug or not, I'm not that familiar with the details of Visual Studio and the Visual-C++ compiler.

Helloworld922's Blog

Pages

Tuesday, December 11, 2012

SSE2 Vectorization Benchmarking Problem