Introduction
Recently I was doing some testing into SSE2 vectorization. I was curious as to how it worked syntactically and what kind of performance I can expect.
SSE2 works by providing SIMD operations. This allows a single operation to manipulate lots of data. In the case of SSE2, we can utilize 128-bit registers which can process 4 single-precision floating point numbers at a time. At best in theory this gives a 4x speedup. However, there are other factors such as loading data into the registers and storing out the results which means in practice sse2 vectorization will provide less than 4x speedup.
My initial tests were producing strange results, though. I was getting some speedups pushing 5x and higher. So let's investigate what's going on.
Testing Setup
Here's my basic test setup. I'm using Visual Studio 2010.
The sse2 vectorized code
void simd_test(const size_t length) { using namespace std; vector<double> times; for(size_t count = 0; count < TIMES; ++count) { float *vec1 = (float*) _aligned_malloc(length * sizeof(float), 16); float *vec2 = (float*) _aligned_malloc(length * sizeof(float), 16); float *result = (float*) _aligned_malloc(length * sizeof(float), 16); for(int i = 0; i < length; ++i) { vec1[i] = i; vec2[i] = -i; } LARGE_INTEGER start, end; LARGE_INTEGER freq; QueryPerformanceFrequency(&freq); QueryPerformanceCounter(&start); __m128* vec1_sse2 = (__m128*) vec1; __m128* vec2_sse2 = (__m128*) vec2; __m128* result_sse2 = (__m128*) result; for(size_t i = 0; i < length / 4; ++i) { #if COMPLEXITY == 0 // basic add *result_sse2 = _mm_mul_ps(*vec1_sse2, *vec2_sse2); #elif COMPLEXITY == 1 // distance *result_sse2 = _mm_add_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2)); *result_sse2 = _mm_sqrt_ps(*result_sse2); #else *result_sse2 = _mm_add_ps(_mm_mul_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2), *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2)); *result_sse2 = _mm_add_ps(*result_sse2, _mm_mul_ps(*vec1_sse2, *vec2_sse2)); *result_sse2 = _mm_sqrt_ps(*result_sse2); *result_sse2 = _mm_sqrt_ps(*result_sse2); #endif ++vec1_sse2; ++vec2_sse2; ++result_sse2; } QueryPerformanceCounter(&end); times.push_back((end.QuadPart - start.QuadPart) * 1000. / (double) freq.QuadPart); _aligned_free(vec1); _aligned_free(vec2); _aligned_free(result); } cout << "sse2 aligned." << endl; double min_time = *std::min_element(times.begin(), times.end()); double max_time = *std::max_element(times.begin(), times.end()); std::sort(times.begin(), times.end()); double median_time = times[times.size() / 2]; cout << "min: " << min_time << " max_time: " << max_time << " median: " << median_time << endl; }
The standard un-vectorized code
void regular_test(const size_t length) { using namespace std; vector<double> times; for(size_t count = 0; count < TIMES; ++count) { float *vec1 = (float*) _aligned_malloc(length * sizeof(float), 16); float *vec2 = (float*) _aligned_malloc(length * sizeof(float), 16); float *result = (float*) _aligned_malloc(length * sizeof(float), 16); for(int i = 0; i < length; ++i) { vec1[i] = i; vec2[i] = -i; } LARGE_INTEGER start, end; LARGE_INTEGER freq; QueryPerformanceFrequency(&freq); QueryPerformanceCounter(&start); for(size_t i = 0; i < length; ++i) { #if COMPLEXITY == 0 // add result[i] = vec1[i] + vec2[i]; #elif COMPLEXITY == 1 // distance result[i] = vec1[i] * vec1[i] + vec2[i] * vec2[i]; result[i] = sqrtf(result[i]); #else result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i]; result[i] = sqrtf(result[i]); result[i] = sqrtf(result[i]); #endif } QueryPerformanceCounter(&end); times.push_back((end.QuadPart - start.QuadPart) * 1000. / (double) freq.QuadPart); _aligned_free(vec1); _aligned_free(vec2); _aligned_free(result); } cout << "regular." << endl; double min_time = *std::min_element(times.begin(), times.end()); double max_time = *std::max_element(times.begin(), times.end()); std::sort(times.begin(), times.end()); double median_time = times[times.size() / 2]; cout << "min: " << min_time << " max_time: " << max_time << " median: " << median_time << endl; }
The goal was to make sure the code was as similar as possible. I used a preprocessor define to allow me to change the computational complexity, ranging from a simple add, to a distance computation, and a random computation. The code is run multiple times to ensure that external factors are limited. I ran it 32 times just to get a good sample size.
I also ran the benchmarks over a variety of data sizes.
int main(void) { using namespace std; for(size_t size = 4096; size <= 16777216; size *= 2) { cout << "size: " << size << endl; simd_test(size); regular_test(size); cout << endl; } cout << "done" << endl; cin.ignore(); return 0; }
Initial Results
Here are the initial test results using COMPLEXITY = 2.
Times in milliseconds. size: 4096 sse2 aligned. min: 0.0144926 max_time: 0.0153984 median: 0.0144926 regular. min: 0.122735 max_time: 0.175723 median: 0.123187 size: 8192 sse2 aligned. min: 0.03442 max_time: 0.208332 median: 0.0348729 regular. min: 0.206973 max_time: 0.767657 median: 0.207879 size: 16384 sse2 aligned. min: 0.0579706 max_time: 0.770375 median: 0.0584235 regular. min: 0.414399 max_time: 0.683419 median: 0.415758 size: 32768 sse2 aligned. min: 0.116394 max_time: 0.219201 median: 0.116847 regular. min: 0.830157 max_time: 1.23369 median: 0.834686 size: 65536 sse2 aligned. min: 0.233241 max_time: 0.316574 median: 0.234147 regular. min: 1.66167 max_time: 2.23051 median: 1.67662 size: 131072 sse2 aligned. min: 0.546644 max_time: 1.08332 median: 0.570648 regular. min: 3.42026 max_time: 3.67434 median: 3.45106 size: 262144 sse2 aligned. min: 1.12499 max_time: 1.68703 median: 1.22282 regular. min: 6.89442 max_time: 7.41208 median: 7.02757 size: 524288 sse2 aligned. min: 2.32969 max_time: 2.71465 median: 2.42525 regular. min: 13.8183 max_time: 20.84 median: 14.4487 size: 1048576 sse2 aligned. min: 4.82514 max_time: 5.74044 median: 4.94924 regular. min: 28.0206 max_time: 30.0473 median: 28.6325 size: 2097152 sse2 aligned. min: 9.76759 max_time: 11.4356 median: 10.0072 regular. min: 55.9955 max_time: 60.2586 median: 57.779 size: 4194304 sse2 aligned. min: 19.5642 max_time: 23.1543 median: 21.1434 regular. min: 113.14 max_time: 120.474 median: 115.781 size: 8388608 sse2 aligned. min: 40.5033 max_time: 50.6785 median: 43.6292 regular. min: 233.114 max_time: 245.116 median: 237.776 size: 16777216 sse2 aligned. min: 80.6615 max_time: 86.9463 median: 81.7471 regular. min: 464.804 max_time: 491.374 median: 476.572
Wait, I thought SSE2 vectorization can only provide at best a 4x speedup. What's going on here?
Investigation
My first thoughts were that there might not have been hardware acceleration when using the CMath sqrt function. To check this theory, I took a look at the dissassembly.
SSE2 Vectorization Disassembly
*result_sse2 = _mm_add_ps(_mm_mul_ps(_mm_mul_ps(*vec1_sse2, *vec1_sse2), *vec1_sse2) , _mm_mul_ps(*vec2_sse2, *vec2_sse2)); 01171111 movaps xmm2,xmmword ptr [eax] 01171114 movaps xmm1,xmmword ptr [edx+eax] 01171118 mulps xmm1,xmm1 0117111B movaps xmm0,xmm2 0117111E mulps xmm0,xmm2 01171121 mulps xmm0,xmm2 01171124 addps xmm0,xmm1 01171127 movaps xmmword ptr [ecx+eax],xmm0 *result_sse2 = _mm_add_ps(*result_sse2, _mm_mul_ps(*vec1_sse2, *vec2_sse2)); 0117112B movaps xmm1,xmmword ptr [edx+eax] 0117112F mulps xmm1,xmmword ptr [eax] 01171132 addps xmm1,xmm0 *result_sse2 = _mm_sqrt_ps(*result_sse2); 01171135 sqrtps xmm0,xmm1 *result_sse2 = _mm_sqrt_ps(*result_sse2); 01171138 sqrtps xmm0,xmm0 0117113B movaps xmmword ptr [ecx+eax],xmm0
Standard un-vectorized Disassembly
result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i]; 01171490 movss xmm0,dword ptr [eax] 01171494 movss xmm2,dword ptr [ecx+eax] 01171499 cvtps2pd xmm1,xmm0 result[i] = sqrtf(result[i]); result[i] = sqrtf(result[i]); 0117149C cvtps2pd xmm4,xmm0 0117149F cvtps2pd xmm0,xmm0 011714A2 cvtps2pd xmm3,xmm2 011714A5 mulsd xmm1,xmm1 011714A9 mulsd xmm1,xmm4 011714AD mulsd xmm3,xmm3 011714B1 cvtps2pd xmm2,xmm2 011714B4 mulsd xmm2,xmm0 011714B8 xorps xmm0,xmm0 011714BB addsd xmm1,xmm3 011714BF addsd xmm1,xmm2 011714C3 cvtpd2ps xmm0,xmm1 011714C7 cvtps2pd xmm0,xmm0 011714CA sqrtsd xmm0,xmm0 011714CE cvtpd2ps xmm0,xmm0 011714D2 cvtss2sd xmm0,xmm0 011714D6 sqrtsd xmm0,xmm0 011714DA cvtpd2ps xmm0,xmm0 011714DE movss dword ptr [edx+eax],xmm0 011714E3 add eax,4 011714E6 dec esi 011714E7 jne regular_test+0D0h (1171490h)
Well, line 20 shows that the program is using the hardware sqrt function so that's not the main issue.
However, take a look at all of those cvtps2pd and cvtpd2ps commands! All of my intermediates are being converted to double precision, all the math is being performed with double precision, and the results are casted back to single precision. It's no wonder the code is noticably slower!
Fixing the Issue
So how do we fix the issue? Luckily for me, this mainly comes down to a compiler setting. Under Project Properties > Configuration Properties > C/C++ > Code Generation I just need to set the Floating Point Model to Fast (/fp:fast).
Taking another look at the disassembly and this looks much better:
result[i] = vec1[i] * vec1[i] * vec1[i] + vec2[i] * vec2[i] + vec1[i] * vec2[i]; 00F11490 movss xmm1,dword ptr [eax] 00F11494 movss xmm0,dword ptr [ecx+eax] 00F11499 movaps xmm3,xmm0 00F1149C mulss xmm3,xmm0 00F114A0 movaps xmm2,xmm1 00F114A3 mulss xmm2,xmm1 00F114A7 mulss xmm2,xmm1 00F114AB mulss xmm0,xmm1 00F114AF addss xmm2,xmm3 00F114B3 addss xmm2,xmm0 result[i] = sqrtf(result[i]); 00F114B7 sqrtss xmm0,xmm2 result[i] = sqrtf(result[i]); 00F114BB sqrtss xmm0,xmm0 00F114BF movss dword ptr [edx+eax],xmm0 00F114C4 add eax,4 00F114C7 dec esi 00F114C8 jne regular_test+0D0h (0F11490h)
And another look at the results:
Times in milliseconds. size: 4096 sse2 aligned. min: 0.030344 max_time: 0.0683872 median: 0.0307969 regular. min: 0.0579706 max_time: 122987 median: 0.0584235 size: 8192 sse2 aligned. min: 0.0289853 max_time: 0.248639 median: 0.0289853 regular. min: 0.116394 max_time: 0.242299 median: 0.116847 size: 16384 sse2 aligned. min: 0.0579706 max_time: 0.0611408 median: 0.0584235 regular. min: 0.233241 max_time: 0.477351 median: 0.234147 size: 32768 sse2 aligned. min: 0.116394 max_time: 0.156702 median: 0.116847 regular. min: 0.467388 max_time: 0.792566 median: 0.469652 size: 65536 sse2 aligned. min: 0.233241 max_time: 0.481427 median: 0.2346 regular. min: 0.93387 max_time: 1.1259 median: 0.942928 size: 131072 sse2 aligned. min: 0.546191 max_time: 0.928435 median: 0.557967 regular. min: 1.94564 max_time: 2.39219 median: 1.98323 size: 262144 sse2 aligned. min: 1.13496 max_time: 1.61638 median: 1.20878 regular. min: 3.95242 max_time: 4.45513 median: 3.9891 size: 524288 sse2 aligned. min: 2.37 max_time: 2.98322 median: 2.45424 regular. min: 7.95148 max_time: 8.76216 median: 8.30066 size: 1048576 sse2 aligned. min: 4.87904 max_time: 9.01306 median: 5.0294 regular. min: 16.1742 max_time: 18.9614 median: 17.4233 size: 2097152 sse2 aligned. min: 9.88398 max_time: 12.1344 median: 11.0724 regular. min: 32.4169 max_time: 41.6229 median: 34.4182 size: 4194304 sse2 aligned. min: 19.6149 max_time: 21.9179 median: 20.4586 regular. min: 65.1077 max_time: 69.9886 median: 67.2517 size: 8388608 sse2 aligned. min: 39.6559 max_time: 43.4222 median: 40.9526 regular. min: 131.373 max_time: 559018 median: 134.09 size: 16777216 sse2 aligned. min: 81.2548 max_time: 93.9051 median: 85.3752 regular. min: 264.099 max_time: 291.895 median: 274.135
That looks much better! There are a few border-line results with very small data sizes, but I think it's reasonable to assume that's due to external factors.
Note: Funny thing that this issue only shows up when compiling under x86/Win32. Switching over to x64 and with /fp:precise or /fp:fast the same code is generated. I don't know if this is a bug or not, I'm not that familiar with the details of Visual Studio and the Visual-C++ compiler.
No comments :
Post a Comment