Introduction
Recently I was doing some testing into SSE2 vectorization. I was curious as to how it worked syntactically and what kind of performance I can expect.
SSE2 works by providing SIMD operations. This allows a single operation to manipulate lots of data. In the case of SSE2, we can utilize 128-bit registers which can process 4 single-precision floating point numbers at a time.
At best in theory this gives a 4x speedup. However, there are other factors such as loading data into the registers and storing out the results which means in practice sse2 vectorization will provide less than 4x speedup.
My initial tests were producing strange results, though. I was getting some speedups pushing 5x and higher. So let's investigate what's going on.