I am doing a benchmark about vectorization on MacOS with the following processor i7 :
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
My MacBook Pro is from middle 2014.
I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.
For my benchmark, I add each element of 2 arrays and store the sum in a third array (array sizes vary from 32^3 to 32^5)
I must make you notice that I am working with `double` type for these arrays.
Here are the functions used into my benchmark code :
**1*) First with SSE vectorization :**
#elif __SSE__ #include <x86intrin.h> #define ALIGN 16 void addition_array(int size, double *a, double *b, double *c) { int i; // Main loop for (i=size-1; i>=0; i-=2) { // Intrinsic SSE syntax const __m128d x = _mm_load_pd(a); // Load two x elements const __m128d y = _mm_load_pd(b); // Load two y elements const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements _mm_store_pd(c, sum); // Store two sum elements // Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double) a += 2; b += 2; c += 2; } }
**2*) Second with AVX256 vectorization :**
#ifdef __AVX__ #include <immintrin.h> #define ALIGN 32 void addition_array(int size, double *a, double *b, double *c) { int i; // Main loop for (i=size-1; i>=0; i-=4) { // Intrinsic AVX syntax const __m256d x = _mm256_load_pd(a); // Load two x elements const __m256d y = _mm256_load_pd(b); // Load two y elements const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements _mm256_store_pd(c, sum); // Store two sum elements // Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double) a += 4; b += 4; c += 4; } }
For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).
What I get in results for SSE vectorization is represented on the following figure :
So, I think these results are valid because SpeedUp is around factor 2.
Now for AVX256, I get the following figure :
For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).
But as you can see, I still get a `factor 2` and not `4` for SpeedUp.
I don't understand why I get the same results for Speedup with SSE and AVX
vectorization.
Does it come from "compilation flags", from my model of processor, ... I don't know.
Here are the compilation command line that I have done for all above results :
**For SSE :**
gcc-mp-4.9 -O3 -msse main_benchmark.c -o vectorizedExe
**For AVX256 :**
gcc-mp-4.9 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe
Entire code is available on : http://beulu.com/test_vectorization/main_benchmark.c.txt
and the shell script for benchmarking is http://beulu.com/test_vectorization/run_benchmark
Could anyone tell me why I get the same speedup between SSE and AVX (i.e a factor 2 between both) ?
Thanks for your help