AVX Optimizations and Performance: VisualStudio vs GCC
Greetings, I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation...
View ArticleBloated Instruction counts in SDE as compared with that from HW PMC 0xC0
I have noted in multiple (though infrequent but freqent enough) circumstances that the instruction counts for execution of a binary in SDE and that reported by PMC 0xC0 differ by ORDERS of magnitude....
View ArticlePoor Code Gen of FMA3 instructions in SPEC FP 06 using Intel 14.0.0 compiler...
I have compiled a SPEC FP 06 using the Intel 14.0.0 compiler suite. I've observed great performance but upon looking at the code gen distributions through SDE, I note that only about 0.1% of the...
View ArticleAVX-512 is a big step forward - but repeating past mistakes!
AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set....
View ArticleStudying Intel TSX Performance: strange results
Dear all,I've made studying of Intel TSX performance - its abort cases and comparison with spin lock. The study with reference to source code is available at...
View Articlemem address directly from SSE/AVX register
Hello, I would like to make a suggestionVery often [otherwise well vectorizible] algorithms require reading/writing from/to mem addresses which are calculated per-channel (reading from table, sampling...
View ArticleIntel® Software Development Emulator, Release 6.7
Hello, we just released version 6.7 of the Intel® Software Development Emulator. It is available here:http://www.intel.com/software/sdeIt includes:Debugging with GDB is now supported with Intel®...
View ArticleMOVNTI and alignment for real mode
In the SDM rev. 48, vol. 2A, page 3-546, in the description of the exceptions for the MOVNTI instruction in the real-mode, it is specified that the instruction can generate#GP If a memory operand is...
View ArticleIntel® Software Development Emulator, Release 6.12
Hello, we just released version 6.12 of the Intel® Software Development Emulator. It is available here:http://www.intel.com/software/sdeIt includes:Support to Mac OSX version 10.9.Improved the TSX...
View ArticleInstruction set extensions programming reference, revision 17,
An updated instruction set extensions programming reference, revision 17, has been posted here. It includes information about:Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructionsIntel®...
View ArticleIs there some books about SIMD(sse, avx and so on) optimization?
~Can someone please recommend a few books on program optimization?I use multithreading and simd to improve the performance of the program.I always learn simd through the website, and ask questions in...
View ArticleLatest ASM compiler other than Intel C and C++ Compilers
Hi,Am trying to code my application in Assembly to run on x86. Please suggest me the suitable compiler which will support all SSE4.2 Assembly instructions(other than Intel Compiler). If any links which...
View Articleunaligned loads avx-128 vs. -256
I just saw that my cases using _mm256_loadu_ps show better performance than _mm_loadu_ps on corei7-4, where the latter was faster on earlier AVX platforms (in part due to the ability of ICL/icc to...
View ArticleWill AVX-512 replace the need for dedicated GPU's?
I do not expect it to replace high end graphics cards, and will likely be less efficient powerwise than a dedicated gpu (integrated or discrete). As far as I can tell performance wise it will easily...
View ArticleICPC 13.0.2 generates scalar load instead of packed load
Hi all,I'm a little puzzled about the generated assembly code for this little piece of Cilk code:void gemv(const float* restrict A[4], const float *restrict x, float * restrict y){...
View Articlegather instructions and the size of indexs for a given base gpr size
Hi, I have a simple question. When performing address computations, the size of the BASE and the INDEX are required to be the same. I presumed this was the case in the GATHER instructions.. but I...
View ArticleFMA manipulation of register’s content for XMM, YMM and ZMM register sets
hello, there wasn’t a typical introduction thread so since it’s my first post i though to introduce myself. my name is mile (yes like the measuring unit) and i’m a student. i’m noob in this area.i’m...
View ArticleGet _mm_alignr_epi8 functionality on 256-bit vector registers (AVX2)
Hello,I'm porting an application from SSE to AVX2 and KNC.I have some _mm_alignr_epi8 intrinsics. While I just had to replace this intrinsic by the _mm512_alignr_epi32 intrinsic for KNC (by the way, I...
View ArticleHow to clear the upper 128 bits of __m256 value?
How can I clear the upper 128 bits of m2: __m256i m2 = _mm256_set1_epi32(2); __m128i m1 = _mm_set1_epi32(1); m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2)); m2 =...
View ArticleDifferent ways to turn an AoS into an SoA
Hi,I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth...
View Article