Why FMA is slower than SSE here?
I am optimizing app which counts correlation coefficients many times. Loops were easy to vectorize, but there are also some calculations made outside of them. I tried to partially optimize them using...
View ArticleAVX add slow due to vinsertf128
CPU is a 3820, ICL 14.0 and VS2013, variables are double'sauto newvel = velocitiesy[i] + force;That line is slow because instruction vinsertf128 in this case has a high CPI of over 3.2, this is the...
View Article[XED] how to encode mov instruction
Hello, all.I am trying to encode mov and call instruction on CentOS7, but encountered some error.The source code is as below.#ifdef __X86_64__ #define XED_MMODE...
View ArticleAVX512 suboptimal intrinsics compilation
I'm looking into the compilation result, of what the Intel compiler makes out of AVX512 intrinsics. (latest trial compiler downloaded a few weeks ago)There are several strange things I notice, to...
View ArticleGo programs (even an empty one) hang on exit
Hi,If I compile an empty go program and run it under SDE64 7.49 on Linux, the program does not exit unless I send it multiple signals (4 or 5 SIGINT or SIGQUIT seems to do the trick).By "empty" I mean...
View ArticleParallelization + Vectorization using OpenMP in Sandy Bridge
Hi,I would like to ask question about parallelization+vectorization:1) Is it possible to implement parallelization+vectorization at the same time (i.e. access AVX in Sandy Bridge processor using...
View ArticleCode scales poorly with AVX
This code scales poorly with AVX on my Sandy Bridge, how can I make it more vectorizer friendly:for (auto i = 0; i < pcount; i += 2){ for (auto j = 0; j < pcount; j += 2){ if (i == j) continue;...
View ArticleIs xend treated as a full memory barrier?
I've started attempting to learn RTM extensions. The most common examples I can find online are using them to implement a mutex or concurrent lock. Often they are similar to:#include...
View Article_mm_prefetch usage
Hi, I couldn't find an answer to this question and it might be silly but does _mm_prefetch need vzeroupper if mixed with AVX or AVX2 code since it is an SSE intrinsic and non-vex instruction? I am...
View Articlemitigating permute costs in AVX 256?
Hello, I'm investigating conversion of a number of compute kernels from AVX 128 to AVX 256 and would appreciate any guidance which might be available on getting a small number of operations on port 0+1...
View ArticleHow to speed up this code?
Hello together,many thanks for all contributors to my past question.Crazy things happens, 2 years ago I was internally moved to UI & Communication development to speed up that things :) So my...
View ArticleCPI rate blows up
Hi,i*m try to solve the last question here (sadly without answer so far) myself. Now I got some rating from VTune amplifier and see some strange results as well (the assembler code is generated by VS...
View ArticleCannot access compiler intrinsics for logarithm in Visual Studio
Hello, I cannot use the compiler intrinsics related to logarithms in neither Visual Studio 2013 nor 2015. I tried to use _mm_log_ps and it was not found. I used the "immintrin.h" header file. I looked...
View ArticleQuestion about latency
AVXOP xmm0, xmm0, xmm1 Hi,years ago I've read and heard different mystic things about latencies caused by a regsiter choise if using AVX, and why it is better to use AVX instead of SSE -...
View ArticleRandom slow downs with AVX2 code.
I wrote a subroutine mostly using compiler intrinsics of AVX2 and AVX, I used some SSE instructions too but I did set the enhanced instruction set to AVX2 in the project settings of Visual Studio. My...
View ArticleE5-1650 v4, What are the AVX 'Base and 'Turbo' Speeds?
Hi;I'm trying to determine if (what appears to be) unexpected (below base frequency) throttling on my new system is being caused by AVX usage when I run various stress programs like Prime-95 and the...
View ArticleQuestion about performance difference SSE4/AVX vs. AVX2 with dual-channel vs....
Hi,today I've interesting question about your experience (not only theoretical improvement) with code performance difference on SSE4/AVX with dual-channel memory board vs. AVX2 with quad-channel...
View ArticleSlightly OT, but maybe somebody has an idea.
Hi,(My question abot ISA-Extension is near the bottom of post)today I has found a old piece of code, done time profiling and ....was nearly fallen from the chair.What's happen? This very smal piece...
View Articlewhy is ‘_mm512d load/store’ intrinsic changed to vmovups not vmovupd?
in my application, speed is very important. so I use intel advisor on my application, then I find that there are some type conversions.I think it is weird, because there are some float type but I...
View ArticleSkylake Xeon and AVX-512VL
Hi all, please excuse my ignorance but I am just wondering if the Skylake Xeon processor is released to the market now?As I need to use the AVX-512VL (not AVX-512F or others) instruction set, and...
View Article