LDDQU vs. MOVDQU guidelines
Hi,I'm wondering what are the guidelines on using LDDQU vs. MOVDQU instructions on latest and future Intel CPUs.I know that during Netburst era LDDQU was supposed to be a more efficient way of loading...
View ArticleThe memory ordering semantics of mfence versus those of locked instructions
Even after many years of the existence of the mfence instruction (and even more time with the lock prefix), and a fairly careful study of the system programming manual, something still isn't clear to...
View ArticleAVX-512 release date
Hello,Do you know when this extension will release for intel iCore ?I am working on a 3D engine with 3D matrix calculation on full CPU since 2014 and I actually use the AVX with its 256 bit...
View Articlegcc not finding a _mm256_storeu2_m128i
Hi,I am developing where i need to use the intrinsic function _mm256_storeu2_m128i. although i have included necessary header file and compilation flag gcc is not able to find the function. what might...
View ArticleUsing AVX opcodes slow my proc
Strange, but AVX version this code slower XMM version in 20 times. CPU i7-6950X 3.0GHz Working on Win10/X64. Is any idea how run this code properly? IDA Pro dissasembly shows what code right. No...
View ArticleKUNPCK* instructions behavior in SDM and Intrinsics Guide
Hi,I wonder what is the behavior of the KUNPCKBW/KUNPCKWD/KUNPCKDQ instructions. In SDM, the description of the instructions imply that they interleave individual bits of the input registers. This is...
View ArticleError in pseudo-code for RDPMC in SWDM Volume 2
I am pretty sure there is a typo (and an inconsistency in notation) in the pseudo-code for the RDPMC instruction in Volume 2 of the SW Developer's Manual. (I am working from document 325383-067, May...
View Articlewhat are the performance implications of using vmovups and vmovapd...
Hi all,I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide...
View ArticleConfusion in behavior of _mm256_loadu_ps and _mm256_loadu_ps instrinsics
Hi all,I performed a quick test to understand the behaviors of _mm256_load_ps and _mm256_loadu_ps SIMD intrinsic respectively, and the behavior is quite unexpected.I am wondering if this is a bug by...
View ArticleI understand Why SSE is slower than ANSI C
Hi, I wrote a very simple c program some days ago, I tried to optimize the code my program with sse, but i just understood sse slower than c. This program is very important for my work if you can help...
View ArticleRDTSC to measure performance of small # of FP calculations
Hey there I found my problem at old topic herehttps://software.intel.com/en-us/forums/intel-isa-extensions/topic/306222but I can not understand which solution was true and I decide repeated questionIn...
View ArticleWhen will SnowRidge be available?
Hi,I'm interested in a few new instructions that will be available in Snow Ridge, but I there is very little information about it, google does not help mcuh either.I wonder when will these instructions...
View ArticleMWAIT is not improving performance and why my machine stucks?
Hi, I'm writing a simple kernel module to test monitror/mwait instructions on my machine, which has i7-7700K processor. I use a char[64] for each core, so the false wakeup should be minimized.I was...
View ArticleIntrinsic functions _rdtsc and _rdtscp
Hello There is an intrinsic _rdtsc according to [1]. The questions are:1- What is the unit of the output? It is an unsigned number. Is that nano second? clock cycle? ... 2- Why there is a form _rdtscp...
View ArticleDisabling HW prefetcher
HiWith _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80...
View ArticleDetermining wake up reason for MWAIT
Hello,I'm trying to figure out how can one check what is the reason for the MWAIT to wakeup.I know there are several reasons for MWAIT to wakeup, including a write to the monitored address (of course,...
View Articlecould not decode some pattern of vgatherdps
The byte code of `vgatherdps zmm0{k1}, [rax + zmm0]` is 62F27D49920400.But>xed -d 62F27D49920400>62F27D49920400>ERROR: GATHER_REGS Could not decode at offset: 0x0 PC: 0x0:...
View ArticleIncorrect links in the Architectures Software Developer’s Manual
Hi,There are a couple of incorrect links (references) to "Figure 6-4. Stack Usage on Transfers to Interrupt and Exception-Handling Routines" in the Intel® 64 and IA-32 Architectures Software...
View ArticleUnable to generate Vectorization report in icc/icpc compilers.(icc...
when i executed the icc command i got following error message :$ icc -vec-report2 p1.cicc: command line remark #10148: option '-vec-report2' not supportedeven i am unable to pass flags for sse,avx.what...
View ArticleSkylake documentation bug
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica; color: #0860a8}This is for Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-039 December 2017, Page...
View Article