Quantcast
Channel: Intel® Software - Intel ISA Extensions
Viewing all 685 articles
Browse latest View live

LDDQU vs. MOVDQU guidelines

$
0
0

Hi,

I'm wondering what are the guidelines on using LDDQU vs. MOVDQU instructions on latest and future Intel CPUs.

I know that during Netburst era LDDQU was supposed to be a more efficient way of loading unaligned data, when the data is not supposed to be modified soon. Later, in Core architectures, MOVDQU was updated to become equivalent to LDDQU. Therefore, the general guideline was to use LDDQU - it would at least be not worse than MOVDQU and on older CPUs it would be faster.

However, in the latest Agner Fog's instruction tables for Skylake I can see that LDDQU has one cycle longer latency compared to MOVDQU, which leads to the following questions:

1. Does this mean that LDDQU is no longer equivalent to MOVDQU? If so, what is the difference?

2. Is this discrepancy an unfortunate (mis-)feature of the Skylake architecture that is intended to be "fixed" in future architectures or the change is permanent?

3. What are the guidelines on choosing one instruction over the other? I'm interested with regard to modern architectures (say, Haswell and later) as well as future CPU architectures.

Thanks.

 

 

 


The memory ordering semantics of mfence versus those of locked instructions

$
0
0

Even after many years of the existence of the mfence instruction (and even more time with the lock prefix), and a fairly careful study of the system programming manual, something still isn't clear to me.

Both mfence and locked instructions have memory ordering effects, generally ensuring sequentially consistent semantics and preventing any reordering across them at least with respect to normal accesses for write-back (WB) memory regions. Are there any cases, however, where the actual, documented or guaranteed memory ordering semantics differ between them? For example, when using non-temporal operations on WB memory regions? When using WC or WT or other types of memory regions other than WB (possibly also mixed with accesses to WB regions)?

The system programming guide doesn't really provide a precisely enough treatment of the topic: section 8.2 deals with memory ordering, but it largely limits itself to the case of WB memory regions, and doesn't handle non-temporal (streaming) operations in a comprehensive way. Various other sections touch on the other cases, and some mention that mfence may be used for ordering (e.g., to flush write-combining buffers when dealing with WC memory regions) - but they don't say that only mfence may be used (leaving open the possibility that lock-prefixed instructions also work in this capacity). Conversely, other locations mention only lock-prefixed instructions for ordering.

So the question is still outstanding: does mfenceprovide ordering guarantees in any cases a locked-prefix instruction doesn't? Alternately, and less likely, does a lock-prefixed instruction provide ordering guarantees in any case that mfence doesn't?

 

AVX-512 release date

$
0
0

Hello,

Do you know when this extension will release for intel iCore  ?

I am working on a 3D engine with 3D matrix calculation on full CPU since 2014 and I actually use the AVX with its 256 bit registers.

And the performance is around 5-15 fps for 1 million vertex.

I heard about the new update of AVX instructions that operate on 512-bit registers since 2014 and now I see that it 's release on Xeon CPU.

But because I am not working with Xeon CPU, I wait it on Intel iCore CPU.

I haven't work on this project since 2015 because 5-15 fps is too tiny to make something good with the engine and I wait the AVX-512 to get back on this project.

 

gcc not finding a _mm256_storeu2_m128i

$
0
0

Hi,

I am developing where i need to use the intrinsic function _mm256_storeu2_m128i. although i have included necessary header file and compilation flag gcc is not able to find the function. what might be the reason?

Thanks in advance

Using AVX opcodes slow my proc

$
0
0

Strange, but AVX version this code slower XMM version in 20 times.
CPU i7-6950X 3.0GHz
Working on Win10/X64. Is any idea how run this code properly?
IDA Pro dissasembly shows what code right. No errors.

procedure ScanLineVec256(X0, X1: Integer; P: TVertexD);
asm
  .NOFRAME

  sub X1, X0
  inc X1

  movq2dq xmm3, mm6
  movd xmm11, r12d
  shufps xmm11, xmm11, 00b

  //vmovdqu ymm10, [R8]
  db $C4, $41, $7E, $6F, $10

 @X:

  add r10, 4
  add r11, 4

  //vaddpd ymm10, ymm10, ymm13
  db $C4, $41, $2D, $58, $D5

  //vandpd ymm0, ymm10, ymm15

  db $C4, $C1, $2D, $54, $C7
  //vxorpd ymm0, ymm0, ymm15
  db $C4, $C1, $7D, $57, $C7
  //vptest ymm0, ymm0
  db $C4, $E2, $7D, $17, $C0

  jz @Inside
  dec X1
  jnz @X
  ret

  @Inside:

  //vmovdqa ymm1, ymm12
  db $C5, $7D, $7F, $E1

  //vmulpd ymm1, ymm1, ymm10
  db $C4, $C1, $75, $59, $CA

  //vmovdqa ymm4, ymm1
  db $C5, $FD, $7F, $CC

  //vmulpd ymm1, ymm1, ymm14
  db $C4, $C1, $75, $59, $CE

  //Extract (X+Y)
  //vextractf128 xmm2, ymm1, 01b
  db $C4, $E3, $7D, $19, $CA, $01
  //(X+Y)+Z
  addsd xmm2, xmm1
  psrldq xmm1, 8
  addsd xmm1, xmm2

  movq xmm0, R13
  divsd xmm0, xmm1
  cvtsd2ss xmm0, xmm0

  comiss xmm0, dword ptr [r10]
  jb @Below
  dec X1
  jnz @X
  ret

 @Below:
  movd dword ptr [r10], xmm0
  shufps xmm0, xmm0, 00b

  //vcvtpd2ps xmm4, ymm4
  db $C5, $FD, $5A, $E4

  movd dword ptr [r11], xmm0

  dec X1
  jnz @X

end;

 

KUNPCK* instructions behavior in SDM and Intrinsics Guide

$
0
0

Hi,

I wonder what is the behavior of the KUNPCKBW/KUNPCKWD/KUNPCKDQ instructions. In SDM, the description of the instructions imply that they interleave individual bits of the input registers. This is especially so for KUNPCKWD, for which the description wording is missing the work "masks".

At the same time, the pseudo-code of the operations indicate that the instructions move SRC1 bits above SRC2 without interleaving. Intel Intrinsics Guide also contains this pseudocode.

Based on my previous experience with unpack instructions in SSE and AVX, I would expect the KUNPCK* instructions to interleave individual bits, but in this case the pseudocode is incorrect. Is this the case? If not, it would be better to update the instructions description to make it clear that they do not interleave individual bits.

 

Error in pseudo-code for RDPMC in SWDM Volume 2

$
0
0

I am pretty sure there is a typo (and an inconsistency in notation) in the pseudo-code for the RDPMC instruction in Volume 2 of the SW Developer's Manual.  (I am working from document 325383-067, May 2018.)

The first piece of the pseudo-code is for Intel processors that support architectural performance monitoring, and is trying to show how bit 30 of the counter number in %ecx is used to determine whether the remainder of the bits in %ecx are used to select a fixed-function counter number or a programmable counter number.

IF (ECX[30] = 1 and ECX[29:0] in valid fixed-counter range)
        EAX ← IA32_FIXED_CTR(ECX)[30:0];
        EDX ← IA32_FIXED_CTR(ECX)[MSCB:32];
ELSE IF (ECX[30] = 0 and ECX[29:0] in valid general-purpose counter range)
        EAX ← PMC(ECX[30:0])[31:0];
        EDX ← PMC(ECX[30:0])[MSCB:32];

Note the difference in syntax between the pairs of assignments to EAX and EDX -- the second pair looks correct.  The first pair uses an ambiguous notation, but in the first statement the "[30:0]" is incorrect whether it is in reference to the bits of ECX (which should be [29:0]) or in reference to the bits of the counter (which should be [31:0]).

Using the second pair as a guide to the syntax, the first pair should probably be:

        EAX ← IA32_FIXED_CTR(ECX[29:0])[31:0];
        EDX ← IA32_FIXED_CTR(ECX[29:0])[MSCB:32];

This confusion is repeated in the first pair of assignments in the second section of pseudo-code ("Intel Core 2 Duo processor family [...]") and in the first pair of assignments in the third section of pseudo-code ("P6 family processors [...]").

what are the performance implications of using vmovups and vmovapd instructions respectively?

$
0
0

Hi all,

I see 2 instruction for virtually performing the same operations - vmovups and vmovapd as per the intel intrinsics guide (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3,3...) except with respect to the expectation of memory alignment.

However,  am very interested in understanding the performance implications of the using one of above vs the other?

The intel developers guide doesn't give us much information about this phenomenon (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-...)

Basically it only states

"Software may see performance penalties when unaligned accesses cross cacheline boundaries, so reasonable attempts to align commonly used data sets should continue to be pursued."

Is there some resource someone could point to which has some significant information  particularly on this topic?

Thanks,

Aketh


Confusion in behavior of _mm256_loadu_ps and _mm256_loadu_ps instrinsics

$
0
0

Hi all,

I performed a quick test to understand the behaviors of _mm256_load_ps and _mm256_loadu_ps SIMD intrinsic respectively, and the behavior is quite unexpected.

I am wondering if this is a bug by any chance?

when i try to load a register with unaligned access with _mm256_load_ps, I am expected to encounter an general-protection exception. But this isn't the case with _mm256_loadu_ps.
However, I see no such thing happen when using the aligned load access intrinsic?. For instance in the code below clearly I must expect an exception thrown on the second iteration.

for(i = 0; i < size ; i+=1)
        {
                t0 = _mm256_load_ps(&a[i]);
                t1 = _mm256_load_ps(&b[i]);
                t2 = _mm256_add_ps(t0, t1);
                _mm256_store_ps(&c[i], t2);
        }

This seems to be the case irrespective of weather a,b,c arrays were aligned or unaligned?

Is there any documentation I could refer to which explains this behavior and the performance implication of such unaligned access?

Attached below is the full code

Thanks,

Aketh

AttachmentSize
Downloadtext/x-csrcSIMD_intrinsics.c851 bytes

I understand Why SSE is slower than ANSI C

$
0
0

Hi, I wrote a very simple c program some days ago, I tried to optimize the code my program with sse, but i just understood sse slower than c. This program is very important for my work if you can help me optimize the code fast I will be very happy

 

code:

 

#include 

#include 

#include

typedef _declspec(align(16)) float vec3_t[3];

 

inline void vec_normalize_sse(vec3_t vec)

{

_asm {

mov esi, vec

movups xmm0, [esi]

movups xmm1, xmm0

mulps xmm1, xmm1

 

movups xmm2, xmm1

shufps xmm2, xmm1, 0xe1

movups xmm3, xmm1

shufps xmm3, xmm1, 0xc6

addps xmm1, xmm2

addps xmm1, xmm3

shufps xmm1, xmm1, 0x00

sqrtps xmm1, xmm1

divps xmm0, xmm1

 

movups [esi], xmm0

}

}

inline void vec_normalize_c(vec3_t vec)

{

float len;

len = vec[0]*vec[0] + vec[1]*vec[1] + vec[2]*vec[2];

len = (float)sqrt(len);

len = 1.0f/len;

vec[0] *= len;

vec[1] *= len;

vec[2] *= len;

}

int main()

{

int i, s, e, count;

vec3_t vec;

count = 1000000;

vec[0] = 1.0f;

vec[1] = 2.0f;

vec[2] = 3.0f;

s = clock();

for (i = 0; i < count; i++) {

vec[0] += 0.1f;

vec[1] += 0.1f;

vec[2] += 0.1f;

vec_normalize_sse(vec);

}

e = clock();

printf("sse = %d, %f, %f, %f

", e - s, vec[0], vec[1], vec[2]);

 

vec[0] = 1.0f;

vec[1] = 2.0f;

vec[2] = 3.0f;

s = clock();

for (i = 0; i < count; i++) {

vec[0] += 0.1f;

vec[1] += 0.1f;

vec[2] += 0.1f;

vec_normalize_c(vec);

}

e = clock();

printf("c = %d, %f, %f, %f

", e - s, vec[0], vec[1], vec[2]);

getch();

return 0;

}

 

I wrote more about my work I work at  https://writer4sale.com/  and this program helps to us to do our easy write different articles and etc.

RDTSC to measure performance of small # of FP calculations

$
0
0

Hey there I found my problem at old topic here

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/306222

but I can not understand which solution was true and I decide repeated question

In response to the original question, I suggest that on late PIV
hardware (Northwood and Prescott core machines) that you have little
chance of getting reliable timings for a short instruction sequence for
a variety of reasons.

In the Intel staff responses it has already been mentioned that the
first iteration is almost exclusively slower than later iterations but
there is another factor that has always effected timings under ring3
access in Windows 32 bit OS versions. Faced with higher privileged
processes being able to interfere with lower privilege level
operations, you will generally get at least a few percent variation on
small samples and it gets worse as the sample gets smaller.

You can reduce this effect by setting the process priority to high or
time critical but you will not escape this effect under ring3 access. I
have found from practice that for real time testing you need a duration
of over half a second before the deviation comes down to within a
percent or two.

What I would suggest is that you isolate the code in a seperate module in an assembler and write code of this type.

push esi
push edi

mov esi, large_number
mov edi, 1
align 16
@@:
; your code to time here
sub esi, edi
jnz @B

pop edi
pop esi

Adjust the immediate "large_number" so that the code you are timing
runs for over a half a second, over 1 second is better, set you process
priority high enough to reduce the higher privilege interference to
some extent and you should start to get timings around the 1% or lower
variation.

Two trailing comments, the next generation Intel cores will behave
differently on a scale something like the differences between the PIII
and PIV processors so be careful not to lock yourself into one
architecture. The other comment is as far as I remember the FP
instruction range while still being available on current core hardware
is being replaced by much faster SSE/2/3 instructions so if your target
hardware is late enough to support these instructions, you will
probably get a big performance hit if you can use the later
instructions.

Regards,

https://phonty.com/

When will SnowRidge be available?

$
0
0

Hi,

I'm interested in a few new instructions that will be available in Snow Ridge, but I there is very little information about it, google does not help mcuh either.

I wonder when will these instructions show on latest CPU, and when can I expect the CPU to be available? Thanks.

MWAIT is not improving performance and why my machine stucks?

$
0
0

Hi, I'm writing a simple kernel module to test monitror/mwait instructions on my machine, which has i7-7700K processor. I use a char[64] for each core, so the false wakeup should be minimized.

I was expecting 2 things. First is low enough wakeup latency. Second is (maybe slight) performance improvement on other cores. However, the result kind of surprises me as neither of two is fully satisfied.

For the wakeup latency, I get 1200 cycles(threads are set affinity to different cores), which is only a few hundreds of cycles less than IPI. Even though the latency is not every low, it is comparatively lower anyway, so it is OK.

For the performance of remaining cores, the performance is not improving at all. I use stress-ng (compile from latest source) with the following command

Result when core 5 and core 7 are in mwait

$ ./stress-ng --matrix 4 -t 10 --taskset 0,1,2,3 --metrics  # core 5 and core 7 is in mwait state, while core 4 and core 6 unaffected
stress-ng: info:  [7809] dispatching hogs: 4 matrix
stress-ng: info:  [7809] successful run completed in 10.00s
stress-ng: info:  [7809] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [7809]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [7809] matrix           136206     10.00     39.98      0.00     13620.61      3406.85

Below is the result in normal case when no core is in mwait. There is even slight degradation(if I use -c 4, the cpu bogo ops degradation is more obvious than matrix):

$ ./stress-ng --matrix 4 -t 10 --taskset 0,1,2,3 --metrics
stress-ng: info:  [7893] dispatching hogs: 4 matrix
stress-ng: info:  [7893] successful run completed in 10.00s
stress-ng: info:  [7893] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [7893]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [7893] matrix           137242     10.00     39.99      0.00     13724.22      3431.91

I suspect that mwait might get woken up too frequently by some events, but I don't see how to disable some of them. I try to set extensions and hints to non-zero values, but it either causes GP or stuck my PC.

 

=============== Here are my questions ================

1. Why is mwait not bringing performance improvement, am I using it in a wrong way?

2. Even if I only mwait on a single core(e.g., core 5), the whole machine's disk IO seems completely 'stuck'. When I switch to a new git repo in oh-my-zsh(which will scan the repo automatically), it gets 'stuck', and my firefox gets 'stuck', then everything 'stucks' and I have to reboot my machine. By saying 'stuck', I mean I can click something/switch tabs in the GUI, but they are not responding. Does anyone know why this would happen? Did I missing anything?

3. The MONITOR instructions allows extension and hints, but where can I find these instructions and hints? Has intel disabled these extensions since Pentium 4?

Thanks!

Zihan

Intrinsic functions _rdtsc and _rdtscp

$
0
0

Hello
There is an intrinsic _rdtsc according to [1]. The questions are:

1- What is the unit of the output? It is an unsigned number. Is that nano second? clock cycle? ...
2- Why there is a form _rdtscp [2] that takes an address as an argument? I don't understand that. I want to get the timestamp. What is the purpose of supplying an address for that?

[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=406...
[2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=406...

Disabling HW prefetcher

$
0
0

Hi

With _mm_clflush(), I flushed an array from all cache levels. Next, I to measure two accesses with __rdtsc(). While I know the distance between two accesses is larger than cache line size, e.g. 80 bytes distance, the TSC for the first access sounds like a miss (which is true), while the TSC for the second element sounds like a hit (which is wrong).

It seems that HW stride prefetcher brings the second element. Is there any way to force the processor not to prefetch?

 


Determining wake up reason for MWAIT

$
0
0

Hello,

I'm trying to figure out how can one check what is the reason for the MWAIT to wakeup.

I know there are several reasons for MWAIT to wakeup, including a write to the monitored address (of course, and to rule out this specific reason is obvious), interrupts, faults and signals.

What I'm trying to figure out exactly is how can you know which FAULT/INTERRUPT/SIGNAL woke you up?

It is not written in the manuals or anything.

Thanks in advance.

could not decode some pattern of vgatherdps

$
0
0

The byte code of `vgatherdps zmm0{k1}, [rax + zmm0]` is 62F27D49920400.

But
>xed -d 62F27D49920400
>62F27D49920400
>ERROR: GATHER_REGS Could not decode at offset: 0x0 PC: 0x0: [62F27D499204000000000000000000]

The reason of this error may be that the index and the destination registers are the same and it results a UD fault.
But -d option is not to emutate the byte code but to decode one instruction, so I expect that xed shows the result of disassemble.
How about it?

verions of xed.
>Copyright (C) 2017, Intel Corporation. All rights reserved.
>XED version: [8.15.0-6-gb862fe0]

Incorrect links in the Architectures Software Developer’s Manual

$
0
0

Hi,

There are a couple of incorrect links (references) to "Figure 6-4. Stack Usage on Transfers to Interrupt and Exception-Handling Routines" in the Intel® 64 and IA-32 Architectures Software Developer’s Manual.

Under "6.12.1 Exception- or Interrupt-Handler Procedures" there are couple of links to the Figure 6-4 as below but are actually linked to "Table 6-4. Interrupt and Exception Classes".

The processor then saves the current state of the EFLAGS, CS, and EIP registers on the new stack (see Figures 6-4).

and

a. The processor saves the current state of the EFLAGS, CS, and EIP registers on the current stack (see Figures 6-4).

Also, such links typically use a singular, e.g., (see Figure 9-1), when they refer to a single figure regardless of whether the figure contains multiple illustrations in it. Please consider to change this (it slightly makes searching difficult).

 

Note that I am looking at "Order Number: 325462-067US May 2018" revision.

Unable to generate Vectorization report in icc/icpc compilers.(icc -vec-report1 p1.c )

$
0
0

when i executed the icc command i got following error message :

$ icc -vec-report2 p1.c

icc: command line remark #10148: option '-vec-report2' not supported

even i am unable to pass flags for sse,avx.

what is couse for generation of this error, why it is not generating report and instead showing error that  -vec-report2 is not supported? 

 

Skylake documentation bug

$
0
0

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Helvetica; color: #0860a8}

This is for Intel® 64 and IA-32 Architectures Optimization Reference Manual, Order Number: 248966-039 December 2017, Page 2-6.

 

In Figure 2-3. CPU Core Pipeline Functionality of the Skylake Microarchitecture, Port 6 correctly lists Int Shft as a function. However, Port 0 does not list Int Shft. My evidence for this being a docubug is that Int Shft is listed for port 0 and port 6 for Haswell in Figure 2-4. CPU Core Pipeline Functionality of the Haswell Microarchitecture. Also Agner Fog in his empirically derived Instruction Tables reference lists SHR SHL SAR r,i as using p06 for Skylake.
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 24.0px Helvetica; color: #0860a8}

Viewing all 685 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>