Convert bytes to nibbles

November 7, 2017, 7:56 am

Latest and popular articles on Intel Technologies

≪ Previous: AVX vs. SSE: expect to see a larger speedup

I'm looking to optimize this loop:

framesize /= 2;
for (auto i = 0; i < framesize; ++i){
	buf0[i] = static_cast<unsigned char>(buf0[i * 2] >> 4 & 0x0f | buf0[i * 2 + 1] & 0xf0);
}

It reduces an array of bytes to nibbles.

The assembly seems rather extensive for this seemingly simple operation, is there a more efficient way of doing this?

↧

Array Registers

November 18, 2017, 2:44 am

Latest and popular articles on Intel Technologies

≫ Next: Older versions of Intel Intrinsics Guide: data-X.X.X.xml file

≪ Previous: Convert bytes to nibbles

Hi,

Some time ago BOINC project RakeSearch released their software as an open source. I took it and I am trying to optimize it now. During my work I got an idea of "Array Registers". The idea is simple: registers are faster than memory, so small array placed in dedicated Array Register should be faster than one stored in memory. Intel CPUs already have XMM/YMM/ZMM registers, which looks like a natural candidate to become Array Registers. However they would have to support additional operations. At a minimum programs should be able to get and set single element at index which is non-const:

array_get(arr_reg, index):
  return arr_reg[index];

array_set(arr_reg, index, value):
  arr_reg[index] = value;

Various dedicated arithmetic and logic operations also would be helpful to speed up things, e.g. addition:

array_set(arr_reg, index, value):
  arr_reg[index] = arr_reg[index] + value;

Next step is to create vector version of this. "Get" is just a shuffle. With "Set" you have to deal with possible duplicated indices - when they are unique, this is shuffle too; when there are duplicates, you have data loss. Overall "Set" operation does not look very useful. Move helpful would be various arithmetic and logic operations, but of course they would be much harder to implement. Intel engineers probably discussed something like this a lot, as they decided to created AVX512CD ISA.

These new instructions would help HPC by allowing to speedup code which cannot be vectorized, There are also tons of various legacy code which depends on optimizations provided by compiler only, it also would benefit from this.

↧

Older versions of Intel Intrinsics Guide: data-X.X.X.xml file

November 24, 2017, 4:36 am

Latest and popular articles on Intel Technologies

≫ Next: AVX512 missing intrinsics

≪ Previous: Array Registers

Dear Intel community,

Is it possible to obtain older versions of the intrinsics specification file data-X.X.X.xml? At the moment of the writing of this inquiry, the latest version available is 3.4 [1]. The reason for this request is because we created a tool that is able to automatically generate domain specific languages (DSL) from the XML specification file, and to use the same DSLs to generate SIMD code, providing full support for vectorization [2]. We like to test our generator, and see how robust it is over different versions of the XML specification.

So far I was able to salvage:

But ideally, I would like to obtain all versions.

Greetings,
Alen Stojanov

[1] https://software.intel.com/sites/landingpage/IntrinsicsGuide/files/data-3.4.xml
[2] https://github.com/ivtoskov/lms-intrinsics

↧

AVX512 missing intrinsics

November 25, 2017, 3:38 pm

Latest and popular articles on Intel Technologies

≫ Next: Is the guide of Gather/Scatter of AVX512 wrong?

≪ Previous: Older versions of Intel Intrinsics Guide: data-X.X.X.xml file

Hello,

The following functions seems to not be available on AVX512.

__m512 _mm512_blendv_ps(__m512 a, __m512 b, __m512 mask)

__m512 _mm512_cmp_ps(__m512 a, __m512 b, int comp)

int _mm512_movemask_ps(__m512 a)

Do they will be available soon ? Or is there an alternative to them ?

Thanks alot !

↧

Is the guide of Gather/Scatter of AVX512 wrong?

December 1, 2017, 2:07 am

Latest and popular articles on Intel Technologies

≫ Next: SSE and AVX behavior with aligned/unaligned instructions

≪ Previous: AVX512 missing intrinsics

I found the following from

https://software.intel.com/en-us/node/523826

_mm512_i32scatter_epi32

extern void __cdecl _mm512_i32scatter_epi32(void* base_addr, __m512i a, __m512i vindex, _MM_DOWNCONV_EPI32_ENUM downconv, int scale, int hint);

Scatters int32 from a into memory using 32-bit indices. 32-bit elements are stored at addresses starting at base_addrand offset by each 32-bit element in vindex (each index is scaled by the factor in scale).

When I tried _mm512_i32scatter_epi32(dst_ptr, Vdata, Vindex, _MM_DOWNCONV_EPI32_NONE, 4, 1);

The compilation reports error saying too many argument.

Then I found another manual from

http://hpc.ipp.ac.cn/wp-content/uploads/2015/12/documentation_2016/en/co...

extern void __cdecl _mm512_i32scatter_epi32(void* mv, __m512i index, __m512i v1, int scale);

And this works. It doesn't only have different number of argument but the order of two __m512i vector is reversed.

I'm using ICC 17.0.4.

I'm wondering what is wrong with the gather/scatter intrinsics here?

↧

SSE and AVX behavior with aligned/unaligned instructions

December 7, 2017, 2:17 pm

Latest and popular articles on Intel Technologies

≫ Next: AVX512-VBMI2: VPSHLDV masks its shift count preventing use as a blend

≪ Previous: Is the guide of Gather/Scatter of AVX512 wrong?

We've learned that if the compiler emits an aligned SSE memory move instruction for an unaligned address, it will cause a SEGV. Will the same occur with AVX? Or in the case of AVX is the extent of the resulting behavior amount to undesirable performance?

↧

AVX512-VBMI2: VPSHLDV masks its shift count preventing use as a blend

December 9, 2017, 12:23 pm

Latest and popular articles on Intel Technologies

≫ Next: Update the SDE MSVS debugger install kit to support VS2017?

≪ Previous: SSE and AVX behavior with aligned/unaligned instructions

Is it too late to suggest a change to AVX512_VBMI2 for Ice Lake?

VPSHLDV (and the W / Q versions) would potentially have more uses (or save a blend instruction) if they allowed shift counts large enough to take the entire element from SRC2, instead of being limited to keeping at least one bit from the DST vector. The current definition in the "future extensions" PDF (October 2017) is:

tmp ← concat(DEST.dword[j], SRC2.dword[j]) << (tsrc3 & 31)

(Or & 15 for the VPSHL/RVW, & 63 for VPSHL/RVQ)

This is inconsistent with regular vector shifts, which don't mask their count (e.g. AVX2 and AVX512F VPSLLVD can zero elements with a shift count of 32 or higher. e.g. vpcmpeqd xmm0, xmm0,xmm0 / vpsllvd xmm0, xmm0, xmm0 produces all-zeros. Same for MMX/SSE2/AVX/... (V)PSLLD)

It is consistent with scalar integer SHLD, but arguably the vector version benefits more from having some elements able to produce SRC2, or even SRC2 left-shifted.

I don't have any particular application in mind; maybe some applications benefit from the implicit masking and would otherwise need a VPANDD. I'm picturing a case where you have a constant vector of shift counts to get different windows for different elements, and for some elements it's useful to have a count of zero, and others it's useful to have a count of 32. Maybe there aren't any real use cases like that, or few enough that you don't mind forcing them to use an extra blend instruction if it saves transistors implementing this instruction.

For VPSHRDV, which gives you DEST.dword[j] ← concat(SRC2.dword[j], DEST.dword[j]) >> (tsrc3 & 31), you can keep elements of DEST with merge masking, or keep elements of SRC2 with a count of zero. (But obviously for consistency, if VPSHLDV changes, then VPSHRDV should change, too, along with the non-V versions!)

↧

Update the SDE MSVS debugger install kit to support VS2017?

December 22, 2017, 7:40 am

Latest and popular articles on Intel Technologies

≫ Next: AVX-512 VBMI2: why no vector version of _pext_u32()?

≪ Previous: AVX512-VBMI2: VPSHLDV masks its shift count preventing use as a blend

When will the SDE debugger for MS Visual Studio be upgraded from VS2015 to VS2017? I am currently using VS2017 to compile the AVX code and VS2015 with the SDE debugger installed to debug. My CPU supports AVX, but not AVX2 or AVX-512. Using VS2015 to debug works fine for AVX and AVX2 code, however it does not work with AVX-512 code, which VS2015 does not understand. I am able to run to break points, but I cannot single step through the code.

↧

AVX-512 VBMI2: why no vector version of _pext_u32()?

January 11, 2018, 4:24 pm

Latest and popular articles on Intel Technologies

≫ Next: If the frequency is set to the P_STATE 1, why AVX-512 is not running to its base frequency?

≪ Previous: Update the SDE MSVS debugger install kit to support VS2017?

BMI2 has a scalar instruction for bit extraction called _pext_u32(). I may just have missed it but Im not seeing the vector equivalent of it in VBMI2. It would be very helpful to have this.

thanks

↧

If the frequency is set to the P_STATE 1, why AVX-512 is not running to its base frequency?

January 23, 2018, 12:32 pm

Latest and popular articles on Intel Technologies

≫ Next: Parallel dependence in bitmap scaling code

≪ Previous: AVX-512 VBMI2: why no vector version of _pext_u32()?

Hello,

I'm testing some SIMD instructions in a Skylake Gold 6148 @ 2.40GHz (20 cores) in a node with two sockets (40 cores). I'm filling all cores with a program executing AVX-512 ADDS, MULS and FMADDS and also setting each core frequency to P_STATE 1 (base frequency, in this case 2.40GHz).

My question is, if the AVX-512 base frequency is 1.6GHz, and I'm making intense use of this AVX512 instructions, why I get a frequency measurements around 2.2GHz? I supposed that I would get a frequency measurements around 1.6GHz, because I set the P_STATE 1, which means NO TURBO, and 2.2GHz could be considered as AVX-512 turbo frequency.

Thank you,

Jordi.

↧

Parallel dependence in bitmap scaling code

February 3, 2018, 6:55 pm

Latest and popular articles on Intel Technologies

≫ Next: How to Reduce CAL (Function Call Interrupts ) on x86_64 architectures in /proc/interrupts

≪ Previous: If the frequency is set to the P_STATE 1, why AVX-512 is not running to its base frequency?

This method is for copying a portion of a large bitmap (RGB8) and flipping it top to bottom as a new image and also downscaling said bitmap by an integer multiplier.

Its for real-time rendering of images too big for most applications and I need to figure out where the parallel dependence is and how I can optimize it.

__declspec(dllexport) void Copy(unsigned char* __restrict src, const long long sst, unsigned char* __restrict dst, const long long vst, const long long count, const long long zmul){
	if (zmul <= 1){
		for (auto i = 0; i < count; ++i){
			memcpy(dst + i*vst, src - i*sst, vst);
		}
	} else{
		const auto st = (sst + vst)*zmul;
		const auto zmsq = zmul*zmul;
		const auto zmul3 = zmul * 3;
		for (auto i = 0; i < count; ++i, src -= st){
			for (auto j = 0; j < vst; j += 3, src += zmul3){
				unsigned int r = 0, g = 0, b = 0;
				for (auto k = 0; k < zmul; ++k) {
					for (auto l = 0; l < zmul; ++l) {
						r += src[k*3 + l*sst];
						g += src[k*3 + 1 + l*sst];
						b += src[k*3 + 2 + l*sst];
					}
				}
				dst[i*vst + j] = r / zmsq;
				dst[i*vst + j + 1] = g / zmsq;
				dst[i*vst + j + 2] = b / zmsq;
			}
		}
	}
}

↧

How to Reduce CAL (Function Call Interrupts ) on x86_64 architectures in /proc/interrupts

February 20, 2018, 12:00 am

Latest and popular articles on Intel Technologies

≫ Next: Histogram examples using AVX-512 CD in Dec 2017 Optimization Ref Manual are wrong?

≪ Previous: Parallel dependence in bitmap scaling code

Hi all,

I am trying to run DPDK application on x86_64 Machine, On this process trying to do core isolation and assign one application to isolated Core.

While doing Core Isolation, achieved "Interrupt count increment stop on isolated core" in all interrupts except CAL (Function Call Interrupts)

Searched in google, but did not get enough data on CAL

While checking the CAL in the Books: Understanding Linux Kernel & Professional Linux Kernel Archiecture , did not get any info for CAL Interrupts generated.

Can you guys please explain reason to generate CAL and to stop/control for the same in x86_64 (IA64 or AMD64 ISA).

Thanks & Regards

Satish.G

↧

Histogram examples using AVX-512 CD in Dec 2017 Optimization Ref Manual are wrong?

March 1, 2018, 6:37 am

Latest and popular articles on Intel Technologies

≫ Next: How to get the FLOP number of an application?

≪ Previous: How to Reduce CAL (Function Call Interrupts ) on x86_64 architectures in /proc/interrupts

The December 2017 Optimization Reference Manual has two sections describing how to use the new AVX-512 conflict detection instructions for histogram calculation.

There are numerous issues with section 15.16.1 and the example code in 15-18. In particular, the code in the conflict_loop segment just doesn't work as described.

    vmovaps zmm4, all_1 // {1, 1, …, 1}
    vmovaps zmm5, all_negative_1
    vmovaps zmm6, all_31
    vmovaps zmm7, all_bins_minus_1
    mov ebx, num_inputs
    mov r10, pInput
    mov r15, pHistogram
histogram_loop:
    vpandd zmm3, [r10+rcx*4], zmm7
    vpconflictd zmm0, zmm3
    kxnorw k1, k1, k1
    vmovaps zmm2, zmm4
    vpxord zmm1, zmm1, zmm1
    vpgatherdd zmm1{k1}, [r15+zmm3*4]
    vptestmd k1, zmm0, zmm0
    kortestw k1, k1
    je update
    vplzcntd zmm0, zmm0
    vpsubd zmm0, zmm6, zmm0
conflict_loop:
    vpermd zmm8{k1}{z}, zmm2, zmm0
    vpermd zmm0{k1}, zmm0, zmm0
    vpaddd zmm2{k1}, zmm2, zmm8
    vpcmpd k1, 4, zmm5, zmm0
    kortestw k1, k1
    jne conflict_loop
update:
    vpaddd zmm0, zmm2, zmm1
    kxnorw k1, k1, k1
    addq rcx, 16
    vpscatterdd [r15+zmm3*4]{k1}, zmm0
    cmpl ebx, ecx
    jb histogram_loop

Section 17.2.3 has another example of using AVX-512 CD intrinsics. This one is also incorrect, but less so. The two `vpsubd zmm1, zmm1, zmm5` are a mistake -- only one `vpsubd zmm1, zmm1, zmm5` should be necessary. Additionally, no indication is given as to the value in the following addresses: [rip+0x185c], [rip+0x1884], [rip+0x18ba]. The loop termination logic in Resolve_conflicts seems redundant, too.

Top:
    vmovups zmm4, [rsp+rdx*4+0x40]
    vpxord zmm1, zmm1, zmm1
    kmovw k2, k1
    vpconflictd zmm2, zmm4
    vpgatherdd zmm1{k2}, [rax+zmm4*4]
    vptestmd k0, zmm2, [rip+0x185c]
    kmovw ecx, k0
    vpaddd zmm3, zmm1, zmm0
    test ecx, ecx
    jz <No_conflicts>
    vmovups zmm1, [rip+0x1884]
    vptestmd k0, zmm2, [rip+0x18ba]
    vplzcntd zmm5, zmm2
    xor bl, bl
    kmovw ecx, k0
    vpsubd zmm1, zmm1, zmm5
    vpsubd zmm1, zmm1, zmm5

Resolve_conflicts:
    vpbroadcastd zmm5, ecx
    kmovw k2, ecx
    vpermd zmm3{k2}, zmm1, zmm3
    vpaddd zmm3{k2}, zmm3, zmm0
    vptestmd k0{k2}, zmm5, zmm2
    kmovw esi, k0
    and ecx, esi
    jz <No_conflicts>
    add bl, 0x1
    cmp bl, 0x10
    jb <Resolve_conflicts>

No_conflicts:
    kmovw k2, k1
    vpscatterdd [rax+zmm4*4]{k2}, zmm3
    add edx, 0x10
    cmp edx, 0x400
    jb <Top>

I've managed to massage the second example into something that appears to work, as follows:

include ksamd64.inc

OP_EQ   equ 0
OP_NEQ  equ 4

;
; Define constant variables.
;

ZMM_ALIGN equ 64
YMM_ALIGN equ 32
XMM_ALIGN equ 16

_DATA$00 SEGMENT PAGE 'DATA'

        align   ZMM_ALIGN
        public  AllOnes
AllOnes         dd  16  dup (1)

        align   ZMM_ALIGN
        public  AllNegativeOnes
AllNegativeOnes dd  16  dup (-1)

        align   ZMM_ALIGN
        public  AllBinsMinusOne
AllBinsMinusOne dd  16  dup (254)

        align   ZMM_ALIGN
        public  AllThirtyOne
AllThirtyOne    dd  16  dup (31)

        align   ZMM_ALIGN
Input1544    dd   5,  3, 3,  1,  8,  2, 50, 1,  0,  7,  6,  4,  9, 3, 10,  3
Permute1544  dd  -1, -1, 1, -1, -1, -1, -1, 3, -1, -1, -1, -1, -1, 2, -1, 13
Conflict1544 dd   0,  0, 2,  0,  0,  0,  0, 8,  0,  0,  0,  0,  0, 6,  0, 8198
Counts1544   dd   1,  1, 2,  1,  1,  1,  1, 2,  1,  1,  1,  1,  1, 3,  1,  4

        public  Input1544
        public  Permute1544
        public  Conflict1544
        public  Counts1544

Input1544v2  dd   5,  3, 3,  1,  8,  2, 50, 1,  0,  7,  6,  4,  9, 3, 10,  3
             dd   5,  3, 3,  1,  8,  2, 50, 1,  0,  7,  6,  4,  9, 3, 10,  3

_DATA$00 ends

        NESTED_ENTRY Histo1710, _TEXT$00

;
; Begin prologue.  Allocate stack space and save non-volatile registers.
;

        alloc_stack LOCALS_SIZE

        save_reg    rbp, Locals.SavedRbp        ; Save non-volatile rbp.
        save_reg    rbx, Locals.SavedRbx        ; Save non-volatile rbx.
        save_reg    rdi, Locals.SavedRdi        ; Save non-volatile rdi.
        save_reg    rsi, Locals.SavedRsi        ; Save non-volatile rsi.
        save_reg    r12, Locals.SavedR12        ; Save non-volatile r12.
        save_reg    r13, Locals.SavedR13        ; Save non-volatile r13.
        save_reg    r14, Locals.SavedR14        ; Save non-volatile r14.
        save_reg    r15, Locals.SavedR15        ; Save non-volatile r15.

        save_xmm128 xmm6, Locals.SavedXmm6      ; Save non-volatile xmm6.
        save_xmm128 xmm7, Locals.SavedXmm7      ; Save non-volatile xmm7.
        save_xmm128 xmm8, Locals.SavedXmm8      ; Save non-volatile xmm8.
        save_xmm128 xmm9, Locals.SavedXmm9      ; Save non-volatile xmm9.
        save_xmm128 xmm10, Locals.SavedXmm10    ; Save non-volatile xmm10.
        save_xmm128 xmm11, Locals.SavedXmm11    ; Save non-volatile xmm11.
        save_xmm128 xmm12, Locals.SavedXmm12    ; Save non-volatile xmm12.
        save_xmm128 xmm13, Locals.SavedXmm13    ; Save non-volatile xmm13.
        save_xmm128 xmm14, Locals.SavedXmm14    ; Save non-volatile xmm14.
        save_xmm128 xmm15, Locals.SavedXmm15    ; Save non-volatile xmm15.

        END_PROLOGUE

        mov     Locals.HomeRcx[rsp], rcx                ; Home rcx.
        mov     Locals.HomeRdx[rsp], rdx                ; Home rdx.
        mov     Locals.HomeR8[rsp], r8                  ; Home r8.
        mov     Locals.HomeR9[rsp], r9                  ; Home r9.

        vmovntdqa       zmm28, zmmword ptr [AllOnes]
        vmovntdqa       zmm29, zmmword ptr [AllNegativeOnes]
        vmovntdqa       zmm30, zmmword ptr [AllBinsMinusOne]
        vmovntdqa       zmm31, zmmword ptr [AllThirtyOne]

        mov     rax, rdx
        xor     rdx, rdx

        lea     r10, Input1544v2

Top:
        ;vmovups zmm4, [rsp+rdx*4+0x40]
        vmovntdqa   zmm4, zmmword ptr [r10]
        add     r10, 40h

        ;vmovups zmm4, [rsp+rdx*4+0x40]
        vpxord zmm1, zmm1, zmm1

;
; kmovw k2, k1
;
;   What's k1?!  Assume it's all 1s for now given that it's fed into vpgatherdd.
;

        kxnorw k1, k1, k1

        kmovw k2, k1

        vpconflictd zmm2, zmm4

        vpgatherdd zmm1{k2}, [rax+zmm4*4]

;
; vptestmd k0, zmm2, [rip+0x185c]
;
;   What's [rip+0x185c]?  Guess -1 as it's being used to compare the vpconflictd
;   result, then determining if there are conflicts.
;
        ;vptestmd k0, zmm2, [rip+0x185c]
        vptestmd k0, zmm2, zmm29 ; Test against AllNegativeOnes

        kmovw ecx, k0

;
; vpaddd zmm3, zmm1, zmm0
;
;   What's zmm0?  Assume all 1s, so use zmm28.
;

        vmovaps zmm0, zmm28

        ;vpaddd zmm3, zmm1, zmm0
        vpaddd zmm3, zmm1, zmm0

        test ecx, ecx

        jz No_conflicts

;
; vmovups zmm1, [rip+0x1884]
;
;   What's [rip+0x1884]?
;
;   Try:
;       - AllThirtyOne (31).
;       ;- AllOnes (zmm28)
;

        ;vmovups zmm1, [rip+0x1884]
        vmovaps zmm1, zmm31

;
; vptestmd k0, zmm2, [rip+0x18ba]
;
;   What's [rip+0x18ba]?
;
;   Try:
;
;       - AllNegativeOnes
;

        ;vptestmd k0, zmm2, [rip+0x18ba]
        vptestmd k0, zmm2, zmm29

        vplzcntd zmm5, zmm2

        xor bl, bl

        kmovw ecx, k0

;
; XXX: why two vpsubds here?
;

        vpsubd zmm1, zmm1, zmm5
        ;vpsubd zmm1, zmm1, zmm5

Resolve_conflicts:
        vpbroadcastd zmm5, ecx
        kmovw k2, ecx
        ; The vpermd doesn't appear to have any effect.
        ;vpermd zmm3{k2}, zmm1, zmm3
        vpaddd zmm3{k2}, zmm3, zmm0
        vptestmd k0{k2}, zmm5, zmm2
        kmovw esi, k0
        and ecx, esi
        jz No_conflicts
        add bl, 1h
        cmp bl, 10h
        jb Resolve_conflicts

No_conflicts:
        kmovw k2, k1
        vpscatterdd [rax+zmm4*4]{k2}, zmm3
        add edx, 10h
        cmp edx, 20h
        jb Top


;
; Indicate success.
;

        mov rax, 1

;
; Restore non-volatile registers.
;

Th199:
        mov             rbp,   Locals.SavedRbp[rsp]
        mov             rbx,   Locals.SavedRbx[rsp]
        mov             rdi,   Locals.SavedRdi[rsp]
        mov             rsi,   Locals.SavedRsi[rsp]
        mov             r12,   Locals.SavedR12[rsp]
        mov             r13,   Locals.SavedR13[rsp]
        mov             r14,   Locals.SavedR14[rsp]
        mov             r15,   Locals.SavedR15[rsp]

        movdqa          xmm6,  Locals.SavedXmm6[rsp]
        movdqa          xmm7,  Locals.SavedXmm7[rsp]
        movdqa          xmm8,  Locals.SavedXmm8[rsp]
        movdqa          xmm9,  Locals.SavedXmm9[rsp]
        movdqa          xmm10, Locals.SavedXmm10[rsp]
        movdqa          xmm11, Locals.SavedXmm11[rsp]
        movdqa          xmm12, Locals.SavedXmm12[rsp]
        movdqa          xmm13, Locals.SavedXmm13[rsp]
        movdqa          xmm14, Locals.SavedXmm14[rsp]
        movdqa          xmm15, Locals.SavedXmm15[rsp]

;
; Begin epilogue.  Deallocate stack space and return.
;

        add     rsp, LOCALS_SIZE
        ret


        NESTED_END Histo1710, _TEXT$00

I'm confused as to the purpose of the vpermd instructions in both examples. Even in the latter example, which actually works, I can remove vpermd and it has no effect on the histogram calculation. I believe the mask update logic takes care of the "conflict permutation" referred to in section 15.

Can someone review both sections and provide some insight?

↧

How to get the FLOP number of an application?

March 2, 2018, 5:51 pm

Latest and popular articles on Intel Technologies

≫ Next: Performance delays - programming with QNan and Denormals

≪ Previous: Histogram examples using AVX-512 CD in Dec 2017 Optimization Ref Manual are wrong?

Hi:

I write a simple application to test the FLOP count of the application by using SDE;code is as follow :

#include <stdio.h>
#include <stdlib.h>

float addSelf(float a,float b)
{
return a + b;
}
int main()
{
float a = 10.0;
float b = 7.0;
int i = 0;
float c = 0.0;
for(i = 0; i < 999;i++)
{
c = addSelf(a,b);
}
printf("c = %f\n",c);
return 0;
}

The processor is i7-7500U,OS is windows 10, the IDE is CodeBlock; I download the SDE "sde-external-8.16.0-2018-01-30-win", and run the SDE with command : sde -mix -- application.exe;the output file is "sde-mix-out.txt", I search "elements_fp" in the file , but I find nothing ! and I search "FMA" in the file, I find nothing either ! does it means there is no floating calculation in this application ? obviously it's impossible;

excuse me, what's the problem ?

Attachment	Size
Download sde-mix-out.txt	773.47 KB

↧

Performance delays - programming with QNan and Denormals

March 13, 2018, 8:12 am

Latest and popular articles on Intel Technologies

≫ Next: Vector processing needs better NAN propagation

≪ Previous: How to get the FLOP number of an application?

The floating-point spectrum holds some special numbers, such as QNan ( quite not-a-number), SNan (signalling not-a-number) and denormalised numbers.

On the pro side, QNan may be used to tag special cases, such as a missing data: You set any unknown item to QNan. Any aritmetic calculation with this missing item will result in QNan. So, if result #YY is QNan, you know it is based on missing data. The Nan property is sticky. Other special numbers may find use, too.

MATLAB uses QNans to tag missing data. Nans come as doubles ( 64-bit numbers) or floats (32-bit numbers). There are many QNan: wth doubles, the first 14 bits define the QNan condition, while the rest 49 bits is a user-defined tag.

On the con side, both Nan and Denormals numbers may cause significant delays. It is a special condition.

My question:

Are MOV instructions - movSD, movAPD, movHPD, etc. - delayed by Nan and Denormals?

Likewise:

Suppose the operation ORPD / ANDPD / XORPD created a Nan / Denormal /Special. Is there any penality ?

Why it matters?

In ideal case, one should not touch any unverified data. But many situations have partial data. Data sets may hold results from many sources - some including width/height/depth, some just volume, some only weight. For some uses, the weight suffice. Why throw away this data? If asked for total length, only data of the first source is acceptable. But for total weight, if the density is given, all three sets are good. That would mean coding a different loop for each result, each loop with many internal cases.

It is much simpler to code a single loop, and throw out (later) the results with 'invalid' tag - those that are QNans. Likewise, with other special numbers. Hence, the performance questions do matter.

↧

Vector processing needs better NAN propagation

March 19, 2018, 12:41 am

Latest and popular articles on Intel Technologies

≫ Next: Support for saturation and addition instruction in AVX-512

≪ Previous: Performance delays - programming with QNan and Denormals

Allow me to start a discussion of NAN propagation, why SIMD code needs it, and some problems that need to be solved.

There are two ways of detecting floating point errors: (1) fault trapping and (2) propagation of INF and NAN.

Fault trapping can cause inconsistent behavior in SIMD code. Consider, for example, an example where we are looping through an array of 16 floats where the last value generates a fault, and we are relying on fault trapping to detect such faults. In linear code, we would have the trap in the last iteration of the loop. If we use ZMM registers, we will have all 16 values in a single register and the trap is happening in the first (and only) iteration of the loop. If the loop has any side effects, for example writing something to a file, then the program will produce different outputs depending on whether the code is vectorized or not. There may be further complications if multiple faults happen in the same vector instruction. Multiple faults in the same instruction generate only one trap, even if the faults are of different types, e.g. 0/0 -> NAN, and 1/0 -> INF. This means that we can have different number of traps depending on the size of vector registers. With bigger vector registers we will have fewer trap events when multiple faults happen simultaneously.

If we want program behavior to be the same for linear and vectorized code - and the same for different vector register sizes - then it may be better to rely on NAN propagation than on fault traps. A further benefit of NAN propagation is that it is faster than fault trapping.

However, there are a few problems with NAN propagation that need to be solved. But first an introduction for those readers who are not familiar with NAN propagation:

Numerical errors such as 0/0 and sqrt(-1) in floating point code generate a special kind of value called NAN (Not a Number). A computation involving a NAN generates a NAN, so that the NAN will propagate through a sequence of calculations to the final result. A NAN can contain a so-called payload which is a sequence of bits that can contain arbitrary information about the error. A calculation error such as 0/0 gives zero payload on most CPUs, but function libraries and custom code may put useful diagnostic information into the payload. The NAN and its payload will propagate through a sequence of calculations to the final result, where it can be detected.

However, there is a problem when two NANs with different payloads are combined, for example in an expression like a+b. The IEEE 754 floating point standard says that when two NANs are combined, the result should be one of the NANs, but the standard does not say which one. Most CPUs, including Intel's, just propagate the first of the two NANs. In other words, NAN1 + NAN2 = NAN1. This is a problem because we do not know which of the two operands the compiler puts first. If one compiler codes an expression as a+b and another compiler makes it b+a then we may get different end results. The result is not predictable or reproducible when you compile the same code with different compilers or different optimization options.

This problem needs a solution. a+b and b+a must give the same results if we want NAN payload propagation to give consistent and reproducible results. There are several possible solutions if we want the combination of two NANs to be consistent:

Make a bitwise OR of the two NAN payloads. This solution is simple, but it limits the amount of information you can have in the payload to one bit for each error type. Some people want to include information in the payload about where the fault occurred, and this will not work when payloads are OR'ed.
Propagate the biggest of the two payloads. The advantage of this is that you can define priorities and propagate the NAN with the highest priority. The disadvantage is that the other payload is lost.
Propagate the smallest of the two payloads. This has the disadvantage that it will propagate empty payloads containing no information.
Make a new unique payload. This is a complicated solution and it doesn't make it easier to find the cause of the error.
Generate a trap when two NANs with different payloads are combined. This makes it possible to detect both payloads, but it defies the purpose of using NAN propagation rather then fault trapping.

The conclusion is that solution 2 should be preferred. Propagate the biggest of the two payloads. This solution will not violate the IEEE 754 standard and it will not break existing applications, but it requires a change in the CPU hardware.

A revision of the IEEE 754 standard is on the way. I have had a long discussion with the people who are writing the revised standard, but they do not want to change anything about NAN1+NAN2. The forthcoming revision will still be undecided about this problem.

Now, I think that Intel has the chance to fix this problem. As the market leader in SIMD processors, you are in a position to propose a solution. If you make a decision and publish it, it is likely to become a de facto standard.

This is the reason why I am asking this question. Is it feasible to change the hardware so that the combination of two NANs will propagate the one with the highest payload? And what is the time frame for such a change?

There is another problem with NAN propagation, namely the max and min functions. The current version of the IEEE 754 standard says the maximum of a NAN and a normal number is the normal number. So the max and min functions will not propagate a NAN. This problem will be solved in the forthcoming revision of the standard. A new set of maximum and minimum functions will be defined that are sure to output a NAN if one of the inputs is a NAN.

The x86 instructions MAXSD etc. are not following the unfortunate old standard but propagating the last operand if one is a NAN. This is useful because it is equivalent to the common high level language expression a > b ? a : b. You probably need to define a new set of max and min instructions that generate a NAN if any of the inputs is a NAN. These new instructions will be needed to match high level language functions that follow the forthcoming revision of the standard. The existing instructions (MAXSD etc.) will still be needed for matching the high-level languate expression a > b ? a : b.

↧

Support for saturation and addition instruction in AVX-512

March 19, 2018, 12:39 pm

Latest and popular articles on Intel Technologies

≫ Next: Immediate operands for SSE instructions?

≪ Previous: Vector processing needs better NAN propagation

Is there a saturation function _mm512_adds_epi32 in AVX-512 . There already exists _mm512_adds_epi16 but I could not find the same for 32 bit integers

Additionally how do we add 2 integer 32 bit vectors and not loose the carry using AVX-512 instruction. As I understand that _mm512_add_epi32 adds 2 vectors but will store only the lower 32 bits ignoring a carry ? Since addition of 2 32 bit numbers can be a 33 bit number how do we accommodate them with the AVX-512 instructions ? Any help and reference to the sample code is highly appreciated

Thanks

↧

Immediate operands for SSE instructions?

March 22, 2018, 8:48 am

Latest and popular articles on Intel Technologies

≫ Next: Possible errors in instruction semantics

≪ Previous: Support for saturation and addition instruction in AVX-512

Hi,

I was wondering why the common floating SSE instructions (e.g. movps/ss, addps/ss, mulps/ss) don't have variants that take immediate operands?
As it currently is, any immediate constant value must be loaded from memory (slow, even if cached) or, if used inside a loop,
can be copied to a register before the loop, but consumes one register.

Is this tough for hardware implementation or there is another problem?
Considering the fact that the immediate versions of the integer instructions are more optimal to use than than loading from memory,
the same would probably be true for hypothetical immediate-operand sse instructions as well?

Of course I am talking about scalar operands (vector ones are too large to be immediate)
Also probably only 32-bit (single precision floats, not doubles). But that would still be much better than nothing.
Even for a vectored code it is very common to use constant factors that are the same for all channels,
or the same additive (bias) const value to be added to all channels. So I think it would be very useful for the vectored instructions too.
(They may take 32-bit immediate operand, which is then replicated to all channels.)

↧

Possible errors in instruction semantics

April 4, 2018, 5:48 pm

Latest and popular articles on Intel Technologies

≫ Next: Enabling Mon feature using IA32_MISC_ENABLES

≪ Previous: Immediate operands for SSE instructions?

Dear Team,

I would like to report on some disparity between some instruction specification as documented in the Intel 64 and IA-32 Architectures Software Developer's Manual, Vol 2 and the actual execution behaviour.

Bug Report 1: vpsravd %xmm3, %xmm2, %xmm1

Semantics as per the above manual:

%ymm1  : 0x0₁₂₈ ∘ ((%ymm2[127:96] sign_shift_right (0x0₂₇ ∘ %ymm3[100:96])) ∘
                  ((%ymm2[95:64] sign_shift_right %ymm3[95:64]) ∘
                  ((%ymm2[63:32] sign_shift_right %ymm3[63:32]) ∘
                  (%ymm2[31:0] sign_shift_right %ymm3[31:0]))))

** ∘ is the concatenate symbol here.

Note that the first term ((%ymm2[127:96] sign_shift_right (0x0₂₇ ∘ %ymm3[100:96])) has only 5 bits selected from '%ymm3'.

But the actual execution behaviour seem to expect 32 bits from %ymm3, i.e., ((%ymm2[127:96] sign_shift_right 0x0₂₇ ∘ %ymm3[127:96])

The following is the pseudo code from manual

VPSRAVD (VEX.128 version)
COUNT_0 = SRC2[31 : 0]
(* Repeat Each COUNT_i for the 2nd through 4th dwords of SRC2*)
COUNT_3 = SRC2[100 : 96]; //<------------------------------------- Possibly a bug
DEST[31:0] = SignExtend(SRC1[31:0] >> COUNT_0);
(* Repeat shift operation for 2nd through 4th dwords *)
DEST[127:96] = SignExtend(SRC1[127:96] >> COUNT_3);
DEST[MAXVL-1:128] = 0;

I am expecting the above bold portion to be a bug and should be SRC2[127 : 96]

Test Input (in Hex):

%ymm2: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - **80 00 00 00** - 00 00 00 00 - 00 00 00 00 - 00 00 00 00

%ymm3: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 - **00 00 00 20** - 00 00 00 00 - 00 00 00 00 - 00 00 00 00

As per the manual, we should select just 5 bits from `00 00 00 20`, where as hardware execution semantics require all 32 bits.

Output as per manual:

 0x0₁₂₈ ∘ ((0x80000000₃₂ sign_shift_right 0x0₃₂) ∘
          ((0x0₃₂ sign_shift_right 0x0₃₂) ∘
          ((0x0₃₂ sign_shift_right 0x0₃₂) ∘
          (0x0₃₂ sign_shift_right 0x0₃₂))))

Output as per actual Intel hardware (Intel(R) Xeon(R) CPU E3-1505M):

 0x0₆₄ ∘ 0x0₆₄ ∘ 0xffffffff00000000₆₄ ∘ 0x0₆₄

The same probable typo appears in the pseudocode for these instructions:

VPSLLVD (VEX.128 version)
VPSLLVD (VEX.256 version)
VPSLLVQ (VEX.256 version)
VPSRAVD (VEX.256 version)

Also, there seems to be a typo in the description text of VPSRAVW/VPSRAVD/VPSRAVQ. There are two paragraphs starting with "The count values..."; the second one should be deleted.

Bug Report 2: packsswb

There seems to be bug in the descriptive text

If the signed doubleword value is beyond the range of an unsigned word (i.e. greater than 7FH or less than 80H), ...

In my opinion, the description must say range of signed word insead.

↧

Enabling Mon feature using IA32_MISC_ENABLES

April 9, 2018, 11:24 pm

Latest and popular articles on Intel Technologies

≫ Next: LDDQU vs. MOVDQU guidelines

≪ Previous: Possible errors in instruction semantics

Hi everyone,

During the past few days, I was reading Intel Manual about Monitor/Mwait instructions and as I understand this feature can be disabled and re-enabled by using IA32_MISC_ENABLES MSR.

I have the exact problem in my MacBook Pro as described in this topic.

The problem is Mon Feature is disabled when I check it in Windows but it is enabled when checking it in OS x. (In a dual boot system, not a virtual machine)

I check this by using MacCPUID (EAX=01H,ECX=0):ECX[3] which is enabled in OS X and using Gnu Win32 CPUID which says this feature is not enabled so it seems something like my OS turned off this feature.

I also check this by using the following assembly in Windows :

0:  48 31 c0                xor    rax,rax
3:  48 c7 c0 01 00 00 00    mov    rax,0x1
a:  0f a2                   cpuid

Then figured out that 3rd bit of ECX is zero so mon is disabled.

In the above topic, one of the answers suggests that enable this feature by changing MSR in this way :

0:  b9 a0 01 00 00          mov    ecx,0x1a0
5:  0f 32                   rdmsr
7:  83 e2 04                and    edx,0x4
a:  25 89 18 c5 00          and    eax,0xc51889
f:  0d 00 00 04 00          or     eax,0x40000
14: 0f 30                   wrmsr

So I run the above code in an x64 Windows driver and test the 3rd bit of ECX by running CPUID with EAX=1H again, and see that the 3rd bit is still zero.

I'm using a i7-6820HQ which is a Skylake - 6th Generation Intel Core.

So my questions are :

Is it correct to enable Mon feature by using the above code?
Or even is it possible to enable this feature when OS is already loaded and CPU is in protected mode?

↧