Quantcast
Channel: Intel® Software - Intel ISA Extensions
Viewing all 685 articles
Browse latest View live

Software Development Emulator (Mix Histogramming Tool - Interpretation)

$
0
0

I need to understand the file generated by sde command: path-to-kit/sde64 -mix -omix path-to-file -- user-application

I searched without success some documentation. Somebody can help me?


FIPS 140-2 and Intel cryptographic implementations

$
0
0

Is there any information regarding FIPS 140-2 certification of any of Intel's cryptographic implementations?  Just curious.  I did not find anything online in a brief search.  Thank you for any light you can shed!

New instruction proposal: memcmp-like int compare

$
0
0

One of most used functions in our app is one which works like C memcmp function for various data types. For int types up to 32 bits it is enough to cast them to wider type and subtract; for int64 we have to use expression "a < b ? -1 : a > b". Recently I looked for some dedicated instruction which provide such functionality, but did not find any. Please add new instruction which will compare two ints and return negative/zero/positive value.

AVX-512 tzcnt

$
0
0

Hi @ all,

I am currently working with AVX-512 on a KNL. I wanted to test one the new conflict detection functions (_mm512_conflict_epi32). The result of this call is a zero-extended bitvector. To use this for further computations, I want to use a vectorized trailing zeros count but I only found a vectorized leading zeros count. _tzcnt_u32 (AVX2) and _mm_tzcnt_32 (AVX-512) are working with scalar types. Is it wright, that I have to perform the trailing zero count on a scalar level or does anyone know a vectorized way (maybe by swapping the endianess and perform a lzc afterwards)? Thanks for your effort!

Sincerely yours

Using scatterstore to interleave two (sparse) 512-bit vector registers

$
0
0

Hi @ all,

I have two 512 bit vector registers and one mask16. The two registers contain (sparse) data:

idx:             15   14   13   12   11   10     9     8     7     6     5     4     3     2     1     0

zmmVidx: 120 112 104 096 088 080 072 064 056 048 040 032 024 016 008 000 

zmm0:        00   11   00   00   10   00   00   00   00   00   01   00   00   00   00   00

zmm1:        00   00   A0  00   00   B0   00   00   00   00   00   F0   00   00   00   00 

mask1:         0     1     0     0     1     0     0     0     0     0     1     0     0     0     0     0  

mask2:         0     0     1     0     0     1     0     0     0     0     0     1     0     0     0     0  

I want to interleave the two vector registers and store them continuosly into a 32 byte of memory, starting with zmm0:

idx:       ...15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

mem:    ...00 00 00 00 00 00 00 00 00 00 A0 11 B0 10 F0 01

I realized that behaviour with the following code

/* ... */
__m512i zmm2        = _mm512_maskz_compress_epi32( mask1, zmm0 );
__m512i zmm3        = _mm512_maskz_compress_epi32( mask2, zmm1 );
__mmask16 mask3 = _mm512_cmp_epi32_mask( zmm3, _mm512_setzero_epi32(), 4);

_mm512_mask_i32scatter_epi32(
   mem, mask3, zmmVidx, zmm2, 1);
_mm512_mask_i32scatter_epi32(
  mem + sizeof( uint32_t ), mask3, zmmVidx, zmm3, 1);

I first align the sparse data continuously in the vector register and store them afterwards. Is it possible to directly perform a masked interleave to memory, so one can avoid a scatterstore and use a continuos store operation?

 

Sincerely yours

Costs of scatter store

$
0
0

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.

What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).

 

Figure 1: Compressing the two data vector registers using the same dynamically created mask.

 

Figure 2: Scatter store to interleave the compressed data.

My code looks like the following:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
   /* Initialize a vector register containing zeroes to get the store mask */
   __m512i zeroVec     = _mm512_setzero_epi32();
   /* Load data */
   __m512i dataVec_1   = _mm512_conflict_epi32( data );
   __m512i dataVec_2   = _mm512_conflict_epi32( data + 16 );
   /* Compress the data */
   __m512i compVec_1   = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
   __m512i compVec_2   = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );

   /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
   __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );

   /* Interleave the selected data */
   _mm512_mask_i32scatter_epi32(
      result,
      maskStore,
      vindex,
      compVec_1,
      1
   );
   _mm512_mask_i32scatter_epi32(
      result + 1,
      maskStore,
      vindex,
      compVec_2,
      1
   );
}

I compiled everything with

-O3 -march=knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution.

I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.

I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.

  1. Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?

  2. Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?

  3. Is there a more easy or more performant way to interleave two vector registers?

Thanks for your help!

Sincerely

Pin tool AVX scalar/vector identification

$
0
0

Hi,

is there a way in pin (a pin API function) to find out whether a specific instruction is a vector (packed) or scalar version of an instruction? I am able to find out the isa class/category/extension using the pin API, but that doesn't tell me whether or not the instruction is truly an SIMD instruction (works on 256 bytes vs. 64 bytes). For instance, VMULPD and VMULSD are both AVX, but not necessarily "true" vector instructions.

Currently I am looking at the largest operand size and classify it based on the returned value, but I am not sure is that a valid approach.

I looked for something like INS_IsScalar() INS_IsVector() in the pin API documentation but couldn't find anything similar.

Greatly appreciate your help and pointers.

-Mark

How to change Intel BIOS Settings

$
0
0

Hello, everyone!

I want to know how can I change my bios settings even I want to change my BIOS of my system?

Thanks.


Check arrays for equality with SIMD

$
0
0

What's the fastest way to check 2 unsigned char arrays of indeterminate size for equality in C++?

I'm using Visual Studio 2017 and Intel Compiler 2017.

Binary operations per clock cycle in Intel Xeon Phi processors.

$
0
0

To see the acceleration of XNOR-nets on CPUs, I have been reading a paper, which claims that most CPUs execute 64 binary operations in one clock cycle. Thus the speedup is calculated accordingly.

 

To calculate the speed up in the XNOR-net, i need to know how many binary operations per clock cycle can be executed by KNL processors. How can I find this information for a CPU?

 

Does AVX-512 imply that 512 bitwise operations are possible every clock cycle?

If this is indeed correct, can you suggest some material with the reference of which I can attempt to code bitwise convolution operations which take advantage of the Intel architecture?

 

Thank you!

Why AVX code slower then native C++ code ?

$
0
0

Good day, dear experts.
I have the following AVX and Native codes:

__forceinline double dotProduct_2(const double* u, const double* v)
{
	_mm256_zeroupper();

	__m256d xy			= _mm256_mul_pd(_mm256_load_pd(u), _mm256_load_pd(v));
	__m256d temp		= _mm256_hadd_pd(xy, xy);
	__m128d dotproduct	= _mm_add_pd(_mm256_extractf128_pd(temp, 0), _mm256_extractf128_pd(temp, 1));

	return dotproduct.m128d_f64[0];
}
__forceinline double dotProduct_1(const D3& a, const D3& b)
{
	return a[0] * b[0] + a[1] * b[1] + a[2] * b[2] + a[3] * b[3];
}

And respective test scripts:
 

	std::cout << res_1 << ""<< res_2 << ""<< res_3 << '\n';
	{
		std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

		for (int i = 0; i < (1 << 30); ++i)
		{
			zx_1 += dotProduct_1(aVx[i % 10000], aVx[(i + 1) % 10000]);
		}
		std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

		std::cout << "NAIVE : "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
	}

	{

		std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
		for (int i = 0; i < (1 << 30); ++i)
		{
			zx_2 += dotProduct_2(&aVx[i % 10000][0], &aVx[(i + 1) % 10000][0]);
		}

		std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();

		std::cout << "AVX : "<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << '\n';
	}

	std::cout << math::min2(zx_1, zx_2) << ""<< zx_1 << ""<< zx_2;

Well, all of the data are aligned by 32. (D3 with __declspec... and aVx arr with _mm_malloc()..)
And, as i can see, native variant is equal/or faster than AVX variant. I can't understand it's nrmally behaviour ? Because i'm think that AVX is 'super FAST' ... If not, how i can optimize it ? I compile it on MSVC 2015(x64), with arch AVX. Also, my hardwre is intel i7 4750HQ(haswell)

Which processor families support RDPID instruction?

$
0
0

Which processors support the RDPID instruction?

How do I programmatically check if there is support for RDPID?

Thanks.

David

Where is _mm512_i32extgather_epi32 defined?

$
0
0

I like to use _mm512_i32extgather_epi32.

The Intrinsics Guide states that I need to include "immintrin.h", but it is not in this file. It could have been a typo, thus I searched in other header files with intrinsics, but none came up.

This page (https://software.intel.com/en-us/node/523513) states that "[t]his intrinsic only applies to Intel® Many Integrated Core Architecture (Intel® MIC Architecture)." However, the Intrinsics Guide states that it is AVX512F. I need this specific one because I need to: _MM_UPCONV_EPI32_UINT8

Question: where i can find this intrinsic?

Regards Henk.

Does Non-Temporal Load/Store Hints have different behavior for WC, WB memory type?

$
0
0

I'm a little confused about the Non-Temporal hints as below, could any expert help to give some comment?

Non-Temporal hints is used for scenario where data maybe used only once. it could help to avoid cache pollution.

But:

1. From <<Intel software develop manual>>, it seems that Non-Temporal hint are mostly apply to memory region of WC(write combine) type, but WC type is inherently un-cacheable. why we still need NT hints ?

2. What's the detail behavior when we use NT hints to load and store.

3. What's the difference when the target memory type is WB and WC?

Ref:

1. Intel SDM volume2, movntxxx instruction description.

2.  https://software.intel.com/en-us/forums/intel-isa-extensions/topic/597075

3.  I also do some experiment on i5-2400 for WB memory type, it seems that NT store hints work as desired, but NT Load have no effect.

 

Running _mm512_popcnt_epi32 on colfax using icc?

$
0
0

How can i run _mm512_popcnt_epi32 on the colfax KNL7210? Does vpopcntd instruction need to be enabled? How can i do this?

I also need to include <iostream> in the C file, however I am using icc. How can I do this?

#include <stdio.h>
#include <mkl.h>
#include <immintrin.h>
#include <zmmintrin.h>

int main(){
        __m512i k, b, c;
        c = _mm512_and_epi32(k, b);
        printf("%d", c);
        int len;
        len = _mm512_popcnt_epi32(c);
        printf("%d\n", len);
}

I have another doubt. Does the function _mm512_popcnt_epi32 return a pointer to an array of 16 integers denoting the population count of each of the 16 integers packed into the _m512i data type? Is assigning this to an integer incorrect? I thought that the function above returns the total popcnt of all the 512 elements in the _m512i data type.

I am running the code above using the following submission script:

cd ~/benchmarking/

icc matmul.c -o mat.out -xMIC-AVX512

./mat.out

Is there something I am missing here?

Thank you!


Understanding Bitwise Intrinsics

$
0
0

Hello, 

This is a very simple doubt, but I just started out with AVX512 ISA.

_mm512_xor_epi64 This intrinsic needs two _m512i data types, which i am let to believe are packed with 8 int64 values each.  And this command will return a _m512i data type containing 8 integers which are the result of the bitwise logical XOR operation between each of the 8 integers. 

For instance, if the _m512i A data type is packed with (2,1,4,5,2,2,3,1) int64 values, and _m512i is packed with (4,2,1,3,5,2,1,3).

Then running _mm512_xor_epi64(A, B) will return a pointer to a _m512i data type which contains (xor(2,4), xor(1,2), xor(4,1), xor(5,3), xor(2, 5), xor(2, 2), xor(3, 1), xor(1,3)). Where xor() is the bitwise XOR of the binary form of 2 and 4, isnt it?

Does extern __m512i __cdecl _mm512_load_epi64(void const* mt); Does this allow me to pack an allocated array of 8 int64 values into a _m512i data type, provided * mt is the pointer to such an array? 

Is there an intrinsic to unpack such values? Would really appreciate some help in the right direction.

Transpose for 4x4 Matrix(double)

$
0
0

Hello, every one

I have a problem about data transpose for 4x4 matrix, which is double-precision ​data

I use _MM_SHUFFLE_PS to transpose float data before, but it is 4x4 single-precision data

I didn't find similar Macro for 4x4 double-precision data.

Now, I use AVX instruction to implement this transpose as following:

    vunpcklpd              ymm4,                           ymm0,                             ymm1
    vunpckhpd             ymm5,                           ymm0,                             ymm1
    vunpcklpd              ymm6,                           ymm2,                             ymm3
    vunpckhpd             ymm7,                           ymm2,                             ymm3
    vperm2f128            ymm0,                           ymm4,                             ymm6,                             020H
    vperm2f128            ymm1,                           ymm5,                             ymm7,                             020H
    vperm2f128            ymm2,                           ymm4,                             ymm6,                             031H
    vperm2f128            ymm3,                           ymm5,                             ymm7,                             031H

I can implement transpose through above instruction, but, I found the vperm2f128's effeciency is very bad, so, my transpose is very slow!

So, I want to know how to optimize the transpose for double ?

Is there any one would like to tell me how to do?

Complex multiply–accumulate

$
0
0

I everyone,

I need to do a Multiply-Multiply–accumulate operations over complex arrays.

More exactly I need a <-- a + alpha * c * d with a, c and d complex value, and alpha a reel.

It's a key point of my algorithm, and I made many tests trying to find the best solution. However, it looks still very slow to me. (Computing the FFT is faster !)

Currently, I use codes based on intrinsics. Attached the version for AVX. I suppose any solution can be extended easily to SSE and/or AVX-512

 

	        __m256 alphaVect=_mm256_set1_ps(Alpha);

		for (int i = 0; i < 2*size/8; i++)
		{

			__m256 C_vec=_mm256_loadu_ps(C+i*8);
			__m256 D_vec=_mm256_loadu_ps(D+i*8);

			__m256 CD=_mm256_mul_ps(_mm256_moveldup_ps(D_vec),C_vec);
			__m256 CCD=_mm256_mul_ps(_mm256_movehdup_ps(D_vec),C_vec);
			CCD=_mm256_shuffle_ps( CCD, CCD, _MM_SHUFFLE(2,3,0,1));


			__m256 valCD=_mm256_addsub_ps(CD,CCD);
		#if __FMA__
			__m256 val=_mm256_fmadd_ps (valCD,alphaVect,_mm256_loadu_ps(dst+i*8));//a*b+c
		#else
			__m256 val=_mm256_add_ps (_mm256_mul_ps(valCD,alphaVect),_mm256_loadu_ps(dst+i*8));
		#endif
			_mm256_storeu_ps(dst+i*8,val);
		}

 

Have someone a better idea or solution ?

If it is possible to do this operation with IPP (like the multiplication ippsMulPack_32f, ...), MKL, ... I didn't find the solution.

AVX512 auto-vectorization on i9-7900X

$
0
0

Hello,

I wrote a small piece of code to test the auto-vectorization options of the icc compiler. The code sums 2 arrays (vectors) of doubles and stores it in a 3rd array. I tried to compile it with -xCORE-AVX2 and -xCORE-AVX512 and was expecting a portion of the 2x possible theoretical maximum speedup. But, what I saw was almost the same execution time for both versions (sometimes even worse). At first I thought the cause is the size of the array which doesn't fit in the L1, but when I repeated the runs with arrays size of 256 numbers, I got the same speedup. I noticed the same effect with a series of other scientific apps, and I just can't accept the fact that for every single app I tried, I can't get any speedup at all (unless the app was precompiled somwhere).

So I checked the objdump of the AVX512 generated binary for my toy program and noticed that there was no usage of zmm registers in the hot loop. However, when I used an online compiler explorer (https://godbolt.org/#) I could see beautiful AVX512 code which I expected to see on my machine as well. The funny thing is that I even use a more recent version of icc compiler, and I just can't get it to produce good code. What could be the issue here? Are there any hints I have to give the compiler? That doesn't seem to bother the online compiler I tested. I tried using different pragmas, the trip counts are known in advance, arrays are aligned, qopt-report says it vectorized the code.

Any suggestion would be helpful.

Here is the C code for the toy app.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 32768
#define MRUNS 100000000

void init_arrays(double *a, double *b, int n)
{
    int i;
    for(i=0; i<n; i++)
    {
        a[i] = rand();
        b[i] = rand();
    }
}

int main(int argc, char* argv[])
{
    double *a, *b, *c;
    int i, j;
    double sum;
    srand(time(NULL));
    a = (double*) _mm_malloc(N * sizeof(double), 64);
    b = (double*) _mm_malloc(N * sizeof(double), 64);
    c = (double*) _mm_malloc(N * sizeof(double), 64);
    init_arrays(a, b, N);

    for(i=0; i<MRUNS; i++)
    {
        for(j=0; j<N; j++)
        {
            c[j] = a[j]+b[j];
        }
    }

    for(i=0; i<N; i++)
        sum += c[i];
    return 0;
}

Here is a portion of the assembly generated by ICC on my machine:

..B1.12:                        # Preds ..B1.12 ..B1.11
                                # Execution count [3.28e+10]
..L12:
                # optimization report
                # LOOP WAS UNROLLED BY 4
                # LOOP WAS VECTORIZED
                # VECTORIZATION SPEEDUP COEFFECIENT 6.402344
                # VECTOR TRIP COUNT IS KNOWN CONSTANT
                # VECTOR LENGTH 4
                # MAIN VECTOR TYPE: 64-bits floating point
        vmovupd   (%r12,%rcx,8), %ymm0                          #36.20
        vmovupd   32(%r12,%rcx,8), %ymm2                        #36.20
        vmovupd   64(%r12,%rcx,8), %ymm4                        #36.20
        vmovupd   96(%r12,%rcx,8), %ymm6                        #36.20
        vaddpd    (%rbx,%rcx,8), %ymm0, %ymm1                   #36.25
        vaddpd    32(%rbx,%rcx,8), %ymm2, %ymm3                 #36.25
        vaddpd    64(%rbx,%rcx,8), %ymm4, %ymm5                 #36.25
        vaddpd    96(%rbx,%rcx,8), %ymm6, %ymm7                 #36.25
        vmovupd   %ymm1, (%rax,%rcx,8)                          #36.13
        vmovupd   %ymm3, 32(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm5, 64(%rax,%rcx,8)                        #36.13
        vmovupd   %ymm7, 96(%rax,%rcx,8)                        #36.13
        addq      $16, %rcx                                     #34.9
        cmpq      $32768, %rcx                                  #34.9
        jb        ..B1.12       # Prob 99%                      #34.9

And here is the assembly for the same loop generated by the online ICC compiler.

vmovups zmm0,ZMMWORD PTR [r12+rcx*8]
 vmovups zmm2,ZMMWORD PTR [r12+rcx*8+0x40]
 vmovups zmm4,ZMMWORD PTR [r12+rcx*8+0x80]
 vmovups zmm6,ZMMWORD PTR [r12+rcx*8+0xc0]
 vaddpd zmm1,zmm0,ZMMWORD PTR [rbx+rcx*8]
 vaddpd zmm3,zmm2,ZMMWORD PTR [rbx+rcx*8+0x40]
 vaddpd zmm5,zmm4,ZMMWORD PTR [rbx+rcx*8+0x80]
 vaddpd zmm7,zmm6,ZMMWORD PTR [rbx+rcx*8+0xc0]
 vmovupd ZMMWORD PTR [rax+rcx*8],zmm1
 vmovupd ZMMWORD PTR [rax+rcx*8+0x40],zmm3
 vmovupd ZMMWORD PTR [rax+rcx*8+0x80],zmm5
 vmovupd ZMMWORD PTR [rax+rcx*8+0xc0],zmm7
 add rcx,0x20
 cmp rcx,0x8000
 jb 400bf0 <main+0xd0>

I use ICC 17.0.3 and compiler explorer uses 17.0.0.
My CPU is a 10-core i9-7900X. Linux kernel 14.10.0-35-generic. 

AVX vs. SSE: expect to see a larger speedup

$
0
0

What type of speedup is AVX expected to have over SSE for vectorized addition?  I would expect a 1.5x to 1.7x speedup.
There are more details in this stackoverflow post if someone wants points.
https://stackoverflow.com/questions/47115510/avx-vs-sse-expect-to-see-a-...
-AG

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #d0d2da; background-color: #161821; background-color: rgba(22, 24, 33, 0.95)}
p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #d0d2da; background-color: #161821; background-color: rgba(22, 24, 33, 0.95); min-height: 13.0px}
span.s1 {font-variant-ligatures: no-common-ligatures}

// add int vectors via AVX

__attribute__((noinline))

void add_iv_avx(__m256i *restrict a, __m256i *restrict b, __m256i *restrict out, int N) {

 

    __m256i *x = __builtin_assume_aligned(a, 32);

    __m256i *y = __builtin_assume_aligned(b, 32);

    __m256i *z = __builtin_assume_aligned(out, 32);

 

    const int loops = N / 8; // 8 is number of int32 in __m256i

    for(int i=0; i < loops; i++) {

        _mm256_store_si256(&z[i], _mm256_add_epi32(x[i], y[i]));

    }

}

 

// add int vectors via SSE; https://en.wikipedia.org/wiki/Restrict

__attribute__((noinline))

void add_iv_sse(__m128i *restrict a, __m128i *restrict b, __m128i *restrict out, int N) {

 

    __m128i *x = __builtin_assume_aligned(a, 16);

    __m128i *y = __builtin_assume_aligned(b, 16);

    __m128i *z = __builtin_assume_aligned(out, 16);

 

    const int loops = N / sizeof(int);

    for(int i=0; i < loops; i++) {

        //out[i]= _mm_add_epi32(a[i], b[i]); // this also works

        _mm_storeu_si128(&z[i], _mm_add_epi32(x[i], y[i]));

    }

}

Viewing all 685 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>