Vectorization - Speed up expected for SSE and AVX

April 27, 2017, 6:30 am

Latest and popular articles on Intel Technologies

≫ Next: Calculate Miss rate of L2 cache given global and L1 miss rates

≪ Previous: Data source for intrinsics guide

I am doing a benchmark about vectorization on MacOS with the following processor i7 :

$ sysctl -n machdep.cpu.brand_string

Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz

My MacBook Pro is from middle 2014.

I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.

For my benchmark, I add each element of 2 arrays and store the sum in a third array (array sizes vary from 32^3 to 32^5)

I must make you notice that I am working with `double` type for these arrays.

Here are the functions used into my benchmark code :

**1*) First with SSE vectorization :**

#elif __SSE__

#include <x86intrin.h>

#define ALIGN 16

void addition_array(int size, double *a, double *b, double *c)
{
 int i;
 // Main loop
 for (i=size-1; i>=0; i-=2)
 {
  // Intrinsic SSE syntax
  const __m128d x = _mm_load_pd(a); // Load two x elements
  const __m128d y = _mm_load_pd(b); // Load two y elements
  const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
  _mm_store_pd(c, sum); // Store two sum elements

  // Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
  a += 2;
  b += 2;
  c += 2;
 }
}

**2*) Second with AVX256 vectorization :**

#ifdef __AVX__
#include <immintrin.h>
#define ALIGN 32
void addition_array(int size, double *a, double *b, double *c)
{
 int i;
 // Main loop
 for (i=size-1; i>=0; i-=4)
 {
  // Intrinsic AVX syntax
  const __m256d x = _mm256_load_pd(a); // Load two x elements
  const __m256d y = _mm256_load_pd(b); // Load two y elements
  const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements
  _mm256_store_pd(c, sum); // Store two sum elements

  // Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double)
  a += 4;
  b += 4;
  c += 4;
 }
}

For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).

What I get in results for SSE vectorization is represented on the following figure :

So, I think these results are valid because SpeedUp is around factor 2.

Now for AVX256, I get the following figure :

For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).

But as you can see, I still get a `factor 2` and not `4` for SpeedUp.

I don't understand why I get the same results for Speedup with SSE and AVX
vectorization.

Does it come from "compilation flags", from my model of processor, ... I don't know.

Here are the compilation command line that I have done for all above results :

**For SSE :**

gcc-mp-4.9 -O3 -msse main_benchmark.c -o vectorizedExe

**For AVX256 :**

gcc-mp-4.9 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe

Entire code is available on : http://beulu.com/test_vectorization/main_benchmark.c.txt

and the shell script for benchmarking is http://beulu.com/test_vectorization/run_benchmark

Could anyone tell me why I get the same speedup between SSE and AVX (i.e a factor 2 between both) ?

Thanks for your help

Zone:

Modern Code

Thread Topic:

Question

↧

Calculate Miss rate of L2 cache given global and L1 miss rates

May 7, 2017, 12:54 pm

Latest and popular articles on Intel Technologies

≫ Next: RPL, CPL and DPL question

≪ Previous: Vectorization - Speed up expected for SSE and AVX

If I have a Global miss rate of all caches of a total of 5.41% and the miss rate of a single level cache of 9.13%, how can I effectively calculate how much the second level cache miss rate need to be?

I found this where it says that the Local miss rate equals misses in a cache divided by the total number of memory accesses to this cache (Miss rate L2) and Global miss rate equals misses in a cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1 x Miss Rate L2).

So if I divide 5.41% / 9.13% it equals 0.5926% for my second level cache. Is this assumption correct?

↧

RPL, CPL and DPL question

May 9, 2017, 4:31 am

Latest and popular articles on Intel Technologies

≫ Next: SSE/AVX/FMA Unexpected Test Results

≪ Previous: Calculate Miss rate of L2 cache given global and L1 miss rates

Good day.

In Intel SDM vol. 3 / 5.6 "PRIVILEGE LEVEL CHECKING WHEN ACCESSING DATA SEGMENTS" we can read:

Before the processor loads a segment selector into
a segment register, it performs a privilege check (see Figure 5-4) by comparing the privilege levels of the currently
running program or task (the CPL), the RPL of the segment selector, and the DPL of the segment’s segment
descriptor. The processor loads the segment selector into the segment register if the DPL is numerically greater
than or equal to both the CPL and the RPL. Otherwise, a general-protection fault is generated and the segment
register is not loaded.

So, if we have a code which runs at level 3, we can't do

mov eax,28h ;points to descriptor with DPL=0
mov DS,eax

because it would lead to #GP, as the CPL=3, RPL=0, DPL=0.

But then in manual we found this:

It is important to note that the RPL of a segment selector for a data segment is under software control. For
example, an application program running at a CPL of 3 can set the RPL for a data- segment selector to 0.

How it is possible?
Or I misunderstand something?

Thanks in advance.

↧

SSE/AVX/FMA Unexpected Test Results

May 10, 2017, 10:37 am

Latest and popular articles on Intel Technologies

≫ Next: 3D X-Point programming versus OS context switching

≪ Previous: RPL, CPL and DPL question

Hello,

I created some computations using SSE, AVX and FMA to test the computation time (in assembly language). The codes I made were in 32-bits as well as in 64-bits.

After a DLL with functions was created, I used Matlab to test it. The computation provides Matrix Vector multiplication in both versions - M*V and V*M. The test Matrix size was 4096x4096.

However, the results of the tests surprised me a little bit. I don't see almost any performance boost of AVX/FMA. Even the 64-programming seems to be almost the same.

Why is that?

* Codes with results are in attachment.
** Just ignore the notes, they're in my native language (result sheet note: word ZLAVA means V*M, SPRAVA means M*V).

Thanks for every explanation.

Attachment	Size
Download x86_32-x86_64_Multip_tests.zip	197.27 KB

Zone:

Windows*

Thread Topic:

Help Me

↧

3D X-Point programming versus OS context switching

May 15, 2017, 10:14 am

Latest and popular articles on Intel Technologies

≫ Next: Intel Software Development Emulator for Itanium architecture

≪ Previous: SSE/AVX/FMA Unexpected Test Results

Two related question with regard to Intel's 3D X-Point memory and the behavior of SFENCE and CLWB:

1) If one uses SFENCE, but not CLWB, what synchronization/ordering guarantees does 3D X-Point technology make?
2) Given that the OS can move a thread between cores, what if anything does CLWB guarantee about the ordering of writes make by other cores in the system?

Thanks in advance!

Thread Topic:

Question

↧

Intel Software Development Emulator for Itanium architecture

May 19, 2017, 8:30 am

Latest and popular articles on Intel Technologies

≫ Next: Popcount emulation for x64 process - RAM memory limit

≪ Previous: 3D X-Point programming versus OS context switching

*** Intel Software Development Emulator for Itanium architecture ***

↧

Popcount emulation for x64 process - RAM memory limit

May 23, 2017, 6:46 am

Latest and popular articles on Intel Technologies

≫ Next: SDE Support for Knight's Mill

≪ Previous: Intel Software Development Emulator for Itanium architecture

Hello,

I'm trying to use Intel Software Developement Emulator to emulate popcount instruction for my CPU (Core2Quad Q6600). I need this to run Quantum Break game, which requires ppcount.

Game runs, but very, very slow. I noticed, that no matter what QuantumBreak.exe uses max around 2,5 GB of my RAM memory (I have 8 GB), so that can be casue of huge slowdown. As far as I know 2,5 GB is a limit for single 32 bit process in Windows systems, but why this limitation occurs since game is x64 and my system too? I tried on Windows 7x64 and Windows 10 x64 - same results.

To check if it's not a game problem/bug I've used sde with different game (GTA V). Normaly it uses around 5 GB of RAM, but with sde again only 2,5 GB, so there is some kind of limit.

How this can be bypassed? I red -help and -help-long but haven't found anything useful in those switches.

Any help would be much appreciated.

Best Regards

↧

SDE Support for Knight's Mill

June 1, 2017, 5:18 pm

Latest and popular articles on Intel Technologies

≫ Next: State of AVX 512 on Skylake-X

≪ Previous: Popcount emulation for x64 process - RAM memory limit

Hi,

I'm trying to use the SDE to emulate the KNM chip. The release notes for SDE 7.58.0 say that this should be possible with the -knm flag. However, when running

sde -knm -- ./binary_file

I get the following error:

An error was detected processing the command line.
Add "-long-help" to the Intel(R) SDE command line to see the options knobs.
BAD OPTION: -knm

Is this a simple mistake on my part or has the KNM chip not been fully implemented? Thanks very much.

↧

State of AVX 512 on Skylake-X

July 8, 2017, 2:17 am

Latest and popular articles on Intel Technologies

≫ Next: Intel 's Assembler

≪ Previous: SDE Support for Knight's Mill

As has been stated on a number of review sites, AVX 512 performance on the 6/8 core Skylake-X is compromised.
Only on the 10 core, the present hardware is fully enabled.
Would Intel be so kind as to provide in depth detail of what the performance difference means ?
From the vague information available it seems one of 2(3?) AVX 512 ports is disabled (port 5).
Can we get more detailed information, which ports are used for AVX 512 ?
What AVX 512 instructions can the ports execute, do they have 512-bit data paths to registers/cache ?
How is AVX 512 gather affected regarding the 6/8 core versus 10 core ?
A similar drawing as below for AVX2 would be appreciated.

Thread Topic:

Question

↧

Intel 's Assembler

July 11, 2017, 3:05 am

Latest and popular articles on Intel Technologies

≫ Next: Compilers/IDE

≪ Previous: State of AVX 512 on Skylake-X

Hi,

Does Intel provide some [macro] assembler compiler? or, what compiler is better to integrate with System Studio? Or, more generally, what is best developer tool(s) to develop using i7 / i9 Intel AVX2 / AVX512 extensions?

I'm a novice here... sorry for trivial.

↧

Compilers/IDE

July 11, 2017, 6:15 am

Latest and popular articles on Intel Technologies

≫ Next: Software Development Emulator With Intel TXT support

≪ Previous: Intel 's Assembler

Hi,

May I ask - what compilers/IDE are better for AVX2 / AVX512 extensions code? What source code style is preferable - C/C++ with intrinsic? or inline assembler? Some macro assembler?

It is real problem for me - I starts new project and the price of wrong decision might be high (i'm novice with this stuff)

↧

Software Development Emulator With Intel TXT support

July 11, 2017, 6:59 am

Latest and popular articles on Intel Technologies

≫ Next: Data conversion from scalar to vector

≪ Previous: Compilers/IDE

Hi, Could someone please help me ? I am trying to develop a bootloader using Intel txt / TPM 1.2 . I want to test my application.

1. How can I debug my loader using intel sde / is it possible to debug loader (like using Qemu)?

2. What TPM should I use with SDE and how can I connect TPM with Intel SDE to look into its internal registers ?

Zone:

↧

Data conversion from scalar to vector

July 24, 2017, 1:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel please update hard copy manuals @ lulu.com

≪ Previous: Software Development Emulator With Intel TXT support

To begin with, I am not very professional programmer, I am just a random guy thrown into this unprepared. So every effort to elaborate the solution will be appreciated.

I am currently working on importing some features from a code to mine. But in the other one, it uses AVX extension data type so I am very confused because I am very new to these AVX extension. All my data is stored in scalar variable as in complex<double> but in the other one it uses __m512d, so I was wondering how I can convert this complex<double> variable to __m512d variable. I am writing a wrapper function that calls functions in the other code and I will bring my data into the wrapper function in scalar variable, and I need to be able to convert those into vectors so that it can be used as arguments for the functions from the other code. I want to keep every line of codes from the other code as intact as possible so, I need to plug my double values in vector variable somehow.

Thanks.

↧

Intel please update hard copy manuals @ lulu.com

July 24, 2017, 3:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Some Questions About New Arch (Memory Protection Keys)

≪ Previous: Data conversion from scalar to vector

Intel, please update your printable manuals at lulu.com. Currently that site has a mix of old and new with some duplicates. Thank you.

↧

Some Questions About New Arch (Memory Protection Keys)

July 26, 2017, 7:37 am

Latest and popular articles on Intel Technologies

≫ Next: #SS on inter-privilege IRET

≪ Previous: Intel please update hard copy manuals @ lulu.com

Hello,

Since I would like to do some research based on the new architecture ( Memory Protection Keys) while Linux Kernel Also support it,

So I need a machine to run some tutorials and experiments.

However, we cannot find some machines that support it except the newest Xeon Phi which is too expensive.

So Anyone could give advices about some machine which support it ( and may not too expensive) or other cloud platform which support this feature and can be rent for a while

Best

↧

#SS on inter-privilege IRET

July 15, 2017, 12:44 am

Latest and popular articles on Intel Technologies

≫ Next: C++ runtime use SSE/SSE2

≪ Previous: Some Questions About New Arch (Memory Protection Keys)

I am talking about latest SDM rev. 63 there. In the description of the #SS exception 'Stack Fault', it is stated that the exception is raised when

· A not-present stack segment is detected when attempting to load the SS register. This violation can occur
during the execution of a task switch, a CALL instruction to a different privilege level, a return to a different
privilege level, an LSS instruction, or a MOV or POP instruction to the SS register.

Experiments on the real hardware demonstrate that this exception can occur if inter-privilege IRET is executed and popped an %ss segment register value which references a non-present descriptor. Small issue is that the explanation above does not list IRET explicitely, but it arguably falls under the 'return to a different privilege level' enumerator.

Much more serious problem is that IRET instruction lists #NP (segment not present) in this situation:

#NP(selector) If the return code or stack segment is not present.

In particular, it seems that QEMU emulates #NP, following the documentation.

Of course I may be wrong with interpreting the experiment results, but it would be useful clarify either the IRET instruction specification, and/or #SS description.

↧

C++ runtime use SSE/SSE2

July 16, 2017, 5:04 am

Latest and popular articles on Intel Technologies

≫ Next: Any docs or manuals about MPK(memory protection keys)

≪ Previous: #SS on inter-privilege IRET

I see the intel compiler 17.0 (in Release configuration) translate C-runtime (like "strlen" or "memcpy") as inline functions used SSE/SSE2.

1) Is it possible to take a look on these function - I sure they are pretty good as samples - maybe somebody knows?

2) What is the efficiency of using SSE2 for such trivial as "strlen"?

↧

Any docs or manuals about MPK(memory protection keys)

July 28, 2017, 8:01 pm

Latest and popular articles on Intel Technologies

≫ Next: About MONITOR and MWAIT

≪ Previous: C++ runtime use SSE/SSE2

Hello guys,

Since I am interested in the  new hardware feature: MPK(Memory Protection Keys),  so I am wondering are there any docs or manuals about MPK?

I have tried to search about MPK but I couldn't find any thing except an lwn article(https://lwn.net/Articles/643797/) and a kernel doc....

I would appreciate any responses and suggestions!

Best wishes!

↧

About MONITOR and MWAIT

August 4, 2017, 2:01 am

Latest and popular articles on Intel Technologies

≫ Next: Software Development Emulator for Windows XP

≪ Previous: Any docs or manuals about MPK(memory protection keys)

Hi,

I wanted to ask why those 2 instructions should be privileged
Unless I miss something, I can see no reason for that.

The [only] purpose of those 2 instructions is to save power.
They don't do anything that can have security consequences,
and have no effect on the architectural state of the processor, whatsoever.

Disallowing them for ring3 code only means that certain [spinlock] algorithms
that run in user space can not be as power-efficient as in ring0.
But isn't it in intel's best interest their processors to be as power-efficient as possible in all situations?

↧

Software Development Emulator for Windows XP

August 9, 2017, 6:24 am

Latest and popular articles on Intel Technologies

≫ Next: Software Development Emulator (Mix Histogramming Tool - Interpretation)

≪ Previous: About MONITOR and MWAIT

Hello,

I'm desperatly looking for Intel SDE, which works on Windows XP 32 bit, wchich I belive is version 5.38. Unfortunately it is nowhere to be found. Cound anybody please share this version as soon as possible?

Thanks in advance

↧