The problem with VZEROUPPER comes up again now that the recommendation for the Knights Landing processor is the opposite of previous processors.
The history is this: The extension of vector registers from 128 to 256 bits caused a problem when legacy Windows device drivers saved only the lower 128 bits of the new 256-bit registers. This problem was solved in a rather complex way. The Sandy Bridge processor could switch between a VEX state with full 256-bit registers and a non-VEX state where all the 256-bit registers were split into two 128-bit parts. The switch between these states had a cost of 70 clock cycles. The instruction VZEROUPPER was used for avoiding the cost of this state transition by clearing the upper half of all the registers. Alternatively, one could use VZEROALL to clear the whole registers. The code has to use VZEROUPPER after any code that uses 256-bit registers if there is any chance that the subsequent code contains non-VEX vector instructions. The recommendation from Intel was to use VZEROUPPER in AVX code before any call or return to an ABI-compliant function with unknown VEX status. The problems were discussed at length in this thread: https://software.intel.com/en-us/forums/intel-isa-extensions/topic/301853
This recommendation is still included in the Optimization Reference Manual. However, the same manual says that VZEROUPPER is not recommended on the new Knights Landing processor. (The book Intel Xeon Phi Coprocessor High-Performance Programming. Knights Landing Edition. 2nd ed. Elsevier, 2016. written by three Intel developers says the same).
There is obviously a need to clarify these conflicting messages, now that the vector registers are extended further from 256 to 512 bits.
My own observations are these:
- The following processors have expensive state transitions and cheap VZEROUPPER and VZEROALL: Sandy Bridge, Ivy Bridge, Haswell, and Broadwell. VZEROUPPER is needed for performance reasons on these processors.
- There is no expensive state transition on the later Intel processors: Skylake and Knights Landing.
- There is no expensive state transition on AMD processors.
- VZEROUPPER and VZEROALL are expensive on Knights Landing. I have measured 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
It appears that VZEROUPPER is no longer needed on processors later than Broadwell and it is harmful on the first processor to support AVX512 (Knights Landing).
Since VZEROUPPER and VZEROALL affect only registers zmm0-zmm15, and not zmm16-zmm31, maybe we can avoid the need for these instructions by using only zmm16-zmm31.
In order to reach a new set of recommendations, I would like the Intel people to please answer these questions:
- Is VZEROUPPER needed after AVX512 code that uses only registers zmm16-zmm31?
- Will VZEROUPPER be needed for performance reasons on any processor that supports AVX512?
- Will VZEROUPPER be needed for performance reasons on any future Intel processor?
If the answers to these questions are no, then I may propose the following guidelines:
- AVX code should use VZEROUPPER before calling a library function or other function of unknown VEX status only on processors that support AVX but not AVX512.
- A function library may have CPU dispatching with the following branches: (a) for processors that support SSE but not AVX, use non-VEX instructions. (b) for processors that support AVX but not AVX512, use VEX code and end with VZEROUPPER if any 256-bit registers have been used. (c) for processors that support AVX512, use VEX or EVEX code, don't use VZEROUPPER.
Do you think these guidelines will work? It is important that we reach a useful set of recommendations now that people are beginning to make AVX512 code.