Hi,
I'm trying to implement a permutation that turns an AoS (where the structure has 4 float) into a SoA, using SSE, AVX, AVX2 and KNC, and without using gather operations, to find out if it worth it.
For example, using KNC, I would like to use 4 zmm registers:
{A0, A1, ... A15}
{B0, B1, ... B15}
{C0, C1, ... C15}
{D0, D1, ... D15}
to end up having something like:
{A0, A4, A8, A12, B0, B4, B8, B12, C0, C4, C8, C12, D0, D4, D8, D12}
{A1, A5, A9, ...}
{A2, A6, A10, ...}
{A3, A7, A11, ...}
Since the permutation instructions are significantly changing among architectures and I wouldn't like to reinvent the wheel, I would be glad if someone could point me where to find information about this, or share their knowledge.
Thank you in advance.