Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z15 VX s390x Linux on z support #317

Closed
edelsohn opened this issue Jul 22, 2020 · 37 comments
Closed

Z15 VX s390x Linux on z support #317

edelsohn opened this issue Jul 22, 2020 · 37 comments

Comments

@edelsohn
Copy link
Collaborator

Port Sleef to IBM Z15 VX architecture supporting s390x Linux on Z compiled with both GCC and Clang. Achieve equivalent speedup to x86-64 and AArch64 appropriate for the z VX 128 bit vector width.

With Github 290x branch clone from a few days ago on Fedora built with GCC, I see the following failures:
The following tests FAILED:
8/26 Test #18: roundtriptest2ddp_4_4 ............***Failed 0.10 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 17549
transpose MT(measured): 85036
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 17549
transpose MT(loaded): 85036
complex : NG (0.315141)

  Start 19: roundtriptest2ddp_8_8

19/26 Test #19: roundtriptest2ddp_8_8 ............***Failed 0.05 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 23537
transpose MT(measured): 6460
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 23537
transpose MT(loaded): 6460
complex : NG (0.301525)

  Start 20: roundtriptest2ddp_10_10

20/26 Test #20: roundtriptest2ddp_10_10 ..........***Failed 0.32 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 41145
transpose MT(measured): 10379
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 41145
transpose MT(loaded): 10379
complex : NG (0.306744)

  Start 21: roundtriptest2ddp_5_15

21/26 Test #21: roundtriptest2ddp_5_15 ...........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 4(ST) 2(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 38633
transpose MT(measured): 22803
Path(random) :2(ST) 2(ST) 4(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :3(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 38633
transpose MT(loaded): 22803
complex : NG (0.309993)

  Start 23: roundtriptest2dsp_4_4

23/26 Test #23: roundtriptest2dsp_4_4 ............***Failed 0.13 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 11867
transpose MT(measured): 120471
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 11867
transpose MT(loaded): 120471
complex : NG (0.707168)

  Start 24: roundtriptest2dsp_8_8

24/26 Test #24: roundtriptest2dsp_8_8 ............***Failed 0.03 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 15258
transpose MT(measured): 4248
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 15258
transpose MT(loaded): 4248
complex : NG (0.659438)

  Start 25: roundtriptest2dsp_10_10

25/26 Test #25: roundtriptest2dsp_10_10 ..........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 24401
transpose MT(measured): 6481
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 24401
transpose MT(loaded): 6481
complex : NG (0.661858)

  Start 26: roundtriptest2dsp_5_15

26/26 Test #26: roundtriptest2dsp_5_15 ...........***Failed 0.26 sec
Path(random) :4(ST) 2(ST) 3(ST) 4(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 26029
transpose MT(measured): 13875
Path(random) :4(ST) 2(ST) 2(ST) 3(ST) 3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :1(ST) 4(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 26029
transpose MT(loaded): 13875
complex : NG (0.663971)

@shibatch
Copy link
Owner

This seems to be a bug in gcc.
I added a workaround.

@barkovv
Copy link

barkovv commented Aug 24, 2020

@shibatch Are you going to add VX support in this library?

@edelsohn
Copy link
Collaborator Author

There is Z15 VX support in the #291 Add_s390x_support_rebased branch.

@shibatch
Copy link
Owner

@edelsohn Can I merge #291?

@edelsohn
Copy link
Collaborator Author

We definitely would like this merged. With the current, updated sources, I am seeing more failures. I am testing on Fedora with GCC 10.1.
***Exception: Illegal 20% tests passed, 24 tests failed out of 30

@shibatch
Copy link
Owner

Since you are seeing ***Exception: Illegal, it seems that your computer does not support ZVECTOR2 extension.

I built gcc-10.2.0 on the LinuxONE VM, and did the test with it. There is no problem.

As you see, it passes the tests on travis.

https://travis-ci.org/github/shibatch/sleef/jobs/719496316

@edelsohn
Copy link
Collaborator Author

My colleague tested with the master branch and successfully tested on both z14 and z15. As another developer mentioned, the failures seem to be a conflict with FFTW. Thanks for the great initial implementation.

@shibatch
Copy link
Owner

I added DISABLE_FFTW option. #327
I hope this option will solve the problem.

@Andreas-Krebbel
Copy link
Contributor

This seems to be a bug in gcc.
I added a workaround.

Could you please elaborate on what the GCC bug is! I would like to have a look.

@Andreas-Krebbel
Copy link
Contributor

Hi,

David asked me to have a look at the Z specific changes. The implementation looks very good to me. Thanks for looking into this. Here are a few comments/questions from my side:

  1. Building with -march=z15 makes a difference since we have a hardware vector conversion between float and int then. In my microbenchmarks this makes a difference e.g. with Sleef_sinf4_u35. We probably would require another build variant for this then.

  2. On IBM Z the vector compares are defined to always set all bits in an element to either ones or zeros depending on the comparison result. So if I understand these operations correctly they could be replaced like this (as it is done on Power):
    -static INLINE int vtestallones_i_vo64(vopmask g) { return g[0] == 0xffffffffffffffffLL && g[1] == 0xffffffffffffffffLL; }
    -static INLINE int vtestallones_i_vo32(vopmask g) { return g[0] == 0xffffffffffffffffLL && g[1] == 0xffffffffffffffffLL; }
    +static INLINE int vtestallones_i_vo32(vopmask g) { return vec_all_ne(g, (vopmask){ 0 }); }
    +static INLINE int vtestallones_i_vo64(vopmask g) { return vec_all_ne(g, (vopmask){ 0 }); }

Otherwise the compiler falls back to scalar AND instructions to implement this.

There are also a few other operations on comparison results which could potentially be simplified I think.

  1. zvector and zvector2 are not ideal as names for the cpu facilities. I would prefer the names which are also shown in /proc/cpuinfo. That would be vx for the z13 vector instruction set, vxe for z14, and vxe2 for z15. Renaming this would be quite some effort but I think it would be good to match the facility names which are used elsewhere.

  2. I'm wondering about what the sleef requirements are for min/max wrt NaN and Inf handling? Our hardware instructions provide several different modes and we have to make sure that this matches.

@shibatch
Copy link
Owner

Hello,

As for 1, In order to add support for z14 and z15, I need access to computers with those architectures.

As for 4, sleef basically follows the specification of math functions in the C99 standard.

@edelsohn
Copy link
Collaborator Author

@shibatch The system to which you have access through the LinuxONE Community Cloud is z15.

@shibatch
Copy link
Owner

The main problem in implementing z15 support is lack of a reference manual.
I don't even know what new intrinsics are available on z14 and z15.
Could you point a good reference if you know?

@edelsohn
Copy link
Collaborator Author

The z15 z/Architecture Principles of Operation is available online.

The vector intrinsics are defined in the GCC vecintrin.h header file.

@shibatch
Copy link
Owner

How can I know which instructions are the new instructions only available on z14 and z15?
There is no clear correspondence between mnemonics and intrinsics either.

@shibatch
Copy link
Owner

shibatch commented Sep 24, 2020

@Andreas-Krebbel The following is the source code for reproducing the bug.

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

#define real double
#define RESTRICT __restrict__
#define ALIGNED(x) __attribute__((aligned(x)))
#define LOG2BS 4

#define BS (1 << LOG2BS)
#define TRANSPOSE_BLOCK(y2) do {                                        \
    for(int x2=y2+1;x2<BS;x2++) {                                       \
      element_t r = *(element_t *)&row[y2].r[x2*2+0];                   \
      *(element_t *)&row[y2].r[x2*2+0] = *(element_t *)&row[x2].r[y2*2+0]; \
      *(element_t *)&row[x2].r[y2*2+0] = r;                             \
    }} while(0)

static void transpose(real *RESTRICT ALIGNED(256) d, real *RESTRICT ALIGNED(256) s, const int log2n, const int log2m) {
  typedef struct { real r[BS*2]; } row_t;
  typedef struct { real r0, r1; } element_t;

  for(int y=0;y<(1 << log2n);y+=BS) {
    for(int x=0;x<(1 << log2m);x+=BS) {
      row_t row[BS];
      for(int y2=0;y2<BS;y2++) {
        row[y2] = *(row_t *)&s[(((y+y2) << log2m)+x)*2];
      }

      TRANSPOSE_BLOCK( 0); TRANSPOSE_BLOCK( 1);
      TRANSPOSE_BLOCK( 2); TRANSPOSE_BLOCK( 3);
      TRANSPOSE_BLOCK( 4); TRANSPOSE_BLOCK( 5);
      TRANSPOSE_BLOCK( 6); TRANSPOSE_BLOCK( 7);
      TRANSPOSE_BLOCK( 8); TRANSPOSE_BLOCK( 9);
      TRANSPOSE_BLOCK(10); TRANSPOSE_BLOCK(11);
      TRANSPOSE_BLOCK(12); TRANSPOSE_BLOCK(13);
      TRANSPOSE_BLOCK(14); TRANSPOSE_BLOCK(15);

      for(int y2=0;y2<BS;y2++) {
        *(row_t *)&d[(((x+y2) << log2n)+y)*2] = row[y2];
      }
    }
  }
}

int main(int argc, char **argv) {
  int n = 5;
  double *s = memalign(256, sizeof(double) * 2 * (1 << n) * (1 << n));
  double *d = memalign(256, sizeof(double) * 2 * (1 << n) * (1 << n));

  double *p = s;
  int cnt = 1;
  for(int y=0;y<(1 << n);y++) {
    for(int x=0;x<(1 << n);x++) {
      *p++ = cnt++;
      *p++ = cnt++;
    }
  }

  transpose(d, s, n, n);

  p = d;
  for(int y=0;y<(1 << n);y++) {
    for(int x=0;x<(1 << n);x++) {
      int n0 = (int)*p++, n1 = (int)*p++;
      printf("(%03x, %03x) ", n0, n1);
    }
    printf("\n");
  }
}
[s390x]~/sleef$ gcc -O2 bug.c
[s390x]~/sleef$ ./a.out > gcc-O2.out
[s390x]~/sleef$ gcc -O0 bug.c
[s390x]~/sleef$ ./a.out > gcc-O0.out
[s390x]~/sleef$ clang bug.c
[s390x]~/sleef$ ./a.out > clang-O0.out
[s390x]~/sleef$ diff gcc-O0.out clang-O0.out
[s390x]~/sleef$ diff gcc-O2.out gcc-O0.out
13,16c13,16
< (019, 01a) (059, 05a) (099, 09a) (0d9, 0da) (119, 11a) (159, 15a) (199, 19a) (1d9, 1da) (219, 21a) (259, 25a) (299, 29a) (2d9, 2da) (319, 31a) (31b, 31c) (31d, 31e) (31f, 320) (419, 41a) (459, 45a) (499, 49a) (4d9, 4da) (519, 51a) (559, 55a) (599, 59a) (5d9, 5da) (619, 61a) (659, 65a) (699, 69a) (6d9, 6da) (719, 71a) (71b, 71c) (71d, 71e) (71f, 720)
...

[s390x]~/sleef$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/s390x-redhat-linux/8/lto-wrapper
Target: s390x-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-gnu-indirect-function --with-long-double-128 --with-arch=z13 --with-tune=z14 --enable-decimal-float --build=s390x-redhat-linux
Thread model: posix
gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)

@Andreas-Krebbel
Copy link
Contributor

Andreas-Krebbel commented Sep 24, 2020

The main problem in implementing z15 support is lack of a reference manual.

Sorry for the confusion. I didn't want to ask for implementing full z15 support right now. It would be sufficient to just provide a way to compile with -march=z15. That way the float-int conversion code as in:

static INLINE vint2 vrint_vi2_vf(vfloat vf) {
  vf = vrint_vf_vf(vf);
  return (vint) { vf[0], vf[1], vf[2], vf[3] };
}

becomes just:

vfisb   %v24,%v24,4,4
vcfeb   %v24,%v24,0,5

instead of: (when compiled with -march=z14)

    vfisb   %v0,%v24,4,4
    vzero   %v24
    vlgvf   %r1,%v0,0
    vlvgf   %v6,%r1,0
    cfebr   %r4,5,%f6
    vlgvf   %r1,%v0,1
    vlvgf   %v4,%r1,0
    vlgvf   %r1,%v0,2
    cfebr   %r3,5,%f4
    vlvgf   %v2,%r1,0
    vlgvf   %r1,%v0,3
    cfebr   %r2,5,%f2
    vlvgf   %v0,%r1,0
    cfebr   %r1,5,%f0
    vlvgf   %v24,%r4,0
    vlvgf   %v24,%r3,1
    vlvgf   %v24,%r2,2
    vlvgf   %v24,%r1,3

I don't even know what new intrinsics are available on z14 and z15.
Could you point a good reference if you know?

The builtins are documented for the XL compiler here:
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.cbcpx01/vectorbltin.htm

The builtins added with z15 are:

vec_float int->float, unsigned->float
vec_signed now also for float->int
vec_unsigned now also for float->unsigned int
vec_revb - vector byte swaps
vec_reve - vector element swaps
vec_sldb - shift left double by bit
vec_srdb - shift right double by bit
vec_search_string_cc - substring search
vec_search_string_until_zero_cc - substring search for 0 terminated strings

@shibatch
Copy link
Owner

shibatch commented Oct 6, 2020

Implementing the requested feature is almost done, but it seems that clang-10 has a bug in handling orderedness of comparison. Because of this bug, it passes the tests only if optimizations are turned off.

@Andreas-Krebbel
Copy link
Contributor

Great, Thanks! Could you please extract a testcase for the clang 10 issue so that we can have a look?

@shibatch
Copy link
Owner

shibatch commented Oct 6, 2020

Anyway, I made PR #343.
I will make a testcase tomorrow.

@shibatch
Copy link
Owner

shibatch commented Oct 7, 2020

Below is the testcase.

[s390x]~$ cat bug.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static uint64_t vreinterpret_vm_vf(float vf) { union { float vf; uint64_t vm; } cnv; cnv.vm = 0; cnv.vf = vf; return cnv.vm; }
static float vreinterpret_vf_vm(uint64_t vm) { union { float vf; uint64_t vm; } cnv; cnv.vm = vm; return cnv.vf; }
static uint64_t vcast_vm_vo(uint32_t o) { return (uint64_t)o | (((uint64_t)o) << 32); }
static uint64_t vor_vm_vo32_vm(uint32_t x, uint64_t y)       { return vcast_vm_vo(x) | y; }

static uint64_t vsignbit_vm_vf(float f) {
  return vreinterpret_vm_vf(f) & vreinterpret_vm_vf(-0.0f);
}

static float vmulsign_vf_vf_vf(float x, float y) {
  return vreinterpret_vf_vm(vreinterpret_vm_vf(x) ^ vsignbit_vm_vf(y));
}

float xtest(float y, float x) {
  return vreinterpret_vf_vm(vor_vm_vo32_vm(x != x ? ~(uint32_t)0 : 0, vreinterpret_vm_vf(vmulsign_vf_vf_vf(0, y))));
}

int main(int argc, char **argv) {
  float vf1 = atof(argv[1]);
  float vf2 = atof(argv[2]);

  printf("t = %.20g\n", xtest(vf1, vf2));
}
[s390x]~$ clang-10 -march=z13 -O2 -fno-strict-aliasing bug.c
[s390x]~$ ./a.out 0 nan
t = 0
[s390x]~$ gcc -march=z13 -O2 -fno-strict-aliasing bug.c
[s390x]~$ ./a.out 0 nan
t = -nan

@shibatch
Copy link
Owner

shibatch commented Oct 8, 2020

@Andreas-Krebbel Is it okay to merge PR #343?

@Andreas-Krebbel
Copy link
Contributor

Hi,

I've tested the z15 build variant. Works fine for me. I see the float-int conversion instructions appearing in the code.

What about the proposed vtestallones changes above? Do you think we could get rid of some of the scalar compares by using the vec_all_* builtins?

@shibatch
Copy link
Owner

@Andreas-Krebbel vec_all_* builtins are now added.

@Andreas-Krebbel
Copy link
Contributor

A fix for the clang issue has been posted now:
https://reviews.llvm.org/D89389

@Andreas-Krebbel
Copy link
Contributor

After building your latest version I can't find the functions without suffix anymore e.g. Sleef_sind2_u10. In sleef.h there is only:

sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe(SLEEF_VECTOR_DOUBLE);
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxenofma(SLEEF_VECTOR_DOUBLE);
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe2(SLEEF_VECTOR_DOUBLE); 
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe2nofma(SLEEF_VECTOR_DOUBLE);

Was that change intended?

@Andreas-Krebbel
Copy link
Contributor

I see vxe2 function in sleef.h:
Sleef_sinf4_u10vxe2nofma

But there doesn't appear to be an implementation in the shared library.

@shibatch
Copy link
Owner

Since there are two supported extensions, I have to add a dispatcher to choose between them, in order to add functions without suffix. The problem is how to test them. I can add them without testing.

As for vxe2nofma functions, could you check if testing for those functions is correctly executed? There should be a test named iutvxe2nofma. Is it executed?

It is a bit hard to read the log at travis, but it is executed.
https://travis-ci.org/github/shibatch/sleef/jobs/735272236#L30211

@Andreas-Krebbel
Copy link
Contributor

A dispatcher could use the glibc getauxval function to check whether the current hardware support vxe or vxe2. I can help testing the dispatcher on other CPU levels.

I've checked with gcc and clang that the vxe2 test is correctly executed.

The travis/before_script.s390x-gcc.sh appears to lack the "-DENFORCE_VXE2=TRUE" so far.

@Andreas-Krebbel
Copy link
Contributor

Andreas-Krebbel commented Oct 23, 2020

I was just having a look at the GCC problem you were seeing. It looks like a problem in the code to me. The same data is accessed through pointers to the two incompatible types row_t and element_t. The compiler assumes that these accesses cannot refer to the same underlying object and reorders them. The GCC option -Wstrict-aliasing=2 is useful to detect such problems.

@Andreas-Krebbel The following is the source code for reproducing the bug.
...
#define BS (1 << LOG2BS)
#define TRANSPOSE_BLOCK(y2) do {
for(int x2=y2+1;x2<BS;x2++) {
element_t r = *(element_t )&row[y2].r[x22+0];
*(element_t )&row[y2].r[x22+0] = *(element_t )&row[x2].r[y22+0];
*(element_t )&row[x2].r[y22+0] = r;
}} while(0)
static void transpose(real *RESTRICT ALIGNED(256) d, real RESTRICT ALIGNED(256) s, const int log2n, const int log2m) {
typedef struct { real r[BS
2]; } row_t;
typedef struct { real r0, r1; } element_t;

As a quick fix you could mark these two types as may alias like this (in fact marking just one of them like this should suffice):

typedef struct { real r[BS*2]; } __attribute__ ((may_alias)) row_t;
typedef struct { real r0, r1; } __attribute__ ((may_alias)) element_t;

Another way would be to use element_t to declare the row_t member. This would avoid the type casting in TRANSPOSE_BLOCK entirely:

typedef struct { real r0, r1; } element_t;
typedef struct { element_t r[BS]; } row_t;

for(int y=0;y<(1 << log2n);y+=BS) {
for(int x=0;x<(1 << log2m);x+=BS) {
row_t row[BS];
for(int y2=0;y2<BS;y2++) {
row[y2] = *(row_t *)&s[(((y+y2) << log2m)+x)*2];
}

  TRANSPOSE_BLOCK( 0); TRANSPOSE_BLOCK( 1);
  TRANSPOSE_BLOCK( 2); TRANSPOSE_BLOCK( 3);
  TRANSPOSE_BLOCK( 4); TRANSPOSE_BLOCK( 5);
  TRANSPOSE_BLOCK( 6); TRANSPOSE_BLOCK( 7);
  TRANSPOSE_BLOCK( 8); TRANSPOSE_BLOCK( 9);
  TRANSPOSE_BLOCK(10); TRANSPOSE_BLOCK(11);
  TRANSPOSE_BLOCK(12); TRANSPOSE_BLOCK(13);
  TRANSPOSE_BLOCK(14); TRANSPOSE_BLOCK(15);

  for(int y2=0;y2<BS;y2++) {
    *(row_t *)&d[(((x+y2) << log2n)+y)*2] = row[y2];

@shibatch
Copy link
Owner

Okay, I will add -fno-strict-aliasing in the next patch.

@Andreas-Krebbel
Copy link
Contributor

Okay, I will add -fno-strict-aliasing in the next patch.

-fno-strict-aliasing prevents many optimizations globally. I think fixing this locally would be better performance-wise.

@shibatch
Copy link
Owner

shibatch commented Oct 23, 2020

There are actually many places where type-punning is used.
I have to properly address this anyway.

@shibatch
Copy link
Owner

shibatch commented Nov 7, 2020

It is already almost done.

@shibatch
Copy link
Owner

shibatch commented Nov 9, 2020

@Andreas-Krebbel I added dispatchers. Please check if the test passes on Z13 or Z14 computers.

@Andreas-Krebbel
Copy link
Contributor

Thanks! I did run some tests.
Testsuite on z15 is clean.

On z14 I see fails with the vxe2 tests. That's expected.
The following tests FAILED:
10 - iutvxe2 (Failed)
11 - iutyvxe2 (Failed)
12 - iutivxe2 (Failed)
14 - iutvxe2nofma (Failed)
15 - iutyvxe2nofma (Failed)
16 - iutivxe2nofma (Failed)

There are plenty of fails on z13. But that's fair as well since it wasn't intended have support for z13 at all.
The following tests FAILED:
22 - iutpurecfma_scalar (Failed)
23 - iutypurecfma_scalar (Failed)
24 - iutipurecfma_scalar (Failed)
26 - iutdsp128 (Failed)
28 - naivetestdp_2 (ILLEGAL)
29 - naivetestdp_3 (ILLEGAL)
30 - naivetestdp_4 (ILLEGAL)
31 - naivetestdp_5 (ILLEGAL)
32 - naivetestdp_10 (ILLEGAL)
34 - naivetestsp_2 (ILLEGAL)
35 - naivetestsp_3 (ILLEGAL)
36 - naivetestsp_4 (ILLEGAL)
37 - naivetestsp_5 (ILLEGAL)
38 - naivetestsp_10 (ILLEGAL)
39 - roundtriptest1ddp_12 (ILLEGAL)
40 - roundtriptest1ddp_16 (ILLEGAL)
41 - roundtriptest1dsp_12 (ILLEGAL)
42 - roundtriptest1dsp_16 (ILLEGAL)
48 - roundtriptest2dsp_2_2 (ILLEGAL)
49 - roundtriptest2dsp_4_4 (ILLEGAL)
50 - roundtriptest2dsp_8_8 (ILLEGAL)
51 - roundtriptest2dsp_10_10 (ILLEGAL)
52 - roundtriptest2dsp_5_15 (ILLEGAL)

I also did some testing with sinf4. The dispatching works as expected. On z15 the vxe2 variant is used showing a nice performance benefit. On z14 it falls back to the vxe version.

I'm fine with merging the patches.

Thanks a lot for the great work on this!

shibatch added a commit that referenced this issue Nov 10, 2020
This patch adds z15 VX support, following issue #317.
It also changes the suffix zvector2 to vxe and vxe2.
In order to enable z15 VXE2 support, it requires clang-10.
However, it seems that clang-9 and later have a bug in handling orderedness of comparison.
Thus, it passes the tests only if optimizations are turned off.

Co-authored-by: shibatch <[email protected]>
@shibatch
Copy link
Owner

Merged. Please create a new issue if you find a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants