Z15 VX s390x Linux on z support #317

edelsohn · 2020-07-22T13:37:16Z

Port Sleef to IBM Z15 VX architecture supporting s390x Linux on Z compiled with both GCC and Clang. Achieve equivalent speedup to x86-64 and AArch64 appropriate for the z VX 128 bit vector width.

With Github 290x branch clone from a few days ago on Fedora built with GCC, I see the following failures:
The following tests FAILED:
8/26 Test #18: roundtriptest2ddp_4_4 ............***Failed 0.10 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 17549
transpose MT(measured): 85036
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 17549
transpose MT(loaded): 85036
complex : NG (0.315141)

  Start 19: roundtriptest2ddp_8_8

19/26 Test #19: roundtriptest2ddp_8_8 ............***Failed 0.05 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 23537
transpose MT(measured): 6460
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 23537
transpose MT(loaded): 6460
complex : NG (0.301525)

  Start 20: roundtriptest2ddp_10_10

20/26 Test #20: roundtriptest2ddp_10_10 ..........***Failed 0.32 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 41145
transpose MT(measured): 10379
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 41145
transpose MT(loaded): 10379
complex : NG (0.306744)

  Start 21: roundtriptest2ddp_5_15

21/26 Test #21: roundtriptest2ddp_5_15 ...........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 4(ST) 2(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 38633
transpose MT(measured): 22803
Path(random) :2(ST) 2(ST) 4(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :3(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 38633
transpose MT(loaded): 22803
complex : NG (0.309993)

  Start 23: roundtriptest2dsp_4_4

23/26 Test #23: roundtriptest2dsp_4_4 ............***Failed 0.13 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 11867
transpose MT(measured): 120471
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 11867
transpose MT(loaded): 120471
complex : NG (0.707168)

  Start 24: roundtriptest2dsp_8_8

24/26 Test #24: roundtriptest2dsp_8_8 ............***Failed 0.03 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 15258
transpose MT(measured): 4248
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 15258
transpose MT(loaded): 4248
complex : NG (0.659438)

  Start 25: roundtriptest2dsp_10_10

25/26 Test #25: roundtriptest2dsp_10_10 ..........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 24401
transpose MT(measured): 6481
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 24401
transpose MT(loaded): 6481
complex : NG (0.661858)

  Start 26: roundtriptest2dsp_5_15

26/26 Test #26: roundtriptest2dsp_5_15 ...........***Failed 0.26 sec
Path(random) :4(ST) 2(ST) 3(ST) 4(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 26029
transpose MT(measured): 13875
Path(random) :4(ST) 2(ST) 2(ST) 3(ST) 3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :1(ST) 4(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 26029
transpose MT(loaded): 13875
complex : NG (0.663971)

The text was updated successfully, but these errors were encountered:

shibatch · 2020-08-20T06:16:02Z

This seems to be a bug in gcc.
I added a workaround.

barkovv · 2020-08-24T17:17:06Z

@shibatch Are you going to add VX support in this library?

edelsohn · 2020-08-24T17:20:34Z

There is Z15 VX support in the #291 Add_s390x_support_rebased branch.

shibatch · 2020-08-24T23:52:27Z

@edelsohn Can I merge #291?

edelsohn · 2020-08-25T18:40:11Z

We definitely would like this merged. With the current, updated sources, I am seeing more failures. I am testing on Fedora with GCC 10.1.
***Exception: Illegal 20% tests passed, 24 tests failed out of 30

shibatch · 2020-08-26T00:48:12Z

Since you are seeing ***Exception: Illegal, it seems that your computer does not support ZVECTOR2 extension.

I built gcc-10.2.0 on the LinuxONE VM, and did the test with it. There is no problem.

As you see, it passes the tests on travis.

https://travis-ci.org/github/shibatch/sleef/jobs/719496316

edelsohn · 2020-08-26T17:06:14Z

My colleague tested with the master branch and successfully tested on both z14 and z15. As another developer mentioned, the failures seem to be a conflict with FFTW. Thanks for the great initial implementation.

shibatch · 2020-08-27T03:38:51Z

I added DISABLE_FFTW option. #327
I hope this option will solve the problem.

Andreas-Krebbel · 2020-09-16T10:31:46Z

This seems to be a bug in gcc.
I added a workaround.

Could you please elaborate on what the GCC bug is! I would like to have a look.

Andreas-Krebbel · 2020-09-21T11:02:30Z

Hi,

David asked me to have a look at the Z specific changes. The implementation looks very good to me. Thanks for looking into this. Here are a few comments/questions from my side:

Building with -march=z15 makes a difference since we have a hardware vector conversion between float and int then. In my microbenchmarks this makes a difference e.g. with Sleef_sinf4_u35. We probably would require another build variant for this then.
On IBM Z the vector compares are defined to always set all bits in an element to either ones or zeros depending on the comparison result. So if I understand these operations correctly they could be replaced like this (as it is done on Power):
-static INLINE int vtestallones_i_vo64(vopmask g) { return g[0] == 0xffffffffffffffffLL && g[1] == 0xffffffffffffffffLL; }
-static INLINE int vtestallones_i_vo32(vopmask g) { return g[0] == 0xffffffffffffffffLL && g[1] == 0xffffffffffffffffLL; }
+static INLINE int vtestallones_i_vo32(vopmask g) { return vec_all_ne(g, (vopmask){ 0 }); }
+static INLINE int vtestallones_i_vo64(vopmask g) { return vec_all_ne(g, (vopmask){ 0 }); }

Otherwise the compiler falls back to scalar AND instructions to implement this.

There are also a few other operations on comparison results which could potentially be simplified I think.

zvector and zvector2 are not ideal as names for the cpu facilities. I would prefer the names which are also shown in /proc/cpuinfo. That would be vx for the z13 vector instruction set, vxe for z14, and vxe2 for z15. Renaming this would be quite some effort but I think it would be good to match the facility names which are used elsewhere.
I'm wondering about what the sleef requirements are for min/max wrt NaN and Inf handling? Our hardware instructions provide several different modes and we have to make sure that this matches.

shibatch · 2020-09-21T12:45:14Z

Hello,

As for 1, In order to add support for z14 and z15, I need access to computers with those architectures.

As for 4, sleef basically follows the specification of math functions in the C99 standard.

edelsohn · 2020-09-21T12:49:49Z

@shibatch The system to which you have access through the LinuxONE Community Cloud is z15.

shibatch · 2020-09-23T23:27:33Z

The main problem in implementing z15 support is lack of a reference manual.
I don't even know what new intrinsics are available on z14 and z15.
Could you point a good reference if you know?

edelsohn · 2020-09-24T01:27:41Z

The z15 z/Architecture Principles of Operation is available online.

The vector intrinsics are defined in the GCC vecintrin.h header file.

shibatch · 2020-09-24T01:52:01Z

How can I know which instructions are the new instructions only available on z14 and z15?
There is no clear correspondence between mnemonics and intrinsics either.

shibatch · 2020-09-24T03:00:17Z

@Andreas-Krebbel The following is the source code for reproducing the bug.

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

#define real double
#define RESTRICT __restrict__
#define ALIGNED(x) __attribute__((aligned(x)))
#define LOG2BS 4

#define BS (1 << LOG2BS)
#define TRANSPOSE_BLOCK(y2) do {                                        \
    for(int x2=y2+1;x2<BS;x2++) {                                       \
      element_t r = *(element_t *)&row[y2].r[x2*2+0];                   \
      *(element_t *)&row[y2].r[x2*2+0] = *(element_t *)&row[x2].r[y2*2+0]; \
      *(element_t *)&row[x2].r[y2*2+0] = r;                             \
    }} while(0)

static void transpose(real *RESTRICT ALIGNED(256) d, real *RESTRICT ALIGNED(256) s, const int log2n, const int log2m) {
  typedef struct { real r[BS*2]; } row_t;
  typedef struct { real r0, r1; } element_t;

  for(int y=0;y<(1 << log2n);y+=BS) {
    for(int x=0;x<(1 << log2m);x+=BS) {
      row_t row[BS];
      for(int y2=0;y2<BS;y2++) {
        row[y2] = *(row_t *)&s[(((y+y2) << log2m)+x)*2];
      }

      TRANSPOSE_BLOCK( 0); TRANSPOSE_BLOCK( 1);
      TRANSPOSE_BLOCK( 2); TRANSPOSE_BLOCK( 3);
      TRANSPOSE_BLOCK( 4); TRANSPOSE_BLOCK( 5);
      TRANSPOSE_BLOCK( 6); TRANSPOSE_BLOCK( 7);
      TRANSPOSE_BLOCK( 8); TRANSPOSE_BLOCK( 9);
      TRANSPOSE_BLOCK(10); TRANSPOSE_BLOCK(11);
      TRANSPOSE_BLOCK(12); TRANSPOSE_BLOCK(13);
      TRANSPOSE_BLOCK(14); TRANSPOSE_BLOCK(15);

      for(int y2=0;y2<BS;y2++) {
        *(row_t *)&d[(((x+y2) << log2n)+y)*2] = row[y2];
      }
    }
  }
}

int main(int argc, char **argv) {
  int n = 5;
  double *s = memalign(256, sizeof(double) * 2 * (1 << n) * (1 << n));
  double *d = memalign(256, sizeof(double) * 2 * (1 << n) * (1 << n));

  double *p = s;
  int cnt = 1;
  for(int y=0;y<(1 << n);y++) {
    for(int x=0;x<(1 << n);x++) {
      *p++ = cnt++;
      *p++ = cnt++;
    }
  }

  transpose(d, s, n, n);

  p = d;
  for(int y=0;y<(1 << n);y++) {
    for(int x=0;x<(1 << n);x++) {
      int n0 = (int)*p++, n1 = (int)*p++;
      printf("(%03x, %03x) ", n0, n1);
    }
    printf("\n");
  }
}

[s390x]~/sleef$ gcc -O2 bug.c
[s390x]~/sleef$ ./a.out > gcc-O2.out
[s390x]~/sleef$ gcc -O0 bug.c
[s390x]~/sleef$ ./a.out > gcc-O0.out
[s390x]~/sleef$ clang bug.c
[s390x]~/sleef$ ./a.out > clang-O0.out
[s390x]~/sleef$ diff gcc-O0.out clang-O0.out
[s390x]~/sleef$ diff gcc-O2.out gcc-O0.out
13,16c13,16
< (019, 01a) (059, 05a) (099, 09a) (0d9, 0da) (119, 11a) (159, 15a) (199, 19a) (1d9, 1da) (219, 21a) (259, 25a) (299, 29a) (2d9, 2da) (319, 31a) (31b, 31c) (31d, 31e) (31f, 320) (419, 41a) (459, 45a) (499, 49a) (4d9, 4da) (519, 51a) (559, 55a) (599, 59a) (5d9, 5da) (619, 61a) (659, 65a) (699, 69a) (6d9, 6da) (719, 71a) (71b, 71c) (71d, 71e) (71f, 720)
...

[s390x]~/sleef$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/s390x-redhat-linux/8/lto-wrapper
Target: s390x-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --disable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --disable-libmpx --enable-gnu-indirect-function --with-long-double-128 --with-arch=z13 --with-tune=z14 --enable-decimal-float --build=s390x-redhat-linux
Thread model: posix
gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)

Andreas-Krebbel · 2020-09-24T06:35:57Z

The main problem in implementing z15 support is lack of a reference manual.

Sorry for the confusion. I didn't want to ask for implementing full z15 support right now. It would be sufficient to just provide a way to compile with -march=z15. That way the float-int conversion code as in:

static INLINE vint2 vrint_vi2_vf(vfloat vf) {
  vf = vrint_vf_vf(vf);
  return (vint) { vf[0], vf[1], vf[2], vf[3] };
}

becomes just:

vfisb   %v24,%v24,4,4
vcfeb   %v24,%v24,0,5

instead of: (when compiled with -march=z14)

    vfisb   %v0,%v24,4,4
    vzero   %v24
    vlgvf   %r1,%v0,0
    vlvgf   %v6,%r1,0
    cfebr   %r4,5,%f6
    vlgvf   %r1,%v0,1
    vlvgf   %v4,%r1,0
    vlgvf   %r1,%v0,2
    cfebr   %r3,5,%f4
    vlvgf   %v2,%r1,0
    vlgvf   %r1,%v0,3
    cfebr   %r2,5,%f2
    vlvgf   %v0,%r1,0
    cfebr   %r1,5,%f0
    vlvgf   %v24,%r4,0
    vlvgf   %v24,%r3,1
    vlvgf   %v24,%r2,2
    vlvgf   %v24,%r1,3

I don't even know what new intrinsics are available on z14 and z15.
Could you point a good reference if you know?

The builtins are documented for the XL compiler here:
https://www.ibm.com/support/knowledgecenter/SSLTBW_2.4.0/com.ibm.zos.v2r4.cbcpx01/vectorbltin.htm

The builtins added with z15 are:

vec_float int->float, unsigned->float
vec_signed now also for float->int
vec_unsigned now also for float->unsigned int
vec_revb - vector byte swaps
vec_reve - vector element swaps
vec_sldb - shift left double by bit
vec_srdb - shift right double by bit
vec_search_string_cc - substring search
vec_search_string_until_zero_cc - substring search for 0 terminated strings

shibatch · 2020-10-06T11:42:35Z

Implementing the requested feature is almost done, but it seems that clang-10 has a bug in handling orderedness of comparison. Because of this bug, it passes the tests only if optimizations are turned off.

Andreas-Krebbel · 2020-10-06T11:44:30Z

Great, Thanks! Could you please extract a testcase for the clang 10 issue so that we can have a look?

shibatch · 2020-10-06T12:07:28Z

Anyway, I made PR #343.
I will make a testcase tomorrow.

shibatch · 2020-10-07T01:12:13Z

Below is the testcase.

[s390x]~$ cat bug.c
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

static uint64_t vreinterpret_vm_vf(float vf) { union { float vf; uint64_t vm; } cnv; cnv.vm = 0; cnv.vf = vf; return cnv.vm; }
static float vreinterpret_vf_vm(uint64_t vm) { union { float vf; uint64_t vm; } cnv; cnv.vm = vm; return cnv.vf; }
static uint64_t vcast_vm_vo(uint32_t o) { return (uint64_t)o | (((uint64_t)o) << 32); }
static uint64_t vor_vm_vo32_vm(uint32_t x, uint64_t y)       { return vcast_vm_vo(x) | y; }

static uint64_t vsignbit_vm_vf(float f) {
  return vreinterpret_vm_vf(f) & vreinterpret_vm_vf(-0.0f);
}

static float vmulsign_vf_vf_vf(float x, float y) {
  return vreinterpret_vf_vm(vreinterpret_vm_vf(x) ^ vsignbit_vm_vf(y));
}

float xtest(float y, float x) {
  return vreinterpret_vf_vm(vor_vm_vo32_vm(x != x ? ~(uint32_t)0 : 0, vreinterpret_vm_vf(vmulsign_vf_vf_vf(0, y))));
}

int main(int argc, char **argv) {
  float vf1 = atof(argv[1]);
  float vf2 = atof(argv[2]);

  printf("t = %.20g\n", xtest(vf1, vf2));
}
[s390x]~$ clang-10 -march=z13 -O2 -fno-strict-aliasing bug.c
[s390x]~$ ./a.out 0 nan
t = 0
[s390x]~$ gcc -march=z13 -O2 -fno-strict-aliasing bug.c
[s390x]~$ ./a.out 0 nan
t = -nan

shibatch · 2020-10-08T02:35:43Z

@Andreas-Krebbel Is it okay to merge PR #343?

Andreas-Krebbel · 2020-10-09T08:17:04Z

Hi,

I've tested the z15 build variant. Works fine for me. I see the float-int conversion instructions appearing in the code.

What about the proposed vtestallones changes above? Do you think we could get rid of some of the scalar compares by using the vec_all_* builtins?

shibatch · 2020-10-13T08:45:35Z

@Andreas-Krebbel vec_all_* builtins are now added.

Andreas-Krebbel · 2020-10-14T13:04:07Z

A fix for the clang issue has been posted now:
https://reviews.llvm.org/D89389

Andreas-Krebbel · 2020-10-14T15:26:29Z

After building your latest version I can't find the functions without suffix anymore e.g. Sleef_sind2_u10. In sleef.h there is only:

sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe(SLEEF_VECTOR_DOUBLE);
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxenofma(SLEEF_VECTOR_DOUBLE);
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe2(SLEEF_VECTOR_DOUBLE); 
sleef.h:IMPORT CONST SLEEF_VECTOR_DOUBLE Sleef_sind2_u10vxe2nofma(SLEEF_VECTOR_DOUBLE);

Was that change intended?

Andreas-Krebbel · 2020-10-14T15:40:37Z

I see vxe2 function in sleef.h:
Sleef_sinf4_u10vxe2nofma

But there doesn't appear to be an implementation in the shared library.

shibatch · 2020-10-14T23:49:50Z

Since there are two supported extensions, I have to add a dispatcher to choose between them, in order to add functions without suffix. The problem is how to test them. I can add them without testing.

As for vxe2nofma functions, could you check if testing for those functions is correctly executed? There should be a test named iutvxe2nofma. Is it executed?

It is a bit hard to read the log at travis, but it is executed.
https://travis-ci.org/github/shibatch/sleef/jobs/735272236#L30211

Andreas-Krebbel · 2020-10-20T07:40:05Z

A dispatcher could use the glibc getauxval function to check whether the current hardware support vxe or vxe2. I can help testing the dispatcher on other CPU levels.

I've checked with gcc and clang that the vxe2 test is correctly executed.

The travis/before_script.s390x-gcc.sh appears to lack the "-DENFORCE_VXE2=TRUE" so far.

Andreas-Krebbel · 2020-10-23T07:44:13Z

I was just having a look at the GCC problem you were seeing. It looks like a problem in the code to me. The same data is accessed through pointers to the two incompatible types row_t and element_t. The compiler assumes that these accesses cannot refer to the same underlying object and reorders them. The GCC option -Wstrict-aliasing=2 is useful to detect such problems.

@Andreas-Krebbel The following is the source code for reproducing the bug.
...
#define BS (1 << LOG2BS)
#define TRANSPOSE_BLOCK(y2) do {
for(int x2=y2+1;x2<BS;x2++) {
element_t r = *(element_t )&row[y2].r[x22+0];
*(element_t )&row[y2].r[x22+0] = *(element_t )&row[x2].r[y22+0];
*(element_t )&row[x2].r[y22+0] = r;
}} while(0)
static void transpose(real *RESTRICT ALIGNED(256) d, real RESTRICT ALIGNED(256) s, const int log2n, const int log2m) {
typedef struct { real r[BS2]; } row_t;
typedef struct { real r0, r1; } element_t;

As a quick fix you could mark these two types as may alias like this (in fact marking just one of them like this should suffice):

typedef struct { real r[BS*2]; } __attribute__ ((may_alias)) row_t;
typedef struct { real r0, r1; } __attribute__ ((may_alias)) element_t;

Another way would be to use element_t to declare the row_t member. This would avoid the type casting in TRANSPOSE_BLOCK entirely:

typedef struct { real r0, r1; } element_t;
typedef struct { element_t r[BS]; } row_t;

for(int y=0;y<(1 << log2n);y+=BS) {
for(int x=0;x<(1 << log2m);x+=BS) {
row_t row[BS];
for(int y2=0;y2<BS;y2++) {
row[y2] = *(row_t *)&s[(((y+y2) << log2m)+x)*2];
}

  TRANSPOSE_BLOCK( 0); TRANSPOSE_BLOCK( 1);
  TRANSPOSE_BLOCK( 2); TRANSPOSE_BLOCK( 3);
  TRANSPOSE_BLOCK( 4); TRANSPOSE_BLOCK( 5);
  TRANSPOSE_BLOCK( 6); TRANSPOSE_BLOCK( 7);
  TRANSPOSE_BLOCK( 8); TRANSPOSE_BLOCK( 9);
  TRANSPOSE_BLOCK(10); TRANSPOSE_BLOCK(11);
  TRANSPOSE_BLOCK(12); TRANSPOSE_BLOCK(13);
  TRANSPOSE_BLOCK(14); TRANSPOSE_BLOCK(15);

  for(int y2=0;y2<BS;y2++) {
    *(row_t *)&d[(((x+y2) << log2n)+y)*2] = row[y2];

shibatch · 2020-10-23T08:07:32Z

Okay, I will add -fno-strict-aliasing in the next patch.

Andreas-Krebbel · 2020-10-23T08:31:35Z

Okay, I will add -fno-strict-aliasing in the next patch.

-fno-strict-aliasing prevents many optimizations globally. I think fixing this locally would be better performance-wise.

shibatch · 2020-10-23T10:30:58Z

There are actually many places where type-punning is used.
I have to properly address this anyway.

shibatch · 2020-11-07T10:59:55Z

It is already almost done.

shibatch · 2020-11-09T07:59:41Z

@Andreas-Krebbel I added dispatchers. Please check if the test passes on Z13 or Z14 computers.

Andreas-Krebbel · 2020-11-10T10:53:34Z

Thanks! I did run some tests.
Testsuite on z15 is clean.

On z14 I see fails with the vxe2 tests. That's expected.
The following tests FAILED:
10 - iutvxe2 (Failed)
11 - iutyvxe2 (Failed)
12 - iutivxe2 (Failed)
14 - iutvxe2nofma (Failed)
15 - iutyvxe2nofma (Failed)
16 - iutivxe2nofma (Failed)

There are plenty of fails on z13. But that's fair as well since it wasn't intended have support for z13 at all.
The following tests FAILED:
22 - iutpurecfma_scalar (Failed)
23 - iutypurecfma_scalar (Failed)
24 - iutipurecfma_scalar (Failed)
26 - iutdsp128 (Failed)
28 - naivetestdp_2 (ILLEGAL)
29 - naivetestdp_3 (ILLEGAL)
30 - naivetestdp_4 (ILLEGAL)
31 - naivetestdp_5 (ILLEGAL)
32 - naivetestdp_10 (ILLEGAL)
34 - naivetestsp_2 (ILLEGAL)
35 - naivetestsp_3 (ILLEGAL)
36 - naivetestsp_4 (ILLEGAL)
37 - naivetestsp_5 (ILLEGAL)
38 - naivetestsp_10 (ILLEGAL)
39 - roundtriptest1ddp_12 (ILLEGAL)
40 - roundtriptest1ddp_16 (ILLEGAL)
41 - roundtriptest1dsp_12 (ILLEGAL)
42 - roundtriptest1dsp_16 (ILLEGAL)
48 - roundtriptest2dsp_2_2 (ILLEGAL)
49 - roundtriptest2dsp_4_4 (ILLEGAL)
50 - roundtriptest2dsp_8_8 (ILLEGAL)
51 - roundtriptest2dsp_10_10 (ILLEGAL)
52 - roundtriptest2dsp_5_15 (ILLEGAL)

I also did some testing with sinf4. The dispatching works as expected. On z15 the vxe2 variant is used showing a nice performance benefit. On z14 it falls back to the vxe version.

I'm fine with merging the patches.

Thanks a lot for the great work on this!

This patch adds z15 VX support, following issue #317. It also changes the suffix zvector2 to vxe and vxe2. In order to enable z15 VXE2 support, it requires clang-10. However, it seems that clang-9 and later have a bug in handling orderedness of comparison. Thus, it passes the tests only if optimizations are turned off. Co-authored-by: shibatch <[email protected]>

shibatch · 2020-11-11T02:06:14Z

Merged. Please create a new issue if you find a problem.

shibatch mentioned this issue Oct 6, 2020

Add z15 support #343

Merged

shibatch closed this as completed Nov 11, 2020

Z15 VX s390x Linux on z support #317

Z15 VX s390x Linux on z support #317

Comments

edelsohn commented Jul 22, 2020

shibatch commented Aug 20, 2020

barkovv commented Aug 24, 2020

edelsohn commented Aug 24, 2020

shibatch commented Aug 24, 2020

edelsohn commented Aug 25, 2020

shibatch commented Aug 26, 2020

edelsohn commented Aug 26, 2020

shibatch commented Aug 27, 2020

Andreas-Krebbel commented Sep 16, 2020

Andreas-Krebbel commented Sep 21, 2020

shibatch commented Sep 21, 2020

edelsohn commented Sep 21, 2020

shibatch commented Sep 23, 2020

edelsohn commented Sep 24, 2020

shibatch commented Sep 24, 2020

shibatch commented Sep 24, 2020 • edited Loading

Andreas-Krebbel commented Sep 24, 2020 • edited Loading

shibatch commented Oct 6, 2020

Andreas-Krebbel commented Oct 6, 2020

shibatch commented Oct 6, 2020

shibatch commented Oct 7, 2020 • edited Loading

shibatch commented Oct 8, 2020

Andreas-Krebbel commented Oct 9, 2020

shibatch commented Oct 13, 2020

Andreas-Krebbel commented Oct 14, 2020

Andreas-Krebbel commented Oct 14, 2020

Andreas-Krebbel commented Oct 14, 2020

shibatch commented Oct 14, 2020

Andreas-Krebbel commented Oct 20, 2020

Andreas-Krebbel commented Oct 23, 2020 • edited Loading

shibatch commented Oct 23, 2020

Andreas-Krebbel commented Oct 23, 2020

shibatch commented Oct 23, 2020 • edited Loading

shibatch commented Nov 7, 2020

shibatch commented Nov 9, 2020

Andreas-Krebbel commented Nov 10, 2020

shibatch commented Nov 11, 2020

shibatch commented Sep 24, 2020 •

edited

Loading

Andreas-Krebbel commented Sep 24, 2020 •

edited

Loading

shibatch commented Oct 7, 2020 •

edited

Loading

Andreas-Krebbel commented Oct 23, 2020 •

edited

Loading

shibatch commented Oct 23, 2020 •

edited

Loading