-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Z15 VX s390x Linux on z support #317
Comments
This seems to be a bug in gcc. |
@shibatch Are you going to add VX support in this library? |
There is Z15 VX support in the #291 Add_s390x_support_rebased branch. |
We definitely would like this merged. With the current, updated sources, I am seeing more failures. I am testing on Fedora with GCC 10.1. |
Since you are seeing I built gcc-10.2.0 on the LinuxONE VM, and did the test with it. There is no problem. As you see, it passes the tests on travis. |
My colleague tested with the master branch and successfully tested on both z14 and z15. As another developer mentioned, the failures seem to be a conflict with FFTW. Thanks for the great initial implementation. |
I added DISABLE_FFTW option. #327 |
Could you please elaborate on what the GCC bug is! I would like to have a look. |
Hi, David asked me to have a look at the Z specific changes. The implementation looks very good to me. Thanks for looking into this. Here are a few comments/questions from my side:
Otherwise the compiler falls back to scalar AND instructions to implement this. There are also a few other operations on comparison results which could potentially be simplified I think.
|
Hello, As for 1, In order to add support for z14 and z15, I need access to computers with those architectures. As for 4, sleef basically follows the specification of math functions in the C99 standard. |
@shibatch The system to which you have access through the LinuxONE Community Cloud is z15. |
The main problem in implementing z15 support is lack of a reference manual. |
The z15 z/Architecture Principles of Operation is available online. The vector intrinsics are defined in the GCC vecintrin.h header file. |
How can I know which instructions are the new instructions only available on z14 and z15? |
@Andreas-Krebbel The following is the source code for reproducing the bug.
|
Sorry for the confusion. I didn't want to ask for implementing full z15 support right now. It would be sufficient to just provide a way to compile with -march=z15. That way the float-int conversion code as in:
becomes just:
instead of: (when compiled with -march=z14)
The builtins are documented for the XL compiler here: The builtins added with z15 are: vec_float int->float, unsigned->float |
Implementing the requested feature is almost done, but it seems that clang-10 has a bug in handling orderedness of comparison. Because of this bug, it passes the tests only if optimizations are turned off. |
Great, Thanks! Could you please extract a testcase for the clang 10 issue so that we can have a look? |
Anyway, I made PR #343. |
Below is the testcase.
|
@Andreas-Krebbel Is it okay to merge PR #343? |
Hi, I've tested the z15 build variant. Works fine for me. I see the float-int conversion instructions appearing in the code. What about the proposed vtestallones changes above? Do you think we could get rid of some of the scalar compares by using the vec_all_* builtins? |
@Andreas-Krebbel vec_all_* builtins are now added. |
A fix for the clang issue has been posted now: |
After building your latest version I can't find the functions without suffix anymore e.g. Sleef_sind2_u10. In sleef.h there is only:
Was that change intended? |
I see vxe2 function in sleef.h: But there doesn't appear to be an implementation in the shared library. |
Since there are two supported extensions, I have to add a dispatcher to choose between them, in order to add functions without suffix. The problem is how to test them. I can add them without testing. As for vxe2nofma functions, could you check if testing for those functions is correctly executed? There should be a test named iutvxe2nofma. Is it executed? It is a bit hard to read the log at travis, but it is executed. |
A dispatcher could use the glibc getauxval function to check whether the current hardware support vxe or vxe2. I can help testing the dispatcher on other CPU levels. I've checked with gcc and clang that the vxe2 test is correctly executed. The travis/before_script.s390x-gcc.sh appears to lack the "-DENFORCE_VXE2=TRUE" so far. |
I was just having a look at the GCC problem you were seeing. It looks like a problem in the code to me. The same data is accessed through pointers to the two incompatible types row_t and element_t. The compiler assumes that these accesses cannot refer to the same underlying object and reorders them. The GCC option -Wstrict-aliasing=2 is useful to detect such problems.
As a quick fix you could mark these two types as may alias like this (in fact marking just one of them like this should suffice):
Another way would be to use element_t to declare the row_t member. This would avoid the type casting in TRANSPOSE_BLOCK entirely:
|
Okay, I will add -fno-strict-aliasing in the next patch. |
-fno-strict-aliasing prevents many optimizations globally. I think fixing this locally would be better performance-wise. |
There are actually many places where type-punning is used. |
It is already almost done. |
@Andreas-Krebbel I added dispatchers. Please check if the test passes on Z13 or Z14 computers. |
Thanks! I did run some tests. On z14 I see fails with the vxe2 tests. That's expected. There are plenty of fails on z13. But that's fair as well since it wasn't intended have support for z13 at all. I also did some testing with sinf4. The dispatching works as expected. On z15 the vxe2 variant is used showing a nice performance benefit. On z14 it falls back to the vxe version. I'm fine with merging the patches. Thanks a lot for the great work on this! |
This patch adds z15 VX support, following issue #317. It also changes the suffix zvector2 to vxe and vxe2. In order to enable z15 VXE2 support, it requires clang-10. However, it seems that clang-9 and later have a bug in handling orderedness of comparison. Thus, it passes the tests only if optimizations are turned off. Co-authored-by: shibatch <[email protected]>
Merged. Please create a new issue if you find a problem. |
Port Sleef to IBM Z15 VX architecture supporting s390x Linux on Z compiled with both GCC and Clang. Achieve equivalent speedup to x86-64 and AArch64 appropriate for the z VX 128 bit vector width.
With Github 290x branch clone from a few days ago on Fedora built with GCC, I see the following failures:
The following tests FAILED:
8/26 Test #18: roundtriptest2ddp_4_4 ............***Failed 0.10 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 17549
transpose MT(measured): 85036
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 17549
transpose MT(loaded): 85036
complex : NG (0.315141)
19/26 Test #19: roundtriptest2ddp_8_8 ............***Failed 0.05 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 23537
transpose MT(measured): 6460
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 23537
transpose MT(loaded): 6460
complex : NG (0.301525)
20/26 Test #20: roundtriptest2ddp_10_10 ..........***Failed 0.32 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 41145
transpose MT(measured): 10379
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 41145
transpose MT(loaded): 10379
complex : NG (0.306744)
21/26 Test #21: roundtriptest2ddp_5_15 ...........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 4(ST) 2(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(measured): 38633
transpose MT(measured): 22803
Path(random) :2(ST) 2(ST) 4(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit double
Path(random) :3(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit double
transpose NoMT(loaded): 38633
transpose MT(loaded): 22803
complex : NG (0.309993)
23/26 Test #23: roundtriptest2dsp_4_4 ............***Failed 0.13 sec
Path(random) :3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 11867
transpose MT(measured): 120471
Path(random) :1(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 11867
transpose MT(loaded): 120471
complex : NG (0.707168)
24/26 Test #24: roundtriptest2dsp_8_8 ............***Failed 0.03 sec
Path(random) :2(ST) 1(ST) 2(ST) 3(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 15258
transpose MT(measured): 4248
Path(random) :3(ST) 2(ST) 2(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 15258
transpose MT(loaded): 4248
complex : NG (0.659438)
25/26 Test #25: roundtriptest2dsp_10_10 ..........***Failed 0.27 sec
Path(random) :2(ST) 1(ST) 4(ST) 1(ST) 2(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 24401
transpose MT(measured): 6481
Path(random) :2(ST) 3(ST) 4(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 24401
transpose MT(loaded): 6481
complex : NG (0.661858)
26/26 Test #26: roundtriptest2dsp_5_15 ...........***Failed 0.26 sec
Path(random) :4(ST) 2(ST) 3(ST) 4(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :2(ST) 1(ST) 1(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(measured): 26029
transpose MT(measured): 13875
Path(random) :4(ST) 2(ST) 2(ST) 3(ST) 3(ST) 1(ST)
ISA : GCC Vector Extension 128 bit float
Path(random) :1(ST) 4(ST)
ISA : GCC Vector Extension 128 bit float
transpose NoMT(loaded): 26029
transpose MT(loaded): 13875
complex : NG (0.663971)
The text was updated successfully, but these errors were encountered: