-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New failing tests in ATDM debug builds of Trilinos due to KOKKOS_ENABLE_DEBUG=ON being set #2471
Comments
I'm OK with the Anasazi test getting disabled on this platform. I don't think ATDM customers need Anasazi even as far as I know. |
I'll disable the Amesos2 KLU failing tests for now when KOKKOS_ENABLE_DEBUG is on until a fix is ready, configuring/building right now. |
The details of the failing tests:
is shown in the below details. All of these failures look to be caused by the enable of Looking at these failures, and looking at the new commmits pulled shown at: I don't see any commits to Kokkos itself that would account for these new failures. Therefore, I think that the option @trilinos/amesos2, @trilinos/kokkos, @trilinos/kokkos-kernels, and @trilinos/panzer developers, Can we get these failures cleaned up pretty quickly? If we don't we are going to spam developers with CDash error emails every day (which can't happen). If we can't get these cleaned up by say tomorrow night, I can demote these DETAILS (click to expand)Now to dig into these failing tests and see why they fail and to confirm that it was the enable of A) Amesos2_KLU2_UnitTests_MPI_2: https://testing.sandia.gov/cdash/testDetails.php?test=45969470&build=3469152 shows the failure:
Looking at: the only build this test is also failing is for the build B) KokkosCore_UnitTest_Cuda_MPI_1: https://testing.sandia.gov/cdash/testDetails.php?test=45967252&build=3469077 shows the failure:
Looking at the query: this test only runs in the ATDM builds of Trilinos and only fails in the C) KokkosKernels_sparse_cuda_MPI_1: https://testing.sandia.gov/cdash/testDetails.php?test=45967928&build=3469089 shows the failure:
Looking at the query: this test only runs in the ATDM builds of Trilinos and only fails in the D) PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4: https://testing.sandia.gov/cdash/testDetails.php?test=45973239&build=3469238 shows the failure:
Looking at the query: this test runs in many builds of Trilinos but only fails in the ATDM |
Temporary fix to address issue trilinos#2471 by disabling failing tests when KOKKOS_ENABLE_DEBUG=ON until failure is triaged and fixed.
Temporary fix to address issue #2471 by disabling failing tests when KOKKOS_ENABLE_DEBUG=ON until failure is triaged and fixed.
@bartlettroscoe PR #2472 merged - disables the failing Amesos2 KLU tests when debugging is enabled. |
The |
How do we fix that? Note that when you turn the debug check off, it seems to run just fine. Is this a false check for this system?
Thanks! |
So that just leaves the failing Panzer test showing the debug checking failure:
@rppawlo or @jmgate, is there some Panzer developer that can look into fixing this debug-mode check failure? |
@bartlettroscoe I suspect the Panzer failure has the same root cause as the KokkosKernels failure |
@ibaned, so that means that your fix and push will likely fix that Panzer test too then? Great! Thanks! That means that all of these failures will likely get cleaned up pretty soon then. |
See #2471 and kokkos/kokkos#1503 Out-of-bounds access when sorting an empty bin.
@bartlettroscoe The fix is in pull request #2476 |
Looks like the merge PR #2476 fixed the failing tests
But it did not fix the failing test As a reminder, that test shows the failure:
@ibaned, any idea how to fix this last failing test? Why does this test pass just fine when ? |
How do we get this Kokkos kernel to stop doing that? Note that this test also fails in the same way on 'white' as shown at: which shows:
Does the Kokkos team run this same unit test on other CUDA platforms and does it pass there? If so, where can we see the results for that? |
We could disable the test for K80 architectures.... |
@swbova just saw this fail on P100s as well with KOKKOS_DEBUG on. |
This test requests a hardcoded number of 32 CUDA threads per warp, but with debugging enabled the CUDA kernel uses too many registers and can only run on 16 threads per warp max. [kokkos/kokkos#1514, kokkos/kokkos#1513, #2471]
Pull request #2494 should fix the failing Kokkos unit test. |
The test
Therefore, I believe that this issue is resolved. I am not going to mark this issue with the new "Disabled Tests" label because this was a very targeted disable and this one unit test just does not seem to be be written in way that works with GPUs with debug checking turned on. Therefore, I am just going to close this. Now every Trilinos user and developer will have Kokkos debug-mode checking turned on by default when they configure with @ibaned, thanks for all of your help in getting these failures cleaned up! Closing as complete! |
Just to wrap things up, note that the test |
Temporary fix to address issue #2471 by disabling failing tests when KOKKOS_ENABLE_DEBUG=ON until failure is triaged and fixed.
CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/amesos2 , @trilinos/anasazi, @trilinos/panzer
Next Action Status
The PR #2476 fixed two of the tests on 3/30/2018 and PR #2494 disabled one single unit test on 4/3/2018 not appropriate to run on GPUs.
Description
As shown in the query:
several tests are timing out today and failing in the ATDM
-debug
builds of Trilinos:Amesos2_KLU2_UnitTests_MPI_2
Anasazi_Epetra_ModalSolversTester_MPI_4
KokkosCore_UnitTest_Cuda_MPI_1
KokkosKernels_sparse_cuda_MPI_1
PanzerMiniEM_MiniEM-BlockPrec_Augmentation_MPI_4
The set of tests that are failing and which platforms they are failing shown in the above query are shown in the below table:
Table of failing tests (click to expend)
Except for the failing Anasazi tests in the build
Trilinos-atdm-white-ride-cuda-debug
(which I will write another GitHub issue for), all of these tests (even the timeouts) seem to be failing due to debug-mode checks fromKOKKOS_ENABLE_DEBUG=ON
being set (see #2439) failing and throwing exceptions. In the case of the failing testsAmesos2_KLU2_UnitTests_MPI_2
, for example, it shows:This exception causes a hang and a timeout in some cases and fails quickly and aborts in other cases. (So much for assuming that one MPI process throwing an excpetion will bring down an MPI job in all cases.)
Many of these builds have been promoted to the "ATDM" CDash group/track and therefore triggered CDash error emails today. Therefore, this must get fixed quickly if possible (or we will need to demote these builds again).
Steps to Reproduce
One can log onto
white
(SON) orride
(SRN) and then reproduce the build and tests as described at:I just reproduced many of these failures on 'white' using
This showed the test results:
The test failure timeout
PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
was also seen in #2446 as well. Not sure why that test timed out when run locally but not in the driver jobs. But otherwise, this one build reproduced all of the failing tests shown on CDash except for the testAnasazi_Epetra_ModalSolversTester_MPI_4
(which does not look to be related toKOKKOS_ENABLE_DEBUG=ON
).Related Issues
The text was updated successfully, but these errors were encountered: