Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lbgpu node vel #2878

Merged
merged 10 commits into from
Jul 11, 2019
Merged

Lbgpu node vel #2878

merged 10 commits into from
Jul 11, 2019

Conversation

fweik
Copy link
Contributor

@fweik fweik commented May 31, 2019

Description of changes:

  • Factored out velocity getter, and obey boundary velocity for
    both interpolation schemes,
  • Removed some globals
  • Leak less

@fweik fweik requested a review from KaiSzuttor May 31, 2019 18:51
@codecov
Copy link

codecov bot commented May 31, 2019

Codecov Report

Merging #2878 into python will decrease coverage by <1%.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           python   #2878   +/-   ##
======================================
- Coverage      82%     82%   -1%     
======================================
  Files         525     525           
  Lines       26807   26807           
======================================
- Hits        22015   22013    -2     
- Misses       4792    4794    +2
Impacted Files Coverage Δ
src/core/electrostatics_magnetostatics/p3m.cpp 85% <0%> (-1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 326c261...e416e7e. Read the comment docs.

@fweik
Copy link
Contributor Author

fweik commented Jun 3, 2019

@mkuron do you have seen the rocm error before? (..### HCC STATUS_CHECK Error: HSA_STATUS_ERROR_INVALID_ISA (0x100f) at file:mcwamp_hsa.cpp line:1193, looks like a toolchain issue to me)

@mkuron
Copy link
Member

mkuron commented Jun 3, 2019

No idea. Could be some variation of an out-of-memory error or out-of-registers.

@KaiSzuttor
Copy link
Member

/**
   * The instruction set architecture is invalid.
   */
  HSA_STATUS_ERROR_INVALID_ISA = 0x100F,

https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/faa3ae51388517353afcdaf9c16621f879ef0a59/src/inc/hsa.h#L196-L199

@KaiSzuttor
Copy link
Member

actually I don't understand why the compilation is not terminated, there are a number of compile errors:

/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:104:37: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
        int_pow<2 * cao>(Utils::sinc(meshi.x * nx) * Utils::sinc(meshi.y * ny) *
                                    ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:104:37: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:104:65: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
        int_pow<2 * cao>(Utils::sinc(meshi.x * nx) * Utils::sinc(meshi.y * ny) *
                                                                ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:104:65: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:105:37: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
                         Utils::sinc(meshi.z * nz));
                                    ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:105:37: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:151:39: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
          U2 = pow((double)Utils::sinc(meshi.x * nmx) *
                                      ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:151:39: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:152:35: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
                       Utils::sinc(meshi.y * nmy) * Utils::sinc(meshi.z * nmz),
                                  ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:152:35: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:152:64: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
                       Utils::sinc(meshi.y * nmy) * Utils::sinc(meshi.z * nmz),
                                                               ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:152:64: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:210:39: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
          U2 = pow((double)Utils::sinc(meshi.x * nmx) *
                                      ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:210:39: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:211:35: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
                       Utils::sinc(meshi.y * nmy) * Utils::sinc(meshi.z * nmz),
                                  ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:211:35: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:211:64: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
                       Utils::sinc(meshi.y * nmy) * Utils::sinc(meshi.z * nmz),
                                                               ^
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:211:64: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ik_i'
/builds/espressomd/espresso/src/core/electrostatics_magnetostatics/p3m_gpu_error_cuda.cu:273:39: error:  'Utils::sinc':  no overloaded function has restriction specifiers that are compatible with the ambient context 'p3m_k_space_error_gpu_kernel_ad_i'
          U2 = pow((double)Utils::sinc(meshi.x * nmx) *
                                      ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
20 errors generated.

@mkuron
Copy link
Member

mkuron commented Jun 6, 2019

Good find. So this is actually something like an unresolved symbol error.

actually I don't understand why the compilation is not terminated

That‘s a known hipcc bug. It doesn‘t produce nonzero exit codes if compilation failed. Usually that‘s not a problem because the object file would subsequently be missing and CMake terminates anyway, but in the current case, an (incomplete?) object file seems to have already been written. That issue was fixed last month: ROCm/HIP#1117.

@jngrad
Copy link
Member

jngrad commented Jun 11, 2019

Concerning the no overloaded function has restriction specifiers compiler error, the explanation can be found in ROCm/HIP#374 (comment). The error message disappears when using #if defined(__CUDACC__) or defined(__HIPCC__) instead of

@fweik fweik changed the title Lbgpu node vel WIP: Lbgpu node vel Jun 12, 2019
@fweik
Copy link
Contributor Author

fweik commented Jun 12, 2019

@KaiSzuttor there is a merge regression now that reintroduces the ek_params stuff?

@KaiSzuttor
Copy link
Member

I'll try again...

*/
__global__ void integrate(LB_nodes_gpu n_a, LB_nodes_gpu n_b, LB_rho_v_gpu *d_v,
LB_node_force_density_gpu node_f,
EK_parameters *ek_parameters_gpu,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here e.g.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was intended by this PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, those are not needed and were removed in the master after this branch was forked of

@KaiSzuttor
Copy link
Member

KaiSzuttor commented Jun 12, 2019

was there before merge:

EK_parameters *ek_parameters_gpu,

@fweik
Copy link
Contributor Author

fweik commented Jun 12, 2019

Still this reverts #2877 which it should not.

@KaiSzuttor
Copy link
Member

puh, then maybe you should merge...

@fweik
Copy link
Contributor Author

fweik commented Jun 12, 2019

@KaiSzuttor

@fweik fweik changed the title WIP: Lbgpu node vel Lbgpu node vel Jun 12, 2019
@fweik
Copy link
Contributor Author

fweik commented Jul 9, 2019

This is too difficult to debug for me, giving up.

@fweik fweik closed this Jul 9, 2019
float *boundary_velocity = nullptr;
int *boundary_node_list = nullptr;
int *boundary_index_list = nullptr;
size_t size_of_boundindex = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variables shadow globals, and these globals are not even used anymore. It's possible that that was the cause of the HSA_STATUS_ERROR_INVALID_ISA you were seeing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That‘s just a generic error code. It is used in many places for which no specific error code has been defined.

@KaiSzuttor
Copy link
Member

@fweik it's only the rocm test that fails... maybe we should not throw away this PR just because of that

@fweik
Copy link
Contributor Author

fweik commented Jul 9, 2019

Well you know where to find the code. I'm done with this.

@mkuron
Copy link
Member

mkuron commented Jul 9, 2019

This is where the error comes from: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/e2d1950bd8fc83dd5b8352a41b829a7c57fc1073/src/core/runtime/amd_aql_queue.cpp#L820. It supposedly means "Out of VGPRs".

On CUDA, an out-of-registers condition would cause an overflow into global memory -- slowing things down, but still working. I guess we don't have infrastructure in place that would detect this kind of performance regression on CUDA though. So it's plausible that a function requiring a very large number of registers would be broken on AMD. Adding -Xptxas=-vto the CUDA flags reveals a few functions that are close to the limit (63 on sm_30 and 255 on later ones) and have probably already overflowed into global memory.

I think AMD can deal with 255 VGPRs too, but it's unclear to me how to measure how many of them a kernel is using. Maybe this pull request just crosses the limit. To reduce register consumption, large kernels would need to be broken up into multiple smaller ones.

@mkuron
Copy link
Member

mkuron commented Jul 9, 2019

It's the call to thrust::transform(..., lb_get_interpolated_velocity_gpu<27>(...)). thrust::transform(..., lb_get_interpolated_velocity_gpu<8>(...)) works fine. #2982 fixes the issue.

bors bot added a commit that referenced this pull request Jul 10, 2019
2982: Reduce excessive loop unrolling in lbgpu velocity interpolation r=KaiSzuttor a=mkuron

This caused excessive register usage, especially when combined with thrust. Issue discovered by @fweik in #2878.

It turns out that this is a problem for CUDA too, it just exhibits a different behavior. Instead of crashing like on HIP, CUDA just produces a large binary and slower code. In a perfect world, the compiler should display a warning, but I guess neither AMD nor Nvidia operate in a perfect world.

Co-authored-by: Michael Kuron <[email protected]>
@KaiSzuttor KaiSzuttor reopened this Jul 10, 2019
@jngrad
Copy link
Member

jngrad commented Jul 10, 2019

@KaiSzuttor the webhook didn't trigger CI for 81b9c33

@KaiSzuttor
Copy link
Member

bors r+

bors bot added a commit that referenced this pull request Jul 11, 2019
2878: Lbgpu node vel r=KaiSzuttor a=fweik

Description of changes:
 - Factored out velocity getter, and obey boundary velocity for
   both interpolation schemes,
- Removed some globals
- Leak less


Co-authored-by: Florian Weik <[email protected]>
Co-authored-by: RudolfWeeber <[email protected]>
Co-authored-by: Kai Szuttor <[email protected]>
Co-authored-by: Kai Szuttor <[email protected]>
@bors
Copy link
Contributor

bors bot commented Jul 11, 2019

Build succeeded

@bors bors bot merged commit e416e7e into espressomd:python Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants