-
Notifications
You must be signed in to change notification settings - Fork 12.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ptx assembly aborted due to errors #58491
Comments
It has to do with the #include <cstdio>
__global__ void foo_kernel()
{
printf("%s", __func__);
}
void foo()
{
foo_kernel<<<10, 1>>>();
}
|
Bisecting brings me here: 7aa1fa0 |
https://godbolt.org/z/8bMYcf1z7 The debug info directive that ptxas does not like is on line 655:
It should've been A work-around would be to disable GPU-side debug info with |
Thanks for the quick help! Will try the workaround :) |
Is this issue solved? I am encountering this issue with clang and llvm 17.0.6 |
I'm encountering a similar issue with |
Minimal repro:
Removing |
It looks like another case of LLVM generating symbol names with a dot in it and sneaking through our attempts to normalize such names:
The variable itself does have Switching to line-only debug info would work around the issue, too. |
Looked into this quite a bit. It seems the name gets embedded in a debug DIE during the After spending already too much time looking into this and not understanding enough about the guts of the LLVM debug information infrastructure I took the easy way out: Generate pre-defined lvalue names without dots
`.` should be converted to `_$_` by the nvptx-assign-valid-global-names pass as `ptxas` doesn't support dots.
But during the ASMPrinter initialization the global variable name gets embedded in a debug DIE.
There somehow end up being two different `MCSymbol`s for the global variable with only the main one being renamed.
Bug: https://github.com/llvm/llvm-project/issues/58491
--- a/clang/lib/CodeGen/CGExpr.cpp
+++ b/clang/lib/CodeGen/CGExpr.cpp
@@ -3277,7 +3277,12 @@ LValue CodeGenFunction::EmitPredefinedLV
FnName = FnName.substr(1);
StringRef NameItems[] = {
PredefinedExpr::getIdentKindName(E->getIdentKind()), FnName};
- std::string GVName = llvm::join(NameItems, NameItems + 2, ".");
+ std::string GVName;
+ if (CGM.getLangOpts().CUDA && CGM.getLangOpts().CUDAIsDevice) {
+ GVName = llvm::join(NameItems, NameItems + 2, "_$_");
+ } else {
+ GVName = llvm::join(NameItems, NameItems + 2, ".");
+ }
if (auto *BD = dyn_cast_or_null<BlockDecl>(CurCodeDecl)) {
std::string Name = std::string(SL->getString());
if (!Name.empty()) {
|
I think we've dealt with a similar issue in the dwarf debug info before. Let me see if I can find it. |
I think I had 2e7e097 in mind, but it may not be helpful here as it was dealing with the concept of private prefixes. Here the symbol which causes the problem is a I believe we did discuss invalid symbol issues in the past, but I do not think it ever went anywhere.
Back to figuring out how to fix this instance.
Oh, well. Looks like we may need to do it the hard way and teach @alexey-bataev Would you happen to have any idea on what would be the best way to get DWARF's symbol references mangled the same way we mangle other symbols in NVPTX? |
I always thought that we need to handle it in the frontend. But it is only my thought, feel free to discard it. |
Avoiding such symbols in the front-end is would avoid some of the issues (granted, including this one), but a symbol with a dot may materialize within LLVM itself. Granted, it may not happen often in practice. It's also possible that such symbol cloning would not be affected by this issue (e.g. if, unlike this case, debug info would point to the same MCSymbol for the cleaned up name). Here are the options I see:
@dwblaikie If we rename a global symbol how hard is that to find and update references to the symbol from debug info. I suspect we already do that somewhere in LLVM. Can you point me in the right direction? |
Not sure if existing instances of this (as you say, abi would mostly make it impossible to change symbol names effectively) But if you want to try it - the disubprogram attached to the function, if it has the mangled name (maybe it doesn't, maybe it just depends on the actual symbol name of the llvm::function in which case you wouldn't have to do anything for debuginfo) - that should be updated. |
The
I managed to write something to reach that instruction, but not how to reach the |
Sorry, I'm not following that last comment - the DISubprogram is the same one from the Function and from the DILocation. I take it this renaming isn't done at the IR level, OK - so it's not about updating the DISubprogram itself to match a change to the Function, but later than that. Sure enough then - |
I guess was aiming at that you can't get the |
Ah, yes, DILocations aren't accesible top-down from the DISubprogram, only bottom-up from the DISubprogram's Function's instructions. |
Hello, I have a similar issue with llvm 18.1.8 and CUDA 12.5. Is that expected ? The failing line is
and the message:
It is also failing with main branch:
|
Unfortunately, the issue is still there, and we still do not have a good fix. Disabling GPU-side dwarf debug info with |
@Artem-B Would you object to applying this workaround to master until someone actually dives into the guts of the DI subsystem to find the bug? |
Doing it on clang side would depend on the name mangling implementation details in NVPTX back-end. I think a better approach would be to try intercepting printouts of |
Fine by me. Thanks |
Hello, We are facing the same issue, was the workaround fix already merged into the master ? If yes, what is the version number for reference ? Thank you |
This problem is not fixed yet. You may work around by disabling GPU-side debug info with |
@Artem-B Thank you for the response |
Until now debug info was printing the symbols names as-is and that resulted in invalid PTX when the symbols contained characters that are incalid for PTX. E.g. `__PRETTY_FUNCTION.something` Debug info is somewhat disconnected from the symbols themselves, so the regular "NVPTXAssignValidGlobalNames" pass can't easily fix them. As the "plan B" this patch catches printout of debug symbols and fixes them, as needed. One gotcha is that the same code path is used to print the names of debug info sections. Those section names do start with a '.debug'. The dot in those names is nominally illegal in PTX, but the debug section names with a dot are accepted as a special case. The downside of this change is that if someone ever has a `.debug*` symbol that needs to be referred to from the debug info, that label will be passed through as-is, and will still produce broken PTX output. If/when we run into a case where we need it to work, we could consider only passing through specific debug section names, or add a mechanist allowing us to tell section names apart from regular symbols. Fixes llvm#58491
Until now debug info was printing the symbols names as-is and that resulted in invalid PTX when the symbols contained characters that are incalid for PTX. E.g. `__PRETTY_FUNCTION.something` Debug info is somewhat disconnected from the symbols themselves, so the regular "NVPTXAssignValidGlobalNames" pass can't easily fix them. As the "plan B" this patch catches printout of debug symbols and fixes them, as needed. One gotcha is that the same code path is used to print the names of debug info sections. Those section names do start with a '.debug'. The dot in those names is nominally illegal in PTX, but the debug section names with a dot are accepted as a special case. The downside of this change is that if someone ever has a `.debug*` symbol that needs to be referred to from the debug info, that label will be passed through as-is, and will still produce broken PTX output. If/when we run into a case where we need it to work, we could consider only passing through specific debug section names, or add a mechanist allowing us to tell section names apart from regular symbols. Fixes llvm#58491
…113216) Until now debug info was printing the symbols names as-is and that resulted in invalid PTX when the symbols contained characters that are invalid for PTX. E.g. `__PRETTY_FUNCTION.something` Debug info is somewhat disconnected from the symbols themselves, so the regular "NVPTXAssignValidGlobalNames" pass can't easily fix them. As the "plan B" this patch catches printout of debug symbols and fixes them, as needed. One gotcha is that the same code path is used to print the names of debug info sections. Those section names do start with a '.debug'. The dot in those names is nominally illegal in PTX, but the debug section names with a dot are accepted as a special case. The downside of this change is that if someone ever has a `.debug*` symbol that needs to be referred to from the debug info, that label will be passed through as-is, and will still produce broken PTX output. If/when we run into a case where we need it to work, we could consider only passing through specific debug section names, or add a mechanism allowing us to tell section names apart from regular symbols. Fixes #58491
I'm now getting this warning with clang 18.1.8:
But the fix from #113216 came much later than this release 🤷 |
It's hard to tell why you get this warning without the complete command line. Are you by any chance passing it to a plain C++ compilation? If so, the warning would be expected, as the |
@Artem-B I'm passing it to if(CMAKE_CUDA_COMPILER_ID STREQUAL "Clang")
set(CMAKE_CUDA_FLAGS_DEBUG "-g -Xarch_device -g0")
endif() |
What is the complete compiler command line with all the options, that produces the warning? The warning is likely benign, but it points that a GPU-specific option has been passed to a compiler invocation that does not do any GPU-side compilations. It's easy enough to silence, with the downside of potentially silencing other unexpectedly ignored options if/when you run into them. The right way to handle it is to make sure that |
Should be this one:
Any idea how to do it in CMake? |
Ignoring the warning, or adding I'm not particularly familiar with cmake's CUDA-related plumbing, but you may look for the compilation-specific subset of flags. |
Hi!
We are bumping Clang to commit 1ae33bf, and we find that it crashes building CUDA code with this error trace:
Is this a known problem?
The text was updated successfully, but these errors were encountered: