Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conflicting SPIR-V versions when linking to atomic-ops.spir #868

Closed
fcharras opened this issue Jan 11, 2023 · 8 comments · Fixed by #1103
Closed

Conflicting SPIR-V versions when linking to atomic-ops.spir #868

fcharras opened this issue Jan 11, 2023 · 8 comments · Fixed by #1103
Labels
user User submitted issue

Comments

@fcharras
Copy link

fcharras commented Jan 11, 2023

I'm having troubles with using atomics from atomic-ops.spir, with the following error message:

...<elided traceback>...

popenargs = (['spirv-link', '--allow-partial-linkage', '-o', '/tmp/tmpcp_knjn_/2-linked-spirv', '/tmp/tmpcp_knjn_/1-generated-spirv', '/opt/venv/lib/python3.9/site-packages/numba_dpex/ocl/atomics/atomic_ops.spir'],)
kwargs = {}, retcode = 1
cmd = ['spirv-link', '--allow-partial-linkage', '-o', '/tmp/tmpcp_knjn_/2-linked-spirv', '/tmp/tmpcp_knjn_/1-generated-spirv', '/opt/venv/lib/python3.9/site-packages/numba_dpex/ocl/atomics/atomic_ops.spir']

    def check_call(*popenargs, **kwargs):
        """Run command with arguments.  Wait for command to complete.  If
        the exit code was zero then return, otherwise raise
        CalledProcessError.  The CalledProcessError object will have the
        return code in the returncode attribute.
    
        The arguments are the same as for the call function.  Example:
    
        check_call(["ls", "-l"])
        """
        retcode = call(*popenargs, **kwargs)
        if retcode:
            cmd = kwargs.get("args")
            if cmd is None:
                cmd = popenargs[0]
>           raise CalledProcessError(retcode, cmd)
E           subprocess.CalledProcessError: Command '['spirv-link', '--allow-partial-linkage', '-o', '/tmp/tmpcp_knjn_/2-linked-spirv', '/tmp/tmpcp_knjn_/1-generated-spirv', '/opt/venv/lib/python3.9/site-packages/numba_dpex/ocl/atomics/atomic_ops.spir']' returned non-zero exit status 1.

/opt/pyenv/versions/3.9.16/lib/python3.9/subprocess.py:373: CalledProcessError

error: 1: Conflicting SPIR-V versions: 1.4 (input modules 1 through 1) vs 1.0 (input module 2).

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/numba_dpex/spirv_generator.py", line 137, in __del__
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpcp_knjn_/2-linked-spirv'

what could cause such a version mismatch ? I'm trying to get a minimal reproducer but it seems the error does not trigger for all atomics calls - will update.

I'm using a custom numba-dpex build from 0.19.0, with an up to date environment (2023 one api releases, dpctl >= 0.14.1dev1)

(I don't think there are differences between my build environment and the runtime environment. I'm using spirv-tools binaries from ubuntu jammy repositories )

For GPU, the error can be circumvented by using native atomics.

Edit: it seems it's a bug that can be summed up this way: the atomic_ops.spir binary has some SPIR-V version that is determined at build time, and in some cases, the JIT can produce different SPIR-V versions for the kernels, but different versions are not compatible and crash the linker. In my case, the SPIR-V version of atomic_ops.spir is 1.0 and I can fix the bug by passing --spirv-max-version 1.0 to the llvm-spirv call at https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/spirv_generator.py#L83 . I am not, however, able to explain why suddenly the llvm-spirv starts outputting SPIR-V 1.3 for some of my kernels 🤔

@fcharras
Copy link
Author

fcharras commented Jan 11, 2023

Turns out that it works if adding the argument --spirv-max-version 1.0 to the llvm-spirv calls at https://github.com/IntelPython/numba-dpex/blob/main/numba_dpex/spirv_generator.py#L83 see edit in OP

@diptorupd
Copy link
Contributor

@fcharras Can you share the environment details? Our internal CI build server is now reliably reproducing the issue, but my unable to do so on my development system.

@fcharras
Copy link
Author

fcharras commented Jan 21, 2023

I didn't try to reproduce with the conda install, here are the details of our custom development build where it can be replicated:

  • everything running in ubuntu 22.04 base docker container with drivers, ghcr.io/intel/llvm/ubuntu2204_intel_drivers

  • level_zero drivers installed from https://github.com/oneapi-src/level-zero/

  • Intel numpy==1.22.3, intel scipy==1.7.3, python==3.9

  • Build process:

    • dpctl==0.14.1dev1, dpnp==0.11.0, numba_dpex==0.19.0 built from source using dpcpp after activating the oneapi basekit 2023 that was installed using the oneapi online bash installer. Each one is built from source after the previous one has been isntalled in the environment.
  • For the runtime: the oneapi basekit is not activated.

    • packages are installed with pip, along with mkl-dpcpp
    • spirv-tools, spirv-headers installed from ubuntu repositories
    • the llvm-spirv binary is extracted from the oneapi basekit 2023 and manually added to $PATH
    • adding lib folder from the python venv to LD_LIBRARY_PATH
    • using OCL_ICD_FILENAMES_RESET=1 OCL_ICD_FILENAMES=libintelocl.so

The image with everything installed is available on dockerhub

docker pull jjerphan/numba_dpex_dev:latest

guide to load into the container here and the dockerfile is there.

but I'd be surprised if this can't be reproduced with the conda install, for which we have instructions here.

Also, simple kernels don't trigger the issue. Probably it only triggers when using advanced features (local or private memory, dpex funcs, or some combination of those...).

@fcharras
Copy link
Author

Until you find the adequate fix for the issue, here is our workaround

@diptorupd
Copy link
Contributor

@fcharras I have applied your workaround for the time being to main. I am following up with our C++ compiler team to look for a better fix.

@fcharras
Copy link
Author

Thank you for the follow-up. I'd be curious to know if there are any performance implication is using some SPIR-V version rather than the other (or why not just use the latest).

@diptorupd
Copy link
Contributor

@fcharras The performance implications are worrying me too. I am reaching out to our SPIR-V experts in the dpcpp team. I will update once I hear back.

@diptorupd diptorupd mentioned this issue Jul 26, 2023
5 tasks
@ZzEeKkAa
Copy link
Contributor

ZzEeKkAa commented Aug 2, 2023

Fixed in #1103

@ZzEeKkAa ZzEeKkAa closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
user User submitted issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants