Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump to PyTorch 2.0.0 #165

Merged
merged 3 commits into from
Apr 4, 2023
Merged

Bump to PyTorch 2.0.0 #165

merged 3 commits into from
Apr 4, 2023

Conversation

Tobias-Fischer
Copy link
Contributor

@Tobias-Fischer Tobias-Fischer commented Apr 2, 2023

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@Tobias-Fischer
Copy link
Contributor Author

Fixes #151

@Tobias-Fischer Tobias-Fischer marked this pull request as draft April 2, 2023 21:10
@Tobias-Fischer
Copy link
Contributor Author

Still struggling with some issues.

  1. For the non-cuda build on linux-64, at least locally I was running into issues where the run dependencies of the pytorch package are not pulled in by the pytorch-cpu package (which is really weird) and leads for the build to fail as sympy cannot be found when import torch.

  2. Locally the cuda build fails with

conda.CondaMultiError: The package for cudnn located at /home/conda/feedstock_root/build_artifacts/pkg_cache/cudnn-8.4.1.50-hed8a83a_0
appears to be corrupted. The path 'bin/.cudnn-post-link.sh'
specified in the package manifest cannot be found.

The package for cudnn located at /home/conda/feedstock_root/build_artifacts/pkg_cache/cudnn-8.4.1.50-hed8a83a_0
appears to be corrupted. The path 'include/cudnn.h'
specified in the package manifest cannot be found.
  1. Have not tested the osx builds.

Let's see what happens in CI.

@Tobias-Fischer
Copy link
Contributor Author

It seems like this is working quite well :)

Unfortunately linux-aarch64 errors with:

ImportError: /home/conda/feedstock_root/build_artifacts/pytorch-recipe_1680469918528/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: _ZNK6google8protobuf7Message11GetTypeNameEv

Any ideas where this issue might be from @conda-forge/pytorch-cpu @conda-forge/libprotobuf @conda-forge/protobuf? It seems like it could be related to the gcc version (NVIDIA-AI-IOT/torch2trt#53)? I can only find this issue on conda-forge but it does not seem to apply here (mixing defaults and conda-forge channels): conda-forge/paraview-feedstock#23

Several issues hint to towards issues with onnx/caffe2.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

my guess is that somebody is re-exporting the symbols publicly. It may be the vendored onnx?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

@Tobias-Fischer Tobias-Fischer marked this pull request as ready for review April 3, 2023 09:33
@Tobias-Fischer
Copy link
Contributor Author

Looks like this is now ready for review. Could someone else please test a cuda build locally?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

I'm building now. About 6 hours per build. about 12 hours to go.

Do we have a "test" script for GPU usage?

I'm kinda out of creative stamina for the day so if you have an idea I would be all ears.

@Tobias-Fischer
Copy link
Contributor Author

Thanks @hmaarrfk!

We can test the gpu builds with

>>> import torch

>>> torch.cuda.is_available()
True

>>> torch.cuda.device_count()
1

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
1
>>> print(torch.__version__)
2.0.0.post200

is the post expected?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

Builds starting to upload https://anaconda.org/mark.harfouche/pytorch/files if people want a more thorough test.

@Tobias-Fischer
Copy link
Contributor Author

There are a few instances of post200 when searching on the PyTorch repo: https://github.com/search?q=repo%3Apytorch%2Fpytorch+post200&type=issues

Not sure what it refers to

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 3, 2023

Of course, pytorch needs to have a custom version numbering system ^_^

https://github.com/pytorch/pytorch/blob/73b06a0268bb89c09a86f16fa0f72818baa4b250/tools/generate_torch_version.py#L51

The use the CONDA_BUILD_NUMBER and add it to the version number.

I can't fully trace it, but it seems to be related to the build number.

Lets maybe flag this as an issue, but I don't want to patch too much at this stage.

@Tobias-Fischer
Copy link
Contributor Author

I don't think it's a big deal, is it? If you really wanted to, we could change https://github.com/Tobias-Fischer/pytorch-cpu-feedstock/blob/5f00d8033d7cc23eb9831c74dc038fe3e4047562/recipe/build_pytorch.sh#LL56 to 0 instead.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

I think it is ok as is.
Ah thnk you. i searched pytorch's repo for that statement but forgot that maybe we just set it...

Eventually, i think somebody's version check will be broken but we can deal with it later.

@Tobias-Fischer
Copy link
Contributor Author

So far no one has complained, and we've had the same situation in past conda-forge builds:

>>> torch.__version__
'1.13.0.post200'

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

Yeah. I agree. I think it's fine.

Just waiting for the builds at this stage.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

LInux cuda log files

log_files.zip

@hmaarrfk hmaarrfk merged commit 1a3257e into conda-forge:main Apr 4, 2023
@h-vetinari
Copy link
Member

I thought that 2.0 needed Triton as a backend, or is that optional...?

@ngam
Copy link
Contributor

ngam commented Apr 4, 2023

The post.xxx has always been there. I believed it was due to building from cloning a git repo...

Any comment on triton btw? Are we building with all new 2.x capabilities?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

hm, sorry i forgot to test for triton, is there a "test" you would like be to try on a GPU?

@ngam
Copy link
Contributor

ngam commented Apr 4, 2023

Not sure, I haven't been using GPUs for a few months now, but I can test when I get a chance in a few weeks. Let's keep this on our radar for 2.0.1 (unless someone complains before then).

Note "torchtriton" in the uploads from the PyTorch channel:

linux-64/pytorch-2.0.0-py3.8_cuda11.8_cudnn8.7.0_0.tar.bz2

No Description

Uploaded Fri Mar 10 00:57:19 2023
md5 checksum fc92239ea8aa4ba12cd8305a1505f78a
arch x86_64
build py3.8_cuda11.8_cudnn8.7.0_0
constrains cpuonly <0
depends blas * mkl, filelock, jinja2, mkl >=2018, networkx, python >=3.8,<3.9.0a0, pytorch-cuda >=11.8,<11.9, pytorch-mutex 1.0 cuda, sympy, torchtriton 2.0.0, typing_extensions
has_prefix True
license BSD 3-Clause
license_family BSD
machine x86_64
operatingsystem linux
platform linux
subdir linux-64
target-triplet x86_64-any-linux
timestamp 1678406617283

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

linux compiling just takes "time" but it is easy to start, so if somebody wants to add triton support, i can rebuild linux.

@Tobias-Fischer
Copy link
Contributor Author

  • I am a bit confused what is depending on what. As far as I can see, pytorch depends on torchtriton, which in turn seems to depend on pytorch.
  • Is there any difference between torchtriton and triton?
  • We already package an old version of triton in conda-forge and have a PR open for version 2.0.0 (triton v2.0.0 triton-feedstock#2). Would this 2.0.0 version be suitable as dependency for here?

@Tobias-Fischer
Copy link
Contributor Author

Another issue seems to be that I can't install a recent version of torchvision and pytorch=2 side-by-side; I manage to kill mamba ;)

> mamba create -n pytorch21 pytorch=2 torchvision=0.14
python: /home/conda/feedstock_root/build_artifacts/mamba-split_1680002410624/work/libmamba/src/core/satisfiability_error.cpp:1767: mamba::{anonymous}::TreeExplainer::node_t mamba::{anonymous}::TreeExplainer::concat_nodes(const std::vector<long unsigned int>&): Assertion `std::all_of( ids.begin(), ids.end(), [&](auto id) { return m_pbs.graph().node(ids.front()).index() == m_pbs.graph().node(id).index(); } )' failed.

If I don't specify the torchvision version, it pulls in a very old torchvision which is definitely not compatible ..

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 4, 2023

pytorch run exports itself, so torchvision has to be rebuilt.

@ngam
Copy link
Contributor

ngam commented Apr 5, 2023

Let's move the discussion to #166 to better keep track. The issue with torchvision should be resolved with the migrator or with a manual rebuild

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants