forked from pytorch/builder
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refiled] IFU-main-2023-07-31 #36
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* add 12.1 workflow for docker image build * add github workflow * update cuDNN to 8.8.1 and location for archive
* Do not use ftp `s#ftp://#https://#` * Remove no-longer relevant comment
* add magma build for CUDA12.1 * copy and fix CMake.patch; drop sm_37 for CUDA 12.1
* remove CUDA 11.6 builds * remove more 11.6 builds
* enable nightly CUDA 12.1 builds * fix version typo
* Remove special case for Python 3.11 * Remove install torch script
* Windows CUDA 12.1 changes * add CUDA version checks for Windows MAGMA builds * use magma branch without fermi arch
And add `12.1` to the matrix Test plan: `conda build . -c nvidia` and observe https://anaconda.org/malfet/pytorch-cuda/files
As 10.9 was release a decade ago and for that reason yet not supported C++17 standard. Similar to pytorch/pytorch#99857
To fix builds, though we should really target 11.0 at the very least
* Fix nvjitlink inclusion in 12.1 wheels * Fix typo
As it should be part of the AMI
* Fix wheel validations * Try using upgrade flag instead * try uninstall * test * Try using python3 * use python3 vs python for validation * Fix windows vs other os python execution * Uninstall fix
…torch#1444) More arm64 changes test run under environment sleep 15min allow investigate add sleep test test Test test test Arm64 use python fix test testing test tests testing test test
Use [`nvidia/cuda:11.4.3-devel-centos7`](https://hub.docker.com/layers/nvidia/cuda/11.4.3-devel-centos7/images/sha256-e2201a4954dfd65958a6f5272cd80b968902789ff73f26151306907680356db8?context=explore) because `nvidia/cuda:10.2-devel-centos7` was deleted in accordance with [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md): > After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned. Also delete redundant DEVTOOLSET=7 clause
Followup after pytorch#1446 CUDA-10.2 and moreover CUDA-9.2 docker images are gone per [Nvidia's Container Support Policy](https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md): > After a period of Six Months time, the EOL tags WILL BE DELETED from Docker Hub and Nvidia GPU Cloud (NGC). This deletion ensures unsupported tags (and image layers) are not left lying around for customers to continue using after they have long been abandoned. Also, as all our Docker script install CUDA toolkit anyway, what's the point of using `nvidia/cuda` images at all instead of `centos:7`/`ubuntu:18.04` that former are based on, according to https://gitlab.com/nvidia/container-images/cuda/-/blob/master/dist/11.4.3/centos7/base/Dockerfile Explicitly install `g++` to `libtorch/Docker` base image, as it's needed by `patchelf` Please note, that `libtorch/Docker` can not be completed without buildkit, as `rocm` step depends on `python3` which is not available in `cpu` image
Not sure, what weird version of `wget` is getting installed, but attempt to download https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2 fails with: ``` --2023-07-06 03:18:38-- https://anaconda.org/pytorch/magma-cuda121/2.6.1/download/linux-64/magma-cuda121-2.6.1-1.tar.bz2 Resolving anaconda.org (anaconda.org)... 104.17.93.24, 104.17.92.24, 2606:4700::6811:5d18, ... Connecting to anaconda.org (anaconda.org)|104.17.93.24|:443... connected. ERROR: cannot verify anaconda.org's certificate, issued by ‘/C=US/O=Let's Encrypt/CN=E1’: Issued certificate has expired. To connect to anaconda.org insecurely, use `--no-check-certificate'. ``` Also, switch from NVIDIA container to a stock `centos:7` one, to make containers slimmer and fit on standard GitHub Actions runners.
And add `nvcc` to path Regression introduced by pytorch#1447 when NVIDIA image was dropped in favor of base `centos` image
As NNC is dead, and llvm dependency has not been updated in last 4 years First step towards fixing pytorch/pytorch#103756
As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests) Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690
This reverts commit ed9a2ae.
* Rebuild docker images on release * Include with-push
I.e. applying the same changes as in pytorch@4a7ed14 to libtorch docker builds
This is a reland of pytorch#1451 with an important to branches filter: entries in multi-line array definition should start with `-` otherwise it were attempting to match branch name `main\nrelease/*` I.e. just copy-n-paste example from https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#using-filters Test plan: actionlint .github/workflows/build-manywheel-images.yml Original PR description: Rebuild docker images on release builds. It should also tag images for release here: https://github.com/pytorch/builder/blob/3fc310ac21c9ede8d0ce13ec71096820a41eb9f4/conda/build_docker.sh#L58-L60 This is first step in pinning docker images for release.
As [`pytorch/manylinux-builder`](https://hub.docker.com/r/pytorch/manylinux-builder) containers has only one version of CUDA, there is no need to select any Nor setup `LD_LIBRARY_PATH` as it does not match the setup users might have on their system (but keep it for libtorch tests for now) Should fix crash due to different minor version of cudnn installed in docker container and specified as dependency to a small wheel package, seen here https://github.com/pytorch/pytorch/actions/runs/5478547018/jobs/9980463690
* Update manywheel and libtorch images to rocm5.6 * Add MIOpen branch for ROCm5.6
* Add msccl-algorithms directory to PyTorch wheel * Bundle msccl-algorithms into wheel * Use correct src path for msccl-algorithms (cherry picked from commit 95b5af3) * Add hipblaslt dependency for ROCm5.6 onwards * Update build_all_docker.sh to ROCm5.6
…#1462) * Fix lapack missing and armcl update * update ARMCL version
…pdate msccl path for ROCm5.7 (cherry picked from commit 36c10cc)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Re-attempt for #35
Mistakenly hit "Squash and merge" when it should have been a regular "Merge" to maintain individual commits.
This IFU PR brings in the following notable changes for ROCm:
Using -complete base images for CentOS and Ubuntu
Removing install_rocm.sh step from Dockerfiles because they're not needed if using the -complete images
A fix for erroneous logic regarding msccl-algorithm files not being included for ROCm5.6 or above.
Tested successfully for wheels via http://rocmhead.amd.com:8080/job/pytorch/job/dev/job/manylinux_rocm_wheels/243 using ROCm5.7 RC1 (build 7) and PyTorch 2.0