Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMD][Build] Porting dockerfiles from the ROCm/vllm fork #11777

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

gshtras
Copy link
Contributor

@gshtras gshtras commented Jan 6, 2025

An attempt to unify the build process with how it is done in ROCm/vllm
Split the build process into:

  • Building the required dependency libraries using Dockerfile.rocm_base. Not needed for the end user, done by AMD.
  • Building the vLLM itself on top of it.

In addition to not building the libraries each time, the new base image is now much smaller (7GB on docker hub vs 18GB previously), which allows to make the build process much faster (~10 minutes)

Copy link

github-actions bot commented Jan 6, 2025

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation ci/build labels Jan 6, 2025
@gshtras gshtras force-pushed the dockerfile_unification branch from 8016502 to a309129 Compare January 6, 2025 19:44
@mgoin mgoin self-requested a review January 6, 2025 21:19
Signed-off-by: Gregory Shtrasberg <[email protected]>
Dockerfile.rocm Outdated Show resolved Hide resolved
Dockerfile.rocm Outdated Show resolved Hide resolved
Signed-off-by: Gregory Shtrasberg <[email protected]>
@DaBossCoda
Copy link

nice

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 12, 2025
Copy link

mergify bot commented Jan 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @gshtras.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 12, 2025
@mergify mergify bot removed the needs-rebase label Jan 13, 2025
Signed-off-by: Gregory Shtrasberg <[email protected]>
@robertgshaw2-redhat
Copy link
Collaborator

@SageMoore - could you take a quick look through this?

Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. I assume the configs are copied over from the rocm/vllm-dev container and are expected to be faster than what we have on main? Do you have any performance results that you can share?

Dockerfile.rocm_base Show resolved Hide resolved
ARG FA_BRANCH
ARG FA_REPO
RUN git clone ${PYTORCH_REPO} pytorch
RUN cd pytorch && git checkout ${PYTORCH_BRANCH} && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this version of pytorch support rocm 6.3? If not, would it make more sense to pull in the 6.2 wheel from pypy instead of building from source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same base image and therefore a combination of versions used in the AMD weekly dockers - rocm/vllm-dev:main (out of ROCm/vllm.git)
6.2 wheel is a all-in-one whl that brings with it an entire ROCm worth of .so's, including the old less performant hipblaslt

@gshtras
Copy link
Contributor Author

gshtras commented Jan 13, 2025

Looks reasonable to me. I assume the configs are copied over from the rocm/vllm-dev container and are expected to be faster than what we have on main? Do you have any performance results that you can share?

Not sure which configs do you mean. The change to the triton tuning configs in this PR is due to the triton version change. Triton 3.2+ deprecated the use of num_stages:0 in favor of num_stages:2

@mgoin
Copy link
Member

mgoin commented Jan 14, 2025

It looks like the CI isn't able to find the container?

@gshtras
Copy link
Contributor Author

gshtras commented Jan 14, 2025

It looks like the CI isn't able to find the container?

All GPU nodes of AMD CI are currently down due to a network issue in the compute cluster used there
It looks like the build of the container is the only one that succeeded as it doesn't use GPU nodes

@gshtras gshtras force-pushed the dockerfile_unification branch from 1092866 to 5c36cb8 Compare January 16, 2025 20:52
@gshtras
Copy link
Contributor Author

gshtras commented Jan 17, 2025

To unbreak the CI we need vllm-project/buildkite-ci#57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation ready ONLY add when PR is ready to merge/full CI is needed rocm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants