Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow for AutoTP #4961

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open

Conversation

delock
Copy link
Collaborator

@delock delock commented Jan 16, 2024

This PR add a new extendable workflow for automatic tensor parallelism (https://www.deepspeed.ai/tutorials/automatic-tensor-parallelism/). The workflow aims to provide a way to validate AutoTP for LLM models.

@delock
Copy link
Collaborator Author

delock commented Jan 17, 2024

The specific error below is because of the container is not created with CAP_SYS_NICE capability. I'll check the additional flags I use for container and post it here.

set_mempolicy: Operation not permitted
setting membind: Operation not permitted

@delock
Copy link
Collaborator Author

delock commented Jan 17, 2024

On my system docker container needs to be started with SYS_NICE capability with the following flag.

  --cap-add SYS_NICE

Not sure how to turn on this for DeepSpeed runner.

Without this capability, we have to remove --bind_cores_to_rank flag, but this would significantly slow down the running time of the test. @mrwyattii what's your thinking on this? We can remove --bind_cores_to_rank to let the workflow run first, then work on how to enable SYS_NICE capability, does it work?

@delock
Copy link
Collaborator Author

delock commented Jan 19, 2024

A proper behavior of DeepSpeed --bind_cores_to_rank is only bind memory to NUMA node if system allows to. This makes DeepSpeed behave more gracefully in docker environment. The latest fix in DeepSpeed had been verified on my own runner, with and without SYS_NICE capability.
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581455004/job/20649083143
https://github.com/delock/DeepSpeedSYCLSupport/actions/runs/7581918228/job/20650446510

@delock
Copy link
Collaborator Author

delock commented Jan 22, 2024

Hi @loadams the blocking issue for this PR had been resolved. Can you help restart the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Jan 22, 2024

@tjruwase Thanks! Currently the autotp workflow passed. One thing I'm not sure is whether the checkpoint downloaded will be preserved across different runs. This will be most time consuming part of this workflow. Will need some comments (i.e. which directory in runner can preserve?) or observe another run to see whether the checkpoint preserves.

@tjruwase
Copy link
Contributor

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

@delock
Copy link
Collaborator Author

delock commented Jan 24, 2024

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.

I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

@loadams
Copy link
Contributor

loadams commented Jan 24, 2024

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

@delock
Copy link
Collaborator Author

delock commented Jan 29, 2024

@mrwyattii @loadams it will be great if there is any link showing how persistency is done on this runner.

@delock, it is great to see the CI now passing.
I think @mrwyattii or @loadams would be the best to answer questions about the checkpoint.

I know @mrwyattii and I still need to leave feedback on this PR, but an example of where things are on the blob storage here, I'm not sure that's the best example, but that's one that shows persisting a larger download/install.

Thanks for the suggestion @loadams . By looking at the usage of '/blob' in DeepSpeed workflows. I found I need to use the default value of TRANSFORMERS_CACHE. Let me make the change and see if it persists.

@delock
Copy link
Collaborator Author

delock commented Jan 31, 2024

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

@loadams
Copy link
Contributor

loadams commented Feb 5, 2024

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

@delock
Copy link
Collaborator Author

delock commented Feb 6, 2024

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

@delock
Copy link
Collaborator Author

delock commented Feb 6, 2024

@loadams Intel Extension for Pytorch 2.2 had been released today. Restart the workflow should resolve the failure.
https://pypi.org/project/intel-extension-for-pytorch/

Hi @loadams can you help start the workflow? The model checkpoint path had been moved to the persistent storage as suggested.

Apologies, I was out but it should be running now.

Thanks! The failure in the workflow should be due to version mismatch between pytorch (2.2.0) and Intel extension for PyTorch (2.1). The recent failure in cpu-inference workflow should also be caused by this reason. An upcoming release of Intel extension for Pytorch should fix it. Let me ping you when the new version is released.

@delock
Copy link
Collaborator Author

delock commented Feb 28, 2024

@loadams Falcon 7b model is not supported by DeepSpeed AutoTP yet. I updated the workflow to test Baichuan 7b instead. Can you help restart the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 13, 2024

Hi @loadams the command line of baichuan model had been changed to fix the test error. The reason is Baichuan model contains remote code so need to set trust_remote_code to true. Can you help restart the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 16, 2024

hi @loadams @tjruwase can you help start this work flow? thanks!

@delock
Copy link
Collaborator Author

delock commented Mar 28, 2024

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

@loadams
Copy link
Contributor

loadams commented Mar 28, 2024

Hi @loadams , I see the environment issue should have been fixed. Can you help restart the workflow? Thanks!

@delock - yes, apologies that took so long.

@delock
Copy link
Collaborator Author

delock commented Apr 1, 2024

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

@loadams
Copy link
Contributor

loadams commented Apr 1, 2024

@loadams I ran these two tests on my local environment. It didn't took so long. Can you help run this workflow again to see whether it is reproducible? Thanks!

Re-running now

@delock
Copy link
Collaborator Author

delock commented Apr 8, 2024

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

@loadams
Copy link
Contributor

loadams commented Apr 8, 2024

Hi @loadams, I tried run these UTs in my environment and didn't see this timeout. Since CPU UT is already covered by workflow cpu-torch-latest. I removed unit tests in this workflow and focus on AutoTP test only. I also removed dependency on oneCCL and use stock pytorch to better focus on AutoTP functionality. Can you help start the workflow? Thanks!

Done

@delock
Copy link
Collaborator Author

delock commented Apr 9, 2024

For Baichuan model failure. I'm seeing it pass on my local environment with exactly the same arguments. From failed log in the workflow I see 'file not found' error when acquiring a lock. Suspect because of HF_HOME had not been properly set. Will point HF_HOME to /blob/ to try again

runner/_work/DeepSpeed/DeepSpeed/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 43, in __init__
    self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left", trust_remote_code=trust_remote_code)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 829, in from_pretrained
    return tokenizer_class.from_pretrained(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2047, in from_pretrained
    resolved_vocab_files[file_id] = cached_file(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 398, in cached_file
    resolved_file = hf_hub_download(
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
    return fn(*args, **kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1451, in hf_hub_download
    with WeakFileLock(lock_path):
  File "/opt/conda/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/huggingface_hub/utils/_fixes.py", line 83, in WeakFileLock
    lock.acquire()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/filelock/_api.py", line 262, in acquire
    self._acquire()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/filelock/_unix.py", line 44, in _acquire
    os.fchmod(fd, self._context.mode)
FileNotFoundError: [Errno 2] No such file or directory

@delock
Copy link
Collaborator Author

delock commented Apr 9, 2024

@loadams HF_HOME had been pointed to /blob/hf_home, can you help start the workflow to see whether the lock file not found issue has been fixed? Thanks!

@delock
Copy link
Collaborator Author

delock commented Apr 10, 2024

Hi @loadams after reading the error log I suspect Baichuan model under TRANSFORMERS_CACHE is corrupted. I unset TRANSFORMERS_CACHE since we set HF_HOME for this model. I also add a peek to TRANSFORMERS_CACHE and HF_HOME in case a manual cleanup will be needed. Can you help start the workflow? Thanks!

@delock
Copy link
Collaborator Author

delock commented Apr 11, 2024

@loadams @tjruwase the latest error in Baichuan model AutoTP is very wierd. It complains about lock file not found or attribute not found. Which I cannot reproduce locally. It indicates probably there is some courrpted states in hf_hub downloaded data.

Currenlty bloom and opt model AutoTP is consistently running well. Can we merge this baseline first then seek add new autotp model validation in followup PR? It might take a while to debug this issue. I'll submit a commit disable baichuan test first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants