-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up imports and add a CI #2845
Conversation
Let me know how I can improve on this tool further so we can then get it going throughout anyone at HF that wants to use it 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this workflow to detect regressions in import time.
I have a a couple of questions and comments, please check.
Regarding tuna, I did a quick check and from my understanding, this is mostly a package to lightly process the profile data generated by Python's cProfile module and to add visualization similar to snakeviz. Also note that the last commit is already almost 1.5 years old and that read_import_profile
comes from a private module. For this reason, I wonder if it wouldn't make more sense to vendor this function with your lib and get rid of the tuna dependency completely.
sorted_data = sort_nodes_by_total_time(data) | ||
paths_above_threshold = get_paths_above_threshold(sorted_data, 0.1, max_depth=7) | ||
err_msg += f"\n{convert_list_to_string(paths_above_threshold)}" | ||
self.assertLess(pct_more, 1.2, err_msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So IIUC, we want to ensure that the import doesn't add more than 20% on top of the PyTorch import. Did you check how robust this is? Should this test perhaps do multiple imports and then average? If so, should we throw away the first timing, as it may require a warm up (disk cache?) and thus make the first import unduly slow (and thus make the test pass trivially)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
We specifically care about the warmup, because otherwise it will lead to slowdowns via accelerate launch
.
I'll check multiple calls, but it shouldn't because everything is run via subprocess
so it shouldn't rely on disk caching
from .megatron_lm import prepare_scheduler as megatron_lm_prepare_scheduler | ||
|
||
|
||
if is_megatron_lm_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, this change is unrelated, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fully related, see the megatron-related graph in the PR description (in the test examples), it was a source of slowdown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that, had to scroll right to see the megatron part :D So it's not related to implementing the new workflow, but it is related to actually increasing the startup time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly. Can't fully do one without the other for good baselines, hence why they are both in this PR 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the check name is somewhat misleading, since a user may have megatron installed (i.e. available) but they aren't telling accelerate to use it, and so it shouldn't load it then.
perhaps calling it is_megatron_needed
?
for the available
logic - shouldn't it fail anyway long before this code if it's not available?
what I'm trying to communicate is for the sake of imports. needed
and available
are orthogonal to each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is always available
and has been available
, else we'd call it needed
or configured
as you say.
requires_
is for tests, we know we are using it so thus we use it.
_available
simply means bring it into the environment if available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its aggressive importing, yes. Which as noted in my last comment, will aim to fix when I can dedicate some time to refactoring our entire import schema in accelerate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the first thing the is_megatron_available
check does is whether it's configured to be used, or am I misreading it?
accelerate/src/accelerate/utils/imports.py
Lines 223 to 224 in 67b2943
def is_megatron_lm_available(): | |
if str_to_bool(os.environ.get("ACCELERATE_USE_MEGATRON_LM", "False")) == 1: |
so it's not available - it's is_wanted_and_available
@BenjaminBossan what's missing here is the initial tuna check, which is why it's required (see the workflow). We can eventually look at gutting it out, sure. However I think having both the visual option for debugging further + the condensed output here is valuable and is why we should still use Edit; I may be wrong here, sorry |
@BenjaminBossan fully removed the requirement for |
import torch.distributed.checkpoint as dist_cp | ||
from torch.distributed.checkpoint.default_planner import DefaultSavePlanner | ||
from torch.distributed.fsdp.fully_sharded_data_parallel import FullyShardedDataParallel as FSDP | ||
from torch.distributed.fsdp.fully_sharded_data_parallel import StateDictType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels somewhat yucky - since it's going to run the same imports many times over during the life of a trainer - and more so because you replicate this exact import code multiple times below in similar functions.
Why not move these imports into its own file and call the imports only if they haven't been loaded yet?
Moreover it should be safe to assume that if a user configured accelerate to use FSDP it'll need all these save/load fsdp functions - so from the performance point of view all of them can be moved into a single place and loaded once - if the user asked for FSDP. i.e. outside of these functions. - similar to how you did it with megatron in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's gross for now. I plan on working with @LysandreJik for adding a lazy-import style next week to fix all of these performance issues fully, which can include doing the solution you suggest (as we have to do this in multiple areas, not ideal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to think of an ideal folder structure, but I'll start with some small version of this in this PR, since we're touching those imports now (specifically sectioning out to a different file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the difference with megatron is we check for a package being installed. So while we can do similar for deepspeed, we cannot do this for fsdp (as it doesn't actually check for configurations, only if a package is available in the env, nor should we check based on configuration alone IMO as some libraries do not rely on this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it should be enough to check whether FSDP is being configured to be used - configured? import, not configured? don't import
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the dependency chain could be:
this_component_is_needed_to_continue -> do we have it installed -> if so import the needed bits.
i.e. the logic that decides to whether import something or not is driven by the configuration of accelerate and not by availability of a 3rd party package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, but imports are cached in Python, so the 2nd time this is called, they don't add any overhead. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but this just feels weird.
imports don't belong to methods/functions and surely not when you repeat the same imports multiple times in multiple functions. Nothing stops you from doing that, of course.
imports are sort of a singleton thing and certainly we occasionally see an odd import in some function, but it's usually an exception and this happens more often in a throw-away code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this LGTM, thanks for working on this and the tuna-interpreter package.
I have some comments, but none are blockers. Regarding the local imports, personally I don't think it's a huge anti-pattern, so this can be left for later IMO.
One question I have is how test_imports.py
is supposed to be run, just manually by devs or some automatic job? If the latter, I'd consider running the checks multiple times and averaging the results (perhaps also discarding outliers) to make the job more robust. If it's manual, then that's not necessary.
tests/test_imports.py
Outdated
def setUpClass(cls): | ||
super().setUpClass() | ||
output = run_import_time("import torch") | ||
with open(cls.tmpdir / "pytorch_results.log", "w") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you now control read_import_profile
, you could make it so that it can also receive an iterable as input instead of a file name, that way you can get rid of the indirection of writing the outputs to a file first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the import faster !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
🔥 |
What does this PR do?
This PR introduces a CI which utilizes the
cProfile
timings and mytuna-interpreter
libraries to perform time-based import tests.It was reported by @stas00 that we were taking far too long to do basic things like
accelerate launch
, andtuna
can help visualize why by creating import graphs directing us to what is taking too long:I wrote a small library called tuna-interpreter that aims to take the best parts of
tuna
and work it into something parse-able that lets us run CIs off of it.After using the tool:
We can see a decrease of ~68%
How it works:
In its current form, we are going based off of a baseline
torch
import, since Accelerate relies on torch no matter what. BUT we should be no more than ~20% slower than the torch import overall. Anything more and we have some slip-up or timing problem.An example test looks like the following:
Where essentially we:
import_time
to run a particular python import called via-c
in subprocessmax_depth
is it should be enough to get your imports out of the slowdown trace, and show what external libraries you are really calling.An example failure is below, where we can clearly see what module chain had the slowdown:
Or:
If there are specific issues with using
tuna-interpreter
, please let me know, it's a very quickly hacked-together-but-working library for what we are doing, and open to improving it further after we battle-test itBefore submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @BenjaminBossan @sayakpaul @Titus-von-Koeller @ArthurZucker @ydshieh @LysandreJik