Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading universal checkpoint for BF16_Optimizer #5211

Closed
wants to merge 6 commits into from

Conversation

mosheisland
Copy link
Contributor

PR#5104 (Remove optimizer step on initialization) breaks loading universal checkpoint for BF16_Optimizer.
This is since universal checkpoint attempts to load the optimizer states into lp._hp_mapping.optim_state dictionary before they are initialized (by step).

As a workaround for loading universal checkpoint, perform step and init hp params optimizer's states before loading from universal checkpoint files.

PR#5104 (Remove optimizer step on initialization) breaks loading universal
checkpoint for BF16_Optimizer.
This is since universal checkpoint attempts to load the optimizer states into
lp._hp_mapping.optim_state dictionary before they are initialized (by step).

As a workaround for loading universal checkpoint, perform step and init hp
params optimizer's states before loading from universal checkpoint files.

Signed-off-by: Moshe Island <[email protected]>
@tjruwase
Copy link
Contributor

@misland-habana, I think we have fixed the underlying issue. Can you please check if this PR is still needed? Thanks.

@mosheisland
Copy link
Contributor Author

@misland-habana, I think we have fixed the underlying issue. Can you please check if this PR is still needed? Thanks.

@tjruwase, sorry for the late reply (vacation...). I have re-tested with your fix for the underlying issue and it is working for me.
I think this PR is not needed anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants