Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BNB] integrate StableEmbeding into VocabParallelEmbedding logic #182

Merged
merged 5 commits into from
Nov 10, 2021

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Nov 7, 2021

This PR merges bnb.StableEmbedding's logic into the custom Megatron-LM's VocabParallelEmbedding

  • move the bnb library check to args
  • undoes some changes from [Feature] Porting bitsandbytes to meg-deepspeed #144 - only word embedding should be touched for BNB, the rest should remain untouched
  • implement xavier_uniform_tensor_parallel_ - a custom version of torch.nn.init.xavier_uniform_ - that correctly adjusts for the full embed dimension and applied to partitioned one.
  • merge the rest of the logic - GlobalOptimManager and norm at the end of forward
  • adjust the bnb test to run on at the very least TP=2 (priority over PP>1), since VocabParallelEmbedding requires TP>1

Fixes #180

Comment on lines +135 to +152
def xavier_uniform_tensor_parallel_(tensor, gain=1., tp_degree=1):
r"""
This is a modified torch.nn.init.xavier_uniform_ with changes to support
partitioned on the vocab size dim embedding with tensor parallel.

Additional args:
- tp_degree: degree of tensor parallel

Note: the code assumes all partitions are equal in size
"""
# receptive_field_size=1 as dim==2, so we don't need init._calculate_fan_in_and_fan_out
fan_out, fan_in = tensor.shape
fan_out *= tp_degree # tp splits on num_embeddings dim

std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
a = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation

return torch.nn.init._no_grad_uniform_(tensor, -a, a)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimDettmers, does it look OK to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 212 to 216
if args.use_bnb_optimizer:
from bitsandbytes.optim import GlobalOptimManager
# XXX: ok doing it for the shard?
GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32)
GlobalOptimManager.get_instance().register_parameters(self.weight)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimDettmers, does this look OK to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you mentioned:

Override the config with optim_bits=32 for each shard first and then register all the parameters (can be done with register_parameters(model.parameters()) before parameters are transferred to cuda

So should I not do:

GlobalOptimManager.get_instance().register_parameters(self.weight)

Here is where we already pass all the param to the optim:

optimizer = adam_optimizer(param_groups,
lr=args.lr,
weight_decay=args.weight_decay,
betas=(args.adam_beta1, args.adam_beta2),
eps=args.adam_eps)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order is correct, but I am not sure about the location. If use_cpu_initialization is False then the weight matrix is transferred to the GPU and the registration requires the buffer to be on the CPU. So the bnb config alteration should be above that.

Copy link
Contributor Author

@stas00 stas00 Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_cpu_initialization is False in our case, and it seems to work.

So the bnb config alteration should be above that.

Please define "above that"? We currently have these 4 logical steps (removing the if use_cpu_initialization):

self.weight = create weight on gpu
init_weight
GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32)
GlobalOptimManager.get_instance().register_parameters(self.weight)

We can't register param before it's created.

@TimDettmers
Copy link

This looks good to me the only thing that would need potentially changed is to do the registration/override before casting the weight to CUDA.

@stas00 stas00 merged commit a34ca7f into main Nov 10, 2021
@stas00 stas00 deleted the bnb-stable-embed branch November 10, 2021 19:41
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 18, 2023
* fixing language_model forward return values

* fix pretrain_gpt.py

---------

Co-authored-by: Alexander Jipa <[email protected]>
Co-authored-by: Conglong Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BNB] integration needs more work
2 participants