-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BNB] integrate StableEmbeding
into VocabParallelEmbedding
logic
#182
Conversation
def xavier_uniform_tensor_parallel_(tensor, gain=1., tp_degree=1): | ||
r""" | ||
This is a modified torch.nn.init.xavier_uniform_ with changes to support | ||
partitioned on the vocab size dim embedding with tensor parallel. | ||
|
||
Additional args: | ||
- tp_degree: degree of tensor parallel | ||
|
||
Note: the code assumes all partitions are equal in size | ||
""" | ||
# receptive_field_size=1 as dim==2, so we don't need init._calculate_fan_in_and_fan_out | ||
fan_out, fan_in = tensor.shape | ||
fan_out *= tp_degree # tp splits on num_embeddings dim | ||
|
||
std = gain * math.sqrt(2.0 / float(fan_in + fan_out)) | ||
a = math.sqrt(3.0) * std # Calculate uniform bounds from standard deviation | ||
|
||
return torch.nn.init._no_grad_uniform_(tensor, -a, a) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimDettmers, does it look OK to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
megatron/mpu/layers.py
Outdated
if args.use_bnb_optimizer: | ||
from bitsandbytes.optim import GlobalOptimManager | ||
# XXX: ok doing it for the shard? | ||
GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32) | ||
GlobalOptimManager.get_instance().register_parameters(self.weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimDettmers, does this look OK to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere you mentioned:
Override the config with optim_bits=32 for each shard first and then register all the parameters (can be done with register_parameters(model.parameters()) before parameters are transferred to cuda
So should I not do:
GlobalOptimManager.get_instance().register_parameters(self.weight)
Here is where we already pass all the param to the optim:
Megatron-DeepSpeed/megatron/optimizer/__init__.py
Lines 68 to 72 in 2d9744f
optimizer = adam_optimizer(param_groups, | |
lr=args.lr, | |
weight_decay=args.weight_decay, | |
betas=(args.adam_beta1, args.adam_beta2), | |
eps=args.adam_eps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order is correct, but I am not sure about the location. If use_cpu_initialization
is False then the weight matrix is transferred to the GPU and the registration requires the buffer to be on the CPU. So the bnb config alteration should be above that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_cpu_initialization
is False
in our case, and it seems to work.
So the bnb config alteration should be above that.
Please define "above that"? We currently have these 4 logical steps (removing the if use_cpu_initialization):
self.weight = create weight on gpu
init_weight
GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32)
GlobalOptimManager.get_instance().register_parameters(self.weight)
We can't register param before it's created.
This looks good to me the only thing that would need potentially changed is to do the registration/override before casting the weight to CUDA. |
* fixing language_model forward return values * fix pretrain_gpt.py --------- Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Conglong Li <[email protected]>
This PR merges
bnb.StableEmbedding
's logic into the custom Megatron-LM'sVocabParallelEmbedding
xavier_uniform_tensor_parallel_
- a custom version oftorch.nn.init.xavier_uniform_
- that correctly adjusts for the full embed dimension and applied to partitioned one.GlobalOptimManager
andnorm
at the end offorward
VocabParallelEmbedding
requires TP>1Fixes #180