[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic #182

stas00 · 2021-11-07T02:49:20Z

This PR merges bnb.StableEmbedding's logic into the custom Megatron-LM's VocabParallelEmbedding

move the bnb library check to args
undoes some changes from [Feature] Porting bitsandbytes to meg-deepspeed #144 - only word embedding should be touched for BNB, the rest should remain untouched
implement xavier_uniform_tensor_parallel_ - a custom version of torch.nn.init.xavier_uniform_ - that correctly adjusts for the full embed dimension and applied to partitioned one.
merge the rest of the logic - GlobalOptimManager and norm at the end of forward
adjust the bnb test to run on at the very least TP=2 (priority over PP>1), since VocabParallelEmbedding requires TP>1

Fixes #180

stas00 · 2021-11-07T02:52:16Z

megatron/mpu/layers.py

+def xavier_uniform_tensor_parallel_(tensor, gain=1., tp_degree=1):
+    r"""
+    This is a modified torch.nn.init.xavier_uniform_ with changes to support
+    partitioned on the vocab size dim embedding with tensor parallel.
+
+    Additional args:
+    - tp_degree: degree of tensor parallel
+
+    Note: the code assumes all partitions are equal in size
+    """
+    # receptive_field_size=1 as dim==2, so we don't need init._calculate_fan_in_and_fan_out
+    fan_out, fan_in = tensor.shape
+    fan_out *= tp_degree # tp splits on num_embeddings dim
+
+    std = gain * math.sqrt(2.0 / float(fan_in + fan_out))
+    a = math.sqrt(3.0) * std  # Calculate uniform bounds from standard deviation
+
+    return torch.nn.init._no_grad_uniform_(tensor, -a, a)


@TimDettmers, does it look OK to you?

Here is the original.
https://github.com/pytorch/pytorch/blob/9fb3ba9d7bc5f6f8852a2bafe053b16a26d73e51/torch/nn/init.py#L304

stas00 · 2021-11-07T02:52:48Z

megatron/mpu/layers.py

+        if args.use_bnb_optimizer:
+            from bitsandbytes.optim import GlobalOptimManager
+            # XXX: ok doing it for the shard?
+            GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32)
+            GlobalOptimManager.get_instance().register_parameters(self.weight)


@TimDettmers, does this look OK to you?

Elsewhere you mentioned:

Override the config with optim_bits=32 for each shard first and then register all the parameters (can be done with register_parameters(model.parameters()) before parameters are transferred to cuda

So should I not do:

GlobalOptimManager.get_instance().register_parameters(self.weight)

Here is where we already pass all the param to the optim:

Megatron-DeepSpeed/megatron/optimizer/__init__.py

Lines 68 to 72 in 2d9744f

optimizer = adam_optimizer(param_groups,

lr=args.lr,

weight_decay=args.weight_decay,

betas=(args.adam_beta1, args.adam_beta2),

eps=args.adam_eps)

The order is correct, but I am not sure about the location. If use_cpu_initialization is False then the weight matrix is transferred to the GPU and the registration requires the buffer to be on the CPU. So the bnb config alteration should be above that.

use_cpu_initialization is False in our case, and it seems to work.

So the bnb config alteration should be above that.

Please define "above that"? We currently have these 4 logical steps (removing the if use_cpu_initialization):

self.weight = create weight on gpu init_weight GlobalOptimManager.get_instance().override_config(self.weight, 'optim_bits', 32) GlobalOptimManager.get_instance().register_parameters(self.weight)

We can't register param before it's created.

TimDettmers · 2021-11-10T18:54:10Z

This looks good to me the only thing that would need potentially changed is to do the registration/override before casting the weight to CUDA.

* fixing language_model forward return values * fix pretrain_gpt.py --------- Co-authored-by: Alexander Jipa <[email protected]> Co-authored-by: Conglong Li <[email protected]>

stas00 added 2 commits November 6, 2021 18:40

move checks into args; undo StableEmbedding

2f03484

integrate StableEmedding into Embedding

dab2064

stas00 commented Nov 7, 2021

View reviewed changes

stas00 mentioned this pull request Nov 9, 2021

add support for adaptive softmax to reduce memory #166

Closed

stas00 added 3 commits November 10, 2021 11:10

cleanup

1bba374

ensure tp>1 is first for bnb

6306e1d

Merge remote-tracking branch 'origin/main' into bnb-stable-embed

1dfaf54

stas00 merged commit a34ca7f into main Nov 10, 2021

stas00 deleted the bnb-stable-embed branch November 10, 2021 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic #182

[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic #182

stas00 commented Nov 7, 2021 •

edited

Loading

stas00 Nov 7, 2021

stas00 Nov 7, 2021

stas00 Nov 7, 2021

stas00 Nov 7, 2021

TimDettmers Nov 10, 2021

stas00 Nov 10, 2021 •

edited

Loading

TimDettmers commented Nov 10, 2021

	optimizer = adam_optimizer(param_groups,
	lr=args.lr,
	weight_decay=args.weight_decay,
	betas=(args.adam_beta1, args.adam_beta2),
	eps=args.adam_eps)

[BNB] integrate StableEmbeding into VocabParallelEmbedding logic #182

[BNB] integrate StableEmbeding into VocabParallelEmbedding logic #182

Conversation

stas00 commented Nov 7, 2021 • edited Loading

stas00 Nov 7, 2021

Choose a reason for hiding this comment

stas00 Nov 7, 2021

Choose a reason for hiding this comment

stas00 Nov 7, 2021

Choose a reason for hiding this comment

stas00 Nov 7, 2021

Choose a reason for hiding this comment

TimDettmers Nov 10, 2021

Choose a reason for hiding this comment

stas00 Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

TimDettmers commented Nov 10, 2021

[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic #182

[BNB] integrate `StableEmbeding` into `VocabParallelEmbedding` logic #182

stas00 commented Nov 7, 2021 •

edited

Loading

stas00 Nov 10, 2021 •

edited

Loading