autoTP for fused qkv weight #3844

inkcherry · 2023-06-29T13:38:48Z

To make autoTP work with some fusedqkv model.
The qkv weight of one rank format after TP should be consistent with no TP, and the heads should be correctly divided and arranged, so that the model can correctly parse the qkv matrix when do inference.

However, some models have different qkv fuse types, and we have found three types. Now types are temporarily named based on model name.

Assuming there is num_ head=n, qx represents the weight of the n-th q head, shape=(hidden dim, hidden dim)

Bloom (huggingface):
q1, k1, v1, q2, k2, v2,..., qn, kn, vn

GLM or WebGLM (huggingface)

q1, q2, q3,.. qn, k1, k2, k3,...,kn, v1, v2, v3,..., vn

Codegen (huggingface):

q1, q2,..., q(n/4), k1, k2,..., k(n/4), v1, v2,..., v(n/4), q(n/4+1),...,q(n/2), k(n/4+1),...,k(n/2), v(n/4+1),...,v(n/2),...

At first, we attempted to infer autoTP GLM but received incorrect results, but there were no errors or warnings. We think that we can add warnings to unrecognized fused qkv for user to better troubleshoot the problem.

For other model use fused_qkv, they can also work by matching or adding fused types.

inkcherry · 2023-06-30T09:51:40Z

we conducted some tests

glm 350M mp=1/2 work
codegen 350M mp=1/2 work
codegen 6B mp=1/2/4 work
glm 10B mp=1/2/4/8 work
codegen16B mp=1/2/3/6 work

inkcherry · 2023-07-06T07:00:41Z

@RezaYazdaniAminabadi @molly-smith
Could you please help review this pr? Will appreciate your feedback.

abhilash1910 · 2023-07-06T18:49:13Z

deepspeed/module_inject/replace_module.py

+                src_split = torch.split(input, shape[0] // 3, dim=0)
+
+                qkv_split = [torch.split(src_s, shape[0] // 3 // mp_size, dim=0) for src_s in src_split]
+                split_fusedqkv = [


split_fusedqkv (cat operations) can be made as a separate method since there may be other variations using the same logic flow. @inkcherry ?
Also changes look good, @molly-smith @RezaYazdaniAminabadi could you please review and provide suggestions?

Yes, I agree, and since we have the same logic for when using the TP for when the inference kernels are enabled then we can reuse this.

RezaYazdaniAminabadi · 2023-07-26T15:44:07Z

deepspeed/module_inject/replace_module.py

                setattr(child, "replaced", True)
                return LinearLayer(weight=data.to(get_accelerator().current_device_name()), bias=bias_data)

+        def require_tp_fused_qkvw(name):


@inkcherry, can we please move all these logic into a separate file, so that we can then refactor the tp-sharding logic based on this? thanks

RezaYazdaniAminabadi · 2023-07-26T16:08:14Z

To make autoTP work with some fusedqkv model. The qkv weight of one rank format after TP should be consistent with no TP, and the heads should be correctly divided and arranged, so that the model can correctly parse the qkv matrix when do inference.

However, some models have different qkv fuse types, and we have found three types. Now types are temporarily named based on model name.

Assuming there is num_ head=n, qx represents the weight of the n-th q head, shape=(hidden dim, hidden dim)

Bloom (huggingface): q1, k1, v1, q2, k2, v2,..., qn, kn, vn

GLM or WebGLM (huggingface)

q1, q2, q3,.. qn, k1, k2, k3,...,kn, v1, v2, v3,..., vn

Codegen (huggingface):

q1, q2,..., q(n/4), k1, k2,..., k(n/4), v1, v2,..., v(n/4), q(n/4+1),...,q(n/2), k(n/4+1),...,k(n/2), v(n/4+1),...,v(n/2),...

At first, we attempted to infer autoTP GLM but received incorrect results, but there were no errors or warnings. We think that we can add warnings to unrecognized fused qkv for user to better troubleshoot the problem.

For other model use fused_qkv, they can also work by matching or adding fused types.

Thanks @inkcherry for the great PR!
I have gone through the changes they look good to me. I just have some minor comments.

inkcherry · 2023-07-27T08:51:21Z

Thanks for your suggestion: ) The file has been modified.
@RezaYazdaniAminabadi @tjruwase

cc @guoyejun @abhilash1910

molly-smith · 2023-07-27T20:17:04Z

Thanks @inkcherry for this contribution. It would be nice to update the supported/unsupported model lists in docs/_tutorials/automatic-tensor-parallelism.md and maybe add one of these fused qkv weight models to a unit test.

inkcherry · 2023-07-28T11:59:21Z

Thanks @inkcherry for this contribution. It would be nice to update the supported/unsupported model lists in docs/_tutorials/automatic-tensor-parallelism.md and maybe add one of these fused qkv weight models to a unit test.

sure, I placed it in this PR
#4057

* autoTP for fused qkv weight * fix format * clean up * clean up * clean up * update * make logic flow to util and move to file * fix formatting * remove empty line --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Reza Yazdani <[email protected]> Co-authored-by: Jeff Rasley <[email protected]>

andre-bauer · 2023-08-22T17:02:47Z

Bloom won't work with this since it's not taking mp_size into account

inkcherry added 2 commits June 29, 2023 13:32

autoTP for fused qkv weight

311b133

fix format

1ae9467

inkcherry requested review from RezaYazdaniAminabadi, jeffra, mrwyattii, awan-10, cmikeh2 and arashb as code owners June 29, 2023 13:38

inkcherry added 2 commits June 29, 2023 14:19

clean up

41b3ccb

clean up

ba768ce

inkcherry added 2 commits June 30, 2023 13:45

clean up

071d538

update

303ec9e

abhilash1910 reviewed Jul 6, 2023

View reviewed changes

Merge branch 'master' into fused_qkv_weight

1ccb214

RezaYazdaniAminabadi reviewed Jul 26, 2023

View reviewed changes

make logic flow to util and move to file

8e21d7e

Merge branch 'master' into fused_qkv_weight

5a3cbdf

molly-smith self-requested a review July 27, 2023 20:12

RezaYazdaniAminabadi approved these changes Jul 27, 2023

View reviewed changes

Merge branch 'master' into fused_qkv_weight

26ade64

jeffra enabled auto-merge July 27, 2023 20:16

molly-smith approved these changes Jul 27, 2023

View reviewed changes

fix formatting

6c410b8

jeffra disabled auto-merge July 27, 2023 20:22

remove empty line

c071037

jeffra enabled auto-merge July 27, 2023 20:24

jeffra added this pull request to the merge queue Jul 27, 2023

Merged via the queue into microsoft:master with commit 6b877d2 Jul 27, 2023

molly-smith mentioned this pull request Sep 20, 2023

Deepspeed inference not support for chatglm 6b #3945

Closed

delock mentioned this pull request Sep 20, 2024

[TRACKER] Customer support related PR tracker for Intel devices #6556

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autoTP for fused qkv weight #3844

autoTP for fused qkv weight #3844

inkcherry commented Jun 29, 2023 •

edited

Loading

inkcherry commented Jun 30, 2023 •

edited

Loading

inkcherry commented Jul 6, 2023

abhilash1910 Jul 6, 2023

RezaYazdaniAminabadi Jul 26, 2023

RezaYazdaniAminabadi Jul 26, 2023

RezaYazdaniAminabadi commented Jul 26, 2023

inkcherry commented Jul 27, 2023

molly-smith commented Jul 27, 2023 •

edited

Loading

inkcherry commented Jul 28, 2023

andre-bauer commented Aug 22, 2023

autoTP for fused qkv weight #3844

autoTP for fused qkv weight #3844

Conversation

inkcherry commented Jun 29, 2023 • edited Loading

inkcherry commented Jun 30, 2023 • edited Loading

inkcherry commented Jul 6, 2023

abhilash1910 Jul 6, 2023

Choose a reason for hiding this comment

RezaYazdaniAminabadi Jul 26, 2023

Choose a reason for hiding this comment

RezaYazdaniAminabadi Jul 26, 2023

Choose a reason for hiding this comment

RezaYazdaniAminabadi commented Jul 26, 2023

inkcherry commented Jul 27, 2023

molly-smith commented Jul 27, 2023 • edited Loading

inkcherry commented Jul 28, 2023

andre-bauer commented Aug 22, 2023

inkcherry commented Jun 29, 2023 •

edited

Loading

inkcherry commented Jun 30, 2023 •

edited

Loading

molly-smith commented Jul 27, 2023 •

edited

Loading