[feat] Dropout(Activation(x+bias)), now with partial BW fusion #144

blefaudeux · 2021-12-09T21:27:15Z

What does this PR do?

This was a long time in the making.. Fusing the BW part of the activation/bias/dropout kernel. Not quite perfect but in some places the speed goes really bananas (like x3 or x4 the naive calls).
Fusing this implied flipping the whole problem upside down, basically the seeds have to be per collum, and the kernels (fw and bw) also work that way. This allows us to fuse the bias gradient computations, since it's a sum over that direction

TODO:

add more unit tests to check that the dropout drops are respected on average
possibly make sure that the rand mask does not repeat (may or may not be a big deal). Ok this is doable by making the kernels cooperate on the same col, like Phil does on LayerNorm
improve on the scheduling for small buffers
Fix the atomic add funkiness (works for now but this does not look completely right, num_warps dependent)

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

blefaudeux · 2021-12-10T06:23:49Z

#Dropout_Bias_False_FW+BW_torch float16_Act:_gelu

#Dropout_Bias_True_FW+BW_torch float16_Act:_squared_relu

#Dropout_Bias_True_FW_torch float16_Act:_squared_relu

Interested @suchenzang ? This took a while to get right. Some speed for small tensors should be recoverable, I didn´t play with the settings too much

edit: old plots, see below for up to date numbers

suchenzang · 2021-12-10T06:40:09Z

Oh man, coming to xFormers to shop for parts is great. @blefaudeux these numbers are on V100s or A100s?

blefaudeux · 2021-12-10T06:50:59Z

Oh man, coming to xFormers to shop for parts is great. @blefaudeux these numbers are on V100s or A100s?

ahah, I missed you @suchenzang ! This is on an ampere laptop, working with what I have around.. 400GB/s is the max bandwidth, so basically there's not much to win on the inference side. The reported training number is not exact GB wise (there are operations in the middle not counted), but could well be that the scheduling brings back another 10/20%, these are very raw numbers. It should be at parity with pytorch accuracy wise, not compromising here, just fusing the kernels. I'll pull in @Chillee also in that he has a NVFuser solution for the same part that you could test out (we were wondering whether xformers could host that also), he was actually the incentive for me to revisit this (he's got very good numbers !)

blefaudeux · 2021-12-10T06:52:51Z

#Dropout_Bias_True_FW_torch float32_Act:_None
the scheduling probably has some margin for improvement, because fp32 saturates the bandwidth better. Anyway, that's the easy part

edit: old plot, see below for up to date numbers

blefaudeux · 2021-12-10T18:53:24Z

cc @min-xu-ai , if it helped to have a look at a Triton kernel for Mevo

blefaudeux · 2021-12-10T19:07:59Z

I'll clean up the PR, sorry for all the extra changes

blefaudeux · 2021-12-10T22:59:35Z

I'll clean up the PR, sorry for all the extra changes

I just pushed a cleaned up version, should be better

blefaudeux · 2021-12-10T23:01:53Z

xformers/triton/dropout.py

@@ -25,42 +25,48 @@
 class _dropout(torch.autograd.Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float16)
-    def forward(ctx, x, p, bias, activation, activation_grad):
+    def forward(ctx, x, p, bias, activation, activation_grad, trainable_bias):


the trainable bias (or not) was not properly handled before (bias was always assumed to be trainable, which is mostly true but not always)

using less seeds tiling + vertical seeds Computing the FW and BW per tile over M workaround atomics, reintroduce locks, should not be too bad yet another take, partial sum better scheduling defaults, improves across the board giving atomic add a go back to the locks, completely fuse the BW

…f the epilogue

…oblem for small buffers and BW

…e and correct masks

blefaudeux · 2021-12-18T07:09:55Z

Probably too long on that, it was on the side and I spent a lot of time getting back into the proper context all the time (+couple of small hiccups with triton on my way). Now with good enough perfs I think, there's one case which is not faster than pytorch (really small buffers + gelu + fp16), everything else is significantly faster and the FW speed almost doubled with this PR. I updated all the graphs, keep in mind when comparing that the previous ones were with V100 (HBM memory, something like 900GB/s max bandwidth) and the new ones are with a 3080 laptop (GDDR6 memory, something like 440GB/s max bandwidth). I think that this could be revisited with a newer Triton, or with NVFuser/functorch (if not too much work a PR would be great @Chillee, if the speed is there OOB I would definitely take it !)

Up for review and next is #153, probably much more impactful if this is doable

blefaudeux · 2021-12-18T07:42:06Z

testing a training with microGPT, looks like sometihng is wrong, the loss plateau..
edit: checked, the stats are wrong on the % of dropout, randint4x is not returning what I thought it was.. fixing that

blefaudeux · 2021-12-18T23:12:54Z

testing a training with microGPT, looks like sometihng is wrong, the loss plateau.. edit: checked, the stats are wrong on the % of dropout, randint4x is not returning what I thought it was.. fixing that

fixed, no impact on perf. Checking with loss curves right now that everything is fine but should be the case. I've improved on the unit test to catch that, basically p=0.5 was correct but p=0.1 was not

blefaudeux · 2021-12-18T23:39:41Z

good, now training properly

suchenzang · 2021-12-19T00:08:41Z

good, now training properly

Do you have a baseline plot to compare against?

blefaudeux · 2021-12-19T04:04:42Z

good, now training properly

Do you have a baseline plot to compare against?

this is against the previous main (blue, still FusedMLP), and MLP instead (red, pure pytorch). Scale is log, emphasizes the small differences, if not you cannot distinguish one from the other. The end test (sampled text) looked good. In both fused cases there's a small but measurable difference with pure pytorch though, I'm not sure why, except that maybe the AMP execution is not the same (in the FusedMLP case inputs are cast to fp16 and remain so over the layer, in the pytorch non-fused case maybe that not everything is fp16)

suchenzang · 2021-12-19T16:03:46Z

So the only mildly worrisome thing here is seeing some divergence as training progresses. It seems like that delta shows up a bit vs the blue. If the two converges to different points, then we have a problem :(

blefaudeux · 2021-12-19T16:53:19Z

So the only mildly worrisome thing here is seeing some divergence as training progresses. It seems like that delta shows up a bit vs the blue. If the two converges to different points, then we have a problem :(

totally, it's very strange because there's no shortcut in the implementation, it's supposed to give the same results. I've tried AMP/fp16, it's not that, same results. Now it could just be about the seed, I'll check that next

blefaudeux · 2021-12-19T17:16:12Z

So the only mildly worrisome thing here is seeing some divergence as training progresses. It seems like that delta shows up a bit vs the blue. If the two converges to different points, then we have a problem :(

totally, it's very strange because there's no shortcut in the implementation, it's supposed to give the same results. I've tried AMP/fp16, it's not that, same results. Now it could just be about the seed, I'll check that next

hmm, testing seeds it does change the result a bit, but still. With this PR we're using less seeds within the random number generation, and generating more numbers out of them, but that's the only measurable difference top of head with the previous take, the rest is just about a different kernel architecture

blefaudeux · 2021-12-19T17:59:50Z

Discussing with Phil, could be that the RNG is not good enough, checking that next

blefaudeux · 2021-12-20T04:05:27Z

Discussing with Phil, could be that the RNG is not good enough, checking that next

could well be the reason, the fused dropout on main/ is fine. I checked if there was a pattern in the dropout, but not obvious on a heatmap. Dropping that for now, could be that this was the main reason for a very fast FW (generating less seeds), but cannot be at the expense of training accuracy

Use configurations from real models

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 9, 2021

blefaudeux changed the base branch from main to bench_triton_sum December 9, 2021 21:27

blefaudeux marked this pull request as draft December 9, 2021 21:28

blefaudeux force-pushed the dropout_bw_fusion branch 3 times, most recently from a18bda0 to 147e2c9 Compare December 10, 2021 00:04

blefaudeux changed the base branch from bench_triton_sum to main December 10, 2021 00:05

blefaudeux force-pushed the dropout_bw_fusion branch 6 times, most recently from c7ab5b0 to 8dea61d Compare December 10, 2021 06:20

blefaudeux changed the title ~~[DRAFT] Dropout BW fusion~~ [feat] Dropout(Activation(x+bias)), now with BW fusion Dec 10, 2021

blefaudeux requested review from dianaml0, jieru-hu and fmassa and removed request for dianaml0 December 10, 2021 06:24

blefaudeux marked this pull request as ready for review December 10, 2021 06:24

blefaudeux force-pushed the dropout_bw_fusion branch from 8dea61d to 190bbaf Compare December 10, 2021 06:45

blefaudeux force-pushed the dropout_bw_fusion branch from 190bbaf to 73b7663 Compare December 10, 2021 22:59

blefaudeux commented Dec 10, 2021

View reviewed changes

blefaudeux force-pushed the dropout_bw_fusion branch from 157789b to 7f3718d Compare December 15, 2021 05:37

blefaudeux force-pushed the dropout_bw_fusion branch from 7f3718d to 16808ea Compare December 16, 2021 06:30

test fix and minor perf tweak, but would still be better to get rid o…

3cb49f3

…f the epilogue

blefaudeux force-pushed the dropout_bw_fusion branch from 89b92ba to 7221f7e Compare December 18, 2021 02:31

Benjamin Lefaudeux and others added 2 commits December 17, 2021 19:47

flipping the bias buffer so that reduction is easier, still a perf pr…

f1acb7a

…oblem for small buffers and BW

good enough perfs

93668d5

blefaudeux force-pushed the dropout_bw_fusion branch from 7221f7e to 93668d5 Compare December 18, 2021 06:17

back to the old method again, best perfs overall, strided sum epilogu…

540b857

…e and correct masks

blefaudeux marked this pull request as ready for review December 18, 2021 07:09

blefaudeux marked this pull request as draft December 18, 2021 07:54

fix the stats issue

12705e8

blefaudeux marked this pull request as ready for review December 18, 2021 23:12

catching the slow case and diverting to pytorch in that case

ee9ea6f

blefaudeux force-pushed the dropout_bw_fusion branch from d631179 to ee9ea6f Compare December 19, 2021 01:32

blefaudeux marked this pull request as draft December 19, 2021 17:57

blefaudeux closed this Dec 20, 2021

blefaudeux mentioned this pull request Dec 23, 2021

[feat] Dropout partial bw fusion (second take) #164

Merged

10 tasks

xwhan pushed a commit to xwhan/xformers that referenced this pull request Feb 8, 2022

Add fine-grained benchmarks for sddmm (facebookresearch#144)

57e5875

Use configurations from real models

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Dropout(Activation(x+bias)), now with partial BW fusion #144

[feat] Dropout(Activation(x+bias)), now with partial BW fusion #144

blefaudeux commented Dec 9, 2021 •

edited

Loading

blefaudeux commented Dec 10, 2021 •

edited

Loading

suchenzang commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021 •

edited

Loading

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux Dec 10, 2021

blefaudeux commented Dec 18, 2021

blefaudeux commented Dec 18, 2021 •

edited

Loading

blefaudeux commented Dec 18, 2021

blefaudeux commented Dec 18, 2021

suchenzang commented Dec 19, 2021

blefaudeux commented Dec 19, 2021 •

edited

Loading

suchenzang commented Dec 19, 2021 •

edited

Loading

blefaudeux commented Dec 19, 2021 •

edited

Loading

blefaudeux commented Dec 19, 2021

blefaudeux commented Dec 19, 2021

blefaudeux commented Dec 20, 2021

[feat] Dropout(Activation(x+bias)), now with partial BW fusion #144

[feat] Dropout(Activation(x+bias)), now with partial BW fusion #144

Conversation

blefaudeux commented Dec 9, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux commented Dec 10, 2021 • edited Loading

suchenzang commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021 • edited Loading

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux commented Dec 10, 2021

blefaudeux Dec 10, 2021

Choose a reason for hiding this comment

blefaudeux commented Dec 18, 2021

blefaudeux commented Dec 18, 2021 • edited Loading

blefaudeux commented Dec 18, 2021

blefaudeux commented Dec 18, 2021

suchenzang commented Dec 19, 2021

blefaudeux commented Dec 19, 2021 • edited Loading

suchenzang commented Dec 19, 2021 • edited Loading

blefaudeux commented Dec 19, 2021 • edited Loading

blefaudeux commented Dec 19, 2021

blefaudeux commented Dec 19, 2021

blefaudeux commented Dec 20, 2021

blefaudeux commented Dec 9, 2021 •

edited

Loading

blefaudeux commented Dec 10, 2021 •

edited

Loading

blefaudeux commented Dec 10, 2021 •

edited

Loading

blefaudeux commented Dec 18, 2021 •

edited

Loading

blefaudeux commented Dec 19, 2021 •

edited

Loading

suchenzang commented Dec 19, 2021 •

edited

Loading

blefaudeux commented Dec 19, 2021 •

edited

Loading