[DRAFT][Blocked] Mem efficient attention - FW pass #162

blefaudeux · 2021-12-22T00:58:11Z

What does this PR do?

First take for #161, only the forward pass with this PR (no training possible). The method is described here, the gist is that you compute the attention with you current best guess for the softmax renormalization, save your offset, and correct post-hoc

TODO:

check parity all along (buggy for now)
possibly change the scheduling to improve L2 rate (may not be a limiting factor really)
handle bias
change the way we handle batch dimension ?
normalize in the kernel if there's only ever one tile over N
tiling case is broken by a factor of kN
cc @ptillet

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

…er case -whole line in kernel-

blefaudeux · 2021-12-22T01:39:15Z

xformers/triton/k_mem_efficient_attention.py

+        v_ptrs = V + rn[:, None] * L + rl_i[None, :]    # (BLOCK_N, BLOCK_L)
+        v = tl.load(v_ptrs, mask=((rn[:, None] < N) & (rl_i[None, :] < L)), other=0.0)
+
+        qkv = tl.dot(exp_acc, v).to(tl.float32)         # (BLOCK_M, BLOCK_L)


@ptillet not having .to(tl.float32) means that this crashes for instance, it's not really obvious to me why

… mostly works

blefaudeux · 2021-12-22T22:28:44Z

quick update: working around a bug in the Triton compiler, PoC is there and runs, not shippable as is. Lots of perf potential, the FW could actually be faster than a vanilla take while using a lot less memory. The BW will always be a little slower but the end result could well be worth it

ptillet · 2021-12-22T23:14:08Z

Yep there's definitely a bug in the compiler. The holiday break seems like the right time to rewrite how Triton handles data layout. I think there's a lot of potential for this kind of fused attention function to be very performant and memory-efficient.

blefaudeux · 2022-01-10T05:36:57Z

keeping the branch up but closing the PR, I cannot do much on this topic at the moment, dependent on upstream fixes on Triton

Small Fix, order in which tensors passed to attention

blefaudeux · 2022-03-14T20:19:17Z

xformers/triton/k_mem_efficient_attention.py

+    BLOCK_N = min(triton.next_power_of_2(N), 1024)  # increase the ceiling to save more memory
+    BLOCK_L = 8
+
+    tiles_n = triton.cdiv(N, BLOCK_N)


cc @dianaml0

ptillet · 2022-03-14T20:28:40Z

Hey! FYI mem-efficient attention is becoming a bigger priority for us. The bug is pretty deep inside of the compiler but I am seriously considering to dive in and take care of it. I have more time to address stability issues in Triton.

blefaudeux · 2022-03-15T01:09:20Z

Hey! FYI mem-efficient attention is becoming a bigger priority for us. The bug is pretty deep inside of the compiler but I am seriously considering to dive in and take care of it. I have more time to address stability issues in Triton.

I just updated this old branch, I'm getting an IndexError: map::at at compilation time with the latest dev package, let me know when there's something up to check out @ptillet !

WIP, this is promising

a12b351

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 22, 2021

minor lint + unit test, needs fixing and better handling of the small…

18744d8

…er case -whole line in kernel-

blefaudeux force-pushed the mem_efficient_fw branch from fc02cc8 to 18744d8 Compare December 22, 2021 01:27

blefaudeux commented Dec 22, 2021

View reviewed changes

blefaudeux linked an issue Dec 22, 2021 that may be closed by this pull request

[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

Closed

blefaudeux force-pushed the mem_efficient_fw branch from 141b484 to 3f8b95f Compare December 22, 2021 04:33

WIP fusing the normalization, kernel still super buggy and a bit long

c91f376

blefaudeux force-pushed the mem_efficient_fw branch from 3f8b95f to c91f376 Compare December 22, 2021 05:42

blefaudeux added 2 commits December 22, 2021 10:16

working around compiler bug, getting to something imperfect but which…

0329d4f

… mostly works

Add a unit test to check memory use, needs improvements..

6fe0903

blefaudeux force-pushed the mem_efficient_fw branch from a47ef5b to 6fe0903 Compare December 22, 2021 18:43

adding a benchmark, could do with more work

6ef7309

blefaudeux mentioned this pull request Jan 8, 2022

[improvement] lower Favor+causal memory consumption #105

Open

blefaudeux changed the title ~~[DRAFT] Mem efficient attention - FW pass~~ [DRAFT][Blocked] Mem efficient attention - FW pass Jan 10, 2022

blefaudeux closed this Jan 10, 2022

xwhan pushed a commit to xwhan/xformers that referenced this pull request Feb 8, 2022

Merge pull request facebookresearch#162 from fairinternal/diana_fix

5494e1a

Small Fix, order in which tensors passed to attention

blefaudeux commented Mar 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT][Blocked] Mem efficient attention - FW pass #162

[DRAFT][Blocked] Mem efficient attention - FW pass #162

blefaudeux commented Dec 22, 2021 •

edited

Loading

blefaudeux Dec 22, 2021

blefaudeux commented Dec 22, 2021

ptillet commented Dec 22, 2021

blefaudeux commented Jan 10, 2022

blefaudeux Mar 14, 2022

ptillet commented Mar 14, 2022

blefaudeux commented Mar 15, 2022

[DRAFT][Blocked] Mem efficient attention - FW pass #162

[DRAFT][Blocked] Mem efficient attention - FW pass #162

Conversation

blefaudeux commented Dec 22, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

blefaudeux Dec 22, 2021

Choose a reason for hiding this comment

blefaudeux commented Dec 22, 2021

ptillet commented Dec 22, 2021

blefaudeux commented Jan 10, 2022

blefaudeux Mar 14, 2022

Choose a reason for hiding this comment

ptillet commented Mar 14, 2022

blefaudeux commented Mar 15, 2022

blefaudeux commented Dec 22, 2021 •

edited

Loading