[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

blefaudeux · 2021-12-21T03:22:11Z

🚀 Feature

Implement https://arxiv.org/pdf/2112.05682v2.pdf using Triton

Motivation

There are existing implementations in Pytorch, but they re bound to be a little slow. It s actually not that much work to write that down in Triton, give it a shot. Given the FW speed (should be similar to normal attention, without the memory) and the expected BW speed (about 60% of the vanilla attention), feels like a compromise that many would use

Pitch

The required kernel is actually not that far from some of the kernels that we already have, at least for the FW. The chunk strategy proposed by the paper is actually fairly classic in that field, nothing out of the ordinary (see for instance), so it's bound to be pretty fast if correctly implemented.

Alternatives

At least support a pure pytorch variant in xformers ?

erip · 2021-12-21T15:14:39Z

Another reference impl can be found here -- same caveats as outlined above.

blefaudeux · 2021-12-21T21:12:43Z

I've started something, it feels like some of the logic would need to be changed a bit for that to make sense at a kernel level, at least for triton. In particular it's hard to sequence things outside of a kernel, and reproducing the same logic as the one from the paper would lead to big buffers (if the computation is tiled), which diminish the interest a lot. It feels like the best approach is with a kernel owning the whole line, and a couple of rows at a time to help with data fetch reuse

blefaudeux · 2021-12-21T21:12:49Z

Another reference impl can be found here -- same caveats as outlined above.

thanks !

blefaudeux mentioned this issue Dec 22, 2021

[DRAFT][Blocked] Mem efficient attention - FW pass #162

Closed

16 tasks

blefaudeux linked a pull request Dec 22, 2021 that will close this issue

[DRAFT][Blocked] Mem efficient attention - FW pass #162

Closed

16 tasks

blefaudeux added the blocked label Jan 7, 2022

xwhan pushed a commit to xwhan/xformers that referenced this issue Feb 8, 2022

Add same-device checks in CUDA kernels (facebookresearch#161)

476ead0

blefaudeux added ongoing and removed blocked labels Apr 7, 2022

blefaudeux assigned fmassa Apr 7, 2022

fmassa mentioned this issue Apr 11, 2022

Memory-efficient attention - forward pass #267

Merged

fmassa closed this as completed in #267 Apr 12, 2022

fmassa mentioned this issue Apr 22, 2022

Memory efficient attention - backward pass #281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

blefaudeux commented Dec 21, 2021

erip commented Dec 21, 2021 •

edited

Loading

blefaudeux commented Dec 21, 2021

blefaudeux commented Dec 21, 2021

[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

[feat] Add a fast implementation of Rabe and Staats algorigthm (mem efficient attention) on GPU #161

Comments

blefaudeux commented Dec 21, 2021

🚀 Feature

Motivation

Pitch

Alternatives

erip commented Dec 21, 2021 • edited Loading

blefaudeux commented Dec 21, 2021

blefaudeux commented Dec 21, 2021

erip commented Dec 21, 2021 •

edited

Loading