Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvFuser integration for operation fusion #352

Closed
wants to merge 0 commits into from

Conversation

yuanandonly
Copy link
Contributor

@yuanandonly yuanandonly commented Jul 7, 2022

What does this PR do?

**redirect to #357 **

Uses AOTAutograd and NVfuser to created fused operator patterns layers

  • Bias + Activation + Dropout
  • Bias + Dropout + Residual
  • Bias + Dropout + Residual + Layernorm

Integrated BiasActivationDropout pattern into MLP feedforward

Wrote benchmarking against pytorch eager and triton fused patterns

Before submitting

  • Did you have fun?
    • Make sure you had fun coding 🙃
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you update the changelog? (if needed)
    • [] N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 7, 2022
@yuanandonly yuanandonly requested a review from Chillee July 7, 2022 18:26
Copy link
Contributor

@dianaml0 dianaml0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial work here @yuanandonly! Left a few comments

xformers/components/nvfuser/__init__.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/__init__.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/bias_act_dropout.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/bias_act_dropout.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/base.py Outdated Show resolved Hide resolved
xformers/components/feedforward/mlp.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/bias_act_dropout.py Outdated Show resolved Hide resolved
xformers/components/feedforward/mlp.py Outdated Show resolved Hide resolved
xformers/benchmarks/benchmark_nvfuser.py Outdated Show resolved Hide resolved
@blefaudeux
Copy link
Contributor

Nice, and the perf is super good ! This is pretty great @yuanandonly, thanks already !

I would agree with @dianaml0 and it will be simpler for you actually, probably better not to register a new layer type ? It can be automatically (conditionally on nvfuser being present and in the cases which are just faster) applied to the existing layers, that would be even better I feel -less complexity and automatically improving the situation for all users-. You can still expose a macro flag, in the form of a bool that users can flip -possibly at runtime-, for debugging for instance

Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

On top of what Diana and Benjamin mentioned, I have also a couple of comments.

xformers/components/nvfuser/bias_relu_dropout.py Outdated Show resolved Hide resolved
xformers/components/nvfuser/bias_relu_dropout.py Outdated Show resolved Hide resolved
Copy link

@Chillee Chillee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of AOTAutograd looks fine to me.

I do wonder about the API though. One of the nice things about using AOTAutograd is that it gives the user a lot of flexibility in how to construct their function. It seems like this API sacrifices that advantage.

Also, it might be worth trying out the TorchInductor backend as well.

xformers/components/nvfuser/bias_act_dropout.py Outdated Show resolved Hide resolved
@yuanandonly
Copy link
Contributor Author

@Chillee What sort of flexibility are you suggesting for the API? Do you mean keeping the nvfuser component more general instead of having different components for the different fused patterns?

from xformers.components import LayerNormStyle


def _fn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can pull out a general method that takes in a function and outputs the nvfused version. Then we can use that to provide these various fusions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could remove some redundancy in these files

@yuanandonly yuanandonly requested a review from dianaml0 July 19, 2022 16:41
@yuanandonly yuanandonly requested review from fmassa and dianaml0 July 22, 2022 21:11
@yuanandonly yuanandonly force-pushed the op_fusion_functorch branch 3 times, most recently from 0e3275d to 7c7f6de Compare July 25, 2022 18:48
@yuanandonly yuanandonly marked this pull request as ready for review July 25, 2022 19:15
@yuanandonly yuanandonly requested a review from Chillee July 25, 2022 19:15
tests/test_nvfuser.py Outdated Show resolved Hide resolved
tests/test_nvfuser.py Outdated Show resolved Hide resolved
and layer_norm_style == LayerNormStyle.Post
):
pytest.skip(
"Layer norm style doesn't apply, the same relevant params already tested once."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we skip here? Should this error out if a user tries it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the layernormstyle argument doesn't apply to the other patterns, so when layer_norm_style == LayerNormStyle.Pre all the possible configurations have already been tested once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should this just be if fused_pattern != NVFusedBiasDropoutResLayerNorm ?

tests/test_nvfuser.py Outdated Show resolved Hide resolved
tests/test_nvfuser.py Outdated Show resolved Hide resolved
tests/test_nvfuser.py Outdated Show resolved Hide resolved
tests/test_nvfuser.py Outdated Show resolved Hide resolved
xformers/__init__.py Outdated Show resolved Hide resolved
@yuanandonly yuanandonly force-pushed the op_fusion_functorch branch 2 times, most recently from 62551fb to c774755 Compare July 27, 2022 00:47
Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes Chris!

I've left some more comments. I think that ideally we would want the nvfuser model to be a drop-in replacement for the vanilla model, which would imply that the weight layout should be the same. I understand this might complicate a few things for now, so let's not make this a merge-blocking requirement, but it would be good to get this fixed.

tests/test_nvfuser.py Outdated Show resolved Hide resolved
_is_triton_available: bool = torch.cuda.is_available()

# Set to true to utilize functorch
_is_functorch_available: bool = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the desired way to let users have the nvfuser path enabled? Will they need to clone and modify xformers for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently we have it as something that users will need to modify. I haven't looked into it a great deal, perhaps users who haven't cloned xformers can do something like python -c "import xformers; xformers._is_functorch_available=True;

Comment on lines 48 to 55
nn.Linear(
in_features=dim_model, out_features=dim_mlp, bias=False
), # bias is handled in the next layer
NVFusedBiasActivationDropout(
p=dropout,
bias_shape=dim_mlp if bias else None,
activation=activation,
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that the bias live in a different module means that you won't be able to easily swap to use nvfuser on a pre-trained model which used the standard layers. This is IMO a drawback which would be good to be solved.

Using this optimization means that this is not just a transparent change to the user, and the snippet like I proposed before with state_dict won't work out of the box.

xformers/components/nvfuser/bias_dropout_res.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@yuanandonly yuanandonly force-pushed the op_fusion_functorch branch from 42a7ae9 to d9199f0 Compare July 27, 2022 19:01
HOWTO.md Outdated Show resolved Hide resolved
HOWTO.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@yuanandonly yuanandonly force-pushed the op_fusion_functorch branch 3 times, most recently from 7a1519c to 2a5772b Compare July 28, 2022 07:16
@yuanandonly yuanandonly force-pushed the op_fusion_functorch branch from 2a5772b to 9e27c8f Compare July 28, 2022 07:51
fmassa pushed a commit that referenced this pull request Aug 10, 2022
Co-authored-by: danthe3rd <danthe3rd>
fmassa added a commit that referenced this pull request Aug 25, 2022
* Enable masking in memory-efficient attention (#333)

* Add attention bias in memory-efficient attention

* Add gradient for attn_mask support

* Add CPU implementation

* clang-format

* Add benchmark scripts

* Add extra loop in benchmarks

* Move zeros array out of helper function

* clang-format

* Enable dropout in memory-efficient attention (#334)

* Merge compute_scaling_coeffs and update_scaling_coeffs into a single function

It wasn't needed to break it in two functions to begin with

* Add CUDA implementation for dropout

* clang-format

* Make p be drop probability

* Only CUDA supports dropout

* Add benchmarks

* Remove unused variables

* Fix test

* Cleanups and comments

* Fix masking corner case when full block is masked (#339)

* Add cutlass 2.9 - 858c735856a7f17bd33fe438ec76d3c9f0234e7f

* Option to load from shared memory for PredicatedTileIterator

* Add cutlass include dir

* Ignore files in third-party for flake8/coverage

* third-party -> third_party

* Address comments

* Revert some un-needed mods

* Add attention_forward_generic.cu

* Add tests

* Fix duplicate calculations on baseline for mem efficient transformers

* Always run all linters in CI

* clang-format attention_forward_generic.cu

* Benchmark: Add possibility to compare benchmarks

* [isort] Ignore third_party

* black autoformat

* Black again + ignore third_party properly

* black

* Fix memory leak between the 2 benchmarks in backward

* Exclude third_party/ without using pyproject.toml as it imposes isolated build which is a pain

* Remove progress bar when finished

* mypy

* flake8

* Save results to shared folder in home location

* run black

* clang-format with 'run-clang-format.py'

* Fix cutlass build for arch>=75

* Set tests precision for gradient more accurately

* Fix precision margin

* Revert changes to black

* [feat] Fix importing xformers when not built (#351)

authored-by: danthe3rd <[email protected]>

* Update black to 22.3.0

* Tweak precision for mem_eff_attention test

* mem-efficient impl for f16 (#352)

Co-authored-by: danthe3rd <danthe3rd>

* Add support for f16 with tensorcores [sm70/sm75/sm80] (#354)

* Add support for f16 with tensorcores

* sm75 minimum for tensorcores

* Run tests with CUDA_LAUNCH_BLOCKING=1

* Support sm70 properly

* Disable tensorcore when not correctly aligned - and use 32bit accessors

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: danthe3rd <[email protected]>

* Optimize backward of memory-efficient attention by ~20% (#355)

* Optimize backward by 15% by using equivalent formulation

* Unify everything into single kernel

* Remove unused implementation

* clang-format

* Remove unused tensor

* Display results as we progress during benchmark (#357)

Co-authored-by: danthe3rd <danthe3rd>

* RFC: Ops dispatch (#356)

* Ops dispatch

* CI: Fix doc build

* memory_efficient_attention raises when no implementation is available

* type: ignore

* Fix torch.device/str comparison

* Make mypy happy

Co-authored-by: danthe3rd <[email protected]>
Co-authored-by: danthe3rd <danthe3rd>

* [A100/f32] Use TensorCores for Q.K_t matmul with FastF32 (#358)

* Use TensorCores for MM0 on Float as well

* Use MultiStage MMA when available - change to FastF32 rather than FastF16

* Better alignment calculation

* Just use regular f32, no fastf32

* Hackfix to handle alignment

* HeuristicsMM0 -> GemmTypeQK

* No longer use f16 for matmul

* Add some doc

* Typo

* Fix build <sm80

* Alignment check based on current device compute capability

* Use TORCH_INTERNAL_ASSERT

Co-authored-by: danthe3rd <danthe3rd>

* FlashAttention implem and dispatch (#360)

* FlashAttention implem WIP

* Fix flashattention forward+backward

* Fix forward/backward for FlashAttention

* Enable tests (more permissive) for f16 backward

* Fix CI

* flashattn only supports Sm75 and above

* Fix CI2

* Disable K=128 when below sm80 for flashattn

Co-authored-by: danthe3rd <danthe3rd>

* Misc performance improvements for generic mem-efficient attention (#361)

* 3% speedup by calculating mi from registers

* Also compute m_prime/s_prime and exponentiate from registers

* Support for Simt tiles

* Fix TensorOp for V100

* Fix for A100

* Fix Simt alignment calculation

* clang-format

* WarpReduction before atomic call for Simt

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: danthe3rd <[email protected]>

* Update flashattention to support bf16 (#363)

* Update flashattention to support bf16

* bfloat16 only on sm80 and above

Co-authored-by: danthe3rd <danthe3rd>

* Flashattn causal (#364)

* Implement causal memory-efficient attention with FlashAttention

* Update benchmarks

* Fix mypy

Co-authored-by: danthe3rd <danthe3rd>

* Option to disable flashattention (long to build) (#362)

* Option to disable flashattention (long to build)

* Update setup.py

Co-authored-by: danthe3rd <danthe3rd>

* Remove code duplicate in attention_scaling_coefs_updater.h (#367)

Co-authored-by: danthe3rd <danthe3rd>

* Update .gitmodules (#366)

* MemoryEff attention forward: Properly fuse matmul and enable TensorCores on the second matmul (#368)

* Generic backwards

* Guard backward to sm75 only

* bounds checking for gradV

* clang-format

* Fused gemm working for Sm80/Sm75 f16/f32

* WIP

* Volta TensorOp for f16

* Working on A100 again

* SIMT working

* Code cleanup 1

* Code cleanup2

* BUGFIX for shared memory limit

* Remove code

* clang-format

* Remove code again

* Remove draft of backward

* Enforce alignment for fp16

* Fix tests

* Fix constraint on seq length when not using tensorcores

* Fix alignment requirements for V100/tensorcores

* Clang-format

* Update xformers/components/attention/csrc/cuda/attention_forward_generic.cu

Co-authored-by: Francisco Massa <[email protected]>

* Address comments from fmassa

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: danthe3rd <[email protected]>
Co-authored-by: Francisco Massa <[email protected]>

* Update install instructions with submodule (#365)

* Generic backward implem with cutlass (#371)

* Old bw code

* P100: gradV working

* gk/gq working (at least for small values of M, and on P100/f16)

* Further restrict supported values for bw

* Fix storage into smem for Simt

* More tooling for pruint/debug

* Remove tests we dont need for now

* Tests pass on P100 :D

* 4 warps per block

* Restraint on q length

* Use tensorcores on V100 for f16

* Support dynamic smem for bw

* Handle alignment and different dtype/arch

* Fix NaNS by initializing shared memory

* bw.py

* Fix launch bounds

* Faster 'computeDi'

* minus_lse can operate on arrays

* Output number of regs used etc...

* Code cleanup

* Hackfix for alignment check during forward

* zFill to avoid nans in Sm80 + fix launch bounds

* COde cleanup1

* clang-format

* Fix tests

* Add benchmark for K=64

Co-authored-by: danthe3rd <[email protected]>
Co-authored-by: danthe3rd <danthe3rd>

* Cutlass as submodule (#375)

* Make cutlass be back at 858c735856a7f17bd33fe438ec76d3c9f0234e7f

* Remove cutlass

* Update submodules

* Add submodule (properly)

* spaces / tab

* Make submodule init be recursive

* Fix bad rebase

* Bump tolerance for backward (#377)

* Add verbose flag to CI builds (#376)

* Add verbose flag to CI builds

* Spurious change to rebuild cache

* Add ninja

* Ninja wasn't visible before, install through conda

* Debugging

* Source env

* One more try

* Forgot to uncomment a line

* Another try

* Cleanup

* Fix for FlashAttention dispatch

It requires device capability >= 7.5

* Remove generated file

* Address some reviewer feedback

Remove unused function and typo fix

* Perf improvement on backward (#378)

* Fast again on V100

* Fix correctness - missing syncthreads

* Get rid of AttentionInfo

Co-authored-by: danthe3rd <[email protected]>

Co-authored-by: danthe3rd <[email protected]>
Co-authored-by: dan_the_3rd <[email protected]>
bertmaher pushed a commit to bertmaher/xformers that referenced this pull request Dec 20, 2024
Co-authored-by: danthe3rd <danthe3rd>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. ongoing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants