Update on "Switch CI to pytorch 1.13"

Closes #515 Closes #514 Note: `static_argnums`'s argument to `memory_efficient_fusion` is now removed, so had to update some code [ghstack-poisoned]
facebookresearch · Nov 14, 2022 · 9b8a17d · 9b8a17d
2 parents 985792f + 3362adf
commit 9b8a17d
Show file tree

Hide file tree

Showing 89 changed files with 793 additions and 547 deletions.
diff --git a/BENCHMARKS.md b/BENCHMARKS.md
@@ -17,7 +17,7 @@ Some examples, generated with `python3 xformers/benchmarks/benchmark_encoder.py
 
 ## Benchmark the core sparse attention mechanisms
 
-`python3 xformers./benchmarks/benchmark_core.py` will measure the speed of the core sparse attention mechanism. The current numbers are as follows (times in microseconds (us)):
+`python3 xformers/benchmarks/benchmark_core.py` will measure the speed of the core sparse attention mechanism. The current numbers are as follows (times in microseconds (us)):
 
 |                        | **matmul_with_mask**  |                        | **softmax**           |                        | **bmm**               |                        |
 | ---------------------- | --------------------- | ---------------------- | --------------------- | ---------------------- | --------------------- | ---------------------- |
@@ -37,6 +37,8 @@ Some examples, generated with `python3 xformers/benchmarks/benchmark_encoder.py
 
 ## Triton layers
 
+Please not that as of November 2022 these layers are not optimized for typical production GPUs out there (not developed for some time and mostly tested on a laptop GPU), and that better performances are probably possible with some minor changes as proven in other libraries since xformers went out.
+
 ### Fused softmax
 
 You can reproduce these numbers locally by running `python3 xformers/benchmarks/benchmark_triton_softmax.py`. The units are GB/s. These results are for a laptop nVidia 3080, Triton 2.0 and PyTorch 1.12.

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## TBD
 ### Fixed
+- Updated triton dependency [#418]
 
 ### Added
 

diff --git a/README.md b/README.md
@@ -7,8 +7,8 @@
 <br/><!--
 ![PyPI](https://img.shields.io/pypi/v/xformers)
 ![PyPI - License](https://img.shields.io/pypi/l/xformers)
--->
 [![Documentation Status](https://github.com/facebookresearch/xformers/actions/workflows/gh-pages.yml/badge.svg)](https://github.com/facebookresearch/xformers/actions/workflows/gh-pages.yml/badge.svg)
+-->
 [![CircleCI](https://circleci.com/gh/facebookresearch/xformers.svg?style=shield)](https://app.circleci.com/pipelines/github/facebookresearch/xformers/)
 [![Codecov](https://codecov.io/gh/facebookresearch/xformers/branch/main/graph/badge.svg?token=PKGKDR4JQM)](https://codecov.io/gh/facebookresearch/xformers)
 [![black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
@@ -19,115 +19,52 @@
 -->
 --------------------------------------------------------------------------------
 
-## Description
-
-xFormers is a modular and field agnostic library to flexibly generate transformer architectures from interoperable and optimized building blocks. These blocks are not limited to xFormers and can also be cherry picked as the user see fit.
-
-## Getting started
-
-The full [documentation](https://facebookresearch.github.io/xformers/) contains instructions for getting started, deep dives and tutorials about the various APIs.
-If in doubt, please check out the [HOWTO](HOWTO.md). Only some general considerations are laid out in the README.
+## xFormers - Toolbox to Accelerate Research on Transformers
 
-For recent changes, you can have a look at the [changelog](CHANGELOG.md)
+xFormers is:
+- **Customizable building blocks**: Independant/customizable building blocks that can be used without boilerplate code. The components are domain-agnostic and xFormers is used by researchers in vision, NLP and more.
+- **Research first**: xFormers contains bleeding-edge components, that are not yet available in mainstream libraries like pytorch.
+- **Built with efficiency in mind**: Because speed of iteration matters, components are as fast and memory-efficient as possible. xFormers contains its own CUDA kernels, but dispatches to other libraries when relevant.
 
+## Installing xFormers
 
-### Installation
-
-To install xFormers, it is recommended to use a dedicated virtual environment, as often with python, through `python-virtualenv` or `conda` for instance.
-PyTorch must be installed. Using conda for example:
+* **(RECOMMENDED) Using binaries**: We provide binaries for Linux and recent PyTorch versions. Install xFormers with conda:
 
 ```bash
-  conda create --name xformers python=3.10
-  conda activate xformers
-  conda install -c pytorch -c conda-forge cudatoolkit=11.6 pytorch=1.12.1
+conda install xformers -c xformers/label/dev
 ```
 
-*Please note that Pytorch 1.12 or newer is required.
-
-There are two ways you can install xFormers locally:
-
-<details><summary> Conda dev packages </summary><p>
-
-There are regular builds of xformers as it is developed on the `main` branch.
-To use these, you must be on Linux and have a conda environment with Python 3.9 or 3.10, CUDA 11.3 or 11.6, and PyTorch 1.12.1.
-You can install the latest with
-
-  ```bash
-  conda install xformers -c xformers/label/dev
-  ```
-
-</p></details>
-
-<details><summary> Build from source (dev mode) </summary><p>
-
-  These commands will fetch the latest version of the code and then install xFormers from source.
-  If you want to build the sparse attention CUDA kernels, please make sure that the next point is covered prior to running these instructions.
-
-  ```bash
-  git clone [email protected]:facebookresearch/xformers.git
-  git submodule update --init --recursive
-  conda create --name xformer_env python=3.8
-  conda activate xformer_env
-  cd xformers
-  pip install -r requirements.txt
-  pip install -e .
-  # or, for OSX
-  MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ pip install -e .
-  ```
-
-</p></details>
-
-### Installing custom (non-pytorch) parts
+* **From source**: Alternatively, if no binaries are available (for instance for windows), you can also install from source:
 
-<details><summary> Sparse attention kernels </summary><p>
-
-Installing the CUDA-based sparse attention kernels may require extra care, as this mobilizes the CUDA toolchain. As a reminder, these kernels are built when you run `pip install -e .` and the CUDA buildchain is available (NVCC compiler). Re-building can for instance be done via `python3 setup.py clean && python3 setup.py develop`, so similarly wipe the `build` folder and redo a pip install -e.
-
-Some advices related to building these CUDA-specific components, tentatively adressing common pitfalls. Please make sure that:
-
-* NVCC and the current CUDA runtime match. Depending on your setup, you may be able to change the CUDA runtime with `module unload cuda module load cuda/xx.x`, possibly also `nvcc`
-* the version of GCC that you're using matches the current NVCC capabilities
-* the `TORCH_CUDA_ARCH_LIST` env variable is set to the architures that you want to support. A suggested setup (slow to build but comprehensive) is `export TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.2;8.0;8.6"`
-
-</p></details>
-
-<details><summary> Triton </summary><p>
-
-Some parts of xFormers use [Triton](http://www.triton-lang.org), and will only expose themselves if Triton is installed, and a compatible GPU is present (nVidia GPU with tensor cores). If Triton was not installed as part of the testing procedure, you can install it directly by running `pip install triton`. You can optionally test that the installation is successful by running one of the Triton-related benchmarks, for instance `python3 xformers/benchmarks/benchmark_triton_softmax.py`
-
-Triton will cache the compiled kernels to `/tmp/triton` by default. If this becomes an issue, this path can be specified through the `TRITON_CACHE_DIR` environment variable.
-
-</p></details>
+```bash
+# (Optional) Makes the build much faster
+pip install ninja
+# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types
+pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers
+# (this can take dozens of minutes)
+```
 
-<details><summary> AOTAutograd/NVFuser </summary><p>
+* **pip wheels**: There is no updated package available on pip, please install from conda or from source
 
-Some parts of xFormers use AOT Autograd from the [FuncTorch](https://pytorch.org/functorch/stable/) library, and will only expose themselves if FuncTorch is installed, and a compatible GPU is present. If functorch was not installed as part of the testing procedure, you can install it directly through pip.
 
-  ```bash
-  pip install functorch
-  ```
+## Results
 
- Once installed, set the flag `_is_functorch_available = True` in `xformers/__init__.py`. You can optionally test that the installation is successful by running one of the functorch-related benchmarks `python3 xformers/benchmarks/benchmark_nvfuser.py`
+**Memory-efficient MHA**
+![Benchmarks for ViTS](./docs/plots/mha/mha_vit.png)
+*Setup: A100 on f16, measured total time for a forward+backward pass*
 
- If you are importing the xFormers library in a script, you can modify the flag as such:
+Note that this is exact attention, not an approximation, just by calling [`xformers.ops.memory_efficient_attention`](https://facebookresearch.github.io/xformers/components/ops.html#xformers.ops.memory_efficient_attention)
 
-  ```python
-  import xformers
-  xformers._is_functorch_available = True
-  ```
+**More benchmarks**
 
-</p></details>
+xFormers provides many components, and more benchmarks are available in [BENCHMARKS.md](BENCHMARKS.md).
 
-### Testing the installation
+### (Optional) Testing the installation
 
-This will run a benchmark of the attention mechanisms exposed by xFormers, and generate a runtime and memory plot.
-If this concludes without errors, the installation is successful. This step is optional, and you will need some extra dependencies for it to
-be able to go through : `pip install -r requirements-benchmark.txt`.
-
-Once this is done, you can run this particular benchmark as follows:
+This command will provide information on an xFormers installation, and what kernels are built/available:
 
 ```python
-python3 xformers/benchmarks/benchmark_encoder.py --activations relu  --plot -emb 256 -bs 32 -heads 16
+python -m xformers.info
 ```
 
 ## Using xFormers
@@ -147,6 +84,8 @@ Models are thus not implemented in monolithic files, which are typically complic
 ### Repo map
 
 ```bash
+├── ops                         # Functional operators
+    └ ...
 ├── components                  # Parts zoo, any of which can be used directly
 │   ├── attention
 │   │    └ ...                  # all the supported attentions
@@ -156,11 +95,7 @@ Models are thus not implemented in monolithic files, which are typically complic
 │   │    └ ...                  # all the supported positional embeddings
 │   ├── activations.py          #
 │   └── multi_head_dispatch.py  # (optional) multihead wrap
-│
-├── factory                     # Build model programatically
-│   ├── block_factory.py        # (optional) helper to programatically generate layers
-│   └── model_factory.py        # (optional) helper to programatically generate models
-│
+|
 ├── benchmarks
 │     └ ...                     # A lot of benchmarks that you can use to test some parts
 └── triton
@@ -258,26 +193,33 @@ Patrick et al., 2021](https://arxiv.org/abs/2106.05392)*
 
 1. Many attention mechanisms, interchangeables
 2. Optimized building blocks, beyond PyTorch primitives
-   1. sparse attention
-   2. block-sparse attention
-   3. fused softmax
-   4. fused linear layer
-   5. fused layer norm
-   6. fused dropout(activation(x+bias))
+   1. Memory-efficient exact attention - up to 10x faster
+   2. sparse attention
+   3. block-sparse attention
+   4. fused softmax
+   5. fused linear layer
+   6. fused layer norm
+   7. fused dropout(activation(x+bias))
+   8. fused SwiGLU
 3. Benchmarking and testing tools
    1. [micro benchnmarks](BENCHMARKS.md)
    2. transformer block benchmark
-   3. [LRA](xformers/benchmarks/LRA/README.md), with SLURM suppot
+   3. [LRA](xformers/benchmarks/LRA/README.md), with SLURM support
 4. Programatic and sweep friendly layer and model construction
    1. Compatible with hierarchical Transformers, like Swin or Metaformer
 5. Hackable
    1. Not using monolithic CUDA kernels, composable building blocks
    2. Using [Triton](https://triton-lang.org/) for some optimized parts, explicit, pythonic and user-accessible
    3. Native support for SquaredReLU (on top of ReLU, LeakyReLU, GeLU, ..), extensible activations
 
-### FAQ ?
+### Install troubleshooting
+
+
+* NVCC and the current CUDA runtime match. Depending on your setup, you may be able to change the CUDA runtime with `module unload cuda; module load cuda/xx.x`, possibly also `nvcc`
+* the version of GCC that you're using matches the current NVCC capabilities
+* the `TORCH_CUDA_ARCH_LIST` env variable is set to the architures that you want to support. A suggested setup (slow to build but comprehensive) is `export TORCH_CUDA_ARCH_LIST="6.0;6.1;6.2;7.0;7.2;7.5;8.0;8.6"`
+* If the build from source OOMs, it's possible to reduce the parallelism of ninja with `MAX_JOBS` (eg `MAX_JOBS=2`)
 
-We've tried to collect a relatively exhaustive list of explanations in the [HOWTO](HOWTO.md)
 
 ### License
 
@@ -288,11 +230,11 @@ xFormers has a BSD-style license, as found in the [LICENSE](LICENSE) file.
 If you use xFormers in your publication, please cite it by using the following BibTeX entry.
 
 ``` bibtex
-@Misc{xFormers2021,
-  author =       {Benjamin Lefaudeux and Francisco Massa and Diana Liskovich and Wenhan Xiong and Vittorio Caggiano and Sean Naren and Min Xu and Jieru Hu and Marta Tintore and Susan Zhang},
+@Misc{xFormers2022,
+  author =       {Benjamin Lefaudeux and Francisco Massa and Diana Liskovich and Wenhan Xiong and Vittorio Caggiano and Sean Naren and Min Xu and Jieru Hu and Marta Tintore and Susan Zhang and Patrick Labatut and Daniel Haziza},
   title =        {xFormers: A modular and hackable Transformer modelling library},
   howpublished = {\url{https://github.com/facebookresearch/xformers}},
-  year =         {2021}
+  year =         {2022}
 }
 ```
 
@@ -308,3 +250,5 @@ The following repositories are used in xFormers, either in close to original for
 * [Nystromformer](https://github.com/mlpen/Nystromformer)
 * [FairScale](https://github.com/facebookresearch/fairscale/)
 * [Pytorch Image Models](https://github.com/rwightman/pytorch-image-models)
+* [CUTLASS](https://github.com/nvidia/cutlass)
+* [Flash-Attention](https://github.com/HazyResearch/flash-attention)
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_gelu.png
diff --git a/...plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_squared_relu.png b/...plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float16_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_gelu.png
diff --git a/...plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_squared_relu.png b/...plots/fused_dropout/Dropout_Bias_False_FW+BW_torch.float32_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_gelu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_squared_relu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float16_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_gelu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_squared_relu.png b/docs/plots/fused_dropout/Dropout_Bias_False_FW_torch.float32_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_gelu.png
diff --git a/.../plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_squared_relu.png b/.../plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float16_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_gelu.png
diff --git a/.../plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_squared_relu.png b/.../plots/fused_dropout/Dropout_Bias_True_FW+BW_torch.float32_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_gelu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_squared_relu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float16_Act_squared_relu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_None.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_None.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_gelu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_gelu.png
diff --git a/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_squared_relu.png b/docs/plots/fused_dropout/Dropout_Bias_True_FW_torch.float32_Act_squared_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_gelu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_gelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_leaky_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_leaky_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_none.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_none.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_smelu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_smelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_squared_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_squared_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_star_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_BW_star_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_gelu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_gelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_leaky_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_leaky_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_none.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_none.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_smelu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_smelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_squared_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_squared_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp16_FW_star_relu.png b/docs/plots/fused_linear/FusedLinear_fp16_FW_star_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_gelu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_gelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_leaky_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_leaky_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_none.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_none.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_smelu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_smelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_squared_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_squared_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_star_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_BW_star_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_gelu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_gelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_leaky_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_leaky_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_none.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_none.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_smelu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_smelu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_squared_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_squared_relu.png
diff --git a/docs/plots/fused_linear/FusedLinear_fp32_FW_star_relu.png b/docs/plots/fused_linear/FusedLinear_fp32_FW_star_relu.png
diff --git a/docs/plots/fused_softmax/Softmax_Bandwidth_FW_BW_fp16.png b/docs/plots/fused_softmax/Softmax_Bandwidth_FW_BW_fp16.png
diff --git a/docs/plots/fused_softmax/Softmax_Bandwidth_FW_BW_fp32.png b/docs/plots/fused_softmax/Softmax_Bandwidth_FW_BW_fp32.png
diff --git a/docs/plots/fused_softmax/Softmax_Bandwidth_FW_fp16.png b/docs/plots/fused_softmax/Softmax_Bandwidth_FW_fp16.png
diff --git a/docs/plots/fused_softmax/Softmax_Bandwidth_FW_fp32.png b/docs/plots/fused_softmax/Softmax_Bandwidth_FW_fp32.png
diff --git a/docs/plots/layer_norm/LayerNorm_FW+BW_torch.float16.png b/docs/plots/layer_norm/LayerNorm_FW+BW_torch.float16.png
diff --git a/docs/plots/layer_norm/LayerNorm_FW+BW_torch.float32.png b/docs/plots/layer_norm/LayerNorm_FW+BW_torch.float32.png
diff --git a/docs/plots/layer_norm/LayerNorm_FW_torch.float16.png b/docs/plots/layer_norm/LayerNorm_FW_torch.float16.png
diff --git a/docs/plots/layer_norm/LayerNorm_FW_torch.float32.png b/docs/plots/layer_norm/LayerNorm_FW_torch.float32.png
diff --git a/docs/plots/mha/MHA_FW+bw_torch.float16.png b/docs/plots/mha/MHA_FW+bw_torch.float16.png
diff --git a/docs/plots/mha/MHA_FW_torch.float16.png b/docs/plots/mha/MHA_FW_torch.float16.png
diff --git a/docs/plots/mha/MHA_FW_torch.float32.png b/docs/plots/mha/MHA_FW_torch.float32.png
diff --git a/docs/plots/mha/mha_vit.png b/docs/plots/mha/mha_vit.png
diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,8 +1,6 @@
 recommonmark==0.5.0
 docutils==0.17.1
-sphinx==3.2.1
-sphinx_rtd_theme==0.4.3
-sphinxcontrib-programoutput==0.16
+sphinx==5.0.0
 git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
 torch>=1.6.0
 numpy>=1.19.5

diff --git a/docs/source/components/index.rst b/docs/source/components/index.rst
@@ -4,6 +4,7 @@ API Reference
 .. toctree::
    :maxdepth: 2
 
+   ops
    attentions
    feedforward
    position_embedding

diff --git a/docs/source/components/ops.rst b/docs/source/components/ops.rst
@@ -0,0 +1,8 @@
+Operators
+======================
+
+.. automodule:: xformers.ops
+    :members:
+    :show-inheritance:
+    :imported-members:
+    :member-order: bysource
diff --git a/requirements-test.txt b/requirements-test.txt
@@ -27,5 +27,4 @@ hydra-core >= 1.1
 fairscale >= 0.4.5
 
 # Dependency for fused layers, optional
-triton == 2.0.0.dev20221105
-networkx
+triton==2.0.0.dev20221105
diff --git a/tests/test_mem_eff_attention.py b/tests/test_mem_eff_attention.py
@@ -202,6 +202,7 @@ def create_tensors(
     k,
     kv,
     *,
+    requires_grad=False,
     attn_bias_type=None,
     fmt: str = "BMK",
 ):
@@ -231,6 +232,7 @@ def create_tensors(
     dispatch = xformers.ops.AttentionOpDispatch.from_arguments(
         query=query, key=key, value=value, attn_bias=attn_bias
     )
+    dispatch.requires_grad = requires_grad
     if not op.supports(dispatch):
         # Ensure we free memory to avoid OOMs
         del query, key, value, attn_bias
@@ -498,7 +500,10 @@ def test_backward(
         kv,
     ) = op_device_dtype_B_Mq_Mkv_H_K_Kv
     query, key, value, attn_bias = create_tensors(
-        *op_device_dtype_B_Mq_Mkv_H_K_Kv, attn_bias_type=attn_bias_type, fmt=fmt
+        *op_device_dtype_B_Mq_Mkv_H_K_Kv,
+        requires_grad=True,
+        attn_bias_type=attn_bias_type,
+        fmt=fmt,
     )
     qkv = None
 

diff --git a/tests/test_triton_basics.py b/tests/test_triton_basics.py
@@ -131,3 +131,50 @@ def test_sum_strided_asserts():
         with pytest.raises(AssertionError):
             # This kernel expects 2D tensors, assert to prevent misuse
             sum_2d_dim_0(a)
+
+    @triton.jit
+    def k_rand(X, Y, SEED_X, SEED_Y, stride_x, stride_y, N: tl.constexpr):
+        # fmt: on
+        """
+        Check the random number generation
+        """
+
+        row = tl.program_id(0)
+
+        # Generate random numbers with seed A
+        rand_offsets = tl.arange(0, N)
+        seed_x = tl.load(SEED_X + row)
+        randx, _, _, _ = tl.randint4x(seed_x, rand_offsets)
+
+        rand_offsets = tl.arange(0, N)
+        seed_y = tl.load(SEED_Y + row)
+        randy, _, _, _ = tl.randint4x(seed_y, rand_offsets)
+
+        # Move to this row
+        tl.store(X + row * stride_x + tl.arange(0, N), randx)
+        tl.store(Y + row * stride_y + tl.arange(0, N), randy)
+
+    def test_rand():
+        # Check that the random generator used in triton works fine
+        torch.random.manual_seed(0)
+        x = torch.zeros((512, 32), device=torch.device("cuda"), dtype=torch.int32)
+        y = torch.zeros((512, 32), device=torch.device("cuda"), dtype=torch.int32)
+
+        M, N = x.shape
+
+        seeds_x = torch.randint(65536, (M,), device=x.device)
+        seeds_y = torch.randint(65536, (M,), device=x.device)
+
+        assert not torch.allclose(seeds_x, seeds_y)
+
+        # enqueue kernels, one per line
+        # fmt: off
+        k_rand[(M,)](
+            x, y,
+            seeds_x, seeds_y,
+            x.stride(0), y.stride(0),
+            N,
+        )
+        # fmt: on
+
+        assert not torch.allclose(x, y)
-Original file line number
+Diff line change
@@ Expand Up @@
     ## TBD
     ### Fixed
+    - Updated triton dependency [#418]
     ### Added
@@ Expand Down @@