Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature knobs to FBGEMM #3290

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add feature knobs to FBGEMM #3290

wants to merge 7 commits into from

Conversation

q10
Copy link
Contributor

@q10 q10 commented Oct 29, 2024

Summary: Add feature knobs to FBGEMM

Differential Revision: D65174905

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65174905

Copy link

netlify bot commented Oct 29, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit b9e3dc1
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/672176b0bcfa4700089b5e5a
😎 Deploy Preview https://deploy-preview-3290--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

MatzeB and others added 7 commits October 29, 2024 16:58
Summary:
X-link: facebookresearch/FBGEMM#219

- Use `new[]` and `std::unique_ptr<float[]>` instead of `std::vector`. This avoids an unnecessary indirection and less work for the compiler to drop all the dynamic resizing code that isn't used here anyway.
- Also allocate a stack buffer of size 256 and use it instead of a heap allocation when block sizes are small enough.

I hope to add specialization mechanisms for common block_sizes later. When specialized for a fixed block_size having a stack allocation is very helpful as it allows the compiler to promote `buf` completely into vector registers (and skip all the related memory operations).

Differential Revision: D62591607
Summary:
X-link: facebookresearch/FBGEMM#221

Checking that `data_size >= 0` early outside the loops means that the compiler can use a single (unsigned) comparison for the `if (idx < 0 || idx >= data_size)` checks within the loops (instead of 2 signed comparisons).

Differential Revision: D62591655
Summary:
X-link: facebookresearch/FBGEMM#222

Cleanup/rework `EmbeddingSpMDMAutovec.cc` code:

- Fix strict-aliasing violations: When dealing with a `uint8_t *input` pointer we have to use `memcpy` to extract float/float16 values for "bias" and "scale" as direct pointer casts violate strict aliasing rules which is undefined behavior that can lead to misoptimization or lost performance (compilers may be more conservative in alias analysis if they detect violations). Note that the additional `memcpy`s are optimized away by any modern compiler (but provide enough semantic information that the compiler knows its potentially unaligned aliasing memory addresses involved).
- Drop most `#pragma omp` annotations. In most places the patterns are so obvious that the compiler vectorizes the code anyway when benefitial.
- Make sure loop iteration variables have the same type as their upper bounds. This way the compiler doesn't need to prove that a loop variable with smaller bit with does not overflow.
- Factor out address calculation performed on "input" and consistently store addresses in `uint8_t*`. This factors out some common address calculation, and occasionally makes the compiler generate slightly better address operations when I inspected the assembly.
- Harmonize handling of weights across the implementations.

Differential Revision: D61048245
Summary:
Some refactoring and cleanup for the `GenerateEmbeddingXXX` functions.

- Drop unnecessary `else` for `ifs` that end in a return statement, like: `if (...) { return; } else if (...) { return; }`.
- Consistently surround asmjit usage with `#if CPUINFO_ARCH_X86 || CPUINFO_ARCH_X86_64`.
- Introduce `FBGEMM_AUTOVEC_AVAILABLE` and consistently surround `xxx_autovec` usage with it.
- Avoid some cases that duplicated autovec/ref kernel calls.

Differential Revision: D62982082
Summary:
Change the API in `EmbeddingSpMDMAutovec.h` to expose `GenerateEmbeddingXXX_autovec` functions to mirror the style used in `FbgemmEmbedding.h`.

This is in preparation to allow the autovec code to select among multiple specialized functions ahead of time.

Differential Revision: D62984078
Summary:
To have auto-vectorized code be competitive with asmjit we need to specialize the generic code to a some fixed parameters. We cannot specialize at runtime, so this introduce a framework to specialize for a given set of parameters at compile time and choose between existing specializations at runtime.

The framework added here allows to specify lines like the following for a given function.
Each parameter can be set to `var` to not specialize it or `fixed(C)` to create a specialized version with that parameter set to the constant value `C`. Example:

```
SPECIALIZE(
      /*BIT_RATE=*/fixed(2),
      /*BLOCK_SIZE=*/var,
      /*HAS_WEIGHT=*/fixed(true),
      /*NORMALIZE_BY_LENGTHS=*/var,
      /*PREFETCH=*/var,
      /*IS_WEIGHT_POSITIONAL=*/var,
      /*USE_OFFSETS=*/var,
      /*OUTPUT_STRIDE=*/fixed(int64_t{-1}),
      /*INPUT_STRIDE=*/fixed(int64_t{-1}),
      /*SCALE_BIAS_LAST=*/fixed(true),
      /*NO_BAG=*/fixed(false),
      /*IS_BF16_OUT=*/var,
      /*IS_BF16_IN=*/var)
```

This diff introduces some exemplary specialization for `GenerateEmbeddingSpMDMWithStrides_autovec` and `GenerateEmbeddingSpMDMNBitWithStrides_autovec` specializing them for bit_rate 2, 4 and block sizes 32, 64, 128.

This framework should make it easy to tune for common use-cases in production by specializing the commonly used parameters or remove specializations to conserve code size.

Differential Revision: D62984408
Summary:
X-link: facebookresearch/FBGEMM#389


Add feature knobs to FBGEMM

Differential Revision: D65174905
@q10 q10 force-pushed the export-D65174905 branch from 74af1b4 to b9e3dc1 Compare October 29, 2024 23:58
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D65174905

@facebook-github-bot
Copy link
Contributor

Hi @q10!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants