Add feature knobs to FBGEMM #3290

q10 · 2024-10-29T22:59:48Z

Summary: Add feature knobs to FBGEMM

Differential Revision: D65174905

facebook-github-bot · 2024-10-29T22:59:56Z

This pull request was exported from Phabricator. Differential Revision: D65174905

netlify · 2024-10-29T23:00:07Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`b9e3dc1`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/672176b0bcfa4700089b5e5a
😎 Deploy Preview	https://deploy-preview-3290--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: X-link: facebookresearch/FBGEMM#219 - Use `new[]` and `std::unique_ptr<float[]>` instead of `std::vector`. This avoids an unnecessary indirection and less work for the compiler to drop all the dynamic resizing code that isn't used here anyway. - Also allocate a stack buffer of size 256 and use it instead of a heap allocation when block sizes are small enough. I hope to add specialization mechanisms for common block_sizes later. When specialized for a fixed block_size having a stack allocation is very helpful as it allows the compiler to promote `buf` completely into vector registers (and skip all the related memory operations). Differential Revision: D62591607

Summary: X-link: facebookresearch/FBGEMM#221 Checking that `data_size >= 0` early outside the loops means that the compiler can use a single (unsigned) comparison for the `if (idx < 0 || idx >= data_size)` checks within the loops (instead of 2 signed comparisons). Differential Revision: D62591655

Summary: X-link: facebookresearch/FBGEMM#222 Cleanup/rework `EmbeddingSpMDMAutovec.cc` code: - Fix strict-aliasing violations: When dealing with a `uint8_t *input` pointer we have to use `memcpy` to extract float/float16 values for "bias" and "scale" as direct pointer casts violate strict aliasing rules which is undefined behavior that can lead to misoptimization or lost performance (compilers may be more conservative in alias analysis if they detect violations). Note that the additional `memcpy`s are optimized away by any modern compiler (but provide enough semantic information that the compiler knows its potentially unaligned aliasing memory addresses involved). - Drop most `#pragma omp` annotations. In most places the patterns are so obvious that the compiler vectorizes the code anyway when benefitial. - Make sure loop iteration variables have the same type as their upper bounds. This way the compiler doesn't need to prove that a loop variable with smaller bit with does not overflow. - Factor out address calculation performed on "input" and consistently store addresses in `uint8_t*`. This factors out some common address calculation, and occasionally makes the compiler generate slightly better address operations when I inspected the assembly. - Harmonize handling of weights across the implementations. Differential Revision: D61048245

Summary: Some refactoring and cleanup for the `GenerateEmbeddingXXX` functions. - Drop unnecessary `else` for `ifs` that end in a return statement, like: `if (...) { return; } else if (...) { return; }`. - Consistently surround asmjit usage with `#if CPUINFO_ARCH_X86 || CPUINFO_ARCH_X86_64`. - Introduce `FBGEMM_AUTOVEC_AVAILABLE` and consistently surround `xxx_autovec` usage with it. - Avoid some cases that duplicated autovec/ref kernel calls. Differential Revision: D62982082

Summary: Change the API in `EmbeddingSpMDMAutovec.h` to expose `GenerateEmbeddingXXX_autovec` functions to mirror the style used in `FbgemmEmbedding.h`. This is in preparation to allow the autovec code to select among multiple specialized functions ahead of time. Differential Revision: D62984078

Summary: To have auto-vectorized code be competitive with asmjit we need to specialize the generic code to a some fixed parameters. We cannot specialize at runtime, so this introduce a framework to specialize for a given set of parameters at compile time and choose between existing specializations at runtime. The framework added here allows to specify lines like the following for a given function. Each parameter can be set to `var` to not specialize it or `fixed(C)` to create a specialized version with that parameter set to the constant value `C`. Example: ``` SPECIALIZE( /*BIT_RATE=*/fixed(2), /*BLOCK_SIZE=*/var, /*HAS_WEIGHT=*/fixed(true), /*NORMALIZE_BY_LENGTHS=*/var, /*PREFETCH=*/var, /*IS_WEIGHT_POSITIONAL=*/var, /*USE_OFFSETS=*/var, /*OUTPUT_STRIDE=*/fixed(int64_t{-1}), /*INPUT_STRIDE=*/fixed(int64_t{-1}), /*SCALE_BIAS_LAST=*/fixed(true), /*NO_BAG=*/fixed(false), /*IS_BF16_OUT=*/var, /*IS_BF16_IN=*/var) ``` This diff introduces some exemplary specialization for `GenerateEmbeddingSpMDMWithStrides_autovec` and `GenerateEmbeddingSpMDMNBitWithStrides_autovec` specializing them for bit_rate 2, 4 and block sizes 32, 64, 128. This framework should make it easy to tune for common use-cases in production by specializing the commonly used parameters or remove specializations to conserve code size. Differential Revision: D62984408

Summary: X-link: facebookresearch/FBGEMM#389 Add feature knobs to FBGEMM Differential Revision: D65174905

facebook-github-bot · 2024-10-29T23:58:48Z

This pull request was exported from Phabricator. Differential Revision: D65174905

facebook-github-bot · 2024-12-16T02:13:18Z

Hi @q10!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot added the cla signed label Oct 29, 2024

facebook-github-bot added the fb-exported label Oct 29, 2024

MatzeB and others added 7 commits October 29, 2024 16:58

Add feature knobs to FBGEMM (pytorch#3290)

b9e3dc1

Summary: X-link: facebookresearch/FBGEMM#389 Add feature knobs to FBGEMM Differential Revision: D65174905

q10 force-pushed the export-D65174905 branch from 74af1b4 to b9e3dc1 Compare October 29, 2024 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature knobs to FBGEMM #3290

Add feature knobs to FBGEMM #3290

q10 commented Oct 29, 2024

facebook-github-bot commented Oct 29, 2024

netlify bot commented Oct 29, 2024 •

edited

Loading

facebook-github-bot commented Oct 29, 2024

facebook-github-bot commented Dec 16, 2024

Add feature knobs to FBGEMM #3290

Are you sure you want to change the base?

Add feature knobs to FBGEMM #3290

Conversation

q10 commented Oct 29, 2024

facebook-github-bot commented Oct 29, 2024

netlify bot commented Oct 29, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Oct 29, 2024

facebook-github-bot commented Dec 16, 2024

Process

netlify bot commented Oct 29, 2024 •

edited

Loading