[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

weifengpy · 2024-09-19T02:43:57Z

Turning on/off float8 all-gather shows different numerics: #873

Root cause: hp_tensor_to_float8_dynamic have different numerics in eager vs compile

compile does float32 upcasting at the begining of scale calculation, see tmp0 = tl.load(in_ptr0 + (x0), None).to(tl.float32) in triton kernels below
eager does float32 upcasting at the end
even if we do float32 upcasting at the begining for both compile and eager, triton has different numerics than eager. but float64 has bitwise on-par numerics

This PR does float64 upcasting to the begining of scale calculation

unit test: TORCH_LOGS="output_code" pytest -s test/float8/test_base.py -k test_dynamic_scale_parity

tmp0 = tl.load(in_ptr0 + (x0), None).to(tl.float32)
tmp1 = tl.load(in_ptr1 + (0))
tmp2 = tl.broadcast_to(tmp1, [XBLOCK])
tmp3 = tmp2.to(tl.float32)
tmp4 = tmp0 * tmp3
tmp5 = tmp4.to(tl.float32)
tmp6 = -448.0
tmp7 = triton_helpers.maximum(tmp5, tmp6)
tmp8 = 448.0
tmp9 = triton_helpers.minimum(tmp7, tmp8)
tmp10 = tmp9.to(tl.float32)
tmp11 = tmp10.to(tl.float8e4nv)
tl.store(out_ptr0 + (x0), tmp11, None)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-09-19T02:44:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/904

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3d0da20 with merge base b149edb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

test/float8/test_base.py

weifengpy · 2024-09-19T02:53:26Z

torchao/float8/float8_utils.py

@@ -42,6 +42,9 @@ def amax_to_scale(
        float8_dtype: The float8 dtype.
        orig_dtype: The original dtype of the tensor.
    """
+    # Preserve precision in amax-to-scale conversion
+    # and ensure on-par numerics with torch.compile
+    amax = amax.to(torch.float64)


upcast amax in amax_to_scale instead of tensor_to_amax for 2 reasons

we can still do bfloat16 all-reduce for amax

safer to delayed scaling as it won't change dtype for amax_buffer

could you share why the upcasting happens?

I can look into inductor more on how it achieved fp64

torch.compile actually upcasts to float32 with tl.load(in_ptr0 + (x0), None).to(tl.float32). Upcasting to float64 further help because torch.compile and eager shows different numerics for 1.0 / float32 (but same numeric for float64)

The float32 numeric difference can be verified with

import torch def upcast_reciprocal(inp: torch.Tensor): return inp.reciprocal() inp = torch.full([], 0.00817871093750000000, device="cuda", dtype=torch.float32) eager_scale = upcast_reciprocal(inp) compile_scale = torch.compile(upcast_reciprocal)(inp) fp64_ground_truth = inp.to(torch.float64).reciprocal() assert torch.equal(eager_scale, compile_scale), f"{eager_scale=} vs {compile_scale=}, {fp64_ground_truth=}"

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-09-19T03:28:45Z

test/float8/test_base.py

+            float8_config,
+            gemm_input_role=GemmInputRole.WEIGHT,
+        )
+        assert torch.equal(float8_eager._scale, float8_compile._scale)


without the PR, the numerics looks like following
eager _scale=106.5000 vs compile _scale=106.1925...

after, eager is also 106.1925...

awgu · 2024-09-19T06:47:18Z

I am a bit surprised that we have to use fp64 in eager. Without the upcast to fp64, what is the dtype being used?

weifengpy · 2024-09-19T07:51:28Z

I am a bit surprised that we have to use fp64 in eager. Without the upcast to fp64, what is the dtype being used?

eager respect param dtype - if param is bf16, amax is bf16, scale compute is bf16. scale gets upcasted to fp32 at the very end

weifengpy · 2024-09-19T18:54:40Z

converting to draft as upcasting to torch.float32 in eager should be enough. digging into inductor as there might be nuances around how bfloat16 is handled

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-09-21T21:34:40Z

torchao/float8/float8_tensor.py

@@ -163,7 +163,8 @@ def forward(

        DTensor Invariant: DTensor must always be the outer most tensor subclass
        """
-        tensor_scaled = tensor * scale
+        # scale is float32 thus upcasting tensor to match
+        tensor_scaled = tensor.to(torch.float32) * scale


without upcasting, the eager numeric is like -157.00000000000000000000, compile is like -157.06507873535156250000

torch.compile upcast tensor ahead, see tmp0 = tl.load(in_ptr0 + (x0), None).to(tl.float32) in following output code

@triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 24576 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = tl.full([XBLOCK], True, tl.int1) x0 = xindex tmp0 = tl.load(in_ptr0 + (x0), None).to(tl.float32)

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

vkuzo · 2024-09-24T17:14:45Z

test/float8/test_base.py

+            torch.float16,
+        ],
+    )
+    def test_dynamic_scale_parity(self, dtype: torch.dtype):


nit: move to test_compile.py since this is testing compile vs eager?

vkuzo · 2024-09-24T17:16:44Z

torchao/float8/float8_tensor.py

@@ -163,7 +163,8 @@ def forward(

        DTensor Invariant: DTensor must always be the outer most tensor subclass
        """
-        tensor_scaled = tensor * scale
+        # scale is float32 thus upcasting tensor to match


can we make this comment contain the context? something like

# Note: when the line below is compiled with `torch.compile`, `tensor` is automatically upcasted to `float32` to multiply with the scale # In order to match numerics between eager and compile, we upcast manually here.

vkuzo · 2024-09-24T17:17:31Z

torchao/float8/float8_utils.py

@@ -42,6 +42,8 @@ def amax_to_scale(
        float8_dtype: The float8 dtype.
        orig_dtype: The original dtype of the tensor.
    """
+    # _scaled_mm requires float32 scale


nit: can we describe in more detail why we are upcasting here

vkuzo · 2024-09-24T17:17:45Z

torchao/float8/fsdp_utils.py

@@ -59,17 +59,17 @@ def precompute_float8_dynamic_scale_for_fsdp(module: nn.Module) -> None:
        return

    # inf-norm is equivalent to max(abs(w))
-    max_weights = torch._foreach_norm(weights, ord=math.inf)  # Partial
+    max_weights = torch._foreach_norm(weights, ord=math.inf, dtype=torch.float64)  # Partial


add comment to describe upcasting

improved comment

vkuzo · 2024-09-24T17:18:55Z

thanks for finding this, it would be great to have numerics match! At a high level, looks good. Could we add two more things,

verify that there is no performance regression with torch.compile
ensure tests pass in CI

weifengpy · 2024-09-24T17:50:14Z

thanks for finding this, it would be great to have numerics match! At a high level, looks good. Could we add two more things,

verify that there is no performance regression with torch.compile

ensure tests pass in CI

thanks for the feedback. will address them and publish for review

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Differential Revision: D63048850 Pull Request resolved: pytorch#912

Differential Revision: D62394341 Pull Request resolved: pytorch#897

* Add compile tests to test suite Summary: This is a follow up PR addressing pytorch#839 (comment) We can add more compiler related tests in the future. Next * refactor a bit to use quantize_ API directly * use the test suite in existing API tests Test Plan: python torchao/testing/utils.py Reviewers: Subscribers: Tasks: Tags: * rename * add result check

Differential Revision: D62711903 Pull Request resolved: pytorch#948

* [float8] all-reduce amax on dp mesh instead of global pg Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * liner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improve comments Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move hp tensor inside if Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct

Summary: download and convert scripts needed to be updated alongside model.py config files Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth Reviewers: Subscribers: Tasks: Tags:

Differential Revision: D62711909 Pull Request resolved: pytorch#953

…merics

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-09-26T23:54:15Z

verified perf is neutral and make sure all CI passes

for perf, I ran benchmark_compiled_module(times=100, repeat=100) from the output code of pytest -s test/float8/test_compile.py -k test_dynamic_scale_numeric_parity. I tested on shape = (1024, 1024),

without the change: 0.000035 secs P1612792187
with the change: 0.000035 secs P1612791058

looking into the output code, the difrerence is mainly tmp3.to(tl.float64) or tmp3.to(tl.float32)

weifengpy · 2024-09-27T17:23:34Z

addressed feedback and is ready for review

awgu · 2024-09-27T17:39:50Z

torchao/float8/fsdp_utils.py

+    # keep consistent with float8_utils.amax_to_scale
+    # torch.compile and eager show different numerics for 1.0 / float32,
+    # upcast to float64 to ensure same numeric between compile and eager
+    max_weights = torch._foreach_norm(weights, ord=math.inf, dtype=torch.float64)  # Partial


dtype=torch.float64 only changes the accumulation dtype? if there is no noticeable cost to this, I wonder if we should be doing this in more places 🤔

good question! Actually I just updated the code to do _foreach_norm in original precision and do float64 upcasting before calculating scales. That ensures consistent implementation between prcompute and float8_utils.amax_to_scale

back to your question, I checked ForeachReduceOp.cu and it's dispatching to lpnorm_cleanup<scalar_t, NormType::LInf, out_t>. Not sure what's inside lpnorm_cleanup. But inf-norm is just max(abs) so not sure if they accumulate numerics
https://github.com/pytorch/pytorch/blob/a28b40fa74470058ca57d77652b9601bece2f4d5/aten/src/ATen/native/cuda/ForeachReduceOp.cu#L534-L535C19

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

vkuzo · 2024-09-30T18:15:46Z

test/float8/test_compile.py

+    torch.manual_seed(42)
+    hp_tensor1 = torch.randn(16, 16, device="cuda", dtype=dtype)
+    hp_tensor2 = hp_tensor1.detach().clone()
+    float8_config = Float8LinearConfig(


this should be an object of type LinearMMConfig, I'm actually kind of surprised passing in Float8LinearConfig works :(

good catch! I switched to LinearMMConfig. Float8LinearConfig was working because I did not call matmul that requires access to self._linear_mm_config.

vkuzo · 2024-09-30T18:17:11Z

torchao/float8/fsdp_utils.py

    max_weights = torch._foreach_norm(weights, ord=math.inf)  # Partial
    amax_tensor = torch.stack(max_weights)  # Partial
    # clamp is dispatched through DTensor
    # it will issue a single all-reduce
    amax_tensor = torch.clamp(amax_tensor, EPS)  # Replicate
-    scale_tensor = torch.finfo(torch.float8_e4m3fn).max / amax_tensor  # Replicate
+    scale_tensor = torch.finfo(torch.float8_e4m3fn).max / amax_tensor.to(torch.float64)  # Replicate


optional nit: upcast separately to make easier to read

amax_tensor = amax_tensor.to(torch.float64) scale_tensor = ... / amax_tensor

updated as suggested

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

…th torch.compile (pytorch#904) * [float8] improve eager numerics for dynamic scales * leave torch.linalg.vector_norm for another PR Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * cuda Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data and investigate Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * remove _data comment Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * upcast to float32 is enough Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * explain why float32 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * _data parity Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * handle sm8.9 Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix transformer unit test Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * print if error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Add tutorial for trainable tensor subclass (pytorch#908) Summary: The new tutorial provides an example of how to implement a trainable tensor subclass that wraps quantized data. This extends the existing `MyDTypeTensor` with a few necessary steps to ensure proper gradient updates, namely: 1. Define a differentiable constructor 2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear) 3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_) Test Plan: python tutorials/developer_api_guide/my_trainable_tensor_subclass.py * Introducing 1-bit quantization for Llama in torchchat (pytorch#910) Differential Revision: D63052325 Pull Request resolved: pytorch#911 * Rename Floating point to fp8 (pytorch#909) * [float8] fix typo in bitwise_identical unit test (pytorch#918) Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * Adding example for quantized tensor + tensor parallelism (pytorch#785) * [WIP] Adding example for quantized tensor + tensor parallelism Summary: This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation Test Plan: torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py Reviewers: Subscribers: Tasks: Tags: * tensor parallel file * Use DTensor.from instead of distribute_tensor * implementing aten.slice.Tensor (WIP) * working * some shape fix and use more quant primitive ops * Add rowwise test * make rowwise sharding work * compile still not working yet * fake tensor didn't pick up shape changes from transpose * backend='eager' * change transpose to non-inplace op * add error message * works now with torch nightly * remove print * ruff * Clean up * Fix device id --------- Co-authored-by: Ke Wen <[email protected]> * rename cuda mode -> gpu mode (pytorch#925) * Add workaround to recover the perf for quantized vit in torch.compile (pytorch#926) Add temporary workaround to recover the perf for quantized vit under torch.compile Summary: Recently we found a perf drop in quantized vit due to pytorch#898 (comment) This PR add a temp fix until we figure out the longer term fix. I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that Test Plan: python tutorials/quantize_vit/run_vit_b_quant.py Reviewers: Subscribers: Tasks: Tags: * clean up device checks in float8 unit test files (pytorch#923) Summary: While working on rowwise scaling I noticed that some of the CUDA device capability checks we had in the test files did not make sense, cleaning this up. Test Plan: tests pass on my H100 CI, it should skip less tests now since CI only has CUDA capability 8, 9 Reviewers: Subscribers: Tasks: Tags: * [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 256 to match new bnb v0.44 (pytorch#927) * Float8 autoquant weight only (pytorch#866) * Fix failing FP6 benchmark (pytorch#931) * Remove two if statements in fp8 padding (pytorch#935) Reviewed By: vkuzo Differential Revision: D63051205 Pull Request resolved: pytorch#935 Approved by: https://github.com/vkuzo * [Distributed] Improve sharding example (pytorch#937) * [Distributed] Improve sharding example * Add comment * Add composable QAT quantizer (pytorch#938) Summary: This is a utility for users who wish to apply multiple QAT quantizers to their models. In the near future, we expect to add an embedding QAT quantizer that composes with the existing linear QAT quantizers. Test Plan: python test/quantization/test_qat.py -k test_composable_qat_quantizer * resolve conflict with latest main Differential Revision: D63048850 Pull Request resolved: pytorch#912 * Add torchchat quantizer Differential Revision: D62394341 Pull Request resolved: pytorch#897 * Add compile tests to test suite (pytorch#906) * Add compile tests to test suite Summary: This is a follow up PR addressing pytorch#839 (comment) We can add more compiler related tests in the future. Next * refactor a bit to use quantize_ API directly * use the test suite in existing API tests Test Plan: python torchao/testing/utils.py Reviewers: Subscribers: Tasks: Tags: * rename * add result check * Fix up CMakeLists and reorganize some code locations Differential Revision: D62711903 Pull Request resolved: pytorch#948 * [float8] all-reduce amax on dp mesh instead of global pg (pytorch#933) * [float8] all-reduce amax on dp mesh instead of global pg Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * liner Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * improve comments Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move hp tensor inside if Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * linter Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * int8 dynamic quant + bsr support (pytorch#821) This PR, adds in int8 dynamicquant + bsr support. Changes: * Use i8i8 -> bf16 matmul to maintain accuracy * Added a block sparse layout type to AffineQuantizedTensor + check/impl. * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers * Lots of lint formatting and README updates * torch.compile now working and is correct * fixing some issues with our support for 70/405B models (pytorch#941) Summary: download and convert scripts needed to be updated alongside model.py config files Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth Reviewers: Subscribers: Tasks: Tags: * Update INT8 mixed-precision training test to be less flaky (pytorch#950) * Add executorch parallel Differential Revision: D62711909 Pull Request resolved: pytorch#953 * test CI Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * better comment on why upcasting Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * control seed Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * move unit test to test_compile Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * fix typo Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * float64 upcasting after allreduce Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: * use LinearMMConfig Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: andrewor14 <[email protected]> Co-authored-by: Vaishnavi Gupta <[email protected]> Co-authored-by: Apurva Jain <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Ke Wen <[email protected]> Co-authored-by: Mark Saroufim <[email protected]> Co-authored-by: Vasiliy Kuznetsov <[email protected]> Co-authored-by: Thien Tran <[email protected]> Co-authored-by: Tobias van der Werff <[email protected]> Co-authored-by: Shuqi Yang <[email protected]> Co-authored-by: Scott Roy <[email protected]> Co-authored-by: Jesse Cai <[email protected]> Co-authored-by: HDCharles <[email protected]>

weifengpy added 2 commits September 18, 2024 19:29

[float8] improve eager numerics for dynamic scales

6bf0f5c

leave torch.linalg.vector_norm for another PR

553687f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 19, 2024

weifengpy commented Sep 19, 2024

View reviewed changes

test/float8/test_base.py Outdated Show resolved Hide resolved

weifengpy commented Sep 19, 2024

View reviewed changes

test/float8/test_base.py Outdated Show resolved Hide resolved

weifengpy requested review from vkuzo and drisspg September 19, 2024 02:49

weifengpy commented Sep 19, 2024

View reviewed changes

weifengpy added 3 commits September 18, 2024 20:00

cuda

19a592d

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove _data and investigate

218290e

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

remove _data comment

24ec914

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Sep 19, 2024

View reviewed changes

weifengpy marked this pull request as draft September 19, 2024 18:52

weifengpy added 3 commits September 21, 2024 13:02

upcast to float32 is enough

c099486

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

explain why float32

b93ffc8

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

_data parity

ebff416

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy commented Sep 21, 2024

View reviewed changes

weifengpy added 2 commits September 21, 2024 15:17

handle sm8.9

8978ab2

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fix transformer unit test

f17dc12

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

vkuzo reviewed Sep 24, 2024

View reviewed changes

print if error

511c751

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy and others added 16 commits September 26, 2024 11:27

resolve conflict with latest main

a05a40f

Differential Revision: D63048850 Pull Request resolved: pytorch#912

Add torchchat quantizer

334891b

Differential Revision: D62394341 Pull Request resolved: pytorch#897

Fix up CMakeLists and reorganize some code locations

93554c0

Differential Revision: D62711903 Pull Request resolved: pytorch#948

Update INT8 mixed-precision training test to be less flaky (pytorch#950)

e7270f1

Add executorch parallel

352685c

Differential Revision: D62711909 Pull Request resolved: pytorch#953

Merge branch 'weifengpy-dynamic_scale_numerics' into dynamic_scale_nu…

168cfe9

…merics

Merge branch 'main' into dynamic_scale_numerics

5900c3e

test CI

37e1479

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

better comment on why upcasting

2efde49

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

control seed

8c04f4f

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

move unit test to test_compile

04b229b

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

fix typo

8b7c2ef

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy marked this pull request as ready for review September 27, 2024 17:23

weifengpy requested a review from vkuzo September 27, 2024 17:23

awgu reviewed Sep 27, 2024

View reviewed changes

float64 upcasting after allreduce

9346afd

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested a review from awgu September 27, 2024 21:43

vkuzo reviewed Sep 30, 2024

View reviewed changes

use LinearMMConfig

3d0da20

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

weifengpy requested a review from vkuzo October 1, 2024 18:08

vkuzo approved these changes Oct 1, 2024

View reviewed changes

weifengpy merged commit 71be315 into pytorch:main Oct 1, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

weifengpy commented Sep 19, 2024 •

edited

Loading

pytorch-bot bot commented Sep 19, 2024 •

edited

Loading

weifengpy Sep 19, 2024 •

edited

Loading

vkuzo Sep 19, 2024

weifengpy Sep 19, 2024

weifengpy Sep 23, 2024

weifengpy Sep 19, 2024 •

edited

Loading

awgu commented Sep 19, 2024

weifengpy commented Sep 19, 2024

weifengpy commented Sep 19, 2024

weifengpy Sep 21, 2024

weifengpy Sep 21, 2024

vkuzo Sep 24, 2024

vkuzo Sep 24, 2024

vkuzo Sep 24, 2024

vkuzo Sep 24, 2024

weifengpy Sep 27, 2024

vkuzo commented Sep 24, 2024

weifengpy commented Sep 24, 2024

weifengpy commented Sep 26, 2024

weifengpy commented Sep 27, 2024

awgu Sep 27, 2024

weifengpy Sep 27, 2024

vkuzo Sep 30, 2024

weifengpy Sep 30, 2024

vkuzo Sep 30, 2024

weifengpy Sep 30, 2024

[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

Conversation

weifengpy commented Sep 19, 2024 • edited Loading

pytorch-bot bot commented Sep 19, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/904

✅ No Failures

weifengpy Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

awgu commented Sep 19, 2024

weifengpy commented Sep 19, 2024

weifengpy commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuzo commented Sep 24, 2024

weifengpy commented Sep 24, 2024

weifengpy commented Sep 26, 2024

weifengpy commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

weifengpy commented Sep 19, 2024 •

edited

Loading

pytorch-bot bot commented Sep 19, 2024 •

edited

Loading

weifengpy Sep 19, 2024 •

edited

Loading

weifengpy Sep 19, 2024 •

edited

Loading