Lowbit #1070

metascroy · 2024-08-28T05:30:29Z

No description provided.

pytorch-bot · 2024-08-28T05:30:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1070

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

metascroy · 2024-09-09T19:01:27Z

@Jack-Khuu adding you here to gauge your thoughts on integration of the new torchao kernel's in torchchat's quantization API (also added you on the torchao diff D62394341). Some clean up is still needed.

Overall, ARM CPU performance of the new kernels on llama3.1-8B is competitive across surfaces:

9.6 tokens/sec in torchchat generate.py eager (P1577139816)
17.9 tokens/sec in torchchat generate.py compile (P1577141735)
17.1 tokens/sec in torchchat generate.py AOTI (P1577169271)
18.6 tokens/sec in torchchat AOTI C++ runner (P1577229019)

In addition, ExecuTorch export works (P1577157356), but running the ExecuTorch *.pte file leads to missing op errors because we haven't written the ExecuTorch wrappers for the kernels yet (P1577172744).

One call out I want to make is the slow model loading perf in torchchat's generate.py. The AOTI C++ runner loads the *.so and runs the model instantly and feels much, much smoother than the python version (which takes ~30sec to load the model). We should fix this experience.

Jack-Khuu · 2024-09-10T02:50:25Z

Thanks for the PR and tag. Numbers are looking great and the API you're showing is inline with how we've been leveraging AO.

Like you mentioned in the Diff, it's worth connecting with the AO folk about the tensor subclass story

Jack-Khuu · 2024-09-10T02:50:43Z

Let me know if you need help with refactoring or integrating this into torchchat

metascroy · 2024-09-10T17:56:46Z

Let me know if you need help with refactoring or integrating this into torchchat

I'll add you as a reviewer on the integration PRs. Hopefully they will be ready soon.

metascroy · 2024-09-11T21:39:55Z

.pins/torchao-pin.txt

@@ -0,0 +1 @@
+85d03de43160328eaf350e7ec3877d3d7b57da50


TODO: update to commit hash after D62394341 lands.

metascroy · 2024-09-11T21:40:36Z

scripts/install_utils.sh

+  echo $pwd
+
+  cp -R /Users/scroy/fbsource/fbcode/pytorch/ao .
+  # git clone https://github.com/pytorch/ao.git


TODO: uncomment after D62394341 lands and commit hash is updated.

metascroy · 2024-09-11T21:50:29Z

generate.py

@@ -395,6 +395,13 @@ def decode_n_tokens(
                    )
                    input_pos += 1
                    break
+            if _i == 1:


Remove this before final landing. It was to do more accurate tokens/sec measurement during development, especially for torch.compile.

Jack-Khuu

Overall I think your plan for getting the custom ops in and supporting the quants seems good to me

Jack-Khuu · 2024-09-12T07:56:58Z

quantization/quantize.py

+                        device=device,
+                        precision=precision,
+                        bitwidth=q_kwargs.get("bitwidth", 4),
+                        groupsize=q_kwargs.get("groupsize", 128),
+                        has_weight_zeros=q_kwargs.get("has_weight_zeros", False),
+                        squeeze_unsqueeze_dim0=True,


nit: We can probably curry Int8DynActLowbitWeightQuantizer to match the same args as the other AO quant classes

I modeled the args after Int8DynActInt4WeightQuantizer, and used "groupsize" instead of "group_size" to match the experience.

But Int8DynActInt4WeightQuantizer doesn't have bitwidth and has_weight_zeros as concepts, but happy to rename those to whatever you think best.

Jack-Khuu · 2024-09-12T08:06:46Z

scripts/install_utils.sh

@@ -124,3 +124,54 @@ install_executorch_libs() {

  install_executorch_python_libs $1
 }
+
+clone_torchao() {


We have a pin for 0.5.0 that you might be able to piggy back off of. Can save you some effort

#1136

This is cloning torchao from source, but the pin looks like it is from pip?

We can decide on it when your diff lands

Big thing here is that AO doesn't have Mac nightlies so I'm fine with us moving back to direct clones for mac

Summary: This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341

Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Reviewed By: digantdesai Differential Revision: D62394341

Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341

metascroy added 3 commits August 20, 2024 13:44

lowbit init

74396ff

mods

95c66b8

updates

3b7e17c

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2024

metascroy added 3 commits August 29, 2024 09:44

Add timers to generate.py that better capture torch.compile perf

008dd90

updates

1406a3c

update

334bc5b

metascroy requested a review from Jack-Khuu September 9, 2024 18:51

metascroy added 2 commits September 11, 2024 14:32

add torchao build scripts

b8ee9da

formatting fixes

a71ec16

metascroy commented Sep 11, 2024

View reviewed changes

Jack-Khuu reviewed Sep 12, 2024

View reviewed changes

metascroy mentioned this pull request Sep 16, 2024

Add torchchat quantizer pytorch/ao#897

Merged

metascroy closed this Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowbit #1070

Lowbit #1070

metascroy commented Aug 28, 2024

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading

metascroy commented Sep 9, 2024

Jack-Khuu commented Sep 10, 2024

Jack-Khuu commented Sep 10, 2024

metascroy commented Sep 10, 2024

metascroy Sep 11, 2024

metascroy Sep 11, 2024

metascroy Sep 11, 2024

Jack-Khuu left a comment

Jack-Khuu Sep 12, 2024

metascroy Sep 12, 2024

Jack-Khuu Sep 12, 2024

metascroy Sep 12, 2024

Jack-Khuu Sep 14, 2024

		@@ -0,0 +1 @@
		85d03de43160328eaf350e7ec3877d3d7b57da50

Lowbit #1070

Lowbit #1070

Conversation

metascroy commented Aug 28, 2024

pytorch-bot bot commented Aug 28, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1070

metascroy commented Sep 9, 2024

Jack-Khuu commented Sep 10, 2024

Jack-Khuu commented Sep 10, 2024

metascroy commented Sep 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jack-Khuu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Aug 28, 2024 •

edited

Loading