Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowbit #1070

Closed
wants to merge 8 commits into from
Closed

Lowbit #1070

wants to merge 8 commits into from

Conversation

metascroy
Copy link
Contributor

No description provided.

Copy link

pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1070

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2024
@metascroy
Copy link
Contributor Author

@Jack-Khuu adding you here to gauge your thoughts on integration of the new torchao kernel's in torchchat's quantization API (also added you on the torchao diff D62394341). Some clean up is still needed.

Overall, ARM CPU performance of the new kernels on llama3.1-8B is competitive across surfaces:

  • 9.6 tokens/sec in torchchat generate.py eager (P1577139816)
  • 17.9 tokens/sec in torchchat generate.py compile (P1577141735)
  • 17.1 tokens/sec in torchchat generate.py AOTI (P1577169271)
  • 18.6 tokens/sec in torchchat AOTI C++ runner (P1577229019)

In addition, ExecuTorch export works (P1577157356), but running the ExecuTorch *.pte file leads to missing op errors because we haven't written the ExecuTorch wrappers for the kernels yet (P1577172744).

One call out I want to make is the slow model loading perf in torchchat's generate.py. The AOTI C++ runner loads the *.so and runs the model instantly and feels much, much smoother than the python version (which takes ~30sec to load the model). We should fix this experience.

@Jack-Khuu
Copy link
Contributor

Thanks for the PR and tag. Numbers are looking great and the API you're showing is inline with how we've been leveraging AO.

Like you mentioned in the Diff, it's worth connecting with the AO folk about the tensor subclass story

@Jack-Khuu
Copy link
Contributor

Let me know if you need help with refactoring or integrating this into torchchat

@metascroy
Copy link
Contributor Author

Let me know if you need help with refactoring or integrating this into torchchat

I'll add you as a reviewer on the integration PRs. Hopefully they will be ready soon.

@@ -0,0 +1 @@
85d03de43160328eaf350e7ec3877d3d7b57da50
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: update to commit hash after D62394341 lands.

echo $pwd

cp -R /Users/scroy/fbsource/fbcode/pytorch/ao .
# git clone https://github.com/pytorch/ao.git
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: uncomment after D62394341 lands and commit hash is updated.

@@ -395,6 +395,13 @@ def decode_n_tokens(
)
input_pos += 1
break
if _i == 1:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this before final landing. It was to do more accurate tokens/sec measurement during development, especially for torch.compile.

Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think your plan for getting the custom ops in and supporting the quants seems good to me

Comment on lines +97 to +102
device=device,
precision=precision,
bitwidth=q_kwargs.get("bitwidth", 4),
groupsize=q_kwargs.get("groupsize", 128),
has_weight_zeros=q_kwargs.get("has_weight_zeros", False),
squeeze_unsqueeze_dim0=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We can probably curry Int8DynActLowbitWeightQuantizer to match the same args as the other AO quant classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled the args after Int8DynActInt4WeightQuantizer, and used "groupsize" instead of "group_size" to match the experience.

But Int8DynActInt4WeightQuantizer doesn't have bitwidth and has_weight_zeros as concepts, but happy to rename those to whatever you think best.

@@ -124,3 +124,54 @@ install_executorch_libs() {

install_executorch_python_libs $1
}

clone_torchao() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a pin for 0.5.0 that you might be able to piggy back off of. Can save you some effort

#1136

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cloning torchao from source, but the pin looks like it is from pip?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can decide on it when your diff lands

Big thing here is that AO doesn't have Mac nightlies so I'm fine with us moving back to direct clones for mac

metascroy added a commit to metascroy/ao that referenced this pull request Sep 16, 2024
Summary:
This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Differential Revision: D62394341
metascroy added a commit to metascroy/ao that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: pytorch#897

This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Reviewed By: digantdesai

Differential Revision: D62394341
metascroy added a commit to metascroy/ao that referenced this pull request Sep 24, 2024
Summary:
Pull Request resolved: pytorch#897

This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Reviewed By: digantdesai

Differential Revision: D62394341
metascroy added a commit to metascroy/ao that referenced this pull request Sep 25, 2024
Summary:
Pull Request resolved: pytorch#897

This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Differential Revision: D62394341
metascroy added a commit to metascroy/ao that referenced this pull request Sep 25, 2024
Summary:
Pull Request resolved: pytorch#897

This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Differential Revision: D62394341
metascroy added a commit to metascroy/ao that referenced this pull request Sep 25, 2024
Summary:
Pull Request resolved: pytorch#897

This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api).  See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API.

I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat.  They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper.

jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you.  I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion.

We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel).  From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further.

Differential Revision: D62394341
@metascroy metascroy closed this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants