-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lowbit #1070
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1070
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@Jack-Khuu adding you here to gauge your thoughts on integration of the new torchao kernel's in torchchat's quantization API (also added you on the torchao diff D62394341). Some clean up is still needed. Overall, ARM CPU performance of the new kernels on llama3.1-8B is competitive across surfaces:
In addition, ExecuTorch export works (P1577157356), but running the ExecuTorch *.pte file leads to missing op errors because we haven't written the ExecuTorch wrappers for the kernels yet (P1577172744). One call out I want to make is the slow model loading perf in torchchat's generate.py. The AOTI C++ runner loads the *.so and runs the model instantly and feels much, much smoother than the python version (which takes ~30sec to load the model). We should fix this experience. |
Thanks for the PR and tag. Numbers are looking great and the API you're showing is inline with how we've been leveraging AO. Like you mentioned in the Diff, it's worth connecting with the AO folk about the tensor subclass story |
Let me know if you need help with refactoring or integrating this into torchchat |
I'll add you as a reviewer on the integration PRs. Hopefully they will be ready soon. |
@@ -0,0 +1 @@ | |||
85d03de43160328eaf350e7ec3877d3d7b57da50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: update to commit hash after D62394341 lands.
echo $pwd | ||
|
||
cp -R /Users/scroy/fbsource/fbcode/pytorch/ao . | ||
# git clone https://github.com/pytorch/ao.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: uncomment after D62394341 lands and commit hash is updated.
@@ -395,6 +395,13 @@ def decode_n_tokens( | |||
) | |||
input_pos += 1 | |||
break | |||
if _i == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this before final landing. It was to do more accurate tokens/sec measurement during development, especially for torch.compile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think your plan for getting the custom ops in and supporting the quants seems good to me
device=device, | ||
precision=precision, | ||
bitwidth=q_kwargs.get("bitwidth", 4), | ||
groupsize=q_kwargs.get("groupsize", 128), | ||
has_weight_zeros=q_kwargs.get("has_weight_zeros", False), | ||
squeeze_unsqueeze_dim0=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We can probably curry Int8DynActLowbitWeightQuantizer to match the same args as the other AO quant classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modeled the args after Int8DynActInt4WeightQuantizer, and used "groupsize" instead of "group_size" to match the experience.
But Int8DynActInt4WeightQuantizer doesn't have bitwidth and has_weight_zeros as concepts, but happy to rename those to whatever you think best.
@@ -124,3 +124,54 @@ install_executorch_libs() { | |||
|
|||
install_executorch_python_libs $1 | |||
} | |||
|
|||
clone_torchao() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a pin for 0.5.0 that you might be able to piggy back off of. Can save you some effort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cloning torchao from source, but the pin looks like it is from pip?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can decide on it when your diff lands
Big thing here is that AO doesn't have Mac nightlies so I'm fine with us moving back to direct clones for mac
Summary: This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341
Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Reviewed By: digantdesai Differential Revision: D62394341
Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Reviewed By: digantdesai Differential Revision: D62394341
Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341
Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341
Summary: Pull Request resolved: pytorch#897 This diff adds a quantizer for the new torchao kernels that is similar to the Int8DynActInt4WeightQuantizer quantizer in torchchat (imported from from torchao.quantization.quant_api). See the draft torchchat PR (pytorch/torchchat#1070) for how this can integrate with torchchat's quantization API. I confirmed that models quantized with this are compatible with eager, compile, AOTI, and export to ExecuTorch in torchchat. They do not run on ExecuTorch because we still have not written an ExecuTorch kernel wrapper. jerryzh168 this does not use the new subclass API, and this is something I'd like to discuss further with you. I'll set up a sync with you this week, but I wanted to have some API on the table to ground the discussion. We do not currently have the required C++ methods implemented to support the new subclass API (e.g., we cannot unpack the packed weights from python; they are instead unpacked inline in the kernel). From a torchchat user's perspective, I do not think this is important, but I'd like to discuss further. Differential Revision: D62394341
No description provided.