Fix llama conversion with smooth quant #1650
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes a few errors which appear when following the README https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md#smoothquant on current latest commit.
Note: the first commit looks quite obvious (although I'm not sure how this could have worked before), while I'm less sure about the second, I just just going by error messages during engine conversion, there might be a better place for the fix. So feel free to treat this as a bug report instead. I verified that engine built in this way has reasonable outputs and expected performance. The model I was testing this on is Mistral 7B (mistral-7b-v0.1-instruct) but I assume other llama 2 and 3 should also work (didn't get to llama 3 yet).