-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support beam search & parallel generation #7
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. Left some small comments.
probs: torch.Tensor, | ||
p: torch.Tensor, | ||
) -> torch.Tensor: | ||
# TODO(woosuk): Optimize. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's faster to simply mask out the tokens whose cumulative gradient is smaller than top_p
(example code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know the code. I feel that implementation would not be remarkably more efficient than ours, because it includes 2 softmax rather than 1.
remove not needed files
add torch.cuda.empty_cache()
Updated OpenVINO version in dockerfile
* Return support for other models apart from jamba * Support n>1 * Revert 2 commits d054737 'Support n>1' b5167cc 'Return support for other models apart from jamba' * TP on input and output * Basic TP impl , working, correctness not working * TP is working * Roll back the verification that everything in the weights fits into the model * Cleanup * Use world size func * clean up * Import * Apply whitespace suggestions from code review * Organize imports * Add comment on the unsqueeze in conv1d * Organize and remove redundant code in forward pass * Remove print * Add comments Co-authored-by: tomeras91 <[email protected]> * White spaces * Set as A * better comment --------- Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: tomeras91 <[email protected]>
arctic load quantized checkpoint
Add support for LMCache
* llama support * flash_attention * sharded * expend * fix: remove redunctant info * change main * llama and opt model supported --------- Co-authored-by: Shao Siyang FYP PDCL <[email protected]> Co-authored-by: lairuiqi <[email protected]> Co-authored-by: LaiRuiqi <[email protected]>
This PR adds support for beam search and parallel generation (i.e.,
n
> 1).NOTE: The correctness is only checked for beam search, but not for random sampling methods.
Tested models:
Tested GPUs: