Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GEMM kernels for weight-only quantization #2090

Merged
merged 225 commits into from
Aug 19, 2024
Merged

Conversation

lzhangzz
Copy link
Collaborator

@lzhangzz lzhangzz commented Jul 19, 2024

f16*u4g128 vs cublasGemmEx f16*f16, both using HMMA + f32 accumulator, on 32 weight matrices from 8 models range from 7B to 72B

fig

  • sm90 features are not used yet
  • sm70 tensor core and fp16 have a shared pipeline

@lvhan028
Copy link
Collaborator

The main branch has OpenGVLab/InternVL2-40B-inner-4bits issue. This PR has nothing to do with it. @zhulinJulia24

@zhyncs
Copy link
Collaborator

zhyncs commented Aug 16, 2024

Verified LMDeploy fp16 and AWQ on H100 with @lzhangzz , both are blazing fast.

@zhyncs
Copy link
Collaborator

zhyncs commented Aug 16, 2024

TODO:

  • Set the default max-batch-size to 256 when VRAM > 40G
  • Resolve the runtime issue on SM90 by disabling certain code sections
  • When using Quant, consider supporting the H100 feature (which can be implemented in other PRs)

Copy link
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM and just need to solve some minor issues with the H100 mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants