-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] TensorRT Model Optimizer - Product Roadmap #108
Comments
ONNX quantization methods that don't require calibration data would be amazing. |
Can you elaborate on trtllm-serve? Are there any integrations planned with Triton TensorRT-LLM backend? |
Would like to see 4bit quantization and multi-gpu solutions on diffusion models (expecially DiT models like SD3). Thank you for the awesome work in advance! A 4bit solution for reference: https://github.com/mit-han-lab/nunchaku |
If the users just wanna test out the latency of the model, then they can skip the calibration data and modelopt.onnx.quantization will use random data to calibrate the model but the accuracy of the model will be unrealistic. Real calibration data is required by PTQ for the acceptable accuracy of the model. |
Hi @wxsms, thanks for the question. We have multi-GPU DiT work ongoing with the PyTorch deployment. We might include that in the early access of in-framework deployment depends on the progress. For deployment with NVIDIA TensorRT, there's ongoing work for FLUX but no timeline to be shared yet. |
TensorRT-LLM is developing a new hosting feature called |
TensorRT Model Optimizer - Product Roadmap
TensorRT Model Optimizer (ModelOpt)’s north star is to be the best-in-class model optimization toolkit to provide inference speedup with ease-of-use on NVIDIA platforms while preserving model accuracy.
In striving for this, our roadmap and development follow these product strategies:
In the following, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of ModelOpt's directions and upcoming features. We welcome any questions and feedback in this thread and feature requests in github Issues 😊.
1. FP4 inference on NVIDIA Blackwell
NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. At the general availability of Blackwell (early 2025), ModelOpt will provide FP4 recipes and quantization techniques:
In our internal research, ModelOpt has achieved near lossless results with both PTQ and QAT for a Nemotron-4-340B, and minimal accuracy loss for a Llama3.1-405B on Blackwell. Currently, we prioritize models like Llama family, Mistral, and FLUX in our FP4 work. All optimized recipes will be publicly available in this repo.
2. Model optimization techniques
2.1 Model compression algorithms
ModelOpt collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:
2.2 Fast decoding techniques for LLM and VLM
ModelOpt works with TensorRT-LLM and vLLM to streamline fast decoding model finetuning (Hugging Face/ NVIDIA NeMo and Megatron-LM) and deployment for endpoint serving and edge device deployment. Our focus areas include:
2.3 Techniques for diffusers
ModelOpt will continue to accelerate image generation inference by investing in these areas:
3. Developer Productivity
3.1 Open-sourcing
To provide extensibility and transparency, we target to open source ModelOpt from v0.23 (Jan 2025). This will enable advanced developers to experiment with custom calibration algorithms or contribute the latest techniques. Users can also self-service to add model support or non-standard dtypes, and benefit from improved debuggability and transparency. Open-sourcing ModelOpt will enable deeper integration with popular open-source frameworks across the ecosystem.
3.2 Ready-to-deploy quantized checkpoints
For developers who have limited GPU resources to quantize large models or prefer to skip the quantization steps, we offer quantized checkpoints of popular models on the Hugging Face Model Optimizer collection ready for TensorRT-LLM and vLLM deployment. The FP8 checkpoints of Llama-3.1 family are already available, with FLUX, diffusion, Medusa-trained checkpoints, and more coming soon.
4. Choice of Deployment
4.1 Popular Community Frameworks
To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example in HuggingFace model hub, with more model support planned.
4.2 In-Framework Deployment
ModelOpt team is enabling a path for deployment within native PyTorch, with early access targeting at Q1 2025. This decouples model build and compile from runtime and offers several benefits:
Internally, we’ve validated this path for FP8 on NVIDIA H100s and FP4 on NVIDIA Blackwell for Llama 3.1 models. We’re enabling this path for broader model types like Mixtral, MoE, FLUX, and LoRA support.
5. Expand Support Matrix
5.1 Data types
Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with minimal to no impact on model fidelity.
5.2 Model Support
We partner with major foundational model providers for 0-day model support (i.e., quantization recipe ready at new model launch). We’ll continue to expand LLM model support (see below roadmap table for details) and invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and potentially expand to SSM-based models based on user interests.
5.3 Platform & Other Support
ModelOpt's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. ModelOpt also has planned support for ONNX FP4 for DRIVE Thor.
In Q4 2024, ModelOpt added formal support for Windows (see ModelOpt-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as Microsoft DirectML and Microsoft Olive. It currently supports quantization such as INT4 AWQ and we’ll expand to more techniques suitable for Windows.
Upcoming releases
We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.
- EAGLE speculative decoding model finetuning and quantization
- SVDQuant
- Real quantization support
- Advanced FP8 KV cache quantization
- MCore/NeMo speculative decoding and quantization workflows
- Megatron-Core/NeMo export to ready-to-deploy quantized checkpoints
- Nemotron 4 in Hugging Face
- Other top models
- DeepSeek-v2 FP8
- Other top models
The text was updated successfully, but these errors were encountered: