Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] TensorRT Model Optimizer - Product Roadmap #108

Open
hchings opened this issue Nov 21, 2024 · 6 comments
Open

[RFC] TensorRT Model Optimizer - Product Roadmap #108

hchings opened this issue Nov 21, 2024 · 6 comments
Labels

Comments

@hchings
Copy link
Collaborator

hchings commented Nov 21, 2024

TensorRT Model Optimizer - Product Roadmap

TensorRT Model Optimizer (ModelOpt)’s north star is to be the best-in-class model optimization toolkit to provide inference speedup with ease-of-use on NVIDIA platforms while preserving model accuracy.

In striving for this, our roadmap and development follow these product strategies:

  1. Provide the best recipes in the ecosystem through software-hardware co-design on NVIDIA platforms. Since ModelOpt’s launch, we’ve been delivering 50% to ~5x speedup on top of existing runtime and compiler optimizations on NVIDIA GPUs with minimal impact on model accuracy (Latest News has some of our benchmarks).
  2. Provide a one-stop-shop for SOTA optimization methods (quantization, distillation, sparsity, pruning, speculative decoding, etc) with easy-to-use APIs for developers to chain different methods with reproducibility.
  3. Provide transparency and extensibility, making it easy for developers and researchers to innovate and contribute to our library.
  4. Tightly integrate into the Deep Learning inference and training ecosystem, beyond NVIDIA’s in-house stacks. Offer many-to-many optimization by supporting popular third-party frameworks like vLLM for deployment, in addition to TensorRT-LLM and TensorRT.

In the following, we outline our key investment areas and upcoming features. All are subject to change and we’ll update this doc regularly. Our goal of sharing roadmaps is to increase visibility of ModelOpt's directions and upcoming features. We welcome any questions and feedback in this thread and feature requests in github Issues 😊.

1. FP4 inference on NVIDIA Blackwell

NVIDIA Blackwell platform powers a new era of computing with FP4 AI inference capabilities. At the general availability of Blackwell (early 2025), ModelOpt will provide FP4 recipes and quantization techniques:

  1. For developers who require lossless or near-lossless FP4 quantization, ModelOpt offers Quantization Aware Training (QAT), which makes the neural network more resilient to quantization. ModelOpt QAT already works with NVIDIA Megatron, NVIDIA NeMo, native PyTorch training, and Hugging Face Trainer. It’ll soon support Megatron’s upcoming custom FSDP.
  2. For developers with a slightly flexible accuracy threshold, ModelOpt offers PTQ (weight and activation, weight-only) and our proprietary AutoQuantize for FP4 inference. AutoQuantize automates per-layer quantization formats to achieve minimal model accuracy loss.

In our internal research, ModelOpt has achieved near lossless results with both PTQ and QAT for a Nemotron-4-340B, and minimal accuracy loss for a Llama3.1-405B on Blackwell. Currently, we prioritize models like Llama family, Mistral, and FLUX in our FP4 work. All optimized recipes will be publicly available in this repo.

2. Model optimization techniques

2.1 Model compression algorithms

ModelOpt collaborates with Nvidia and external research labs to continuously develop and integrate state-of-the-art techniques into our library for faster inference. Our recent focus areas include:

  • Advanced PTQ methods (e.g., SVDQuant, QuaRot, SpinQuant)
  • QAT with distillation, a proven path for FP4 inference
  • Attention sparsity (e.g., SnapKV, DuoAttention)
  • AutoQuantize improvements (e.g., support more fine-trained format selection and various weight and activation combination)
  • New token-efficient pruning and distillation methods
  • Infrastructure to support general rotation and smoothing (target ModelOpt v0.25)

2.2 Fast decoding techniques for LLM and VLM

ModelOpt works with TensorRT-LLM and vLLM to streamline fast decoding model finetuning (Hugging Face/ NVIDIA NeMo and Megatron-LM) and deployment for endpoint serving and edge device deployment. Our focus areas include:

  • Integrated draft model: Medusa, Redrafter, and EAGLE.
  • Standalone draft model training through pruning and knowledge distillation (e.g. Llama-3.2 1/3B).
  • Quantization-aware training that supports FP8 and FP4 on NVIDIA Blackwell platform.
  • Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.
  • Hosting pretrained checkpoints for popular models such as Llama-3.1 and Nemotron family on Hugging Face ModelOpt collection.

2.3 Techniques for diffusers

ModelOpt will continue to accelerate image generation inference by investing in these areas:

  • Quantization: Expand model support for INT8/FP8/FP4 PTQ and QAT. e.g., FLUX model series.
  • Caching: Adding more training-free and lightweight finetuning-based caching techniques with user-friendly APIs. (Previous work: Cache Diffusion).
  • Improve easy of use of the deployment pipelines, including adding multi-GPU support.

3. Developer Productivity

3.1 Open-sourcing

To provide extensibility and transparency, we target to open source ModelOpt from v0.23 (Jan 2025). This will enable advanced developers to experiment with custom calibration algorithms or contribute the latest techniques. Users can also self-service to add model support or non-standard dtypes, and benefit from improved debuggability and transparency. Open-sourcing ModelOpt will enable deeper integration with popular open-source frameworks across the ecosystem.

3.2 Ready-to-deploy quantized checkpoints

For developers who have limited GPU resources to quantize large models or prefer to skip the quantization steps, we offer quantized checkpoints of popular models on the Hugging Face Model Optimizer collection ready for TensorRT-LLM and vLLM deployment. The FP8 checkpoints of Llama-3.1 family are already available, with FLUX, diffusion, Medusa-trained checkpoints, and more coming soon.

4. Choice of Deployment

4.1 Popular Community Frameworks

To offer greater flexibility, we’ve been investing in supporting popular inference and serving frameworks like vLLM and SGLang, in addition to having seamless integration with the NVIDIA AI software ecosystem. We currently provide an initial workflow for vLLM deployment and an example in HuggingFace model hub, with more model support planned.

4.2 In-Framework Deployment

ModelOpt team is enabling a path for deployment within native PyTorch, with early access targeting at Q1 2025. This decouples model build and compile from runtime and offers several benefits:

  1. When optimizing inference performance or exploring new model compression techniques, ModelOpt users can quickly prototype in the PyTorch runtime and native PyTorch APIs to evaluate performance gains. Once satisfied, they can transition to the TensorRT-LLM runtime as the final step to maximize performance.
  2. For models not yet supported by TensorRT-LLM or applications that do not need ultra-fast inference speeds, users can get out-of-the-box performance improvements within native PyTorch.

Internally, we’ve validated this path for FP8 on NVIDIA H100s and FP4 on NVIDIA Blackwell for Llama 3.1 models. We’re enabling this path for broader model types like Mixtral, MoE, FLUX, and LoRA support.

5. Expand Support Matrix

5.1 Data types

Alongside our existing supported dtypes, we’ve recently added MXFP4 support and will soon expand to emerging popular dtypes like FP6 and sub-4-bit. Our focus is to further speed up GenAI inference with minimal to no impact on model fidelity.

5.2 Model Support

We partner with major foundational model providers for 0-day model support (i.e., quantization recipe ready at new model launch). We’ll continue to expand LLM model support (see below roadmap table for details) and invest more in LLM with multi-modality (vision, video, audio, image generation, and action), and potentially expand to SSM-based models based on user interests.

5.3 Platform & Other Support

ModelOpt's explicit quantization will be part of the upcoming NVIDIA DriveOS releases. We recently added an e2e BEVFormer INT8 example in NVIDIA DL4AGX, with more model support coming soon for Automotive customers. ModelOpt also has planned support for ONNX FP4 for DRIVE Thor.

In Q4 2024, ModelOpt added formal support for Windows (see ModelOpt-Windows), targeting Windows RTX PC systems with tight integration with Windows ecosystem such as Microsoft DirectML and Microsoft Olive. It currently supports quantization such as INT4 AWQ and we’ll expand to more techniques suitable for Windows.

Upcoming releases

We'll do our best to provide visibility into our upcoming releases. Details are subject to change and this table is not comprehensive.

ModelOpt v0.21 (Released in Nov) ModelOpt v0.23 (Jan 2025)
Blackwell Inference PTQ, QAT recipes for FP4 inference
Feature Improvements - ONNX per/acc tuning
- EAGLE speculative decoding model finetuning and quantization
- In-Framework (PyTorch) deployment (Early Access)
- SVDQuant
- Real quantization support
- Advanced FP8 KV cache quantization
- MCore/NeMo speculative decoding and quantization workflows
Developer Productivity - Target to go open source
- Megatron-Core/NeMo export to ready-to-deploy quantized checkpoints
Model Support - Nemotron 340B FP8 (ongoing)
- Nemotron 4 in Hugging Face
- Other top models
- Codestral Mamba 7B or Mamba2 FP8
- DeepSeek-v2 FP8
- Other top models
Platform Support & Ecosystem - Official Windows support with ONNX INT4
@hchings hchings pinned this issue Nov 21, 2024
@ogencoglu
Copy link

ONNX quantization methods that don't require calibration data would be amazing.

@kshitizgupta21
Copy link

kshitizgupta21 commented Nov 23, 2024

Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.

Can you elaborate on trtllm-serve? Are there any integrations planned with Triton TensorRT-LLM backend?

@wxsms
Copy link

wxsms commented Nov 25, 2024

Would like to see 4bit quantization and multi-gpu solutions on diffusion models (expecially DiT models like SD3). Thank you for the awesome work in advance!

A 4bit solution for reference: https://github.com/mit-han-lab/nunchaku

@riyadshairi979
Copy link
Collaborator

riyadshairi979 commented Nov 25, 2024

ONNX quantization methods that don't require calibration data would be amazing.

If the users just wanna test out the latency of the model, then they can skip the calibration data and modelopt.onnx.quantization will use random data to calibrate the model but the accuracy of the model will be unrealistic. Real calibration data is required by PTQ for the acceptable accuracy of the model.

@hchings
Copy link
Collaborator Author

hchings commented Nov 26, 2024

Would like to see 4bit quantization and multi-gpu solutions on diffusion models (expecially DiT models like SD3). Thank you for the awesome work in advance!

Hi @wxsms, thanks for the question. We have multi-GPU DiT work ongoing with the PyTorch deployment. We might include that in the early access of in-framework deployment depends on the progress. For deployment with NVIDIA TensorRT, there's ongoing work for FLUX but no timeline to be shared yet.

@ChenhanYu
Copy link

Out-of-the-box deployment with trtllm-serve, NVIDIA NIM, and vLLM serve.

Can you elaborate on trtllm-serve? Are there any integrations planned with Triton TensorRT-LLM backend?

TensorRT-LLM is developing a new hosting feature called trtllm-serve which should be similar to vllm serve. We are still working with TRTLLM regarding details. Regarding Triton TRTLLM, it supports the conventional TRTLLM checkpoints (which can generated by modelopt.torch.export).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants