BentoML - v1.1.0
🍱 We're thrilled to announce the release of BentoML v1.1.0, our first minor version update since the milestone v1.0.
- Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
- Official gRPC Support: We've transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
- Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
- Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
- Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.
🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.
-
GPU and CPU Support: Running Llama is support on both GPU and CPU.
-
Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face.
meta-llama/llama-2-70b-chat-hf meta-llama/llama-2-13b-chat-hf meta-llama/llama-2-7b-chat-hf meta-llama/llama-2-70b-hf meta-llama/llama-2-13b-hf meta-llama/llama-2-7b-hf openlm-research/open_llama_7b_v2 openlm-research/open_llama_3b_v2 openlm-research/open_llama_13b huggyllama/llama-65b huggyllama/llama-30b huggyllama/llama-13b huggyllama/llama-7b
Users can use any weights on HuggingFace (e.g.
TheBloke/Llama-2-13B-chat-GPTQ
), custom weights from local path (e.g./path/to/llama-1
), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM. -
Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground.
python -m openllm.playground.llama2_qlora --help