Skip to content

Releases: bentoml/BentoML

BentoML - v1.1.0

24 Jul 20:34
2ab6de7
Compare
Choose a tag to compare

🍱 We're thrilled to announce the release of BentoML v1.1.0, our first minor version update since the milestone v1.0.

  • Backward Compatibility: Rest assured that this release maintains full API backward compatibility with v1.0.
  • Official gRPC Support: We've transitioned gRPC support in BentoML from experimental to official status, expanding your toolkit for high-performance, low-latency services.
  • Ray Integration: Ray is a popular open-source compute framework that makes it easy to scale Python workloads. BentoML integrates natively with Ray Serve to enable users to deploy Bento applications in a Ray cluster without modifying code or configuration.
  • Enhanced Hugging Face Transformers and Diffusers Support: All Hugging Face Diffuser models and pipelines can be seamlessly imported and integrated into BentoML applications through the Transformers and Diffusers framework libraries.
  • Enhanced Model Version Management: Enjoy greater flexibility with the improved model version management, enabling flexible configuration and synchronization of model versions with your remote model store.

🦾 We are also excited to announce the launch of OpenLLM v0.2.0 featuring the support of Llama 2 models.

image

  • GPU and CPU Support: Running Llama is support on both GPU and CPU.

  • Model variations and parameter sizes: Support all model weights and parameter sizes on Hugging Face.

    meta-llama/llama-2-70b-chat-hf
    meta-llama/llama-2-13b-chat-hf
    meta-llama/llama-2-7b-chat-hf
    meta-llama/llama-2-70b-hf
    meta-llama/llama-2-13b-hf
    meta-llama/llama-2-7b-hf
    openlm-research/open_llama_7b_v2
    openlm-research/open_llama_3b_v2
    openlm-research/open_llama_13b
    huggyllama/llama-65b
    huggyllama/llama-30b
    huggyllama/llama-13b
    huggyllama/llama-7b

    Users can use any weights on HuggingFace (e.g. TheBloke/Llama-2-13B-chat-GPTQ), custom weights from local path (e.g. /path/to/llama-1), or fine-tuned weights as long as it adheres to LlamaModelForCausalLM.

  • Stay tuned for Fine-tuning capabilities in OpenLLM: Fine-tuning various Llama 2 models will be added in a future release. Try the experimental script for fine-tuning Llama-2 with QLoRA under OpenLLM playground.

    python -m openllm.playground.llama2_qlora --help
    

BentoML - v1.0.22

12 Jun 20:44
89e5fda
Compare
Choose a tag to compare

🍱 BentoML v1.0.22 release has brought a list of well-anticipated updates.

  • Added support for Pydantic 2 for better validate performance.

  • Added support for CUDA 12 versions in builds and containerization.

  • Introduced service lifecycle events allowing adding custom logic on_deployment, on_startup, and on_shutdown. States can be managed using the context ctx variable during the on_startup and on_shutdown events and during request serving in the API.

    @svc.on_deployment
    def on_deployment():
      pass
    
    @svc.on_startup
    def on_startup(ctx: bentoml.Context):
      ctx.state["object_key"] = create_object()
    
    @svc.on_shutdown
    def on_shutdown(ctx: bentoml.Context):
      cleanup_state(ctx.state["object_key"])
    
    @svc.api
    def predict(input_data, ctx):
      object = ctx.state["object_key"]
      pass
  • Added support for traffic control for both API Server and Runners. Timeout and maximum concurrency can now be configured through configuration.

    api_server:
      traffic:
        timeout: 10 # API Server request timeout in seconds
        max_concurrency: 32 # Maximum concurrency requests in the API Server
    
    runners:
      iris:
        traffic:
          timeout: 10 # Runner request timeout in seconds
          max_concurrency: 32 # Maximum concurrency requests in the Runner
  • Improved performance of bentoml push performance for large Bentos.

🚀 One more thing, the team is delighted to unveil our latest endeavor, OpenLLM. This innovative project allows you to effortless build with the state-of-the-art open source or fine-tuned Large Language Models.

  • Supports all variants of Flan-T5, Dolly V2, StarCoder, Falcon, StableLM, and ChatGLM out-of-box. Fully customizable with model specific arguments.

    openllm start [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]
  • Exposes the familiar BentoML APIs and transforms LLMs seamlessly into Runners.

    llm_runner = openllm.Runner("dolly-v2")
  • Builds LLM application into the Bento format that can be deployed to BentoCloud or containerized into OCI images.

    openllm build [falcon | flan_t5 | dolly_v2 | chatglm | stablelm | starcoder]

Our dedicated team is working hard to pioneering more integrations of advanced models for our upcoming releases of OpenLLM. Stay tuned for the unfolding developments.

BentoML - v1.0.20

10 May 01:14
7f7be71
Compare
Choose a tag to compare

🍱 BentoML v1.0.20 is released with improved usability and compatibility features.

  • Production Mode by Default: bentoml serve command will now run with the --production option by default. The change is made the simulate the production behavior during development. The --reload option will continue to with as expected. To achieve the serving behavior previously, use --development instead.

  • Optional Dependency for OpenTelemetry Exporter: The opentelemetry-exporter-otlp-proto-http dependency has been moved from a required dependency to an optional one to address a protobuf dependency incompatibility issue. ⚠️ If you are currently using the Model Monitoring and Inference Data Collection feature, you must install the package with the monitor-otlp ****option from this release onwards to include the necessary dependency.

    pip install "bentoml[monitor-otlp]"
  • OpenTelemetry Trace ID Configuration Option: A new configuration option has been added to return the OpenTelemetry Trace ID in the response. This feature is particularly helpful when tracing has not been initialized in the upstream caller, but the caller still wishes to log the Trace ID in case of an error.

    api_server:
      http:
        response:
          trace_id: True
  • Start from a Service: Added the ability to start a server from a bentoml.Service object. This is helpful for troubleshooting a project in a development environment where no Bentos has been built yet.

    import bentoml
    
    # import the Service defined in `/clip_api_service/service.py` file
    from clip_api_service.service import svc 
    
    if __name__ == "__main__":
      # start a server:
      server = bentoml.HTTPServer(svc)
      server.start(blocking=False)
      client = server.get_client()
      client.predict(..)

What's Changed

New Contributors

Full Changelog: v1.0.19...v1.0.20

BentoML - v1.0.19

26 Apr 23:52
afe9660
Compare
Choose a tag to compare

🍱 BentoML v1.0.19 is released with enhanced GPU utilization and expanded ML framework support.

  • Optimized GPU resource utilization: Enabled scheduling of multiple instances of the same runner using the workers_per_resource scheduling strategy configuration. The following configuration allows scheduling 2 instances of the “iris” runner per GPU instance. workers_per_resource is 1 by default.

    runners:
      iris:
        resources:
          nvidia.com/gpu: 1
        workers_per_resource: 2
  • New ML framework support: We've added support for EasyOCR and Detectron2 to our growing list of supported ML frameworks.

  • Enhanced runner communication: Implemented PEP 574 out-of-band pickling to improve runner communication by eliminating memory copying, resulting in better performance and efficiency.

  • Backward compatibility for Hugging Face Transformers: Resolved compatibility issues with Hugging Face Transformers versions prior to v4.18, ensuring a seamless experience for users with older versions.

⚙️ With the release of Kubeflow 1.7, BentoML now has native integration with Kubeflow, allowing developers to leverage BentoML's cloud-native components. Prior, developers were limited to exporting and deploying Bento
as a single container. With this integration, models trained in Kubeflow can easily be packaged, containerized, and deployed to a Kubernetes cluster as microservices. This architecture enables the individual models to run in their own pods, utilizing the most optimal hardware for their respective tasks and enabling independent scaling.

💡 With each release, we consistently update our blog, documentation and examples to empower the community in harnessing the full potential of BentoML.

What's Changed

New Contributors

Full Changelog: v1.0.18...v1.0.19

BentoML - v1.0.18

14 Apr 10:59
52f7863
Compare
Choose a tag to compare

🍱 BentoML v1.0.18 brings a new way of creating the server and client natively from Python.

  • Start an HTTP or gRPC server and client asynchronously with a context manager.

    server = HTTPServer("iris_classifier:latest", production=True, port=3000)
    
    # Start the server in a separate process and connect to it using a client
    with server.start() as client:
        res = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))
  • Start an HTTP or gRPC server synchronously.

    server = HTTPServer("iris_classifier:latest", production=True, port=3000)
    server.start(blocking=True)
  • As always, a client can be created and connected to an running server.

    client = Client.from_url("http://localhost:3000")
    res = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))

What's Changed

  • chore(deps): bump coverage[toml] from 7.2.2 to 7.2.3 by @dependabot in #3746
  • bugs: Fix an f-string bug in Tranformers framework. by @ssheng in #3753
  • chore(deps): bump pytest from 7.2.2 to 7.3.0 by @dependabot in #3751
  • chore(deps): bump bufbuild/buf-setup-action from 1.16.0 to 1.17.0 by @dependabot in #3750
  • fix: BufferError when pushing model to BentoCloud by @aarnphm in #3737
  • chore: remove codecov dependencies by @aarnphm in #3754
  • feat: implement new serve API by @sauyon in #3696
  • examples: Add a client example to quickstart by @ssheng in #3752

Full Changelog: v1.0.17...v1.0.18

BentoML - v1.0.17

06 Apr 20:55
09cf0f4
Compare
Choose a tag to compare

🍱 We are excited to announce the release of BentoML v1.0.17, which includes support for 🤗 Hugging Face Transformers pre-trained instances. Prior to this release, only pipelines could be saved and loaded using the bentoml.transformers APIs. However, based on the community's demand to work with pre-trained models, tokenizers, preprocessors, etc., without pipelines, we have expanded our capabilities in bentoml.transformers APIs. With this release, all pre-trained instances can be saved and loaded into either built-in Transformers framework runners or custom runners. This update opens up new possibilities for users to work with pre-trained models, and we are thrilled to see what the community will create using this feature. To learn more, visit BentoML Transformers framework documentation.

  • Pre-trained models and instances, such as tokenizers, preprocessors, and feature extractors, can also be saved as standalone models using the bentoml.transformers.save_model API.

    import bentoml
    from transformers import AutoTokenizer
    
    processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
    model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
    vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
    
    bentoml.transformers.save_model("speecht5_tts_processor", processor)
    bentoml.transformers.save_model("speecht5_tts_model", model, signatures={"generate_speech": {"batchable": False}})
    bentoml.transformers.save_model("speecht5_tts_vocoder", vocoder)
  • Pre-trained models and instances can be run either independently as Transformers framework runners or jointly in a custom runner. To use pre-trained models and instances as individual framework runners, simply get the models reference and convert them to runners using the to_runner method.

    import bentoml
    import torch
    
    from bentoml.io import Text, NumpyNdarray
    from datasets import load_dataset
    
    proccessor_runner = bentoml.transformers.get("speecht5_tts_processor").to_runner()
    model_runner = bentoml.transformers.get("speecht5_tts_model").to_runner()
    vocoder_runner = bentoml.transformers.get("speecht5_tts_vocoder").to_runner()
    embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
    speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)
    
    svc = bentoml.Service("text2speech", runners=[proccessor_runner, model_runner, vocoder_runner])
    
    @svc.api(input=Text(), output=NumpyNdarray())
    def generate_speech(inp: str):
        inputs = proccessor_runner.run(text=inp, return_tensors="pt")
        speech = model_runner.generate_speech.run(input_ids=inputs["input_ids"], speaker_embeddings=speaker_embeddings, vocoder=vocoder_runner.run)
        return speech.numpy()
  • To use the pre-trained models and instances together in a custom runner, use the bentoml.transformers.get API to get the models references and load them in a custom runner. The pretrained instances can then be used for inference in the custom runner.

    import bentoml
    import torch
    
    from datasets import load_dataset
    
    processor_ref = bentoml.models.get("speecht5_tts_processor:latest")
    model_ref = bentoml.models.get("speecht5_tts_model:latest")
    vocoder_ref = bentoml.models.get("speecht5_tts_vocoder:latest")
    
    class SpeechT5Runnable(bentoml.Runnable):
    
        def __init__(self):
            self.processor = bentoml.transformers.load_model(processor_ref)
            self.model = bentoml.transformers.load_model(model_ref)
            self.vocoder = bentoml.transformers.load_model(vocoder_ref)
            self.embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
            self.speaker_embeddings = torch.tensor(self.embeddings_dataset[7306]["xvector"]).unsqueeze(0)
    
        @bentoml.Runnable.method(batchable=False)
        def generate_speech(self, inp: str):
            inputs = self.processor(text=inp, return_tensors="pt")
            speech = self.model.generate_speech(inputs["input_ids"], self.speaker_embeddings, vocoder=self.vocoder)
            return speech.numpy()
    
    text2speech_runner = bentoml.Runner(SpeechT5Runnable, name="speecht5_runner", models=[processor_ref, model_ref, vocoder_ref])
    svc = bentoml.Service("talk_gpt", runners=[text2speech_runner])
    
    @svc.api(input=bentoml.io.Text(), output=bentoml.io.NumpyNdarray())
    async def generate_speech(inp: str):
        return await text2speech_runner.generate_speech.async_run(inp)

What's Changed

  • feat(containerize): caching pip/conda installation layers by @smidm in #3673
  • docs(batching): update docs to 503 by @sauyon in #3677
  • chore(deps): bump ruff from 0.0.255 to 0.0.256 by @dependabot in #3676
  • fix(type): annotate PdSeries with pandas-stubs by @aarnphm in #3466
  • chore(dispatcher): refactor out training code by @sauyon in #3663
  • fix: makes containerize for triton examples to all amd64 by @aarnphm in #3678
  • chore(deps): bump coverage[toml] from 7.2.1 to 7.2.2 by @dependabot in #3679
  • revert: "chore(dispatcher): refactor out training code (#3663)" by @sauyon in #3680
  • doc: add more links to Bentoml/examples by @larme in #3631
  • perf: serialization optimization by @larme in #3606
  • examples: Kubeflow by @ssheng in #3656
  • chore(deps): bump pytest-asyncio from 0.20.3 to 0.21.0 by @dependabot in #3688
  • chore(deps): bump ruff from 0.0.256 to 0.0.257 by @dependabot in #3689
  • chore(deps): bump imageio from 2.26.0 to 2.26.1 by @dependabot in #3690
  • chore(deps): bump yamllint from 1.29.0 to 1.30.0 by @dependabot in #3694
  • fix: remove duplicate dependabot check for pip by @aarnphm in #3691
  • chore(deps): bump ruff from 0.0.257 to 0.0.258 by @dependabot in #3699
  • docs: Update the Kubeflow example by @ssheng in #3703
  • chore(deps): bump ruff from 0.0.258 to 0.0.259 by @dependabot in #3709
  • docs: add link to pyfilesystem plugins by @sauyon in #3716
  • docs: Kubeflow integration documentation by @ssheng in #3704
  • docs: replace load_runner() to get().to_runner() by @KimSoungRyoul in #3715
  • chore(deps): bump imageio from 2.26.1 to 2.27.0 by @dependabot in #3720
  • fix(readme): format markdown table by @aarnphm in #3722
  • fix: copy files before running setup_script by @aarnphm in #3713
  • chore: remove experimental warning for bentoml.metrics by @aarnphm in #3725
  • ci: temporary disable coverage by @aarnphm in #3726
  • chore(deps): bump ruff from 0.0.259 to 0.0.260 by @dependabot in #3734
  • chore(deps): bump tritonclient[all] from 2.31.0 to 2.32.0 by @dependabot in #3730
  • fix(type): bentoml.container.build should accept multiple image_tag by @pmayd in #3719
  • chore(deps): bump bufbuild/buf-setup-action from 1.15.1 to 1.16.0 by @dependabot in #3738
  • feat: add query params to request context by @sauyon in #3717
  • chore(dispatcher): use attr class instead of a tuple by @sauyon in #3731
  • fix: Make it so the configured max_batch_size is respected when batching inference requests together by @RShang97 in #3741
  • feat(transformers): pretrained protocol support by @aarnphm in #3684
  • fix(tests): broken CI by @aarnphm in #3742
  • chore(deps): bump ruff from 0.0.260 to 0.0.261 by @dependabot in #3744
  • docs: Transformers documentation on pre-trained instances support by @ssheng in #3745

New Contributors

Full Changelog: v1.0.16...v1.0.17

BentoML - v1.0.16

14 Mar 21:03
f503a68
Compare
Choose a tag to compare

🍱 BentoML v1.0.16 release is here featuring the introduction of the bentoml.triton framework. With this integration, BentoML now supports running NVIDIA Triton Inference Server as a Runner. See Triton Inference Server documentation to learn more!

  • Triton Inference Server can be configured as a Runner in BentoML with its model repository and CLI arguments specified as parameters.

    import bentoml
    
    triton_runner = bentoml.triton.Runner(
    	"triton_runner",
    	model_repository="s3://bucket/path/to/model_repository",
    	cli_args=["--load-model=torchscrip_yolov5s", "--model-control-mode=explicit"],
    )
  • Models served by the Triton Inference Server Runner can be called as a method on the runner handle both synchronously and asynchronously.

    @svc.api(
        input=bentoml.io.Image.from_sample("./data/0.png"), output=bentoml.io.NumpyNdarray()
    )
    async def bentoml_torchscript_mnist_infer(im: Image) -> NDArray[t.Any]:
        arr = np.array(im) / 255.0
        arr = np.expand_dims(arr, (0, 1)).astype("float32")
        InferResult = await triton_runner.torchscript_mnist.async_run(arr)
        return InferResult.as_numpy("OUTPUT__0")
  • Build bentos and containerize images with Triton Runners by specifying nvcr.io/nvidia/tritonserver base image in bentofile.yaml.

    service: service:svc
    include:
      - /model_repository
      - /data/*.png
      - /*.py
    exclude:
      - /__pycache__
      - /venv
      - /train.py
      - /build_bento.py
      - /containerize_bento.py
    python:
      packages:
        - bentoml[triton]
    docker:
      base_image: nvcr.io/nvidia/tritonserver:22.12-py3

💡 If you are an existing Triton user, the integration provides simpler ways to add custom logics in Python, deploy distributed multi-model inference graph, unify model management across different ML frameworks and workflows, and standardize model packaging format with versioning and collaboration features. If you are an existing BentoML user, the integration improves the runner efficiency and throughput under high load thanks to Triton’s efficient C++ runtime.

What's Changed

New Contributors

Full Changelog: v1.0.15...v1.0.16

BentoML - v1.0.15

16 Feb 01:31
a61379a
Compare
Choose a tag to compare

🍱 BentoML v1.0.15 release is here featuring the introduction of the bentoml.diffusers framework.

  • Learn more about the capabilities of the bentoml.diffusers framework in the Creating Stable Diffusion 2.0 Service With BentoML And Diffusers blog and BentoML Diffusers example project.

  • Import a diffusion model with the bentoml.diffusers.import_model API.

    import bentoml
    
    bentoml.diffusers.import_model(
        "sd2",
        "stabilityai/stable-diffusion-2",
    )
  • Create a text2img service using a Stable Diffusion 2.0 model runner with the familiar to_runner API from the bentoml.diffuser framework.

    import torch
    from diffusers import StableDiffusionPipeline
    
    import bentoml
    from bentoml.io import Image, JSON, Multipart
    
    bento_model = bentoml.diffusers.get("sd2:latest")
    stable_diffusion_runner = bento_model.to_runner()
    
    svc = bentoml.Service("stable_diffusion_v2", runners=[stable_diffusion_runner])
    
    @svc.api(input=JSON(), output=Image())
    def txt2img(input_data):
        images, _ = stable_diffusion_runner.run(**input_data)
        return images[0]

🍱 Fixed a incompatibility change introduced in starlette==0.25.0 result in the type MultiPartMessage not being found in starlette.formparsers.

ImportError: cannot import name 'MultiPartMessage' from 'starlette.formparsers' (/opt/miniconda3/envs/bentoml/lib/python3.10/site-packages/starlette/formparsers.py)

What's Changed

New Contributors

Full Changelog: v1.0.14...v1.0.15

BentoML - v1.0.14

08 Feb 22:41
9a6dc93
Compare
Choose a tag to compare

🍱 Fixed the backward incompatibility introduced in starlette version 0.24.0. Upgrade BentoML to v1.0.14 if you encounter the error related to content_type like below.

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/service_app.py", line 305, in api_func
    input_data = await api.input.from_http_request(request)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/io_descriptors/multipart.py", line 208, in from_http_request
    reqs = await populate_multipart_requests(request)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/utils/formparser.py", line 188, in populate_multipart_requests
    form = await multipart_parser.parse()
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/utils/formparser.py", line 158, in parse
    multipart_file = UploadFile(
TypeError: __init__() got an unexpected keyword argument 'content_type'

BentoML - v1.0.13

20 Jan 03:52
4d2fd62
Compare
Choose a tag to compare

🍱 BentoML v1.0.13 is released featuring a preview of batch inference with Spark.

  • Run the batch inference job using the bentoml.batch.run_in_spark() method. This method takes the API name, the Spark DataFrame containing the input data, and the Spark session itself as parameters, and it returns a DataFrame containing the results of the batch inference job.

    import bentoml
    
    # Import the bento from a repository or get the bento from the bento store
    bento = bentoml.import_bento("s3://bentoml/quickstart")
    
    # Run the run_in_spark function with the bento, API name, and Spark session
    results_df = bentoml.batch.run_in_spark(bento, "classify", df, spark)
  • Internally, what happens when you run run_in_spark is as follows:

    • First, the bento is distributed to the cluster. Note that if the bento has already been distributed, i.e. you have already run a computation with that bento, this step is skipped.
    • Next, a process function is created, which starts a BentoML server on each of the Spark workers, then uses a client to process all the data. This is done so that the workers take advantage of the batch processing features of the BentoML server. PySpark pickles this process function and dispatches it, along with the relevant data, to the workers.
    • Finally, the function is evaluated on the given dataframe. Once all methods that the user defined in the script have been executed, the data is returned to the master node.

⚠️ The bentoml.batch API may undergo incompatible changes until general availability announced in a later minor version release.
🥂 Shout out to jeffthebear, KimSoungRyoul, Robert Fernandez, Marco Vela, Quan Nguyen, and y1450 from the community for their contributions in this release.

What's Changed

New Contributors

Full Changelog: v1.0.12...v1.0.13