docs: Update Triton documentation and examples (#3668)

bentoml · Mar 14, 2023 · 42a62d8 · 42a62d8
1 parent eaa6218
commit 42a62d8
Show file tree

Hide file tree

Showing 7 changed files with 44 additions and 75 deletions.
diff --git a/docs/source/integrations/triton.rst b/docs/source/integrations/triton.rst
@@ -408,6 +408,17 @@ HTTP/REST APIs is disabled by default, though it can be enabled when creating th
 
 Additionally, BentoML will allocate a random port for the gRPC/HTTP server, hence ``grpc-port`` or ``http-port`` options that is passed to Runner ``cli_args`` will be omitted.
 
+Adaptive Batching
+^^^^^^^^^^^^^^^^^
+
+:ref:`Adaptive batching <guides/batching:Adaptive Batching>` is a feature supported by BentoML runners that allows for efficient batch size selection during inference. However, it's important to note that this feature is not compatible with ``TritonRunner``.
+
+``TritonRunner`` is designed as a standalone Triton server, which means that the adaptive batching logic in BentoML runners is not invoked when using ``TritonRunner``.
+
+Fortunately, Triton supports its own solution for efficient batching called `dynamic batching <https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#scheduling-and-batching>`_.
+Similar to adaptive batching, dynamic batching also allows for the selection of the optimal batch size during inference. To use dynamic batching in Triton, relevant settings can be specified in the
+`model configuration <https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#model-configuration>`_ file.
+
 .. admonition:: 🚧 Help us improve the integration!
 
     This integration is still in its early stages and we are looking for feedbacks and contributions to make it even better!

diff --git a/examples/triton/onnx/README.md b/examples/triton/onnx/README.md
@@ -22,6 +22,16 @@ triton_runner = bentoml.triton.Runner(
 )
 ```
 
+CLI arguments can be passed through the `cli_args` argument of `bentoml.triton.Runner`:
+
+```python
+triton_runner = bentoml.triton.Runner(
+    "triton-runners",
+    model_repository="s3://path/to/model_repository",
+    cli_args=["--load-model=torchscrip_yolov5s", "--model-control-mode=explicit"],
+)
+```
+
 An example of inference API:
 
 ```python
@@ -60,29 +70,6 @@ docker:
 > tritonserver are currently only supported with `--production` tag. Make sure
 > to have `tritonserver` binary available in PATH if running locally.
 
-To pass triton arguments to `serve` do it via
-`--triton-options ARG=VALUE[, VALUE]`
-
-```bash
-bentoml serve --production --triton-options log-verbose=True
-```
-
-or via `bentoml.serve`:
-
-```python
-import bentoml
-
-server = bentoml.serve(
-    bento,
-    server_type='grpc',
-    production=True,
-    triton_args=[
-        "model-control-mode=explicit",
-        "load-model=onnx_yolov5s",
-    ],
-)
-```
-
 To find out more about BentoML Runner architecture, see
 [our latest documentation](https://docs.bentoml.org/en/latest/concepts/runner.html#)
 

diff --git a/examples/triton/onnx/train.py b/examples/triton/onnx/train.py
@@ -29,6 +29,7 @@
             raise bentoml.exceptions.NotFound(
                 "'override=True', overriding previously saved weights/conversions."
             )
+        print(f"{bento_model_name} already exists. Skipping...")
     except bentoml.exceptions.NotFound:
         ModelProto = onnx.load(MODEL_FILE.with_suffix(".onnx").__fspath__())
         onnx_checker.check_model(ModelProto)

diff --git a/examples/triton/pytorch/README.md b/examples/triton/pytorch/README.md
@@ -22,6 +22,16 @@ triton_runner = bentoml.triton.Runner(
 )
 ```
 
+CLI arguments can be passed through the `cli_args` argument of `bentoml.triton.Runner`:
+
+```python
+triton_runner = bentoml.triton.Runner(
+    "triton-runners",
+    model_repository="s3://path/to/model_repository",
+    cli_args=["--load-model=torchscrip_yolov5s", "--model-control-mode=explicit"],
+)
+```
+
 An example of inference API:
 
 ```python
@@ -57,32 +67,6 @@ docker:
   base_image: nvcr.io/nvidia/tritonserver:22.12-py3
 ```
 
-> tritonserver are currently only supported with `--production` tag. Make sure
-> to have `tritonserver` binary available in PATH if running locally.
-
-To pass triton arguments to `serve` do it via
-`--triton-options ARG=VALUE[, VALUE]`
-
-```bash
-bentoml serve --production --triton-options log-verbose=True
-```
-
-or via `bentoml.serve`:
-
-```python
-import bentoml
-
-server = bentoml.serve(
-    bento,
-    server_type='grpc',
-    production=True,
-    triton_args=[
-        "model-control-mode=explicit",
-        "load-model=pytorch_yolov5s",
-    ],
-)
-```
-
 To find out more about BentoML Runner architecture, see
 [our latest documentation](https://docs.bentoml.org/en/latest/concepts/runner.html#)
 

diff --git a/examples/triton/pytorch/train.py b/examples/triton/pytorch/train.py
@@ -33,6 +33,7 @@
             raise bentoml.exceptions.NotFound(
                 "'override=True', overriding previously saved weights/conversions."
             )
+        print(f"{bento_model_name} already exists. Skipping...")
     except bentoml.exceptions.NotFound:
         print(
             "Saved model:",

diff --git a/examples/triton/tensorflow/README.md b/examples/triton/tensorflow/README.md
@@ -22,6 +22,16 @@ triton_runner = bentoml.triton.Runner(
 )
 ```
 
+CLI arguments can be passed through the `cli_args` argument of `bentoml.triton.Runner`:
+
+```python
+triton_runner = bentoml.triton.Runner(
+    "triton-runners",
+    model_repository="s3://path/to/model_repository",
+    cli_args=["--load-model=torchscrip_yolov5s", "--model-control-mode=explicit"],
+)
+```
+
 An example of inference API:
 
 ```python
@@ -57,32 +67,6 @@ docker:
   base_image: nvcr.io/nvidia/tritonserver:22.12-py3
 ```
 
-> tritonserver are currently only supported with `--production` tag. Make sure
-> to have `tritonserver` binary available in PATH if running locally.
-
-To pass triton arguments to `serve` do it via
-`--triton-options ARG=VALUE[, VALUE]`
-
-```bash
-bentoml serve --production --triton-options log-verbose=True
-```
-
-or via `bentoml.serve`:
-
-```python
-import bentoml
-
-server = bentoml.serve(
-    bento,
-    server_type='grpc',
-    production=True,
-    triton_args=[
-        "model-control-mode=explicit",
-        "load-model=tensorflow_yolov5s",
-    ],
-)
-```
-
 To find out more about BentoML Runner architecture, see
 [our latest documentation](https://docs.bentoml.org/en/latest/concepts/runner.html#)
 

diff --git a/examples/triton/tensorflow/train.py b/examples/triton/tensorflow/train.py
@@ -27,6 +27,7 @@
             raise bentoml.exceptions.NotFound(
                 "'override=True', overriding previously saved weights/conversions."
             )
+        print(f"{bento_model_name} already exists. Skipping...")
     except bentoml.exceptions.NotFound:
         _, metadata = load_traced_script()
         model = tf.saved_model.load(