Trained models are deployed in ModelMesh Serving via Predictor
s. These represent a stable service endpoint behind which the underlying model can change.
Models must reside on shared storage. Currently, only S3-based storage is supported but support for other types will follow. Note that model data residing at a particular path within a given storage instance is assumed to be immutable. Different versions of the same logical model are treated at the base level as independent models and must reside at different paths. In particular, where a given model server/runtime natively supports the notion of versioning (such as Nvidia Triton, TensorFlow Serving, etc), the provided path should not point to the top of a (pseudo-)directory structure containing multiple versions. Instead, point to the subdirectory which corresponds to a specific version.
The ModelMesh Serving instance should be installed in the desired namespace. See install docs for more details.
If installed using the install script and the --quickstart
argument, a locally deployed MinIO should be available.
kubectl get secret storage-config -o json
There should be secret key called localMinIO
that looks like:
{
"type": "s3",
"access_key": "AKIAIOSFODNN7EXAMPLE",
"secret_access_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
"endpoint_url": "http://minio:9000",
"default_bucket": "modelmesh-example-models"
}
The config/example-predictors
directory contains Predictor manifests for many of the example models. For a list of available models, see the example models documentation.
Here we are deploying an sklearn model located at sklearn/mnist-svm.joblib
within the MinIO storage.
# Pulled from sample config/example-predictors/example-mlserver-sklearn-mnist-predictor.yaml
$ kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1alpha1
kind: Predictor
metadata:
name: example-mnist-predictor
spec:
modelType:
name: sklearn
path: sklearn/mnist-svm.joblib
storage:
s3:
secretKey: localMinIO
EOF
predictor.serving.kserve.io/example-mnist-predictor created
Note that localMinIO
is the name of the secret key verified in the previous step.
For more details go to the Predictor Spec page.
Once the Predictor
is created, mlserver runtime pods are automatically started to load and serve it.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
modelmesh-serving-mlserver-0.x-658b7dd689-46nwm 0/3 ContainerCreating 0 2s
modelmesh-serving-mlserver-0.x-658b7dd689-46nwm 0/3 ContainerCreating 0 2s
modelmesh-controller-568c45b959-nl88c 1/1 Running 0 11m
$ kubectl get predictors
NAME TYPE AVAILABLE ACTIVEMODEL TARGETMODEL TRANSITION AGE
example-mnist-predictor sklearn true Loading UpToDate 60s
$ kubectl get predictor example-mnist-predictor -o=jsonpath='{.status.grpcEndpoint}'
grpc://modelmesh-serving:8033
The states should reflect immediate availability, but may take some seconds to move from Loading
to Loaded
.
Inferencing requests for this Predictor received prior to loading completion will block until it completes.
See the Predictor Status section for details of how to interpret the different states.
Note
When ScaleToZero
is enabled, the first Predictor assigned to the Triton runtime may be stuck in the Pending
state for some time while the Triton pods are being created. The Triton image is large and may take a while to download.
The built-in runtimes implement the gRPC protocol of the KServe Predict API Version 2.
The .proto
file for this API can be downloaded from KServe's repo
or from the modelmesh-serving
repository at fvt/proto/kfs_inference_v2.proto
.
To send an inference request, configure your gRPC client to point to address modelmesh-serving:8033
and construct a request to the model using the ModelInfer
RPC, setting the name of the Predictor as the model_name
field in the ModelInferRequest
message.
Here is an example of how to do this using the command-line based grpcurl:
Port-forward to access the runtime service:
# access via localhost:8033
$ kubectl port-forward service/modelmesh-serving 8033
Forwarding from 127.0.0.1:8033 -> 8033
Forwarding from [::1]:8033 -> 8033
In a separate terminal window, send an inference request using the proto file from fvt/proto
or one that you have locally. Note that you have to provide the model_name
in the data load, which is the name of the Predictor deployed.
Note that you have to set the model_name
in the data payload to the name of the Predictor.
$ grpcurl -plaintext -proto fvt/proto/kfs_inference_v2.proto localhost:8033 list
inference.GRPCInferenceService
# run inference
# with below input, expect output to be 8
$ grpcurl -plaintext -proto fvt/proto/kfs_inference_v2.proto -d '{ "model_name": "example-mnist-predictor", "inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": { "fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0] }}]}' localhost:8033 inference.GRPCInferenceService.ModelInfer
{
"modelName": "example-mnist-predictor__ksp-7702c1b55a",
"outputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [
"1"
],
"contents": {
"fp32Contents": [
8
]
}
}
]
}
Changes can be made to the Predictor's Spec, such as changing the target storage and/or model, without interrupting the inferencing service. The predictor will continue to use the prior spec/model until the new one is loaded and ready.
Below, we are changing the Predictor to use a completely different model, in practice the schema of the Predictor's model would be consistent across updates even if the type of model or ML framework changes.
$ kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1alpha1
kind: Predictor
metadata:
name: example-mnist-predictor
spec:
modelType:
name: tensorflow
# Note updated model type and location
path: tensorflow/mnist.savedmodel
storage:
s3:
secretKey: localMinIO
EOF
predictor.serving.kserve.io/example-mnist-predictor configured
$ kubectl get predictors
NAME TYPE AVAILABLE ACTIVEMODEL TARGETMODEL TRANSITION AGE
example-mnist-predictor tensorflow true Loaded Loading InProgress 10m
The "transition" state of the Predictor will be InProgress
while waiting for the new backing model to be ready,
and return to UpToDate
once the transition is complete.
$ kubectl get predictors
NAME TYPE AVAILABLE ACTIVEMODEL TARGETMODEL TRANSITION AGE
example-mnist-predictor tensorflow true Loaded UpToDate 31m
If there is a problem loading the new model (for example it does not exist at the specified path), the transition state will
change to BlockedByFailedLoad
, but the service will remain available. The active model state will still show as Loaded
, and the
Predictor remains available.
$ kubectl get predictors
NAME TYPE AVAILABLE ACTIVEMODEL TARGETMODEL TRANSITION AGE
example-mnist-predictor tensorflow true Loaded Failed BlockedByFailedLoad 20m