LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
-
Updated
Nov 22, 2024 - Python
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Serving Example of CodeGen-350M-Mono-GPTJ on Triton Inference Server with Docker and Kubernetes
Deploy KoGPT with Triton Inference Server
tutorial on how to deploy a scalable autoregressive causal language model transformer using nvidia triton server
This repository is a code sample to serve Large Language Models (LLM) on a Google Kubernetes Engine (GKE) cluster with GPUs running NVIDIA Triton Inference Server with FasterTransformer backend.
Add a description, image, and links to the fastertransformer topic page so that developers can more easily learn about it.
To associate your repository with the fastertransformer topic, visit your repo's landing page and select "manage topics."