LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.
English Docs | 中文文档 | Blogs
- Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
- Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
- Dynamic Batch: enables dynamic batch scheduling of requests
- FlashAttention: incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
- Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.
- Token Attention: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
- High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
- Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support.
The following table provides a list of supported models along with any special arguments required for their configuration and annotations.
Model Name | Comments |
---|---|
BLOOM | None |
LLaMA | None |
LLaMA V2 | None |
StarCoder | None |
Qwen-7b | --eos_id 151643 --trust_remote_code |
ChatGLM2-6b | --trust_remote_code |
InternLM-7b | --trust_remote_code |
InternVL-Chat | --eos_id 32007 --trust_remote_code (Phi3) or --eos_id 92542 --trust_remote_code (InternLM2) |
Qwen-VL | None |
Qwen-VL-Chat | None |
Qwen2-VL | --eos_id 151645 --trust_remote_code , and run pip install git+https://github.com/huggingface/transformers |
Llava-7b | None |
Llava-13b | None |
Mixtral | None |
Stablelm | --trust_remote_code |
MiniCPM | None |
Phi-3 | Only supports Mini and Small |
CohereForAI | None |
DeepSeek-V2-Lite | --data_type bfloat16 |
DeepSeek-V2 | --data_type bfloat16 |
Use lightllm with docker
.
docker pull ghcr.io/modeltc/lightllm:main
To start a container with GPU support and port mapping:
docker run -it --gpus all -p 8080:8080 \
--shm-size 1g -v your_local_path:/data/ \
ghcr.io/modeltc/lightllm:main /bin/bash
Note: If multiple GPUs are used, `--shm-size` in `docker run` command should be increased.
Alternatively, you can build the docker image or install from source with pip.
Lightllm provides LLM inference services with the state-of-the-art throughput performance via efficient request routers and TokenAttention.
We provide examples to launch the LightLLM service and query the model (via console and python) for both text and multimodal models.
-
Note: additional parameters for multimodal models (
--enable_multimodal
,--cache_capacity
) require larger--shm-size
. If the lightllm is run with--tp > 1
, the visual model will run on the gpu 0. Input images format: list for dict like{'type': 'url'/'base64', 'data': xxx}
The special image tag for Qwen-VL is<img></img>
(<image>
for Llava), the length ofdata["multimodal_params"]["images"]
should be the same as the count of tags, The number can be 0, 1, 2, ...
Please refer to the documentation for more information.
Lightllm provides high throughput services. The performance comparison between LightLLM and vLLM is shown here. Up to vllm=0.1.2, we have achieved a 2x larger throughput than vLLM.
Please refer to the FAQ for more information.
We welcome any coopoeration and contribution. If there is a project requires lightllm's support, please contact us via email or create a pull request.
-
LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications.
Once you have installed
lightllm
andlazyllm
, and then you can use the following code to build your own chatbot:from lazyllm import TrainableModule, deploy, WebModule # Model will be download automatically if you have an internet connection m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm) WebModule(m).start().wait()
Documents: https://lazyllm.readthedocs.io/
For further information and discussion, join our discord server.
This repository is released under the Apache-2.0 license.
We learned a lot from the following projects when developing LightLLM.