GitHub - ModelTC/lightllm: LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.

English Docs | 中文文档 | Blogs

Features

Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
Dynamic Batch: enables dynamic batch scheduling of requests
FlashAttention: incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.
Token Attention: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support.

Supported Model List

The following table provides a list of supported models along with any special arguments required for their configuration and annotations.

Model Name	Comments
BLOOM	None
LLaMA	None
LLaMA V2	None
StarCoder	None
Qwen-7b	`--eos_id 151643 --trust_remote_code`
ChatGLM2-6b	`--trust_remote_code`
InternLM-7b	`--trust_remote_code`
InternVL-Chat	`--eos_id 32007 --trust_remote_code` (Phi3) or `--eos_id 92542 --trust_remote_code` (InternLM2)
Qwen-VL	None
Qwen-VL-Chat	None
Qwen2-VL	`--eos_id 151645 --trust_remote_code`, and run `pip install git+https://github.com/huggingface/transformers`
Llava-7b	None
Llava-13b	None
Mixtral	None
Stablelm	`--trust_remote_code`
MiniCPM	None
Phi-3	Only supports Mini and Small
CohereForAI	None
DeepSeek-V2-Lite	`--data_type bfloat16`
DeepSeek-V2	`--data_type bfloat16`

Get started

Installation

Use lightllm with docker.

docker pull ghcr.io/modeltc/lightllm:main

To start a container with GPU support and port mapping:

docker run -it --gpus all -p 8080:8080                  \
        --shm-size 1g -v your_local_path:/data/         \
        ghcr.io/modeltc/lightllm:main /bin/bash

Note: If multiple GPUs are used, `--shm-size` in `docker run` command should be increased.

Alternatively, you can build the docker image or install from source with pip.

Quick Start

Lightllm provides LLM inference services with the state-of-the-art throughput performance via efficient request routers and TokenAttention.

We provide examples to launch the LightLLM service and query the model (via console and python) for both text and multimodal models.

Quick Start
Text Model Service
Multimodal Model Service

Note: additional parameters for multimodal models (--enable_multimodal, --cache_capacity) require larger --shm-size. If the lightllm is run with --tp > 1, the visual model will run on the gpu 0. Input images format: list for dict like {'type': 'url'/'base64', 'data': xxx} The special image tag for Qwen-VL is <img></img> (<image> for Llava), the length of data["multimodal_params"]["images"] should be the same as the count of tags, The number can be 0, 1, 2, ...

Other

Please refer to the documentation for more information.

Performance

Lightllm provides high throughput services. The performance comparison between LightLLM and vLLM is shown here. Up to vllm=0.1.2, we have achieved a 2x larger throughput than vLLM.

FAQ

Please refer to the FAQ for more information.

Projects using lightllm

We welcome any coopoeration and contribution. If there is a project requires lightllm's support, please contact us via email or create a pull request.

LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications.

Once you have installed lightllm and lazyllm, and then you can use the following code to build your own chatbot:

from lazyllm import TrainableModule, deploy, WebModule
# Model will be download automatically if you have an internet connection
m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm)
WebModule(m).start().wait()

Documents: https://lazyllm.readthedocs.io/

Star History

Community

For further information and discussion, join our discord server.

License

This repository is released under the Apache-2.0 license.

Acknowledgement

We learned a lot from the following projects when developing LightLLM.

Name		Name	Last commit message	Last commit date
Latest commit History 434 Commits
.github		.github
assets		assets
demos		demos
docs		docs
format_out		format_out
lightllm		lightllm
test		test
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
benchmark.md		benchmark.md
build_and_upload_docker.sh		build_and_upload_docker.sh
format.py		format.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Supported Model List

Get started

Installation

Quick Start

Other

Performance

FAQ

Projects using lightllm

Star History

Community

License

Acknowledgement

About

Releases

Packages

Contributors 32

Languages

License

ModelTC/lightllm

Folders and files

Latest commit

History

Repository files navigation

Features

Supported Model List

Get started

Installation

Quick Start

Other

Performance

FAQ

Projects using lightllm

Star History

Community

License

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 32

Languages

Packages