Skip to content

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

License

Notifications You must be signed in to change notification settings

ModelTC/lightllm

Repository files navigation

LightLLM

docs Docker stars visitors Discord Banner license

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.

English Docs | 中文文档 | Blogs

Features

  • Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
  • Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
  • Dynamic Batch: enables dynamic batch scheduling of requests
  • FlashAttention: incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
  • Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.
  • Token Attention: implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
  • High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
  • Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support.

Supported Model List

The following table provides a list of supported models along with any special arguments required for their configuration and annotations.

Model Name Comments
BLOOM None
LLaMA None
LLaMA V2 None
StarCoder None
Qwen-7b --eos_id 151643 --trust_remote_code
ChatGLM2-6b --trust_remote_code
InternLM-7b --trust_remote_code
InternVL-Chat --eos_id 32007 --trust_remote_code (Phi3) or --eos_id 92542 --trust_remote_code (InternLM2)
Qwen-VL None
Qwen-VL-Chat None
Qwen2-VL --eos_id 151645 --trust_remote_code, and run pip install git+https://github.com/huggingface/transformers
Llava-7b None
Llava-13b None
Mixtral None
Stablelm --trust_remote_code
MiniCPM None
Phi-3 Only supports Mini and Small
CohereForAI None
DeepSeek-V2-Lite --data_type bfloat16
DeepSeek-V2 --data_type bfloat16

Get started

Installation

Use lightllm with docker.

docker pull ghcr.io/modeltc/lightllm:main

To start a container with GPU support and port mapping:

docker run -it --gpus all -p 8080:8080                  \
        --shm-size 1g -v your_local_path:/data/         \
        ghcr.io/modeltc/lightllm:main /bin/bash
Note: If multiple GPUs are used, `--shm-size` in `docker run` command should be increased.

Alternatively, you can build the docker image or install from source with pip.

Quick Start

Lightllm provides LLM inference services with the state-of-the-art throughput performance via efficient request routers and TokenAttention.

We provide examples to launch the LightLLM service and query the model (via console and python) for both text and multimodal models.

  • Quick Start

  • Text Model Service

  • Multimodal Model Service

    Note: additional parameters for multimodal models (--enable_multimodal, --cache_capacity) require larger --shm-size. If the lightllm is run with --tp > 1, the visual model will run on the gpu 0. Input images format: list for dict like {'type': 'url'/'base64', 'data': xxx} The special image tag for Qwen-VL is <img></img> (<image> for Llava), the length of data["multimodal_params"]["images"] should be the same as the count of tags, The number can be 0, 1, 2, ...

Other

Please refer to the documentation for more information.

Performance

Lightllm provides high throughput services. The performance comparison between LightLLM and vLLM is shown here. Up to vllm=0.1.2, we have achieved a 2x larger throughput than vLLM.

FAQ

Please refer to the FAQ for more information.

Projects using lightllm

We welcome any coopoeration and contribution. If there is a project requires lightllm's support, please contact us via email or create a pull request.

  1. LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications.

    Once you have installed lightllm and lazyllm, and then you can use the following code to build your own chatbot:

    from lazyllm import TrainableModule, deploy, WebModule
    # Model will be download automatically if you have an internet connection
    m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm)
    WebModule(m).start().wait()

    Documents: https://lazyllm.readthedocs.io/

Star History

Star History Chart

Community

For further information and discussion, join our discord server.

License

This repository is released under the Apache-2.0 license.

Acknowledgement

We learned a lot from the following projects when developing LightLLM.

About

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages