diff --git a/docs/router/router.md b/docs/router/router.md index c1c70f2bf64..695902c3a02 100644 --- a/docs/router/router.md +++ b/docs/router/router.md @@ -7,14 +7,14 @@ The router is a independent Python package, and it can be used as a drop-in repl ## Installation ```bash -pip install sglang-router +$ pip install sglang-router ``` Detailed usage of the router can be found in [launch_router](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang_router/launch_router.py) and [launch_server](https://github.com/sgl-project/sglang/blob/main/rust/py_src/sglang/launch_server.py). Also, you can directly run the following command to see the usage of the router. ```bash -python -m sglang_router.launch_server --help -python -m sglang_router.launch_router --help +$ python -m sglang_router.launch_server --help +$ python -m sglang_router.launch_router --help ``` The router supports two working modes: @@ -27,7 +27,7 @@ The router supports two working modes: This will be a drop-in replacement for the existing `--dp-size` arguement of SGLang Runtime. Under the hood, it uses multi-processes to launch multiple workers, wait for them to be ready, then connect the router to all workers. ```bash -python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1 +$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 1 ``` After the server is ready, you can directly send requests to the router as the same way as sending requests to each single worker. @@ -47,12 +47,62 @@ print(response.json()) This is useful for multi-node DP. First, launch workers on multiple nodes, then launch a router on the main node, and connect the router to all workers. ```bash -python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2 +$ python -m sglang_router.launch_router --worker-urls http://worker_url_1 http://worker_url_2 ``` -## Strategies +## Dynamic Scaling APIs -### Cache-Aware Load-Balancing Router +We offer `/add_worker` and `/remove_worker` APIs to dynamically add or remove workers from the router. + +- `/add_worker` + +Usage: + +```bash +$ curl -X POST http://localhost:30000/add_worker?url=http://worker_url_1 +``` + +Example: + +```bash +$ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --port 30001 +$ curl -X POST http://localhost:30000/add_worker?url=http://127.0.0.1:30001 +Successfully added worker: http://127.0.0.1:30001 +``` + +- `/remove_worker` + +Usage: + +```bash +$ curl -X POST http://localhost:30000/remove_worker?url=http://worker_url_1 +``` + +Example: + +```bash +$ curl -X POST http://localhost:30000/remove_worker?url=http://127.0.0.1:30001 +Successfully removed worker: http://127.0.0.1:30001 +``` + +Note: + +- For cache-aware router, the worker will be removed from the tree and the queues. + +## Fault Tolerance + +We provide retries based for failure tolerance. + +1. If the request to a worker fails for `max_worker_retries` times, the router will remove the worker from the router and move on to the next worker. +2. If the total number of retries exceeds `max_total_retries`, the router will return an error. + +Note: + +- `max_worker_retries` is 3 and `max_total_retries` is 6 by default. + +## Routing Strategies + +#### Cache-Aware Load-Balancing Router The native router combines two strategies to optimize both cache utilization and request distribution: diff --git a/rust/README.md b/rust/README.md index 84a8e8fb1d0..617bca5405f 100644 --- a/rust/README.md +++ b/rust/README.md @@ -2,115 +2,13 @@ SGLang router is a standalone module implemented in Rust to achieve data parallelism across SGLang instances. -## Installation +## User docs -```bash -pip install sglang-router -``` - -## Usage -The router offers two modes: - -### 1. Co-launch workers and router -This will be a drop-in replacement for the existing `--dp-size`. This part of code will be moved into sglang core. -Under the hood, it uses multi-processes to launch multiple sglang workers, wait for them to be healthy, then launch the router. - -```bash -$ python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --dp-size 8 -``` - -### 2. Launch only router -This is useful for multi-node DP. You can launch workers on different nodes, then connect the router to them. - -```bash -$ python -m sglang_router.launch_router --worker-urls http://worker1:8000 http://worker2:8000 - -$ python -m sglang_router.launch_router --help -usage: launch_router.py [-h] [--host HOST] [--port PORT] [--worker-urls WORKER_URLS [WORKER_URLS ...]] - [--policy {random,round_robin,cache_aware}] [--cache-threshold CACHE_THRESHOLD] - [--balance-abs-threshold BALANCE_ABS_THRESHOLD] [--balance-rel-threshold BALANCE_REL_THRESHOLD] - [--eviction-interval EVICTION_INTERVAL] [--max-tree-size MAX_TREE_SIZE] - -options: - -h, --help show this help message and exit - --host HOST Host address to bind the router server (default: 127.0.0.1) - --port PORT Port number to bind the router server (default: 30000) - --worker-urls WORKER_URLS [WORKER_URLS ...] - List of worker URLs (e.g., http://worker1:8000 http://worker2:8000) (default: None) - --policy {random,round_robin,cache_aware} - Load balancing policy to use (default: cache_aware) - --cache-threshold CACHE_THRESHOLD - Cache threshold (0.0-1.0) for cache-aware routing (default: 0.5) - --balance-abs-threshold BALANCE_ABS_THRESHOLD - Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 32) - --balance-rel-threshold BALANCE_REL_THRESHOLD - Load balancing is triggered when (max_load - min_load) > abs_threshold AND max_load > min_load * rel_threshold (default: 1.0001) - --eviction-interval EVICTION_INTERVAL - Interval in seconds between cache eviction operations (default: 60) - --max-tree-size MAX_TREE_SIZE - Maximum size of the approximation tree for cache-aware routing (default: 16777216) -``` - -## Strategy - -### Cache-Aware Load-Balancing Router - -This router combines two strategies to optimize both cache utilization and request distribution: - -1. Cache-Aware Routing (Approximate Tree) -2. Load-Balancing Routing (Shortest Queue with Balance Thresholds) +Please check https://sgl-project.github.io/router/router.html -The router dynamically switches between these strategies based on load conditions: -- Uses load balancing when the system is imbalanced -- Uses cache-aware routing when the system is balanced +## Developer docs -A system is considered imbalanced if both conditions are met: -1. (max_load - min_load) > balance_abs_threshold -2. max_load > balance_rel_threshold * min_load - -#### 1. Cache-Aware Routing (Approximate Tree) -This strategy maintains an approximate radix tree for each worker based on request history, -eliminating the need for direct cache state queries. The tree stores raw text characters -instead of token IDs to avoid tokenization overhead. - -Process: -- For each request, find the worker with the highest prefix match -- If match rate > cache_threshold: - - Route to the worker with highest match (likely has relevant data cached) -- If match rate ≤ cache_threshold: - - Route to the worker with smallest tree size (most available cache capacity) -- Background maintenance: - - Periodically evict least recently used leaf nodes to prevent memory overflow - -#### 2. Load-Balancing (Shortest Queue) -This strategy tracks pending request counts per worker and routes new requests -to the least busy worker when the system is detected to be imbalanced. This helps -maintain optimal load distribution across workers. - -### Configuration Parameters - -1. `cache_threshold`: (float, 0.0 to 1.0, default: 0.5) - - Minimum prefix match ratio to use highest-match routing - - Below this threshold, routes to worker with most available cache space - -2. `balance_abs_threshold`: (integer, default: 32) - - Absolute difference threshold for load imbalance detection - - System is potentially imbalanced if (max_load - min_load) > abs_threshold - -3. `balance_rel_threshold`: (float, default: 1.0001) - - Relative ratio threshold for load imbalance detection - - System is potentially imbalanced if max_load > min_load * rel_threshold - - Used in conjunction with abs_threshold to determine final imbalance state - -4. `eviction_interval`: (integer, default: 60) - - Interval in seconds between LRU eviction cycles for the approximate trees - - Background thread periodically evicts least recently used nodes to maintain tree size - -5. `max_tree_size`: (integer, default: 16777216) - - Maximum nodes per tree - - When exceeded, LRU leaf nodes are evicted during the next eviction cycle - -## Development +### Prerequisites - Rust and Cargo installed @@ -134,7 +32,7 @@ cargo --version #### 1. Build Rust Project ```bash -cargo build +$ cargo build ``` #### 2. Build Python Binding @@ -142,13 +40,19 @@ cargo build ##### Option A: Build and Install Wheel 1. Build the wheel package: ```bash -pip install setuptools-rust wheel build -python -m build +$ pip install setuptools-rust wheel build +$ python -m build ``` 2. Install the generated wheel: ```bash -pip install +$ pip install +``` + +If you want one handy command to do build + install for every change you make: + +```bash +$ python -m build && pip install --force-reinstall dist/*.whl ``` ##### Option B: Development Mode @@ -158,7 +62,7 @@ For development purposes, you can install the package in editable mode: Warning: Using editable python binding can suffer from performance degradation!! Please build a fresh wheel for every update if you want to test performance. ```bash -pip install -e . +$ pip install -e . ``` **Note:** When modifying Rust code, you must rebuild the wheel for changes to take effect. diff --git a/rust/src/server.rs b/rust/src/server.rs index ded7bb56528..7d0d23ccde1 100644 --- a/rust/src/server.rs +++ b/rust/src/server.rs @@ -118,7 +118,7 @@ async fn remove_worker( None => return HttpResponse::BadRequest().finish(), }; data.router.remove_worker(&worker_url); - HttpResponse::Ok().finish() + HttpResponse::Ok().body(format!("Successfully removed worker: {}", worker_url)) } pub struct ServerConfig {