[Bug] (v0.3.6.post2) Output degredation when using structured output #2216

Quang-elec44 · 2024-11-27T08:12:03Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

The results (w and w/o JSON schema) are different, while those generated from vllm server (v0.6.4.post1) remain the same

Reproduction

How to start `sglang` server

services:
  llm-sglang-dev:
    image: lmsysorg/sglang:latest
    container_name: llm-sglang-dev
    restart: unless-stopped
    environment:
      HUGGING_FACE_HUB_TOKEN: <my-hf-token>
    ports:
      - "8007:8007"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    env_file: 
      - .env
    command: >
      python3 -m sglang.launch_server
      --model Qwen/Qwen2.5-7B-Instruct-AWQ
      --host 0.0.0.0
      --port 8007
      --api-key <my-api-key>
      --served-model-name gpt-4o
      --tensor-parallel-size 1
      --mem-fraction-static 0.4
      --random-seed 42
      --enable-p2p-check
      --show-time-cost
      --quantization awq_marlin
      --grammar-backend xgrammar
      --enable-cache-report
      --context-length 2048

How to start `vllm` server

services:
  llm-vllm:
    image: vllm/vllm-openai:latest
    container_name: llm-vllm
    restart: unless-stopped
    environment:
      HUGGING_FACE_HUB_TOKEN: <my-hf-token>
    ports:
      - "8007:8007"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0']
            capabilities: [gpu]
    ipc: host
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --host 0.0.0.0
      --port 8007
      --api-key <my-api-key>
      --max-model-len 16382
      --tensor-parallel-size 1
      --gpu-memory-utilization 0.8
      --served-model-name gpt-4o
      --seed 42
      --disable-log-requests
      --enable-prefix-caching
      --model MODEL=Qwen/Qwen2.5-7B-Instruct-AWQ

Python script

import json
import openai

from pydantic import BaseModel


client = openai.OpenAI(
    base_url="http://localhost:8007/v1",
    api_key="Lizai@54321"
)

class Players(BaseModel):
    names: list[str]
    

class Model(BaseModel):
    name: str
    number_of_parameters: str
    number_of_max_tokens: str
    architecture: list[str]


class Usage(BaseModel):
    use_case: list[str]
    license: str


class Schema(BaseModel):
    model: Model
    usage: Usage


document = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.
Code: <https://github.com/mistralai/mistral-src>
Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""


template = """{
    "model": {
        "name": "",
        "number_of_parameters": "",
        "number_of_max_tokens": "",
        "architecture": []
    },
    "usage": {
        "use_case": [],
        "licence": ""
    }
}"""

schema = json.dumps(json.loads(template))

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": f"Task: Extract precise values for each field in the provided output schema from the given document.\nInstructions:\n-Do not hallucinate, paraphrase, or modify the extracted values.\n- If a field has no corresponding value in the document, leave it as an empty string (\"\").\n\nDocument:\n{document}\nSchema:\n{schema}"}
    ],
    temperature=0.0,
    max_tokens=256,
    extra_body={
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": Schema.__name__,
                "schema": Schema.model_json_schema()
            }
        }
    }
)
print(completion.choices[0].message.content)

Results without json_schema

vllm

{
  "model": {
    "name": "Mistral 7B",
    "number_of_parameters": "7 billion",
    "number_of_max_tokens": "",
    "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]
  },
  "usage": {
    "use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
    "licence": "Apache 2.0"
  }
}

sglang

{
  "model": {
    "name": "Mistral 7B",
    "number_of_parameters": "7 billion",
    "number_of_max_tokens": "",
    "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]
  },
  "usage": {
    "use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
    "licence": "Apache 2.0"
  }
}

Results with json_schema

vllm

{
  "model": {
    "name": "Mistral 7B",
    "number_of_parameters": "7 billion",
    "number_of_max_tokens": "",
    "architecture": ["grouped-query attention (GQA) for faster inference", "sliding window attention (SWA)"]
  },
  "usage": {
    "use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
    "license": "Apache 2.0"
  }
}

sglang (xgrammar backend)

{"model": {"name": "Mistral 7B", "number_of_parameters": "7–billion", "number_of_max_tokens": "", "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]}, "usage": {"use_case": [], "license": "Apache 2.0"}}

sglang (outlines backend)

{"model": {"name": "Mistral 7B", "number_of_parameters": "7–billion", "number_of_max_tokens": "", "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]}, "usage": {"use_case": [], "license": "Apache 2.0"}}

Environment

Python: 3.10.15 (main, Sep  7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A10G
GPU 0 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.120
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.7
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
psutil: 6.1.0
pydantic: 2.10.1
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.55.1
anthropic: 0.39.0
NVIDIA Topology: 
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-47    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM
ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-11-27T08:22:27Z

cc @merrymercy

merrymercy · 2024-11-27T10:04:14Z

Thanks for the reporting this. I can reproduce this error. The problem is that this model works better with multi-line style JSON but the default argument in sglang uses single-line style. We can fix this for both outlines and xgrammar backend.

Fix for outlines backend (works)

You can add --constrained-json-whitespace-pattern "[\n\t ]*" when you launch the server. See also #1438

python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct-AWQ --port 8007 --quantization awq_marlin  --grammar-backend outlines --constrained-json-whitespace-pattern "[\n\t ]*"

Output

{
  "model": {
    "name": "Mistral 7B",
    "number_of_parameters": "7 billion",
    "number_of_max_tokens": "",
    "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]
  },
  "usage": {
    "use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
    "license": "Apache 2.0"
  }
}

Fix for xgrammar backend (does not work)

Try this commit dd4482e

python3 -m sglang.launch_server --model Qwen/Qwen2.5-7B-Instruct-AWQ --port 8007 --quantization awq_marlin  --grammar-backend xgrammar

Output

{
  "model": {
    "name": "Mistral 7B",
    "number_of_parameters": "7 billion",
    "number_of_max_tokens": "",
    "architecture": []
  },
  "usage": {
    "use_case": [],
    "license": "Apache 2.0"
  }
}

However, this fix does not work. I think there are some subtle details on how it handle the whitespace @Ubospica .

arunpatala · 2024-11-29T12:48:38Z

I am also facing a wierd issue when using xgrammar as backend. Not sure if this is related.

I am using document prefix caching to do multiple extractions at the same time. Some of them use structured json output. And some of them outputs plain text. When using xgrammar using sgl.gen with json_schema, the outputs in plain text are changing, and sometimes not even terminating. Wierd thing is the plain text is not using json_schema.

While using outlines, its working as expected.

Isnt sgl.gen with json_schema and xgrammar not supported yet?
I can provide more information to reproduce if necessary (but its custom code).

Thanks,

Swipe4057 · 2024-12-01T09:29:16Z

I also observe in my json output tests with around 1500 requests that the xgrammar backend allows for infinite generation. It also does not adhere to the api max_tokens limit and continues generating until memory overflow occurs, after which the generation is abruptly cut off. This does not happen with outlines. In my test, there are also requests with and without a json schema in the same batch.

merrymercy · 2024-12-01T10:29:25Z

@arunpatala @Swipe4057 any minimal small reproducible examples will be very helpful here. The developers of grammar @Ubospica are here ready to help if the bugs can be easily reproduced.

Quang-elec44 · 2024-12-04T04:08:05Z

@merrymercy Hi, thanks for your advice. I pulled the latest version v0.4.0 and added --constrained-json-whitespace-pattern "[\n\t ]*". Both xgrammar and outlines backend output the same result.
'{"model": {"name": "Mistral 7B", "number_of_parameters": "7–billion", "number_of_max_tokens": "", "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]}, "usage": {"use_case": [], "license": "Apache 2.0"}}'

Ubospica · 2024-12-04T04:52:04Z

Hi @Quang-elec44, thanks for pointing that out!

For XGrammar, we found the reason is XGrammar requires LLM to generate strictly formatted JSON. This is not strictly formatted JSON (the array is compressed in one line):

"usage": {
  "use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"]
}

This is strictly formatted:

"usage": {
  "use_case": [
    "superior performance and efficiency", 
    "reasoning", 
    "mathematics", 
    "code generation", 
    "following instructions"
  ]
}

However, this strict requirement sometimes makes LLM's output quality deteriorate. In this case, the LLM will generate

"usage": {
  "use_case": []
}

which is still strictly formatted, but not meaningful.

We will relax this restriction and allow non-strictly formatted json in the real case in a recent version to ensure the output quality.

Swipe4057 · 2024-12-05T09:26:07Z

@merrymercy It seems I was able to understand a bit more in detail; the issue isn't with xgrammar, but rather that my LLM doesn't generate stop tokens under certain parameters.

arunpatala · 2024-12-05T11:32:24Z

Is there a way to format the json data for finetuning to follow the format of xgrammar?

remixer-dec · 2024-12-06T03:09:40Z

~~After upgrading SGlang, JSON schema completely stopped working for me~~

response_format.json_schema
  Input should be a valid dictionary or instance of JsonSchemaResponseFormat [type=model_type, input_value='{\n  "$schema": "http://...\n  "type": "object"\n}', input_type=str]

Looks like it is not possible to pass schema as a string anymore. Ok fine, I pass it as object, but still get an error

response_format.json_schema.name
  Field required [type=missing, input_value={'$schema': 'http://json-...ema#', 'type': 'object'}, input_type=dict]

Why does a schema require a name???

Even after adding the name, the schema was fully ignored and the generated output was not following the schema at all without any errors in the logs:

P.S. This is happening even if I switch --grammar-backend between outlines and xgrammar
P.S.S. I do provide both schema and json_schema inside response_format for compatibility with another runtime.

UPD: seems like in my case the issue was that the schema should be inside json_shema object and not json_shema itself, whoops, I totally missed this change.

Ubospica · 2024-12-08T13:53:08Z

The problem of XGrammar mentioned above (#2216 (comment)) is solved in XGrammar v0.1.6. It is also updated in SGLang (#2390). This should solve the issue mentioned by @Quang-elec44.

arunpatala · 2024-12-09T06:56:26Z

I tried to reproduce the error when i am mixing structured generation with normal generation using sgl forks.
The example task is to extract details from a abstract of a paper.

abstract = """
Computer Science > Computer Vision and Pattern Recognition
[Submitted on 5 Dec 2024]
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
Vinayak Gupta, Yunze Man, Yu-Xiong Wang
Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge. Existing dynamic 4D generation methods typically rely on distilling knowledge from pre-trained 3D generative models, often fine-tuned on synthetic object datasets. Consequently, the resulting scenes tend to be object-centric and lack photorealism. While text-to-video models can generate more realistic scenes with motion, they often struggle with spatial understanding and provide limited control over camera viewpoints during rendering. To address these limitations, we present PaintScene4D, a novel text-to-4D scene generation framework that departs from conventional multi-view generative models in favor of a streamlined architecture that harnesses video generative models trained on diverse real-world datasets. Our method first generates a reference video using a video generation model, and then employs a strategic camera array selection for rendering. We apply a progressive warping and inpainting technique to ensure both spatial and temporal consistency across multiple viewpoints. Finally, we optimize multi-view images using a dynamic renderer, enabling flexible camera control based on user preferences. Adopting a training-free architecture, our PaintScene4D efficiently produces realistic 4D scenes that can be viewed from arbitrary trajectories. The code will be made publicly available. Our project page is at this https URL
Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2412.04471 [cs.CV]
 	(or arXiv:2412.04471v1 [cs.CV] for this version)
 
https://doi.org/10.48550/arXiv.2412.04471
Focus to learn more
Submission history
From: Yunze Man [view email]
[v1] Thu, 5 Dec 2024 18:59:57 UTC (7,775 KB)

"""

from pydantic import BaseModel
from typing import List, Dict

class SubmissionHistory(BaseModel):
    version: str
    date: str
    file_size: str

class Metadata(BaseModel):
    comments: str
    submission_history: SubmissionHistory
    doi: str
    arxiv_id: str

class Abstract(BaseModel):
    title: str
    authors: List[str]
    submission_date: str
    categories: List[str]
    summary: str
    metadata: Metadata

import sglang as sgl


abstract_prompt = """
**ABSTRACT**
{}

"""

abstract_instruction = """
**INSTRUCTION:**

"Parse the provided abstract and metadata of a research paper into JSON format with the following structure:

1. Include the `title`, `authors`, `submission_date`, and `categories` as direct fields.
2. Exclude the `abstract` field.
3. Add a `summary` field that concisely explains the core idea, methodology, and significance of the paper.
4. Retain a `metadata` field containing any additional details such as `comments`, `submission_history`, `doi`, and `arxiv_id`.

Ensure the JSON is well-structured and adheres to the specified format."

"""

    

@sgl.function
def gets(s, abstract, keys):
    s += sgl.user_begin()
    s += abstract_prompt.format(abstract) 
    forks = s.fork(1 + len(keys))
    forks[0] += abstract_instruction + sgl.user_end()
    forks[0] += sgl.assistant(sgl.gen("response", json_schema=Abstract.schema_json(), max_tokens=1024, temperature=0.0))
    for k,f in zip(keys, forks[1:]):
        f += f"Extract the following field: {k}" + sgl.user_end()
        f += sgl.assistant(sgl.gen("response", max_tokens=1024, 
                                   temperature=0.0))
    forks.join()
    s["return1"] = forks[0]["response"]
    s["return2"] = [f["response"] for f in forks[1:]]

sgl.set_default_backend(sgl.RuntimeEndpoint("http://localhost:30000"))

keys = ['title', 'authors', 'summary']

states = gets.run(abstract, keys)

for i,k in zip(keys, states["return2"]):
    print(i, ":")
    print(k)

I am running sglang server with

python3 -m sglang.launch_server \
        --context-length 8192 \
        --served-model-name model \
        --host 0.0.0.0 --port 30000 \
        --mem-fraction-static 0.85 \
        --max-running-requests 64 \
        --model-path meta-llama/Llama-3.2-3B-Instruct \
        --grammar-backend xgrammar

I tried with both the xgrammar v0.1.5 and v0.1.6 with latest main branch in docker.

I am getting the following output for outlines at temperature 0.0 and 0.25

# outlines temperature=0.0

--------
title :
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
--------
authors :
The authors of the paper are:

1. Vinayak Gupta
2. Yunze Man
3. Yu-Xiong Wang
--------
summary :
Here is the extracted summary:

PaintScene4D: Consistent 4D Scene Generation from Text Prompts



#outlines temperature=0.25

--------
title :
PaintScene4D: Consistent 4D Scene Generation from Text Prompts
--------
authors :
The authors of the paper are:

1. Vinayak Gupta
2. Yunze Man
3. Yu-Xiong Wang
--------
summary :
Here is the extracted summary:

**PaintScene4D: Consistent 4D Scene Generation from Text Prompts**

Recent advances in diffusion models have revolutionized 2D and 3D content creation, yet generating photorealistic dynamic 4D scenes remains a significant challenge.

But the output of xgrammar at especially lower temperatures are not generated properly.



#xgrammar temperature=0

--------
title :
PaintScene4D: Consistent 4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--------
authors :
The authors of the paper are:

1.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--------
summary :
Here is the extracted summary:

PaintScene4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


xgrammar temperature=0.25

--------
title :
assistant

PaintScene4D: Consistent 4D Scene Generation from Text Prompts
--------
authors :
assistant

The authors of the paper are:

1. Vinayak Gupta
2. Yunze Man
3. Yu-Xiong Wang
--------
summary :

Also xgrammar is generating "assistant" keyword at temperature 0.25 sometimes or not generating at all. The structured output seems to be all right for both grammar backends.

Hope this helps to reproduce the bug. Let me know if any other information is needed.

Ubospica · 2024-12-09T07:00:04Z

@arunpatala Thanks for producing the reproducible script! I think that should be a problem in the mask application process where the mask is applied on non-structured requests. We will fix that problem soon.

arunpatala · 2024-12-09T07:38:59Z

@Ubospica thanks for looking into it.

Ubospica mentioned this issue Dec 9, 2024

[Bug] XGrammar causes gibberish during parallel execution and cuts off other requests #2414

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] (v0.3.6.post2) Output degredation when using structured output #2216

[Bug] (v0.3.6.post2) Output degredation when using structured output #2216

Quang-elec44 commented Nov 27, 2024

zhyncs commented Nov 27, 2024

merrymercy commented Nov 27, 2024

arunpatala commented Nov 29, 2024

Swipe4057 commented Dec 1, 2024 •

edited

Loading

merrymercy commented Dec 1, 2024

Quang-elec44 commented Dec 4, 2024

Ubospica commented Dec 4, 2024 •

edited

Loading

Swipe4057 commented Dec 5, 2024

arunpatala commented Dec 5, 2024

remixer-dec commented Dec 6, 2024 •

edited

Loading

Ubospica commented Dec 8, 2024 •

edited

Loading

arunpatala commented Dec 9, 2024

Ubospica commented Dec 9, 2024

arunpatala commented Dec 9, 2024

[Bug] (v0.3.6.post2) Output degredation when using structured output #2216

[Bug] (v0.3.6.post2) Output degredation when using structured output #2216

Comments

Quang-elec44 commented Nov 27, 2024

Checklist

Describe the bug

Reproduction

How to start sglang server

How to start vllm server

Python script

Results without json_schema

vllm

sglang

Results with json_schema

vllm

sglang (xgrammar backend)

sglang (outlines backend)

Environment

zhyncs commented Nov 27, 2024

merrymercy commented Nov 27, 2024

Fix for outlines backend (works)

Fix for xgrammar backend (does not work)

arunpatala commented Nov 29, 2024

Swipe4057 commented Dec 1, 2024 • edited Loading

merrymercy commented Dec 1, 2024

Quang-elec44 commented Dec 4, 2024

Ubospica commented Dec 4, 2024 • edited Loading

Swipe4057 commented Dec 5, 2024

arunpatala commented Dec 5, 2024

remixer-dec commented Dec 6, 2024 • edited Loading

Ubospica commented Dec 8, 2024 • edited Loading

arunpatala commented Dec 9, 2024

Ubospica commented Dec 9, 2024

arunpatala commented Dec 9, 2024

How to start `sglang` server

How to start `vllm` server

Swipe4057 commented Dec 1, 2024 •

edited

Loading

Ubospica commented Dec 4, 2024 •

edited

Loading

remixer-dec commented Dec 6, 2024 •

edited

Loading

Ubospica commented Dec 8, 2024 •

edited

Loading