You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To leverage vLLM V1 architecture change, we are trying to propose a new integration apporach for the neuron backend that seamlessly integrate with vLLM, while maintaining high-performance and taking prefix-caching as first-class feature.
Background
(ref: #8779)
vLLM is on a path toward 1/ full support for torch.compile, 2/ turn on chunked prefill, prefix caching, speculative decoding by default, 3/ support more than 60 model variants. While, current neuron backend is supported via the transformers-neuronx library, which has limited support to the combination of these feature.
To support a wide range of model variants, vLLM has been maintaining the modular design with vllm.model_executor.layers module. This enables new model developers easily contribute to vLLM to support new model variants. For instance, Mistral team released pixtral-large model weights and brought pixtral-large model support with Pixtral (vllm-project#8377).
Proposed Change.
Embracing torch.compile support
As part of Neuron SDK 2.21 release, we are able to support torch.compile with openxla backend. For instance, we can implement copy_blocks with
We may build NKI-based flash-attention with paged KV cache, as part of vllm.attention.ops module. This is similar to triton-lang based flash-attention in vLLM (ref: triton_flash_attention.py).
Introduce forward_neuron into vllm.model_executor.layers module
Like many other backends in vLLM, the native kernel may not be highly performant on a specific hardware backend. vLLM has been building and maintaining the default behavior with forward_native function call, while hardware-specific optimization can be enabled with forward_xxx function call.
We can reuse the performant components in neuronx_distributed package to further improvement performance.
@CustomOp.register("rms_norm")classRMSNorm(CustomOp):
"""Root mean square normalization. Computes x -> w * x / sqrt(E[x^2] + eps) where w is the learned weight. """defforward_neuron(
self,
x: torch.Tensor,
residual: Optional[torch.Tensor] =None,
) ->Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
fromneuronx_distributed.opsimportNeuronFusedRMSNormifNeuronFusedRMSNormisNone:
returnself.forward_native(x, residual)
ifresidualisnotNone:
orig_shape=x.shaperesidual+=x.view(residual.shape)
x=NeuronFusedRMSNorm.apply(residual, self.weight, self.variance_epsilon)
returnx.view(orig_shape), residualx=NeuronFusedRMSNorm.apply(x, self.weight, self.variance_epsilon)
returnx
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
Motivation.
To leverage vLLM V1 architecture change, we are trying to propose a new integration apporach for the neuron backend that seamlessly integrate with vLLM, while maintaining high-performance and taking prefix-caching as first-class feature.
Background
(ref: #8779)
vLLM is on a path toward 1/ full support for torch.compile, 2/ turn on chunked prefill, prefix caching, speculative decoding by default, 3/ support more than 60 model variants. While, current neuron backend is supported via the transformers-neuronx library, which has limited support to the combination of these feature.
To support a wide range of model variants, vLLM has been maintaining the modular design with vllm.model_executor.layers module. This enables new model developers easily contribute to vLLM to support new model variants. For instance, Mistral team released pixtral-large model weights and brought pixtral-large model support with Pixtral (vllm-project#8377).
Proposed Change.
Embracing torch.compile support
As part of Neuron SDK 2.21 release, we are able to support torch.compile with openxla backend. For instance, we can implement copy_blocks with
Build neuron attention backend with NKI
We may build NKI-based flash-attention with paged KV cache, as part of vllm.attention.ops module. This is similar to triton-lang based flash-attention in vLLM (ref: triton_flash_attention.py).
Introduce forward_neuron into vllm.model_executor.layers module
Like many other backends in vLLM, the native kernel may not be highly performant on a specific hardware backend. vLLM has been building and maintaining the default behavior with forward_native function call, while hardware-specific optimization can be enabled with forward_xxx function call.
We can reuse the performant components in neuronx_distributed package to further improvement performance.
Development Progress
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: