vllm_module package#

Submodules#

vllm_module.vllm_module module#

class vllm_module.vllm_module.VLLMModule(project: str, *, node_selector: Dict[str, str] | None = None, name: str = 'vllm', image: str = 'vllm/vllm-openai:latest', model: str = 'Qwen/Qwen2.5-Omni-3B', gpus: int = 1, mem: str = '10G', port: int = 8000, dtype: str = 'auto', uvicorn_log_level: str = 'info', max_tokens: int = 500)[source]#

Bases: object

This module provides a lightweight wrapper for deploying a vLLM (OpenAI-compatible) large language model server as an MLRun application runtime.

The VLLMModule is responsible for: - Creating an MLRun application runtime based on a vLLM container image - Configuring GPU resources, memory limits, and Kubernetes node selection - Launching the model using vllm serve with configurable runtime flags - Supporting multi-GPU inference via tensor parallelism - Automatically configuring shared memory (/dev/shm) when using multiple GPUs - Exposing an OpenAI-compatible API (e.g. /v1/chat/completions) for inference - Providing a simple Python interface for deployment and invocation from Jupyter notebooks

The module is designed to be used in Jupyter notebooks and MLRun pipelines, allowing users to deploy and test large language models on Kubernetes with minimal configuration.

add_args(extra_args: List[str])[source]#
get_runtime()[source]#

Module contents#