vLLM Module with MLRun#
This notebook shows how to configure and deploy a vLLM OpenAI compatible server as an MLRun application runtime, then showcases how to send a chat request to it to the vLLM server.
import mlrun
Prerequisite#
At lease one GPU is required for running this notebook.
What this notebook does#
In this notebook we will:
Create or load an MLRun project
Import a custom vLLM module from the MLRun Hub
Deploy a vLLM OpenAI-compatible server as an MLRun application runtime
Configure deployment parameters such as model, GPU count, memory, node selector, port, and log level
Invoke the deployed service using the
/v1/chat/completionsendpointParse the response and extract only the assistant’s generated text
By the end of this notebook, you will have a working vLLM deployment that can be queried directly from a Jupyter notebook using OpenAI-style APIs.
For more information about vLLM documentation
1. Create an MLRun project#
In this section we create or load an MLRun project that will own the deployed vLLM application runtime.
project = mlrun.get_or_create_project(name="vllm-module", context="", user_project=True)
2. Import the vLLM module from the MLRun Hub#
In this section we import the vLLM module from the MLRun Hub so we can instantiate VLLMModule and deploy it as an application runtime.
vllm = mlrun.import_module("hub://vllm-module")
3. Deploy the vLLM application runtime#
Configure the vLLM deployment parameters and deploy the application.
The returned address is the service URL for the application runtime.
# Initialize the vLLM app
vllm_module = vllm.VLLMModule(
project=project,
node_selector={"alpha.eksctl.io/nodegroup-name": "added-gpu"},
name="qwen-vllm",
image="vllm/vllm-openai:latest",
model="Qwen/Qwen2.5-Omni-3B",
gpus=1,
mem="10G",
port=8000,
dtype="auto",
uvicorn_log_level="info",
max_tokens = 501,
)
# Deploy the vLLM app
addr = vllm_module.vllm_app.deploy(with_mlrun=True)
addr
4. Get the runtime handle#
Fetch the runtime object and invoke the service using app.invoke(...).
# Optional: get_runtime() method uses to get the MLRun application runtime
app = vllm_module.get_runtime()
5. Send a chat request for testing#
Call the OpenAI compatible endpoint /v1/chat/completions, parse the JSON response, and print only the assistant message text.
body = {
"model": vllm_module.model,
"messages": [{"role": "user", "content": "what are the 3 countries with the most gpu as far as you know"}],
"max_tokens": vllm_module.max_tokens, # start smaller for testing
}
resp = app.invoke(path="/v1/chat/completions", body=body)
data = resp
assistant_text = data["choices"][0]["message"]["content"]
print("\nassistant:\n")
print(assistant_text.strip())
assistant:
As of the most commonly cited estimates, the three countries with the largest GPU capacity for AI workloads are the United States, China, and India.