vLLM Module with MLRun#

This notebook shows how to configure and deploy a vLLM OpenAI compatible server as an MLRun application runtime, then showcases how to send a chat request to it to the vLLM server.

import mlrun

Prerequisite#

  • At lease one GPU is required for running this notebook.

What this notebook does#

In this notebook we will:

  • Create or load an MLRun project

  • Import a custom vLLM module from the MLRun Hub

  • Deploy a vLLM OpenAI-compatible server as an MLRun application runtime

  • Configure deployment parameters such as model, GPU count, memory, node selector, port, and log level

  • Invoke the deployed service using the /v1/chat/completions endpoint

  • Parse the response and extract only the assistant’s generated text

By the end of this notebook, you will have a working vLLM deployment that can be queried directly from a Jupyter notebook using OpenAI-style APIs.

For more information about vLLM documentation

1. Create an MLRun project#

In this section we create or load an MLRun project that will own the deployed vLLM application runtime.

project = mlrun.get_or_create_project(name="vllm-module", context="", user_project=True)

2. Import the vLLM module from the MLRun Hub#

In this section we import the vLLM module from the MLRun Hub so we can instantiate VLLMModule and deploy it as an application runtime.

vllm = mlrun.import_module("hub://vllm-module")

3. Deploy the vLLM application runtime#

Configure the vLLM deployment parameters and deploy the application.

The returned address is the service URL for the application runtime.

# Initialize the vLLM app
vllm_module = vllm.VLLMModule(
    project=project,
    node_selector={"alpha.eksctl.io/nodegroup-name": "added-gpu"},
    name="qwen-vllm",
    image="vllm/vllm-openai:latest",
    model="Qwen/Qwen2.5-Omni-3B",
    gpus=1,
    mem="10G",
    port=8000,
    dtype="auto",
    uvicorn_log_level="info",
    max_tokens = 501,
)

# Deploy the vLLM app
addr = vllm_module.vllm_app.deploy(with_mlrun=True)
addr

4. Get the runtime handle#

Fetch the runtime object and invoke the service using app.invoke(...).

# Optional: get_runtime() method uses to get the MLRun application runtime
app = vllm_module.get_runtime()

5. Send a chat request for testing#

Call the OpenAI compatible endpoint /v1/chat/completions, parse the JSON response, and print only the assistant message text.

body = {
    "model": vllm_module.model,
    "messages": [{"role": "user", "content": "what are the 3 countries with the most gpu as far as you know"}],
    "max_tokens": vllm_module.max_tokens,     # start smaller for testing
}

resp = app.invoke(path="/v1/chat/completions", body=body)
data = resp
assistant_text = data["choices"][0]["message"]["content"]

print("\nassistant:\n")
print(assistant_text.strip())
assistant:

As of the most commonly cited estimates, the three countries with the largest GPU capacity for AI workloads are the United States, China, and India.