Interactive, vendor-neutral guide for practitioners

From First Token to Production Scale

GPU time is expensive, and the serving layer can waste much of it. Your app feels slow; your cluster stalls after a handful of users. This guide covers LLM inference mechanics and where vLLM changes the picture.

Why does the first token take so long, and what controls it?
Why does streaming stutter when more users connect?
Where does GPU memory go?
How do teams keep latency promises under real traffic?
Explore Modern Serving

Source snapshot: official docs and engineering posts reviewed in April 2026. [1] [2] [3] [4]

How to read this guide

Pick the path that fits your role. Every section is self-contained, so you can skip ahead or stop early.

Business & Decision-Makers

Understand costs, capabilities, and when vLLM fits.

  1. The Problem — why inference is hard
  2. Capabilities — models, features, hardware
  3. TCO & ROI — self-host vs. API economics
  4. Adoption Playbook — POC to production

Platform & Ops Engineers

Deploy, tune, and operate vLLM in production.

  1. Deployment — Kubernetes, parallelism
  2. Observability — metrics, tracing, troubleshooting
  3. Hardening — security and resilience checklists
  4. Tuning — performance knobs and benchmarking

Deep-Dive & Research

Understand the engine internals and scheduling.

  1. PagedAttention — memory management
  2. Batching — scheduler mechanics
  3. Architecture — V1 engine internals
  4. Tuning — latency vs throughput tradeoffs
01

The Inference Challenge Start here

Organizations buy expensive GPU hardware. Apps still feel slow and top out after a small number of concurrent users. Fast responses and efficient GPU use pull on the same memory and compute budget. This section names that tradeoff and how vLLM addresses it.

When your application uses an AI model to answer questions, write text, or process documents, every response requires GPU compute time. vLLM is an open-source inference and serving engine built for high-throughput, memory-efficient LLM serving. Teams adopt it when they want more control over hardware choice, deployment shape, and cost model than a managed API provides. [10]

How an AI response reaches the user

User asks a question Chat, API call, or app request
Application sends prompt Your code calls the vLLM API
vLLM processes on GPU Prefill, then decode token by token
Response streams back Words appear as they are generated

The sections below explain what happens at each stage and where vLLM makes the difference.

Outcomes Over Benchmarks

Teams care about how fast the first token arrives and how smoothly the stream feels. They also track prompt reuse, multimodal inputs, structured outputs, and fairness on shared clusters. A single tokens per second figure rarely tells that whole story.

Prefill and Decode Are Different Jobs

Prefill is prompt-heavy and compute-bound. Decode adds one token at a time while the user watches; it is often memory-bandwidth-bound. A serving stack balances both phases continuously. [4]

The Key-Value (KV) Cache Becomes the Real Budget

Long context and high concurrency with large models all squeeze KV cache space. Design choices decide whether GPU memory holds useful tokens or loses capacity to fragmentation, eviction pressure, or preemption.

Quick overview

vLLM at a Glance

vLLM is an open-source inference and serving engine for high-throughput, memory-efficient LLM workloads. PagedAttention tightens KV memory use. [9] Continuous batching keeps accelerators busy under mixed load. The runtime also ships prefix caching, structured-output support, multimodal model support, multi-GPU parallelism, and OpenAI-compatible serving. [13] [15] [14] [12] [7]

Faster Responses

Targets lower TTFT and steadier streaming by reducing KV waste and prioritizing decode work under mixed load. [9] [8]

More Users, Same Hardware

PagedAttention and continuous batching increase useful work per GPU by reducing KV waste and keeping batches moving under uneven traffic. [9] [17]

Production-Ready

Run on Kubernetes with documented KServe integrations, published security guidance, and vendor-supported deployment options. [16] [28] [24]

Cost impact: GPU-hour pricing varies widely by accelerator, region, and contract. The useful comparison is your own measured cost per token or cost per request under representative load, not a generic market average. [21] [23]

Why teams switch to self-hosted inference

Potentially lower cost per output When utilization is high, higher throughput per GPU can improve cost per token or cost per request. [21] [23]
Deployment control Self-hosting keeps prompts and responses on infrastructure you operate, subject to your own network and storage controls. [28]
Capacity-based cost model Self-hosted costs are driven mainly by the capacity you provision and the utilization you achieve, not a provider's per-token meter. [31] [21]
Model and hardware flexibility One serving layer can cover many supported open-weight model families and multiple documented hardware backends. [14] [10]
02

What Production Requires Start here

Interactive serving needs a short time to first token (TTFT), steady streaming gaps between tokens (inter-token latency), and predictable latency under concurrent load. The mental model below ties prefill, decode, caching, and how vLLM addresses each requirement.

Applying the mental model

Building on the prefill/decode tradeoff above, prompt length, concurrency, prefix reuse, multimodal inputs, and latency targets usually decide whether batching, KV memory, parallelism, or disaggregated serving matter most. Use the topic tabs below to explore each technique and when it applies to your workload.

Interactive explainer

Disaggregated prefill / decode

What it is

Run prefill and decode on separate vLLM instances and hand off KV cache so long prefills do not queue behind decode work.

Why it matters

Splitting tiers keeps long prefills from delaying decode batches that used to share the same serving workers. [2] [4]

How vLLM Compares

vLLM

Open-source serving engine focused on high-throughput inference, broad model support, OpenAI-compatible APIs, and features such as PagedAttention, unified scheduling, and prefix caching. [10] [14] [13]

TGI

Hugging Face documents TGI as maintenance mode and recommends newer engines such as vLLM or SGLang for future serving-stack work where they fit. [20]

TensorRT-LLM

NVIDIA’s TensorRT-LLM stack: builder and runtime focused on CUDA GPUs, fused kernels, and detailed KV cache and compilation controls for teams already standardized on NVIDIA inference. [3]

SGLang

SGLang is an active open-source serving stack whose official materials emphasize RadixAttention, structured generation, and gateway/routing features. [18]

Is vLLM Right for You?

These questions cover most vLLM fit and sizing reviews.

Latency target

Is the user experience dominated by first token, smooth streaming, or offline throughput?

Prompt profile

Are prompts short and repetitive, or long, bursty, and retrieval-heavy?

Workload mix

Is this mostly chat, long-form generation, multimodal requests, or structured/tool-driven flows?

Deployment boundary

Single node, shared cluster, air-gapped enterprise, or multi-node fleet with strict operational controls?

Common Questions

Expand a card for the full answer.

03

What vLLM Can Do Start here

The supported-models documentation covers native text, multimodal, and pooling architectures for chat, code, vision, audio, embedding, scoring, and retrieval workloads. [14] What follows is a concise feature matrix and hardware map for serving.

Coverage Snapshot

Dense chat and code models

The most common deployment target: frontier open models, chat assistants, copilots, and code generation stacks.

Llama Qwen Mistral Granite

Sparse Mixture-of-Experts (MoE) families

Sparse MoE stacks need explicit expert-parallel layout and memory planning so routing overhead does not erase the parameter savings.

Mixtral DeepSeek Expert parallelism

Multimodal request paths

Multimodal traffic changes preprocessing and pushes memory and time to first token harder than text-only chat.

LLaVA Qwen2-VL Pixtral

Embedding, scoring, and retrieval

Pooling, embedding, and ranking endpoints sit beside chat completions for Retrieval-Augmented Generation (RAG) and search backends.

Embedding Scoring Pooling

Plan around dense text, MoE, multimodal inputs, or pooling instead of assuming a single model name tells the whole story.

Feature Matrix

Feature Description Key Benefit What this means for you
Quantization GPTQ, AWQ, FP8, BitsAndBytes, GGUF, TorchAO, and related schemes are documented in the project and surrounding vllm-project materials [10] Run larger models on fewer GPUs Run the same model on less expensive hardware
LoRA Adapters Runtime loading and unloading of LoRA weights [32] Serve many fine-tuned variants from one base One deployment serves multiple customized models
Structured Output JSON schema constraints plus regex, choice, grammar, and structural-tag guards [15] [6] Safer integration with tools and downstream systems The model's output is constrained to a format your code can parse
Prefix Caching Automatic hash-based KV block reuse for shared prefixes [13] Better TTFT for repeated system prompts and shared context Repeated instructions (system prompts) are processed once instead of every time
Multimodal Vision-language and audio-capable models [14] Multimodal and text paths in one serving stack Process images, audio, and text with the same server
CUDA Graphs Captured execution graphs for repeat decode shapes and other stable execution paths [10] Lower per-step launch overhead on stable workloads Faster responses on steady, predictable traffic
Advanced features
Feature Description Key Benefit
Speculative Decoding EAGLE, MTP, draft, n-gram, PARD, suffix, and MLP-style speculators [34] Lower wall-clock latency when acceptance rate is healthy
Expert / MoE Support Serve sparse MoE architectures such as Mixtral and DeepSeek families [14] [12] Use hardware more efficiently for large sparse models
Disaggregated Prefill experimental Separate prompt-ingest and decode pools when workloads justify it Protect TTFT-sensitive traffic under prompt-heavy load

Hardware Support Across the vLLM Organization

CUDA, ROCm, CPU, and XPU installers target the main repository; accelerator-specific code also ships in separate vllm-project repositories linked below. [10]

Mainline platforms

Core

NVIDIA CUDA

Mainline GPU path for high-throughput serving across common enterprise and hyperscaler fleets.

repo: vllm
Core

AMD ROCm

ROCm-backed deployments for AMD accelerator fleets, including MI-series hardware.

repo: vllm
Core

Intel XPU

Intel Data Center and Arc GPUs through the XPU backend in the main repo plus vllm-xpu-kernels; typical installs use a source build (Intel publishes Docker images).

repo: vllm
Core

CPU Family

Linux x86_64 and Arm AArch64 ship prebuilt CPU wheels. Apple Silicon and IBM Z CPU targets are experimental and need source builds; Apple GPU inference uses the vllm-metal plugin in the grid below.

repo: vllm
Plugin

Google TPU

Cloud TPU support via the tpu-inference / vllm-tpu plugin package, integrated with the main vLLM install.

repo: tpu-inference
Plugin

AWS Neuron

Trainium and Inferentia support via a dedicated vLLM organization plugin for AWS deployments.

repo: vllm-neuron

Organization plugins and platform projects

Plugin

Ascend NPU

Community-maintained vLLM organization plugin for Huawei Ascend deployments.

repo: vllm-ascend
Plugin

Intel Gaudi

Dedicated org plugin for Gaudi/HPU environments and their operator + runtime differences.

repo: vllm-gaudi
Plugin

IBM Spyre

Community-maintained vLLM organization plugin for IBM Spyre AIU acceleration.

repo: vllm-spyre
Plugin

Apple Silicon / Metal

Community-maintained Metal plugin for Mac workflows and Apple unified-memory serving experiments.

repo: vllm-metal
04

TCO & ROI Business focus

The economics of self-hosted inference depend on infrastructure cost, operational overhead, and what an API provider charges for the same workload. Use representative benchmarking instead of generic break-even claims. [21] [31]

Self-hosted inference trades metered API pricing for capacity you manage. Break-even depends on measured throughput, utilization, infrastructure cost, and operations. GuideLLM and vLLM benchmarks provide the throughput side of that equation, but the answer is still workload-specific. [21] [23] [31]

GPU Infrastructure

Raw GPU-hour cost is only one input. The more useful metric is an effective cost per token or per request: take your real hourly or amortized infrastructure cost and divide it by measured throughput under representative prompts and concurrency. [21] [23]

Operational Overhead

Running your own inference adds deployment, monitoring, upgrades, and support work. The vLLM production-stack and OpenShift AI materials show reference patterns for autoscaling, observability, packaging, and cluster deployment; include that work in the business case. [22] [24]

API Pricing at Scale

Commercial APIs publish per-token prices. Compare those published prices directly against your self-hosted cost model instead of assuming one universal traffic threshold where the economics flip. [31] [21]

Inputs for a Cost Model

Input How to Measure Why It Matters
Infrastructure cost Use your real hourly GPU rate, reserved-instance rate, or amortized on-prem cost. This is the numerator in any self-hosted cost-per-token calculation.
Measured throughput Benchmark on representative prompts, outputs, and concurrency with GuideLLM or vLLM tools. Throughput determines how much useful work that infrastructure cost actually buys. [21] [23]
Utilization and duty cycle Estimate how often the GPUs are busy versus idle across your real traffic pattern. Idle capacity can dominate the economics of spiky or low-volume workloads.
Operational overhead Include deployment, monitoring, patching, on-call, and support contracts. Self-hosting is not only a hardware decision. [22] [24]
API pricing baseline Use the provider’s published token pricing for the model and tier you are actually comparing against. This is the baseline alternative to self-hosting. [31]

A practical self-hosted comparison is: effective cost per million tokens = hourly or amortized infrastructure cost / measured tokens per second / 3,600 × 1,000,000. Use representative workloads, not toy prompts, when you collect the throughput number. [21] [23]

Workload patterns

How Different Workloads Change the Math

The same engine can look cheap or expensive depending on duty cycle, prompt shape, and latency targets. Use workload-specific benchmarks to decide which side of the trade-off matters more.

Workload Profile Important Variables What to Measure First Likely Outcome
RAG Chatbot Prefix reuse, burstiness, TTFT and ITL targets, traffic predictability TTFT, ITL, prefix-cache hit rate, and average utilization Can go either way; self-hosting gets more compelling as demand and reuse become steady
Batch Document Processing Long prompts, high duty cycle, queue tolerance, sustained throughput Output tokens/sec, queue time, and effective cost per token Often the easiest case for self-hosting because utilization can stay high
Code Assistant Low-latency SLOs, long contexts, multi-turn locality, peak concurrency P95 TTFT, P99 ITL, context-length distribution, and required headroom Dedicated self-hosted capacity can make sense when latency SLOs and demand are both steady

GuideLLM supports synthetic or custom datasets plus rate types such as constant, concurrency, Poisson, and sweep-style tests. Use those modes to build a benchmark that actually resembles your production traffic. [23]

When self-hosting does not save money

Self-hosting is not always the right call. Managed APIs often win when traffic is sporadic, when proprietary model access matters more than infrastructure control, or when GPU operations overhead would dominate the savings. Keep the decision empirical: measure representative throughput, compare it to published API pricing, and include support and operations in the model. [21] [31]

How to measure your own cost-per-token

Divide your hourly or amortized infrastructure cost by measured throughput (tokens/sec) to get cost per million tokens, then compare that against the provider’s published token pricing for the model tier you are using. The Performance Tuning section covers representative benchmarking methodology with GuideLLM in more detail. [21] [31]

05

Adoption Playbook Business focus

A structured path from evaluation to production. Each phase maps to documented vLLM deployment modes, benchmarking steps, and support options.

You can start with vllm serve <model> for local evaluation and then move into Kubernetes-based deployment patterns documented by vLLM and KServe. Three phases, each with concrete exit criteria, take you from a single-GPU proof of concept to production. [7] [16]

Data Residency

Do prompts and responses need to stay on your network? Self-hosted vLLM lets inference traffic stay on infrastructure you control, but the security guide also notes that multi-node communications are insecure by default and should be isolated on trusted networks. [28]

Hardware Compatibility

The vLLM project documents support for CUDA, ROCm, Intel XPU, CPU, and additional accelerator plugins in the wider vllm-project organization. If you already run modern accelerator fleets, there is usually a documented path to start evaluating. [10]

Team Expertise

Complexity ranges from a one-liner (vllm serve <model>) to multi-node tensor/pipeline/data parallelism. Managed platforms like Red Hat OpenShift AI handle the infrastructure layer. [24]

Phased rollout

From Evaluation to Production in Three Phases

Phase 1: POC Single GPU, one model, internal users
Phase 2: Pilot K8s deployment, traffic shadow, SLO baseline
Phase 3: Production Multi-model, autoscaling, monitoring, handoff
Phase vLLM Mode Key Activity Exit Criteria
1 — POC Offline/batch via LLM class or vllm serve on a single GPU Validate model quality, measure baseline throughput with GuideLLM [23] Model produces acceptable outputs; throughput justifies self-hosting per TCO analysis
2 — Pilot Kubernetes Deployment with health probes, GPU limits, shared memory volume Shadow production traffic; define TTFT and ITL SLOs using GuideLLM sweep mode [21] Meets latency SLOs at target concurrency; ops team comfortable with monitoring
3 — Production vLLM production-stack or KServe with autoscaling, observability, and routing [22] [16] Multi-model serving, LoRA adapters, and supportable operational ownership [24] Uptime SLO met; cost model reviewed against measured throughput and current API pricing baseline [31]

Stakeholder Map

Each role owns a different piece of the deployment. Get them aligned before Phase 2.

Platform / Infra

GPU provisioning, Kubernetes integration, Helm charts, shared memory and storage configuration. Concerned with resource limits, node autoscaling, and multi-tenant isolation.

Security

Data residency, endpoint protection (reverse proxy, API-key limitations), network isolation for inter-node ZMQ and PyTorch Distributed traffic, and pre-production hardening checks. [28]

ML / Data Science

Model selection from the supported-model catalog, quantization options, accuracy validation, and LoRA adapter management. [14]

Procurement / Finance

TCO analysis from the previous section, support contracts, GPU lease-vs-buy decisions, and measurable success criteria. [24]

Success criteria template

Tie your go/no-go decision to metrics vLLM exposes:

  • Latency SLOs: TTFT < X ms, ITL < Y ms, measured via vllm:time_to_first_token_seconds and vllm:inter_token_latency_seconds Prometheus histograms [25]
  • Cost reduction: tokens/sec per infrastructure dollar, compared against published API pricing. Measured via GuideLLM throughput benchmarks [21] [31]
  • Uptime: /health endpoint availability meets your target, paired with platform probes and alerting [28]
  • Operational readiness: Monitoring dashboards active, on-call runbook documented, graceful shutdown tested (--shutdown-timeout) [7] [22]
06

Deployment Options Start here

vLLM runs the same engine family on one GPU, behind an HTTP server, or on Kubernetes. That gives teams one serving stack with familiar model and runtime flags across those shapes instead of maintaining separate engines per environment. [7] [16]

Enterprise deployment

Kubernetes-Native Serving

Upstream docs cover running vLLM on Kubernetes with KServe (Hugging Face serving runtime or LLMInferenceService). [16] OpenShift AI documents KServe-based model serving; you apply the same integration patterns there when exposing vLLM as a served model. [24]

Optimized Models

Many organizations publish quantized or otherwise optimized checkpoints on Hugging Face. Treat those as starting points for evaluation rather than automatic production approval, and validate them on your own prompts, latency targets, and safety requirements.

Security and Compliance

vLLM is built for throughput; production security requires layers around it. The Production Hardening section covers reverse proxy setup, network isolation, SSRF protection, FIPS-sensitive deployment considerations, and the full pre-production security checklist. [28]

Offline

Batch Inference

The LLM class runs generation in-process with no server overhead.

  • Ideal for ETL, evals, and backfills
  • No network hop between caller and engine
from vllm import LLM, SamplingParams

llm = LLM(model="RedHatAI/gemma-4-31B-it-FP8-block")
params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(["Explain Kubernetes in one paragraph."], params)
print(outputs[0].outputs[0].text)
Online

API Server

One command starts an OpenAI-compatible HTTP server with chat, completions, and model-listing endpoints; embedding and pooling flows are documented separately in the supported-model and pooling docs. [7] [14]

  • Drop-in for any OpenAI client
  • Continuous batching with streaming
vllm serve RedHatAI/gemma-4-31B-it-FP8-block
Kubernetes

Scalable Deployment

Deploy on Kubernetes or OpenShift with GPU node selectors, resource limits, and room to grow into multi-node topologies.

  • Best fit for shared clusters and platform teams
  • Pairs with autoscaling and service meshes
  • Good handoff point into OpenShift AI and KServe workflows
# Minimal example only; add probes, shared memory, auth, and model storage for production.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["vllm", "serve",
          "RedHatAI/gemma-4-31B-it-FP8-block"]
        resources:
          limits:
            nvidia.com/gpu: 1
Interactive chooser

Parallelism Strategies

Each strategy below lists topology, fit, tradeoffs, and the matching CLI flags from the serving docs. [12]

Tensor Parallelism

Split each layer across GPUs when the model fits on one node but not on one device.

Best when Single node, fast interconnect, latency-sensitive serving
Tradeoff More collectives and tighter node-level coupling
Pairs with Quantization or PP when the model still outgrows one node
--tensor-parallel-size N Watch interconnect bandwidth and collective overhead.

Scaling Patterns

Combine parallelism flags with load balancing and routing for production scale. [12]

Pattern What It Does When to Use
Parallelism flags -tp, -pp, -dp plus expert and context parallel Scale a single model across multiple GPUs. See the parallelism chooser above [12]
Multi-API-server --api-server-count N runs multiple frontend processes sharing one engine. Auto-configures PROMETHEUS_MULTIPROC_DIR for shared metrics [12] When the API frontend is the bottleneck (high request parsing overhead, many concurrent connections)
vLLM Router Production-grade Rust load balancer with consistent hashing for KV cache reuse, prefill/decode disaggregation support, K8s service discovery, and circuit breakers [29] Multi-replica deployments needing intelligent request routing and cache affinity
llm-d Red Hat’s KV-cache-aware routing layer for multi-replica vLLM on OpenShift [30] Multi-turn workloads where routing to the replica holding the conversation’s KV cache avoids recomputation

Multi-Model Serving

Pattern How Tradeoff
Served name aliases --served-model-name name1 name2 — one engine, multiple API model IDs [7] No overhead; useful for API compatibility or gradual migration
Runtime LoRA adapters --lora-modules at startup, plus /v1/load_lora_adapter and /v1/unload_lora_adapter at runtime (requires VLLM_ALLOW_RUNTIME_LORA_UPDATING=1) [32] Share one base model across fine-tuned variants; adapter count limited by --max-loras and GPU memory
Separate deployments + routing One Deployment per model, fronted by vLLM Router or llm-d [29] [30] Full isolation; requires more GPUs but simplifies per-model scaling and upgrades
Model storage on Kubernetes
  • PVC-based cache: Mount a PersistentVolumeClaim at /root/.cache/huggingface so the model is downloaded once and persisted across pod restarts. Shown in official KServe and Kubernetes-oriented deployment examples [16]
  • ModelCar packaging: Package the model into a container image for reproducible, versioned deployments. Documented by Red Hat for OpenShift AI workflows [24]
  • Init-container download: Use an init container to pull the model before the vLLM container starts, decoupling download time from the readiness probe window. This is a common Kubernetes pattern when model download time would otherwise dominate readiness.
07

Observability & Troubleshooting Ops focus

vLLM exposes Prometheus metrics on the /metrics endpoint of the API server, supports OpenTelemetry tracing, and ships pre-built Grafana dashboards. The metrics below cover queue health, memory pressure, latency percentiles, and request outcomes. [25] [5] [27]

Prometheus OpenTelemetry Grafana
In plain English

Queue depth tells you whether traffic is backing up. KV cache usage warns you before memory pressure triggers evictions. TTFT and ITL are the latency numbers your users feel. The troubleshooting tree below pairs each symptom with the metric to check and the flag to change. [25] [8]

Key Operational Metrics

These metrics are defined in vLLM’s PrometheusStatLogger and documented in the official Production Metrics page. [25]

Metric Type What It Means Alert When
vllm:num_requests_running Gauge Requests currently in model execution batches Sustained at --max-num-seqs ceiling
vllm:num_requests_waiting Gauge Requests queued for scheduling (capacity + deferred) Growing queue (>0 sustained)
vllm:kv_cache_usage_perc Gauge KV cache utilization (1.0 = 100%) >0.9 — approaching preemption threshold
vllm:num_preemptions Counter Cumulative preemptions (requests evicted from KV cache) Rate >0 — requests being recomputed
vllm:time_to_first_token_seconds Histogram Time from request arrival to first generated token p99 exceeds your TTFT SLO
vllm:inter_token_latency_seconds Histogram Gap between consecutive output tokens (streaming smoothness) p99 exceeds your ITL SLO
vllm:e2e_request_latency_seconds Histogram Total request duration from arrival to final token p99 exceeds end-to-end SLO
vllm:request_queue_time_seconds Histogram Time a request spent in the WAITING state before scheduling Growing queuing latency
vllm:prefix_cache_hits / queries Counter Prefix cache token hits vs. queries Low hit ratio = wasted KV memory
vllm:request_success Counter Count of successfully processed requests Success rate drops against your normal request volume baseline
Advanced metrics (flag-gated)
  • --kv-cache-metrics enables vllm:kv_block_lifetime_seconds, vllm:kv_block_idle_before_evict_seconds, and vllm:kv_block_reuse_gap_seconds — useful for diagnosing cache churn and sizing
  • --enable-mfu-metrics enables vllm:estimated_flops_per_gpu_total — Model Flops Utilization for hardware efficiency analysis
  • Speculative decoding metrics (vllm:spec_decode_num_accepted_tokens, acceptance rate by draft position) are registered automatically when speculative decoding is configured

Dashboard Blueprint

Four Grafana rows built from the metrics above. The vLLM Monitoring Dashboards docs and the production-stack provide pre-built versions. [27] [22]

Row 1: Throughput

rate(vllm:generation_tokens[1m]) and rate(vllm:prompt_tokens[1m]). Shows decode and prefill throughput in tokens per second.

Row 2: Latency Percentiles

histogram_quantile(0.99, ...) over vllm:time_to_first_token_seconds, inter_token_latency_seconds, and e2e_request_latency_seconds. The three numbers users feel.

Row 3: KV Cache Pressure

vllm:kv_cache_usage_perc, rate(vllm:num_preemptions[5m]), and prefix cache hit ratio. Warns before memory pressure causes evictions.

Row 4: Queue Depth

vllm:num_requests_running, vllm:num_requests_waiting, and num_requests_waiting_by_reason (split by capacity vs. deferred).

OpenTelemetry Tracing

vLLM ships an official OTLP tracing example and documents trace export through --otlp-traces-endpoint plus related environment variables. [5]

Flag / Env Var What It Does
--otlp-traces-endpoint <URL> Sends traces to an OTLP collector (gRPC by default, or HTTP/protobuf via OTEL_EXPORTER_OTLP_TRACES_PROTOCOL)
--collect-detailed-traces {model,worker,all} Enables per-module spans for model execution or worker-level tracing (requires OTLP endpoint)
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL Set to http/protobuf for HTTP export instead of the default grpc

With an OTLP collector configured, request traces can flow into Jaeger or any other OTLP-compatible backend you already run. [5]

Troubleshooting Decision Tree

Each branch pairs a Prometheus metric with the configuration flag to change.

TTFT is high
  1. Check vllm:num_requests_waiting — is traffic queueing? If yes, scale replicas or increase --max-num-seqs
  2. Check vllm:request_prompt_tokens histogram — are prompts unusually long? Long prefills are compute-bound
  3. Consider increasing --tensor-parallel-size to split prefill across GPUs [12]
  4. Tune chunked prefill via --max-num-batched-tokens to bound the per-step prefill budget [8]
OOM or preemptions
  1. Check vllm:kv_cache_usage_perc — is it above 0.9?
  2. Lower --gpu-memory-utilization (default 0.9) to leave headroom for activation memory
  3. Reduce --max-model-len to cap KV cache per sequence
  4. Enable quantization (--quantization awq, gptq, or fp8) to shrink model weights and free memory for KV [8]
Streaming stutters (high ITL)
  1. Check vllm:inter_token_latency_seconds p99 for spikes
  2. Check vllm:iteration_tokens_total — large batches per step increase per-token decode time
  3. Lower --max-num-batched-tokens to reduce batch size, trading throughput for smoother streaming [8]
Requests timing out
  1. Check vllm:request_queue_time_seconds for growing queue latency
  2. Inspect vllm:num_requests_waiting_by_reason — is the bottleneck capacity (not enough GPU) or deferred (transient constraints like LoRA budget)?
  3. For capacity: scale replicas or increase --max-num-seqs
  4. For deferred: check LoRA adapter limits or KV transfer backpressure
Low prefix cache hit rate
  1. Compute hit ratio: rate(vllm:prefix_cache_hits[5m]) / rate(vllm:prefix_cache_queries[5m])
  2. Enable prefix caching if not already on: --enable-prefix-caching [13]
  3. Structure prompts so system messages and common context appear at the beginning (cache matches from the prefix)
  4. Use session affinity or the vLLM Router’s consistent hashing to route repeat users to the same replica [29]

For health probe configuration, graceful shutdown, resource limits, and logging controls, see the Resilience Checklist in Production Hardening.

08

Production Hardening Ops focus

A pre-production checklist drawn from vLLM’s security guide, official Kubernetes examples, and source code. [28]

In plain English

vLLM runs out of the box. Running it safely in production takes deliberate configuration. Each item below links to the flag or doc that implements it. [28] [7]

Security Checklist

Items are documented in vLLM’s security guide plus the official server arguments and deployment examples. [28] [7] [16]

Item Why How
Deploy behind a reverse proxy --api-key only protects /v1 endpoints. Endpoints like /invocations and /pooling are unprotected; /pause and /resume are also unprotected but only exist when VLLM_SERVER_DEV_MODE=1 is set. The security guide states: “The most effective approach is to deploy vLLM behind a reverse proxy.” [28] nginx, Envoy, or a Kubernetes Gateway that allowlists only the endpoints you need and adds rate limiting
API key authentication Bearer token auth for /v1 endpoints --api-key <token> or VLLM_API_KEY env var. Note: only protects /v1 path prefix; auth middleware skips other paths
TLS termination Encrypt client-to-server traffic --ssl-keyfile, --ssl-certfile, --ssl-ca-certs. For TLS 1.2 cipher control: --ssl-ciphers
FIPS-sensitive environments Some regulated deployments need approved cryptographic settings and validation. Use the documented hashing and TLS-related settings from the security guide, then validate the full deployment with your compliance team before claiming FIPS readiness. [28]
Network isolation All inter-node traffic (PyTorch Distributed, KV cache transfer, ZMQ between API server and engine core) is insecure and unencrypted by default Deploy on an isolated network segment. Use Kubernetes NetworkPolicies. Set VLLM_HOST_IP to a specific interface. Configure firewalls to block all ports except the API server
Disable dev mode VLLM_SERVER_DEV_MODE=1 exposes /collective_rpc (arbitrary RPC execution), cache resets, and sleep endpoints Never set VLLM_SERVER_DEV_MODE=1 in production. Never enable profiler endpoints (--profiler-config) in production
SSRF protection Malicious users can supply URLs targeting internal services or cloud metadata endpoints --allowed-media-domains <domains> to restrict media URL fetching. Set VLLM_MEDIA_URL_ALLOW_REDIRECTS=0 to prevent redirect bypass
Request parameter limits The n parameter can cause resource exhaustion if set very high Set VLLM_MAX_N_SEQUENCES to a deployment-appropriate cap for public-facing traffic rather than leaving the risk unbounded. [28]

Resilience Checklist

Item Why How
Resource limits and requests Prevent noisy-neighbor issues and ensure GPU scheduling K8s: set resources.limits (CPU, memory, nvidia.com/gpu) and resources.requests. Mount /dev/shm as emptyDir: { medium: Memory, sizeLimit: "2Gi" } for tensor parallel shared memory [16]
Health endpoint GET /health gives the orchestrator a simple liveness and readiness signal. Use it for both liveness and readiness probes, and tune initial delays and thresholds for the model load time you actually observe.
Graceful shutdown Without a drain window, restarts can interrupt in-flight requests. Set --shutdown-timeout N to give the server time to finish or drain active work before exit. [7]
Logging controls Production logs should be informative without leaking prompts or overwhelming storage --max-log-len to bound logged prompt/output size. --disable-access-log-for-endpoints /health,/metrics to suppress probe noise. --enable-log-requests is off by default [7]

For scaling patterns, multi-model serving, and model storage options, see the Deployment Options section.

09

PagedAttention Technical deep-dive

The core idea: manage KV cache with the same block-and-table structure as virtual memory, so variable-length sequences and concurrent requests use a fixed GPU memory budget more efficiently. [9]

In plain English

When a model is answering multiple questions at once, it needs to remember context for each conversation. The straightforward approach reserves a big chunk of GPU memory up front for each request and wastes whatever goes unused. PagedAttention organizes that memory more efficiently—like how your computer's operating system manages RAM—so the same GPU can handle more conversations simultaneously without running out of space. [9]

Allocator showdown

Same workload, two allocation patterns

The steps below replay a short burst of requests: fixed contiguous reservations beside drawing fixed-size blocks from a pool only as each sequence grows.

Step 1 / 4

Naive Contiguous

Contiguous pre-allocation

vs

PagedAttention

On-demand block pool
Logical block tables
Physical block pool

Each request's logical KV blocks map into a shared physical block pool on the GPU, just as an OS maps virtual pages to RAM frames. [9] Sequences grow, share prefix storage, and release blocks on completion without copying KV tensors.

Grow without relocating Share prefix KV blocks Return blocks to the pool on completion Lower KV cache waste
Animated walkthrough

Allocate on demand

Req A arrives. Blocks land wherever space exists. No contiguous reservation needed.

Used0
Shared0
Free12
Utilization
Research Paper: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023. arXiv:2309.06180
10

Continuous Batching & Chunked Prefill Technical deep-dive

Iteration-level schedulers run one forward pass per round and can change which requests are in the batch, instead of holding a fixed batch until every request finishes. [17] [8]

In plain English

Older systems process requests in fixed groups: if one user's question takes longer, everyone else in the group waits. Continuous batching lets vLLM swap finished requests out and new requests in on every processing step, so the GPU stays busy and no single slow response blocks the rest. Chunked prefill breaks very long prompts into smaller pieces so they do not monopolize the processor while other users are waiting for their next word. [17] [8]

Scheduler simulator

Continuous Batching Lab

Scrub through scheduler steps and compare static batching versus vLLM-style continuous admission.

Prefill Decode Admitted Waiting Idle
Workload mix
Step 1 / 5

Static Batching

Batch-locked
Prefill Req A Long prompt prefill
Decode Req B Chat decode
Decode Req C Chat decode
Queued Req D Waiting to enter
vs

Continuous Batching

Iteration-level
Prefill Req A Long prompt prefill
Decode Req B Chat decode
Decode Req C Chat decode
Queued Req D Ready next step
Req D is queued in both panels. Iteration-level scheduling admits D on the next step when a slot frees; batch-locked scheduling waits until the current batch completes.
Budget-sharing explainer

Chunked Prefill

Long prompts span several steps so decode rows still advance in the same batch.

Req A has a 256-token prompt, but the scheduler only processes 64 tokens of it this step, leaving room for Reqs B and C to keep streaming their responses.

Prefilling Decoding (streaming) Just admitted Queued Complete
Req A Chunk 1 / 4 · long doc summary
64 tok
Req B Streaming · chat reply
20 tok
Req C Streaming · chat reply
20 tok
Req D Queued · new question
Req E Queued · code completion
Req F Queued · translation
104 / 128 tokens

This diagram uses a fixed step budget of 128 tokens (illustrative only). vLLM applies the same idea through max_num_batched_tokens, often in the thousands. The scheduler fills that cap with a mix of prefill and decode work. [8]

Without chunking: Req A’s 256-token prompt would consume the entire 128-token budget for 2 consecutive steps. Reqs B and C would not decode during those steps, and Reqs D through F would wait longer, increasing Time to First Token (TTFT) for queued work.

Chunking caps how much prefill runs per step, so decode is less likely to sit idle behind a long prompt. When prefill dominates load, operators can split the two phases across separate pools—see Disaggregated Prefill in the Modern Requirements section. [8] [2]

11

Architecture Deep Dive Technical deep-dive

From "what happens when I send a prompt?" to the V1 engine internals. Built for people new to inference. [11]

In plain English

vLLM is structured like a well-run kitchen. An API server takes orders (your prompts), a scheduler decides which orders to cook next, and worker processes run the model on the GPU. Separating these roles lets vLLM handle many users at once without one slow request blocking the others. This section walks through each piece from the outside in. [11]

Start here

What happens when you send a prompt?

Serving a prompt means turning text into token IDs, running the model on the GPU to produce more tokens, and streaming decoded text back as tokens arrive. The rest of the vLLM layout speeds up and overlaps those stages. [11]

1

Prepare

Text → tokens. The tokenizer converts your prompt into numbers. Images and audio get encoded too.

2

Generate

In the basic autoregressive case, the model runs once per output token. This is a loop: every new token depends on the one before it. The KV cache stores past computations so only the new token needs processing.

loop
3

Stream

Each new token is converted back to text and streamed to the user in real time.

Most serving cost and latency sit in step 2. The architecture below shows how vLLM runs that loop efficiently under concurrency, building on the KV cache and PagedAttention concepts covered earlier. [11] [9]

Interactive walkthrough

A prompt arrives

The user sends "The quick brown fox," four tokens. The model needs to continue this sequence, but it can't just look at the text. It must process every token through the full neural network to build an internal representation.

Tokens
Forward Pass Waiting for input…
KV Cache
0 entries
Prompt tokens4
Generated0
Forward passes0
KV entries0
Recomputations avoided0
Inside vLLM V1

The Engine Core Loop

The autoregressive loop above is what the engine core drives at high frequency: a tight schedule, execute, update cycle on every step: [11]

1

Schedule

Picks which requests get GPU time this step and assigns a token budget. Each request tracks how many tokens it has computed vs. how many it needs. No rigid prefill/decode split. [8]

vllm/v1/core/sched/scheduler.py
2

Execute

Sends the scheduled batch to GPU workers via the executor. Each worker runs the model’s forward pass using fused attention backends (FlashAttention, FlashInfer, etc.) and CUDA graphs against the paged KV cache. [11]

vllm/v1/worker/gpu_model_runner.py
3

Update

Applies sampled tokens from the model output, records progress in the scheduler, frees finished requests, and streams detokenized text to the user. Then the loop repeats. [11]

vllm/v1/engine/core.py
Simplified from EngineCore.step()
scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
engine_core_outputs = self.scheduler.update_from_output(
    scheduler_output, model_output
)

Order matches EngineCore.step(): after future.result(), the real code may call sample_tokens(grammar_output) if the model output is None, and it runs _process_aborts_queue() before update_from_output, inside logging context managers. The non_block=True call overlaps GPU work with grammar bitmask preparation. See vllm/v1/engine/core.py for the full branch logic. [11]

Full picture

V1 Multi-Process Architecture

V1 splits work across OS processes: the API server runs tokenization and streams responses while engine cores and GPU workers run the scheduler and forwards, connected with ZMQ as in the upstream architecture doc. That separation keeps CPU-side string work off the engine core’s busy loop; process isolation also limits how far a failure in one role spreads. [11]

API Edge Engine Core GPU Execution

The Cast of Characters

Each component has a specific job in the engine loop. Click to expand.

The Input Processor tokenizes text (and encodes images or audio when the model path requires it) in the API server process. The Output Processor detokenizes new tokens and streams them back as SSE events. Both run in the API process, which communicates with the engine core via ZMQ, so tokenization and detokenization stay off the core GPU scheduling path. [11]

Why it matters: Tokenization and detokenization are CPU-bound. Running them in a separate process from the GPU scheduler prevents them from adding latency to the critical engine loop.

vllm/v1/engine/input_processor.py vllm/v1/engine/output_processor.py

Decides which requests get GPU time each step. Tracks num_computed_tokens vs num_tokens per request with no rigid prefill/decode split, so chunked prefill, prefix caching, and speculative decoding share one scheduler. [8]

Why it matters: Scheduling policy and token budgets shape both TTFT and inter-token latency.

vllm/v1/core/sched/scheduler.py

Manages the block pool you saw in the PagedAttention section. It allocates and frees fixed-size blocks and enables prefix caching by hashing block contents. [13]

Why it matters: KV cache is the #1 memory consumer. This manager is why vLLM can serve more concurrent requests than naive implementations.

vllm/v1/core/kv_cache_manager.py

Runs the actual torch.nn.Module forward pass on each GPU. It prepares input tensors, replays CUDA graphs for speed, and coordinates with attention backends (FlashAttention, FlashInfer) to read and write the paged KV cache. [11]

Why it matters: This is where GPU time is spent. CUDA graph capture eliminates CPU launch overhead, making each decode step faster.

vllm/v1/worker/gpu_model_runner.py

Abstracts how scheduled work reaches GPU workers. ParallelConfig defaults to UniProc when world_size == 1 and to Multiproc (mp) for typical multi-GPU setups; Ray is used for multi-node or Ray-backed deployments. All implementations still expose the same execute_model surface to the engine core. [11]

Why it matters: EngineCore.step() stays the same; changing parallelism swaps the executor implementation behind model_executor.

vllm/v1/executor/

The scheduler orchestrates the concepts from the PagedAttention and Batching sections: it assigns tokens from a per-step budget with no rigid prefill/decode split, enabling chunked prefills, prefix caching, speculative decoding, and mixed batches in one loop. When KV cache space is exhausted, the scheduler preempts the lowest-priority request (V1 default: recompute). [8] For prefix caching details, see the design doc.

Illustrative Process Calculator

The architecture diagram above maps to real OS processes. This calculator shows a common V1 topology under the default assumption that API-server count tracks data-parallel replicas; adjust the inputs to see how process count changes under that model. [11] [12]

Splits each layer across GPUs. Use when a single model layer doesn’t fit on one GPU.
Splits layers into stages across GPUs. Use to scale beyond one node’s GPU interconnect.
Runs independent model replicas for higher throughput. Each replica handles its own requests.
API Servers (default)1
Engine Cores1
GPU Workers4
Total Processes6
Total GPUs4
12

Performance Tuning Technical deep-dive

The tuning lab shows how concurrency and token budget affect throughput, latency, and KV pressure. [8]

In plain English

More simultaneous users raise overall throughput but can slow individual responses. The sliders below let you explore that tradeoff, and the table lists the flags that control it. [8] [21]

Latency vs Throughput Lab

Sliders are illustrative: they show which levers usually lift throughput, which smooth streaming, and which add preemption or OOM risk.

Throughput High
TTFT Medium
Streaming ITL Medium
KV Pressure Controlled

Balanced shared cluster

This profile is tuned for a shared chat service: enough token budget to keep throughput healthy, enough headroom to avoid constant preemption, and enough prefix reuse to reward repeated scaffolding.

  • Raise concurrency when requests are short and similar.
  • Lower token budget if chat streaming feels sticky or TTFT drifts.
  • Protect memory headroom when context length or multimodal assets vary.

Key Tuning Parameters

Lever Primary Effect Tradeoff How to explain it
--gpu-memory-utilization More GPU memory for the model executor (weights, runtime, and KV cache); range (0, 1], default 0.9 [8] Less memory headroom for spikes and variability Raise only when the workload is stable and OOM risk is well understood.
--max-num-seqs More sequences per scheduler step More contention for KV capacity and scheduler budget Best for short requests; lower it when contexts are long or heterogeneous.
--max-num-batched-tokens More work per scheduler step Larger values generally improve throughput and TTFT; smaller values often improve ITL when chunked prefill is on (vLLM V1 default when supported) Tune against your TTFT, ITL, and prefill mix; vLLM docs recommend trying values above 8192 for throughput on smaller models on large GPUs.
--max-model-len Maximum sequence length (prompt plus generated tokens) the engine allows KV grows with actual length, but a higher cap raises worst-case per-request memory Set the cap to the longest context you must serve; unused headroom still shapes KV reservation planning.
--tensor-parallel-size Fits larger models across GPUs More inter-GPU communication Use the smallest TP value that fits comfortably on the available hardware.
--quantization Lower memory footprint and cost Model-specific accuracy and compatibility tradeoffs Measure accuracy and latency on representative prompts; effects vary by scheme and kernel.
--enable-prefix-caching / --no-enable-prefix-caching If omitted, vLLM enables prefix caching when the loaded model supports it; pass --no-enable-prefix-caching to force it off [8] Extra cache bookkeeping with little value when prompts are unique Disable when reuse is rare; pass --enable-prefix-caching if you need to turn it back on after disabling.
--scheduling-policy Admission fairness and priority behavior Different tail-latency profiles under contention priority orders waiting work by request priority (lower numeric values first, then arrival time); fcfs is strict arrival order. Use priority when the API assigns priorities and you want that order under load.

Measurement Tips

  • Separate TTFT, inter-token latency, and total throughput instead of reporting only one aggregate score.
  • Warmup, cache state, and request arrival pattern materially change results for dynamic batching systems.
  • Use a real prompt-length distribution, not a single toy prompt, when comparing engines.
  • Always record hardware, precision, context limit, and key scheduler settings alongside the results. [21] [23]

Practical Benchmarking with GuideLLM

The test dataset is the most overlooked tuning variable. Synthetic or toy prompts lead to misleading conclusions because vLLM performance depends on incoming request shapes and arrival patterns. [21]

  • Build test datasets that match your production input/output token distributions, prompt variability, and repeated text patterns (vLLM optimizes recurring prefixes).
  • Use GuideLLM to move from synthetic load testing to production reality: configurable request shapes, varying concurrency patterns, and captured TTFT / ITL / throughput metrics. [21]
  • Define P95 and P99 targets for TTFT and ITL before tuning, then sweep concurrency to find where those targets break. Adjust one lever at a time and re-measure.

Parallelism Strategy

For guidance on choosing between TP, PP, DP, EP, and CP, see the interactive parallelism chooser in the Deployment section. The --tensor-parallel-size row above is the most common tuning entry point. [4]

13

Quick Reference

Commands, HTTP API paths, and links to cited sources.

CLI

# Serve a model
vllm serve <model>

# Common options
vllm serve <model> \
  --tensor-parallel-size 4 \
  --max-model-len 4096

# Benchmark
vllm bench serve --model <model>

Commands above are drawn from the official serve and benchmark CLI docs. [7] [33]

OpenAI-Compatible API

# Chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model>", "messages": [
    {"role": "user", "content": "Hello!"}
  ]}'

# List models
curl http://localhost:8000/v1/models

Paths follow the documented OpenAI-compatible server surface. [7]

Source Notes

  1. Aleksa Gordić (vLLM Blog), Inside vLLM: Anatomy of a High-Throughput LLM Inference System.
  2. vLLM docs, Disaggregated Prefilling (experimental).
  3. NVIDIA, Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM.
  4. Engineering at Meta, Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism.
  5. vLLM docs, Setup OpenTelemetry POC (OTLP tracing example, exporter protocol, Jaeger walkthrough).
  6. OpenAI API docs, Structured model outputs.
  7. vLLM docs, Server Arguments (vllm serve, OpenAI-compatible server flags, operational controls).
  8. vLLM docs, Optimization and Tuning (chunked prefill defaults, preemption, tuning parameters).
  9. Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023.
  10. vLLM project, GitHub repository and README (project scope, feature overview, hardware and plugin links).
  11. vLLM docs, Architecture Overview (V1 engine roles, process topology, API server / engine core / worker split).
  12. vLLM docs, Parallelism and Scaling; see also Data Parallel Deployment, Expert Parallel Deployment, and Context Parallel Deployment.
  13. vLLM docs, Automatic Prefix Caching.
  14. vLLM docs, Supported Models.
  15. vLLM docs, Structured Outputs.
  16. vLLM docs, Deploying with KServe.
  17. Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models, OSDI 2022.
  18. SGLang project, GitHub repository.
  19. Artificial Analysis, Open Source Models.
  20. Hugging Face docs, Text Generation Inference (maintenance mode notice and downstream engine recommendations).
  21. Trevor Royer (Red Hat Developer), Practical strategies for vLLM performance tuning.
  22. vLLM Project, High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack.
  23. Red Hat Developer, GuideLLM: Evaluate LLM deployments for real-world inference.
  24. Red Hat Developer, Optimize and deploy LLMs for production with OpenShift AI.
  25. vLLM Docs, Production Metrics.
  26. vLLM Docs, Prometheus and Grafana.
  27. vLLM Docs, Monitoring Dashboards.
  28. vLLM Docs, Security.
  29. vLLM Project, vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving.
  30. Red Hat Developer, Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing.
  31. OpenAI, API Pricing.
  32. vLLM docs, LoRA Adapters.
  33. vLLM docs, vllm bench serve.
  34. vLLM docs, Speculative Decoding.