Dense chat and code models
The most common deployment target: frontier open models, chat assistants, copilots, and code generation stacks.
GPU time is expensive, and the serving layer can waste much of it. Your app feels slow; your cluster stalls after a handful of users. This guide covers LLM inference mechanics and where vLLM changes the picture.
Source snapshot: official docs and engineering posts reviewed in April 2026. [1] [2] [3] [4]
Pick the path that fits your role. Every section is self-contained, so you can skip ahead or stop early.
Understand costs, capabilities, and when vLLM fits.
Deploy, tune, and operate vLLM in production.
Understand the engine internals and scheduling.
Organizations buy expensive GPU hardware. Apps still feel slow and top out after a small number of concurrent users. Fast responses and efficient GPU use pull on the same memory and compute budget. This section names that tradeoff and how vLLM addresses it.
When your application uses an AI model to answer questions, write text, or process documents, every response requires GPU compute time. vLLM is an open-source inference and serving engine built for high-throughput, memory-efficient LLM serving. Teams adopt it when they want more control over hardware choice, deployment shape, and cost model than a managed API provides. [10]
The sections below explain what happens at each stage and where vLLM makes the difference.
Teams care about how fast the first token arrives and how smoothly the stream feels. They also track prompt reuse, multimodal inputs, structured outputs, and fairness on shared clusters. A single tokens per second figure rarely tells that whole story.
Prefill is prompt-heavy and compute-bound. Decode adds one token at a time while the user watches; it is often memory-bandwidth-bound. A serving stack balances both phases continuously. [4]
Long context and high concurrency with large models all squeeze KV cache space. Design choices decide whether GPU memory holds useful tokens or loses capacity to fragmentation, eviction pressure, or preemption.
vLLM is an open-source inference and serving engine for high-throughput, memory-efficient LLM workloads. PagedAttention tightens KV memory use. [9] Continuous batching keeps accelerators busy under mixed load. The runtime also ships prefix caching, structured-output support, multimodal model support, multi-GPU parallelism, and OpenAI-compatible serving. [13] [15] [14] [12] [7]
Targets lower TTFT and steadier streaming by reducing KV waste and prioritizing decode work under mixed load. [9] [8]
Interactive serving needs a short time to first token (TTFT), steady streaming gaps between tokens (inter-token latency), and predictable latency under concurrent load. The mental model below ties prefill, decode, caching, and how vLLM addresses each requirement.
Building on the prefill/decode tradeoff above, prompt length, concurrency, prefix reuse, multimodal inputs, and latency targets usually decide whether batching, KV memory, parallelism, or disaggregated serving matter most. Use the topic tabs below to explore each technique and when it applies to your workload.
Open-source serving engine focused on high-throughput inference, broad model support, OpenAI-compatible APIs, and features such as PagedAttention, unified scheduling, and prefix caching. [10] [14] [13]
Hugging Face documents TGI as maintenance mode and recommends newer engines such as vLLM or SGLang for future serving-stack work where they fit. [20]
NVIDIA’s TensorRT-LLM stack: builder and runtime focused on CUDA GPUs, fused kernels, and detailed KV cache and compilation controls for teams already standardized on NVIDIA inference. [3]
SGLang is an active open-source serving stack whose official materials emphasize RadixAttention, structured generation, and gateway/routing features. [18]
These questions cover most vLLM fit and sizing reviews.
Is the user experience dominated by first token, smooth streaming, or offline throughput?
Are prompts short and repetitive, or long, bursty, and retrieval-heavy?
Is this mostly chat, long-form generation, multimodal requests, or structured/tool-driven flows?
Single node, shared cluster, air-gapped enterprise, or multi-node fleet with strict operational controls?
Expand a card for the full answer.
The supported-models documentation covers native text, multimodal, and pooling architectures for chat, code, vision, audio, embedding, scoring, and retrieval workloads. [14] What follows is a concise feature matrix and hardware map for serving.
The most common deployment target: frontier open models, chat assistants, copilots, and code generation stacks.
Sparse MoE stacks need explicit expert-parallel layout and memory planning so routing overhead does not erase the parameter savings.
Multimodal traffic changes preprocessing and pushes memory and time to first token harder than text-only chat.
Pooling, embedding, and ranking endpoints sit beside chat completions for Retrieval-Augmented Generation (RAG) and search backends.
Plan around dense text, MoE, multimodal inputs, or pooling instead of assuming a single model name tells the whole story.
| Feature | Description | Key Benefit | What this means for you |
|---|---|---|---|
| Quantization | GPTQ, AWQ, FP8, BitsAndBytes, GGUF, TorchAO, and related schemes are documented in the project and surrounding vllm-project materials [10] | Run larger models on fewer GPUs | Run the same model on less expensive hardware |
| LoRA Adapters | Runtime loading and unloading of LoRA weights [32] | Serve many fine-tuned variants from one base | One deployment serves multiple customized models |
| Structured Output | JSON schema constraints plus regex, choice, grammar, and structural-tag guards [15] [6] | Safer integration with tools and downstream systems | The model's output is constrained to a format your code can parse |
| Prefix Caching | Automatic hash-based KV block reuse for shared prefixes [13] | Better TTFT for repeated system prompts and shared context | Repeated instructions (system prompts) are processed once instead of every time |
| Multimodal | Vision-language and audio-capable models [14] | Multimodal and text paths in one serving stack | Process images, audio, and text with the same server |
| CUDA Graphs | Captured execution graphs for repeat decode shapes and other stable execution paths [10] | Lower per-step launch overhead on stable workloads | Faster responses on steady, predictable traffic |
| Feature | Description | Key Benefit |
|---|---|---|
| Speculative Decoding | EAGLE, MTP, draft, n-gram, PARD, suffix, and MLP-style speculators [34] | Lower wall-clock latency when acceptance rate is healthy |
| Expert / MoE Support | Serve sparse MoE architectures such as Mixtral and DeepSeek families [14] [12] | Use hardware more efficiently for large sparse models |
| Disaggregated Prefill experimental | Separate prompt-ingest and decode pools when workloads justify it | Protect TTFT-sensitive traffic under prompt-heavy load |
CUDA, ROCm, CPU, and XPU installers target the main repository; accelerator-specific code also ships in separate vllm-project repositories linked below. [10]
Mainline GPU path for high-throughput serving across common enterprise and hyperscaler fleets.
repo: vllm
ROCm-backed deployments for AMD accelerator fleets, including MI-series hardware.
repo: vllm
Intel Data Center and Arc GPUs through the XPU backend in the main repo plus vllm-xpu-kernels; typical installs use a source build (Intel publishes Docker images).
repo: vllm
Linux x86_64 and Arm AArch64 ship prebuilt CPU wheels. Apple Silicon and IBM Z CPU targets are experimental and need source builds; Apple GPU inference uses the vllm-metal plugin in the grid below.
repo: vllm
Cloud TPU support via the tpu-inference / vllm-tpu plugin package, integrated with the main vLLM install.
repo: tpu-inference
Trainium and Inferentia support via a dedicated vLLM organization plugin for AWS deployments.
repo: vllm-neuron
Community-maintained vLLM organization plugin for Huawei Ascend deployments.
repo: vllm-ascend
Dedicated org plugin for Gaudi/HPU environments and their operator + runtime differences.
repo: vllm-gaudi
Community-maintained vLLM organization plugin for IBM Spyre AIU acceleration.
repo: vllm-spyre
Community-maintained Metal plugin for Mac workflows and Apple unified-memory serving experiments.
repo: vllm-metal
The economics of self-hosted inference depend on infrastructure cost, operational overhead, and what an API provider charges for the same workload. Use representative benchmarking instead of generic break-even claims. [21] [31]
Self-hosted inference trades metered API pricing for capacity you manage. Break-even depends on measured throughput, utilization, infrastructure cost, and operations. GuideLLM and vLLM benchmarks provide the throughput side of that equation, but the answer is still workload-specific. [21] [23] [31]
Raw GPU-hour cost is only one input. The more useful metric is an effective cost per token or per request: take your real hourly or amortized infrastructure cost and divide it by measured throughput under representative prompts and concurrency. [21] [23]
| Input | How to Measure | Why It Matters |
|---|---|---|
| Infrastructure cost | Use your real hourly GPU rate, reserved-instance rate, or amortized on-prem cost. | This is the numerator in any self-hosted cost-per-token calculation. |
| Measured throughput | Benchmark on representative prompts, outputs, and concurrency with GuideLLM or vLLM tools. | Throughput determines how much useful work that infrastructure cost actually buys. [21] [23] |
| Utilization and duty cycle | Estimate how often the GPUs are busy versus idle across your real traffic pattern. | Idle capacity can dominate the economics of spiky or low-volume workloads. |
| Operational overhead | Include deployment, monitoring, patching, on-call, and support contracts. | Self-hosting is not only a hardware decision. [22] [24] |
| API pricing baseline | Use the provider’s published token pricing for the model and tier you are actually comparing against. | This is the baseline alternative to self-hosting. [31] |
A practical self-hosted comparison is: effective cost per million tokens = hourly or amortized infrastructure cost / measured tokens per second / 3,600 × 1,000,000. Use representative workloads, not toy prompts, when you collect the throughput number. [21] [23]
Self-hosting is not always the right call. Managed APIs often win when traffic is sporadic, when proprietary model access matters more than infrastructure control, or when GPU operations overhead would dominate the savings. Keep the decision empirical: measure representative throughput, compare it to published API pricing, and include support and operations in the model. [21] [31]
Divide your hourly or amortized infrastructure cost by measured throughput (tokens/sec) to get cost per million tokens, then compare that against the provider’s published token pricing for the model tier you are using. The Performance Tuning section covers representative benchmarking methodology with GuideLLM in more detail. [21] [31]
A structured path from evaluation to production. Each phase maps to documented vLLM deployment modes, benchmarking steps, and support options.
You can start with vllm serve <model> for local evaluation and then move into Kubernetes-based deployment patterns documented by vLLM and KServe. Three phases, each with concrete exit criteria, take you from a single-GPU proof of concept to production. [7] [16]
Do prompts and responses need to stay on your network? Self-hosted vLLM lets inference traffic stay on infrastructure you control, but the security guide also notes that multi-node communications are insecure by default and should be isolated on trusted networks. [28]
The vLLM project documents support for CUDA, ROCm, Intel XPU, CPU, and additional accelerator plugins in the wider vllm-project organization. If you already run modern accelerator fleets, there is usually a documented path to start evaluating. [10]
Complexity ranges from a one-liner (vllm serve <model>) to multi-node tensor/pipeline/data parallelism. Managed platforms like Red Hat OpenShift AI handle the infrastructure layer. [24]
Each role owns a different piece of the deployment. Get them aligned before Phase 2.
GPU provisioning, Kubernetes integration, Helm charts, shared memory and storage configuration. Concerned with resource limits, node autoscaling, and multi-tenant isolation.
Data residency, endpoint protection (reverse proxy, API-key limitations), network isolation for inter-node ZMQ and PyTorch Distributed traffic, and pre-production hardening checks. [28]
Model selection from the supported-model catalog, quantization options, accuracy validation, and LoRA adapter management. [14]
TCO analysis from the previous section, support contracts, GPU lease-vs-buy decisions, and measurable success criteria. [24]
Tie your go/no-go decision to metrics vLLM exposes:
vllm:time_to_first_token_seconds and vllm:inter_token_latency_seconds Prometheus histograms [25]/health endpoint availability meets your target, paired with platform probes and alerting [28]--shutdown-timeout) [7] [22]vLLM runs the same engine family on one GPU, behind an HTTP server, or on Kubernetes. That gives teams one serving stack with familiar model and runtime flags across those shapes instead of maintaining separate engines per environment. [7] [16]
Upstream docs cover running vLLM on Kubernetes with KServe (Hugging Face serving runtime or LLMInferenceService). [16] OpenShift AI documents KServe-based model serving; you apply the same integration patterns there when exposing vLLM as a served model. [24]
Many organizations publish quantized or otherwise optimized checkpoints on Hugging Face. Treat those as starting points for evaluation rather than automatic production approval, and validate them on your own prompts, latency targets, and safety requirements.
vLLM is built for throughput; production security requires layers around it. The Production Hardening section covers reverse proxy setup, network isolation, SSRF protection, FIPS-sensitive deployment considerations, and the full pre-production security checklist. [28]
The LLM class runs generation in-process with no server overhead.
from vllm import LLM, SamplingParams
llm = LLM(model="RedHatAI/gemma-4-31B-it-FP8-block")
params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(["Explain Kubernetes in one paragraph."], params)
print(outputs[0].outputs[0].text)
One command starts an OpenAI-compatible HTTP server with chat, completions, and model-listing endpoints; embedding and pooling flows are documented separately in the supported-model and pooling docs. [7] [14]
vllm serve RedHatAI/gemma-4-31B-it-FP8-block
Deploy on Kubernetes or OpenShift with GPU node selectors, resource limits, and room to grow into multi-node topologies.
# Minimal example only; add probes, shared memory, auth, and model storage for production.
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["vllm", "serve",
"RedHatAI/gemma-4-31B-it-FP8-block"]
resources:
limits:
nvidia.com/gpu: 1
Each strategy below lists topology, fit, tradeoffs, and the matching CLI flags from the serving docs. [12]
Split each layer across GPUs when the model fits on one node but not on one device.
--tensor-parallel-size N
Watch interconnect bandwidth and collective overhead.
Combine parallelism flags with load balancing and routing for production scale. [12]
| Pattern | What It Does | When to Use |
|---|---|---|
| Parallelism flags | -tp, -pp, -dp plus expert and context parallel |
Scale a single model across multiple GPUs. See the parallelism chooser above [12] |
| Multi-API-server | --api-server-count N runs multiple frontend processes sharing one engine. Auto-configures PROMETHEUS_MULTIPROC_DIR for shared metrics [12] |
When the API frontend is the bottleneck (high request parsing overhead, many concurrent connections) |
| vLLM Router | Production-grade Rust load balancer with consistent hashing for KV cache reuse, prefill/decode disaggregation support, K8s service discovery, and circuit breakers [29] | Multi-replica deployments needing intelligent request routing and cache affinity |
| llm-d | Red Hat’s KV-cache-aware routing layer for multi-replica vLLM on OpenShift [30] | Multi-turn workloads where routing to the replica holding the conversation’s KV cache avoids recomputation |
| Pattern | How | Tradeoff |
|---|---|---|
| Served name aliases | --served-model-name name1 name2 — one engine, multiple API model IDs [7] |
No overhead; useful for API compatibility or gradual migration |
| Runtime LoRA adapters | --lora-modules at startup, plus /v1/load_lora_adapter and /v1/unload_lora_adapter at runtime (requires VLLM_ALLOW_RUNTIME_LORA_UPDATING=1) [32] |
Share one base model across fine-tuned variants; adapter count limited by --max-loras and GPU memory |
| Separate deployments + routing | One Deployment per model, fronted by vLLM Router or llm-d [29] [30] | Full isolation; requires more GPUs but simplifies per-model scaling and upgrades |
/root/.cache/huggingface so the model is downloaded once and persisted across pod restarts. Shown in official KServe and Kubernetes-oriented deployment examples [16]vLLM exposes Prometheus metrics on the /metrics endpoint of the API server, supports OpenTelemetry tracing, and ships pre-built Grafana dashboards. The metrics below cover queue health, memory pressure, latency percentiles, and request outcomes. [25] [5] [27]
Queue depth tells you whether traffic is backing up. KV cache usage warns you before memory pressure triggers evictions. TTFT and ITL are the latency numbers your users feel. The troubleshooting tree below pairs each symptom with the metric to check and the flag to change. [25] [8]
These metrics are defined in vLLM’s PrometheusStatLogger and documented in the official Production Metrics page. [25]
| Metric | Type | What It Means | Alert When |
|---|---|---|---|
vllm:num_requests_running |
Gauge | Requests currently in model execution batches | Sustained at --max-num-seqs ceiling |
vllm:num_requests_waiting |
Gauge | Requests queued for scheduling (capacity + deferred) | Growing queue (>0 sustained) |
vllm:kv_cache_usage_perc |
Gauge | KV cache utilization (1.0 = 100%) | >0.9 — approaching preemption threshold |
vllm:num_preemptions |
Counter | Cumulative preemptions (requests evicted from KV cache) | Rate >0 — requests being recomputed |
vllm:time_to_first_token_seconds |
Histogram | Time from request arrival to first generated token | p99 exceeds your TTFT SLO |
vllm:inter_token_latency_seconds |
Histogram | Gap between consecutive output tokens (streaming smoothness) | p99 exceeds your ITL SLO |
vllm:e2e_request_latency_seconds |
Histogram | Total request duration from arrival to final token | p99 exceeds end-to-end SLO |
vllm:request_queue_time_seconds |
Histogram | Time a request spent in the WAITING state before scheduling | Growing queuing latency |
vllm:prefix_cache_hits / queries |
Counter | Prefix cache token hits vs. queries | Low hit ratio = wasted KV memory |
vllm:request_success |
Counter | Count of successfully processed requests | Success rate drops against your normal request volume baseline |
--kv-cache-metrics enables vllm:kv_block_lifetime_seconds, vllm:kv_block_idle_before_evict_seconds, and vllm:kv_block_reuse_gap_seconds — useful for diagnosing cache churn and sizing--enable-mfu-metrics enables vllm:estimated_flops_per_gpu_total — Model Flops Utilization for hardware efficiency analysisvllm:spec_decode_num_accepted_tokens, acceptance rate by draft position) are registered automatically when speculative decoding is configuredFour Grafana rows built from the metrics above. The vLLM Monitoring Dashboards docs and the production-stack provide pre-built versions. [27] [22]
rate(vllm:generation_tokens[1m]) and rate(vllm:prompt_tokens[1m]). Shows decode and prefill throughput in tokens per second.
histogram_quantile(0.99, ...) over vllm:time_to_first_token_seconds, inter_token_latency_seconds, and e2e_request_latency_seconds. The three numbers users feel.
vllm:kv_cache_usage_perc, rate(vllm:num_preemptions[5m]), and prefix cache hit ratio. Warns before memory pressure causes evictions.
vllm:num_requests_running, vllm:num_requests_waiting, and num_requests_waiting_by_reason (split by capacity vs. deferred).
vLLM ships an official OTLP tracing example and documents trace export through --otlp-traces-endpoint plus related environment variables. [5]
| Flag / Env Var | What It Does |
|---|---|
--otlp-traces-endpoint <URL> |
Sends traces to an OTLP collector (gRPC by default, or HTTP/protobuf via OTEL_EXPORTER_OTLP_TRACES_PROTOCOL) |
--collect-detailed-traces {model,worker,all} |
Enables per-module spans for model execution or worker-level tracing (requires OTLP endpoint) |
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL |
Set to http/protobuf for HTTP export instead of the default grpc |
With an OTLP collector configured, request traces can flow into Jaeger or any other OTLP-compatible backend you already run. [5]
Each branch pairs a Prometheus metric with the configuration flag to change.
vllm:num_requests_waiting — is traffic queueing? If yes, scale replicas or increase --max-num-seqsvllm:request_prompt_tokens histogram — are prompts unusually long? Long prefills are compute-bound--tensor-parallel-size to split prefill across GPUs [12]--max-num-batched-tokens to bound the per-step prefill budget [8]vllm:kv_cache_usage_perc — is it above 0.9?--gpu-memory-utilization (default 0.9) to leave headroom for activation memory--max-model-len to cap KV cache per sequence--quantization awq, gptq, or fp8) to shrink model weights and free memory for KV [8]vllm:inter_token_latency_seconds p99 for spikesvllm:iteration_tokens_total — large batches per step increase per-token decode time--max-num-batched-tokens to reduce batch size, trading throughput for smoother streaming [8]vllm:request_queue_time_seconds for growing queue latencyvllm:num_requests_waiting_by_reason — is the bottleneck capacity (not enough GPU) or deferred (transient constraints like LoRA budget)?--max-num-seqsrate(vllm:prefix_cache_hits[5m]) / rate(vllm:prefix_cache_queries[5m])--enable-prefix-caching [13]For health probe configuration, graceful shutdown, resource limits, and logging controls, see the Resilience Checklist in Production Hardening.
A pre-production checklist drawn from vLLM’s security guide, official Kubernetes examples, and source code. [28]
vLLM runs out of the box. Running it safely in production takes deliberate configuration. Each item below links to the flag or doc that implements it. [28] [7]
Items are documented in vLLM’s security guide plus the official server arguments and deployment examples. [28] [7] [16]
| Item | Why | How |
|---|---|---|
| Deploy behind a reverse proxy | --api-key only protects /v1 endpoints. Endpoints like /invocations and /pooling are unprotected; /pause and /resume are also unprotected but only exist when VLLM_SERVER_DEV_MODE=1 is set. The security guide states: “The most effective approach is to deploy vLLM behind a reverse proxy.” [28] |
nginx, Envoy, or a Kubernetes Gateway that allowlists only the endpoints you need and adds rate limiting |
| API key authentication | Bearer token auth for /v1 endpoints |
--api-key <token> or VLLM_API_KEY env var. Note: only protects /v1 path prefix; auth middleware skips other paths |
| TLS termination | Encrypt client-to-server traffic | --ssl-keyfile, --ssl-certfile, --ssl-ca-certs. For TLS 1.2 cipher control: --ssl-ciphers |
| FIPS-sensitive environments | Some regulated deployments need approved cryptographic settings and validation. | Use the documented hashing and TLS-related settings from the security guide, then validate the full deployment with your compliance team before claiming FIPS readiness. [28] |
| Network isolation | All inter-node traffic (PyTorch Distributed, KV cache transfer, ZMQ between API server and engine core) is insecure and unencrypted by default | Deploy on an isolated network segment. Use Kubernetes NetworkPolicies. Set VLLM_HOST_IP to a specific interface. Configure firewalls to block all ports except the API server |
| Disable dev mode | VLLM_SERVER_DEV_MODE=1 exposes /collective_rpc (arbitrary RPC execution), cache resets, and sleep endpoints |
Never set VLLM_SERVER_DEV_MODE=1 in production. Never enable profiler endpoints (--profiler-config) in production |
| SSRF protection | Malicious users can supply URLs targeting internal services or cloud metadata endpoints | --allowed-media-domains <domains> to restrict media URL fetching. Set VLLM_MEDIA_URL_ALLOW_REDIRECTS=0 to prevent redirect bypass |
| Request parameter limits | The n parameter can cause resource exhaustion if set very high |
Set VLLM_MAX_N_SEQUENCES to a deployment-appropriate cap for public-facing traffic rather than leaving the risk unbounded. [28] |
| Item | Why | How |
|---|---|---|
| Resource limits and requests | Prevent noisy-neighbor issues and ensure GPU scheduling | K8s: set resources.limits (CPU, memory, nvidia.com/gpu) and resources.requests. Mount /dev/shm as emptyDir: { medium: Memory, sizeLimit: "2Gi" } for tensor parallel shared memory [16] |
| Health endpoint | GET /health gives the orchestrator a simple liveness and readiness signal. |
Use it for both liveness and readiness probes, and tune initial delays and thresholds for the model load time you actually observe. |
| Graceful shutdown | Without a drain window, restarts can interrupt in-flight requests. | Set --shutdown-timeout N to give the server time to finish or drain active work before exit. [7] |
| Logging controls | Production logs should be informative without leaking prompts or overwhelming storage | --max-log-len to bound logged prompt/output size. --disable-access-log-for-endpoints /health,/metrics to suppress probe noise. --enable-log-requests is off by default [7] |
For scaling patterns, multi-model serving, and model storage options, see the Deployment Options section.
The core idea: manage KV cache with the same block-and-table structure as virtual memory, so variable-length sequences and concurrent requests use a fixed GPU memory budget more efficiently. [9]
When a model is answering multiple questions at once, it needs to remember context for each conversation. The straightforward approach reserves a big chunk of GPU memory up front for each request and wastes whatever goes unused. PagedAttention organizes that memory more efficiently—like how your computer's operating system manages RAM—so the same GPU can handle more conversations simultaneously without running out of space. [9]
The steps below replay a short burst of requests: fixed contiguous reservations beside drawing fixed-size blocks from a pool only as each sequence grows.
Each request's logical KV blocks map into a shared physical block pool on the GPU, just as an OS maps virtual pages to RAM frames. [9] Sequences grow, share prefix storage, and release blocks on completion without copying KV tensors.
Req A arrives. Blocks land wherever space exists. No contiguous reservation needed.
Iteration-level schedulers run one forward pass per round and can change which requests are in the batch, instead of holding a fixed batch until every request finishes. [17] [8]
Older systems process requests in fixed groups: if one user's question takes longer, everyone else in the group waits. Continuous batching lets vLLM swap finished requests out and new requests in on every processing step, so the GPU stays busy and no single slow response blocks the rest. Chunked prefill breaks very long prompts into smaller pieces so they do not monopolize the processor while other users are waiting for their next word. [17] [8]
Scrub through scheduler steps and compare static batching versus vLLM-style continuous admission.
Long prompts span several steps so decode rows still advance in the same batch.
Req A has a 256-token prompt, but the scheduler only processes 64 tokens of it this step, leaving room for Reqs B and C to keep streaming their responses.
This diagram uses a fixed step budget of 128 tokens (illustrative only). vLLM applies the same idea through max_num_batched_tokens, often in the thousands. The scheduler fills that cap with a mix of prefill and decode work. [8]
Chunking caps how much prefill runs per step, so decode is less likely to sit idle behind a long prompt. When prefill dominates load, operators can split the two phases across separate pools—see Disaggregated Prefill in the Modern Requirements section. [8] [2]
From "what happens when I send a prompt?" to the V1 engine internals. Built for people new to inference. [11]
vLLM is structured like a well-run kitchen. An API server takes orders (your prompts), a scheduler decides which orders to cook next, and worker processes run the model on the GPU. Separating these roles lets vLLM handle many users at once without one slow request blocking the others. This section walks through each piece from the outside in. [11]
Serving a prompt means turning text into token IDs, running the model on the GPU to produce more tokens, and streaming decoded text back as tokens arrive. The rest of the vLLM layout speeds up and overlaps those stages. [11]
Text → tokens. The tokenizer converts your prompt into numbers. Images and audio get encoded too.
In the basic autoregressive case, the model runs once per output token. This is a loop: every new token depends on the one before it. The KV cache stores past computations so only the new token needs processing.
loopEach new token is converted back to text and streamed to the user in real time.
Most serving cost and latency sit in step 2. The architecture below shows how vLLM runs that loop efficiently under concurrency, building on the KV cache and PagedAttention concepts covered earlier. [11] [9]
The user sends "The quick brown fox," four tokens. The model needs to continue this sequence, but it can't just look at the text. It must process every token through the full neural network to build an internal representation.
The autoregressive loop above is what the engine core drives at high frequency: a tight schedule, execute, update cycle on every step: [11]
Picks which requests get GPU time this step and assigns a token budget. Each request tracks how many tokens it has computed vs. how many it needs. No rigid prefill/decode split. [8]
vllm/v1/core/sched/scheduler.py
Sends the scheduled batch to GPU workers via the executor. Each worker runs the model’s forward pass using fused attention backends (FlashAttention, FlashInfer, etc.) and CUDA graphs against the paged KV cache. [11]
vllm/v1/worker/gpu_model_runner.py
Applies sampled tokens from the model output, records progress in the scheduler, frees finished requests, and streams detokenized text to the user. Then the loop repeats. [11]
vllm/v1/engine/core.py
scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
engine_core_outputs = self.scheduler.update_from_output(
scheduler_output, model_output
)
Order matches EngineCore.step(): after future.result(), the real code may call sample_tokens(grammar_output) if the model output is None, and it runs _process_aborts_queue() before update_from_output, inside logging context managers. The non_block=True call overlaps GPU work with grammar bitmask preparation. See vllm/v1/engine/core.py for the full branch logic. [11]
V1 splits work across OS processes: the API server runs tokenization and streams responses while engine cores and GPU workers run the scheduler and forwards, connected with ZMQ as in the upstream architecture doc. That separation keeps CPU-side string work off the engine core’s busy loop; process isolation also limits how far a failure in one role spreads. [11]
Each component has a specific job in the engine loop. Click to expand.
The Input Processor tokenizes text (and encodes images or audio when the model path requires it) in the API server process. The Output Processor detokenizes new tokens and streams them back as SSE events. Both run in the API process, which communicates with the engine core via ZMQ, so tokenization and detokenization stay off the core GPU scheduling path. [11]
Why it matters: Tokenization and detokenization are CPU-bound. Running them in a separate process from the GPU scheduler prevents them from adding latency to the critical engine loop.
vllm/v1/engine/input_processor.py
vllm/v1/engine/output_processor.py
Decides which requests get GPU time each step. Tracks num_computed_tokens vs num_tokens per request with no rigid prefill/decode split, so chunked prefill, prefix caching, and speculative decoding share one scheduler. [8]
Why it matters: Scheduling policy and token budgets shape both TTFT and inter-token latency.
vllm/v1/core/sched/scheduler.py
Manages the block pool you saw in the PagedAttention section. It allocates and frees fixed-size blocks and enables prefix caching by hashing block contents. [13]
Why it matters: KV cache is the #1 memory consumer. This manager is why vLLM can serve more concurrent requests than naive implementations.
vllm/v1/core/kv_cache_manager.py
Runs the actual torch.nn.Module forward pass on each GPU. It prepares input tensors, replays CUDA graphs for speed, and coordinates with attention backends (FlashAttention, FlashInfer) to read and write the paged KV cache. [11]
Why it matters: This is where GPU time is spent. CUDA graph capture eliminates CPU launch overhead, making each decode step faster.
vllm/v1/worker/gpu_model_runner.py
Abstracts how scheduled work reaches GPU workers. ParallelConfig defaults to UniProc when world_size == 1 and to Multiproc (mp) for typical multi-GPU setups; Ray is used for multi-node or Ray-backed deployments. All implementations still expose the same execute_model surface to the engine core. [11]
Why it matters: EngineCore.step() stays the same; changing parallelism swaps the executor implementation behind model_executor.
vllm/v1/executor/
The scheduler orchestrates the concepts from the PagedAttention and Batching sections: it assigns tokens from a per-step budget with no rigid prefill/decode split, enabling chunked prefills, prefix caching, speculative decoding, and mixed batches in one loop. When KV cache space is exhausted, the scheduler preempts the lowest-priority request (V1 default: recompute). [8] For prefix caching details, see the design doc.
The architecture diagram above maps to real OS processes. This calculator shows a common V1 topology under the default assumption that API-server count tracks data-parallel replicas; adjust the inputs to see how process count changes under that model. [11] [12]
The tuning lab shows how concurrency and token budget affect throughput, latency, and KV pressure. [8]
More simultaneous users raise overall throughput but can slow individual responses. The sliders below let you explore that tradeoff, and the table lists the flags that control it. [8] [21]
Sliders are illustrative: they show which levers usually lift throughput, which smooth streaming, and which add preemption or OOM risk.
This profile is tuned for a shared chat service: enough token budget to keep throughput healthy, enough headroom to avoid constant preemption, and enough prefix reuse to reward repeated scaffolding.
| Lever | Primary Effect | Tradeoff | How to explain it |
|---|---|---|---|
--gpu-memory-utilization |
More GPU memory for the model executor (weights, runtime, and KV cache); range (0, 1], default 0.9 [8] | Less memory headroom for spikes and variability | Raise only when the workload is stable and OOM risk is well understood. |
--max-num-seqs |
More sequences per scheduler step | More contention for KV capacity and scheduler budget | Best for short requests; lower it when contexts are long or heterogeneous. |
--max-num-batched-tokens |
More work per scheduler step | Larger values generally improve throughput and TTFT; smaller values often improve ITL when chunked prefill is on (vLLM V1 default when supported) | Tune against your TTFT, ITL, and prefill mix; vLLM docs recommend trying values above 8192 for throughput on smaller models on large GPUs. |
--max-model-len |
Maximum sequence length (prompt plus generated tokens) the engine allows | KV grows with actual length, but a higher cap raises worst-case per-request memory | Set the cap to the longest context you must serve; unused headroom still shapes KV reservation planning. |
--tensor-parallel-size |
Fits larger models across GPUs | More inter-GPU communication | Use the smallest TP value that fits comfortably on the available hardware. |
--quantization |
Lower memory footprint and cost | Model-specific accuracy and compatibility tradeoffs | Measure accuracy and latency on representative prompts; effects vary by scheme and kernel. |
--enable-prefix-caching / --no-enable-prefix-caching |
If omitted, vLLM enables prefix caching when the loaded model supports it; pass --no-enable-prefix-caching to force it off [8] |
Extra cache bookkeeping with little value when prompts are unique | Disable when reuse is rare; pass --enable-prefix-caching if you need to turn it back on after disabling. |
--scheduling-policy |
Admission fairness and priority behavior | Different tail-latency profiles under contention | priority orders waiting work by request priority (lower numeric values first, then arrival time); fcfs is strict arrival order. Use priority when the API assigns priorities and you want that order under load. |
The test dataset is the most overlooked tuning variable. Synthetic or toy prompts lead to misleading conclusions because vLLM performance depends on incoming request shapes and arrival patterns. [21]
For guidance on choosing between TP, PP, DP, EP, and CP, see the interactive parallelism chooser in the Deployment section. The --tensor-parallel-size row above is the most common tuning entry point. [4]
You finished the path. Switch to see the guide from a different angle, or show everything.
Understand costs, capabilities, and when vLLM fits.
Deploy, tune, and operate vLLM in production.
Understand the engine internals and scheduling.
Commands, HTTP API paths, and links to cited sources.
# Serve a model
vllm serve <model>
# Common options
vllm serve <model> \
--tensor-parallel-size 4 \
--max-model-len 4096
# Benchmark
vllm bench serve --model <model>
Commands above are drawn from the official serve and benchmark CLI docs. [7] [33]
# Chat completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "<model>", "messages": [
{"role": "user", "content": "Hello!"}
]}'
# List models
curl http://localhost:8000/v1/models
Paths follow the documented OpenAI-compatible server surface. [7]
vllm serve, OpenAI-compatible server flags, operational controls).