Interactive, vendor-neutral guide for practitioners

From First Token to Production Scale

GPU time is expensive, and the serving layer can waste much of it. Your app feels slow; your cluster stalls after a handful of users. This guide covers LLM inference mechanics and where vLLM changes the picture.

Why does the first token take so long, and what controls it?

Why does streaming stutter when more users connect?

Where does GPU memory go?

How do teams keep latency promises under real traffic?

Explore Modern Serving ↓

Source snapshot: official docs and engineering posts reviewed in April 2026. [1] [2] [3] [4]

How to read this guide

Pick the path that fits your role. Every section is self-contained, so you can skip ahead or stop early.

Business & Decision-Makers

Understand costs, capabilities, and when vLLM fits.

The Problem — why inference is hard
Capabilities — models, features, hardware
TCO & ROI — self-host vs. API economics
Adoption Playbook — POC to production

Platform & Ops Engineers

Deploy, tune, and operate vLLM in production.

Deployment — Kubernetes, parallelism
Observability — metrics, tracing, troubleshooting
Hardening — security and resilience checklists
Tuning — performance knobs and benchmarking

Deep-Dive & Research

Understand the engine internals and scheduling.

PagedAttention — memory management
Batching — scheduler mechanics
Architecture — V1 engine internals
Tuning — latency vs throughput tradeoffs

01

The Inference Challenge Start here

Organizations buy expensive GPU hardware. Apps still feel slow and top out after a small number of concurrent users. Fast responses and efficient GPU use pull on the same memory and compute budget. This section names that tradeoff and how vLLM addresses it.

When your application uses an AI model to answer questions, write text, or process documents, every response requires GPU compute time. vLLM is an open-source inference and serving engine built for high-throughput, memory-efficient LLM serving. Teams adopt it when they want more control over hardware choice, deployment shape, and cost model than a managed API provides. [10]

How an AI response reaches the user

User asks a question Chat, API call, or app request

Application sends prompt Your code calls the vLLM API

vLLM processes on GPU Prefill, then decode token by token

Response streams back Words appear as they are generated

The sections below explain what happens at each stage and where vLLM makes the difference.

Outcomes Over Benchmarks

Teams care about how fast the first token arrives and how smoothly the stream feels. They also track prompt reuse, multimodal inputs, structured outputs, and fairness on shared clusters. A single tokens per second figure rarely tells that whole story.

Prefill and Decode Are Different Jobs

Prefill is prompt-heavy and compute-bound. Decode adds one token at a time while the user watches; it is often memory-bandwidth-bound. A serving stack balances both phases continuously. [4]

The Key-Value (KV) Cache Becomes the Real Budget

Long context and high concurrency with large models all squeeze KV cache space. Design choices decide whether GPU memory holds useful tokens or loses capacity to fragmentation, eviction pressure, or preemption.

Quick overview

vLLM at a Glance

vLLM is an open-source inference and serving engine for high-throughput, memory-efficient LLM workloads. PagedAttention tightens KV memory use. [9] Continuous batching keeps accelerators busy under mixed load. The runtime also ships prefix caching, structured-output support, multimodal model support, multi-GPU parallelism, and OpenAI-compatible serving. [13] [15] [14] [12] [7]

Faster Responses

Targets lower TTFT and steadier streaming by reducing KV waste and prioritizing decode work under mixed load. [9] [8]

More Users, Same Hardware

PagedAttention and continuous batching increase useful work per GPU by reducing KV waste and keeping batches moving under uneven traffic. [9] [17]

Production-Ready

Run on Kubernetes with documented KServe integrations, published security guidance, and vendor-supported deployment options. [16] [28] [24]

Cost impact: GPU-hour pricing varies widely by accelerator, region, and contract. The useful comparison is your own measured cost per token or cost per request under representative load, not a generic market average. [21] [23]

Why teams switch to self-hosted inference

Potentially lower cost per output When utilization is high, higher throughput per GPU can improve cost per token or cost per request. [21] [23]

Deployment control Self-hosting keeps prompts and responses on infrastructure you operate, subject to your own network and storage controls. [28]

Capacity-based cost model Self-hosted costs are driven mainly by the capacity you provision and the utilization you achieve, not a provider's per-token meter. [31] [21]

Model and hardware flexibility One serving layer can cover many supported open-weight model families and multiple documented hardware backends. [14] [10]

02

What Production Requires Start here

Interactive serving needs a short time to first token (TTFT), steady streaming gaps between tokens (inter-token latency), and predictable latency under concurrent load. The mental model below ties prefill, decode, caching, and how vLLM addresses each requirement.

Applying the mental model

Building on the prefill/decode tradeoff above, prompt length, concurrency, prefix reuse, multimodal inputs, and latency targets usually decide whether batching, KV memory, parallelism, or disaggregated serving matter most. Use the topic tabs below to explore each technique and when it applies to your workload.

Interactive explainer

Disaggregated prefill / decode

What it is

Run prefill and decode on separate vLLM instances and hand off KV cache so long prefills do not queue behind decode work.

Why it matters

Splitting tiers keeps long prefills from delaying decode batches that used to share the same serving workers. [2] [4]

How vLLM Compares

vLLM

Open-source serving engine focused on high-throughput inference, broad model support, OpenAI-compatible APIs, and features such as PagedAttention, unified scheduling, and prefix caching. [10] [14] [13]

TGI

Hugging Face documents TGI as maintenance mode and recommends newer engines such as vLLM or SGLang for future serving-stack work where they fit. [20]

TensorRT-LLM

NVIDIA’s TensorRT-LLM stack: builder and runtime focused on CUDA GPUs, fused kernels, and detailed KV cache and compilation controls for teams already standardized on NVIDIA inference. [3]

SGLang

SGLang is an active open-source serving stack whose official materials emphasize RadixAttention, structured generation, and gateway/routing features. [18]

Is vLLM Right for You?

These questions cover most vLLM fit and sizing reviews.

Latency target

Is the user experience dominated by first token, smooth streaming, or offline throughput?

Prompt profile

Are prompts short and repetitive, or long, bursty, and retrieval-heavy?

Workload mix

Is this mostly chat, long-form generation, multimodal requests, or structured/tool-driven flows?

Deployment boundary

Single node, shared cluster, air-gapped enterprise, or multi-node fleet with strict operational controls?

Common Questions

Expand a card for the full answer.

03

What vLLM Can Do Start here

The supported-models documentation covers native text, multimodal, and pooling architectures for chat, code, vision, audio, embedding, scoring, and retrieval workloads. [14] What follows is a concise feature matrix and hardware map for serving.

Coverage Snapshot

Dense chat and code models

The most common deployment target: frontier open models, chat assistants, copilots, and code generation stacks.

Llama Qwen Mistral Granite

Sparse Mixture-of-Experts (MoE) families

Sparse MoE stacks need explicit expert-parallel layout and memory planning so routing overhead does not erase the parameter savings.

Mixtral DeepSeek Expert parallelism

Multimodal request paths

Multimodal traffic changes preprocessing and pushes memory and time to first token harder than text-only chat.

LLaVA Qwen2-VL Pixtral

Embedding, scoring, and retrieval

Pooling, embedding, and ranking endpoints sit beside chat completions for Retrieval-Augmented Generation (RAG) and search backends.

Embedding Scoring Pooling

Plan around dense text, MoE, multimodal inputs, or pooling instead of assuming a single model name tells the whole story.

Feature Matrix

Feature	Description	Key Benefit	What this means for you
Quantization	GPTQ, AWQ, FP8, BitsAndBytes, GGUF, TorchAO, and related schemes are documented in the project and surrounding vllm-project materials [10]	Run larger models on fewer GPUs	Run the same model on less expensive hardware
LoRA Adapters	Runtime loading and unloading of LoRA weights [32]	Serve many fine-tuned variants from one base	One deployment serves multiple customized models
Structured Output	JSON schema constraints plus regex, choice, grammar, and structural-tag guards [15] [6]	Safer integration with tools and downstream systems	The model's output is constrained to a format your code can parse
Prefix Caching	Automatic hash-based KV block reuse for shared prefixes [13]	Better TTFT for repeated system prompts and shared context	Repeated instructions (system prompts) are processed once instead of every time
Multimodal	Vision-language and audio-capable models [14]	Multimodal and text paths in one serving stack	Process images, audio, and text with the same server
CUDA Graphs	Captured execution graphs for repeat decode shapes and other stable execution paths [10]	Lower per-step launch overhead on stable workloads	Faster responses on steady, predictable traffic

Advanced features

Feature	Description	Key Benefit
Speculative Decoding	EAGLE, MTP, draft, n-gram, PARD, suffix, and MLP-style speculators [34]	Lower wall-clock latency when acceptance rate is healthy
Expert / MoE Support	Serve sparse MoE architectures such as Mixtral and DeepSeek families [14] [12]	Use hardware more efficiently for large sparse models
Disaggregated Prefill experimental	Separate prompt-ingest and decode pools when workloads justify it	Protect TTFT-sensitive traffic under prompt-heavy load

Hardware Support Across the vLLM Organization

CUDA, ROCm, CPU, and XPU installers target the main repository; accelerator-specific code also ships in separate vllm-project repositories linked below. [10]

Mainline platforms

Core

NVIDIA CUDA

Mainline GPU path for high-throughput serving across common enterprise and hyperscaler fleets.

repo: vllm

Core

AMD ROCm

ROCm-backed deployments for AMD accelerator fleets, including MI-series hardware.

repo: vllm

Core

Intel XPU

Intel Data Center and Arc GPUs through the XPU backend in the main repo plus vllm-xpu-kernels; typical installs use a source build (Intel publishes Docker images).

repo: vllm

Core

CPU Family

Linux x86_64 and Arm AArch64 ship prebuilt CPU wheels. Apple Silicon and IBM Z CPU targets are experimental and need source builds; Apple GPU inference uses the vllm-metal plugin in the grid below.

repo: vllm

Plugin

Google TPU

Cloud TPU support via the tpu-inference / vllm-tpu plugin package, integrated with the main vLLM install.

repo: tpu-inference

Plugin

AWS Neuron

Trainium and Inferentia support via a dedicated vLLM organization plugin for AWS deployments.

repo: vllm-neuron

Organization plugins and platform projects

Plugin

Ascend NPU

Community-maintained vLLM organization plugin for Huawei Ascend deployments.

repo: vllm-ascend

Plugin

Intel Gaudi

Dedicated org plugin for Gaudi/HPU environments and their operator + runtime differences.

repo: vllm-gaudi

Plugin

IBM Spyre

Community-maintained vLLM organization plugin for IBM Spyre AIU acceleration.

repo: vllm-spyre

Plugin

Apple Silicon / Metal

Community-maintained Metal plugin for Mac workflows and Apple unified-memory serving experiments.

repo: vllm-metal

04

TCO & ROI Business focus

The economics of self-hosted inference depend on infrastructure cost, operational overhead, and what an API provider charges for the same workload. Use representative benchmarking instead of generic break-even claims. [21] [31]

Self-hosted inference trades metered API pricing for capacity you manage. Break-even depends on measured throughput, utilization, infrastructure cost, and operations. GuideLLM and vLLM benchmarks provide the throughput side of that equation, but the answer is still workload-specific. [21] [23] [31]

GPU Infrastructure

Raw GPU-hour cost is only one input. The more useful metric is an effective cost per token or per request: take your real hourly or amortized infrastructure cost and divide it by measured throughput under representative prompts and concurrency. [21] [23]

Operational Overhead

Running your own inference adds deployment, monitoring, upgrades, and support work. The vLLM production-stack and OpenShift AI materials show reference patterns for autoscaling, observability, packaging, and cluster deployment; include that work in the business case. [22] [24]

API Pricing at Scale

Commercial APIs publish per-token prices. Compare those published prices directly against your self-hosted cost model instead of assuming one universal traffic threshold where the economics flip. [31] [21]

Inputs for a Cost Model

Input	How to Measure	Why It Matters
Infrastructure cost	Use your real hourly GPU rate, reserved-instance rate, or amortized on-prem cost.	This is the numerator in any self-hosted cost-per-token calculation.
Measured throughput	Benchmark on representative prompts, outputs, and concurrency with GuideLLM or vLLM tools.	Throughput determines how much useful work that infrastructure cost actually buys. [21] [23]
Utilization and duty cycle	Estimate how often the GPUs are busy versus idle across your real traffic pattern.	Idle capacity can dominate the economics of spiky or low-volume workloads.
Operational overhead	Include deployment, monitoring, patching, on-call, and support contracts.	Self-hosting is not only a hardware decision. [22] [24]
API pricing baseline	Use the provider’s published token pricing for the model and tier you are actually comparing against.	This is the baseline alternative to self-hosting. [31]

A practical self-hosted comparison is: effective cost per million tokens = hourly or amortized infrastructure cost / measured tokens per second / 3,600 × 1,000,000. Use representative workloads, not toy prompts, when you collect the throughput number. [21] [23]

Workload patterns

How Different Workloads Change the Math

The same engine can look cheap or expensive depending on duty cycle, prompt shape, and latency targets. Use workload-specific benchmarks to decide which side of the trade-off matters more.

Workload Profile	Important Variables	What to Measure First	Likely Outcome
RAG Chatbot	Prefix reuse, burstiness, TTFT and ITL targets, traffic predictability	TTFT, ITL, prefix-cache hit rate, and average utilization	Can go either way; self-hosting gets more compelling as demand and reuse become steady
Batch Document Processing	Long prompts, high duty cycle, queue tolerance, sustained throughput	Output tokens/sec, queue time, and effective cost per token	Often the easiest case for self-hosting because utilization can stay high
Code Assistant	Low-latency SLOs, long contexts, multi-turn locality, peak concurrency	P95 TTFT, P99 ITL, context-length distribution, and required headroom	Dedicated self-hosted capacity can make sense when latency SLOs and demand are both steady

GuideLLM supports synthetic or custom datasets plus rate types such as constant, concurrency, Poisson, and sweep-style tests. Use those modes to build a benchmark that actually resembles your production traffic. [23]

When self-hosting does not save money

Self-hosting is not always the right call. Managed APIs often win when traffic is sporadic, when proprietary model access matters more than infrastructure control, or when GPU operations overhead would dominate the savings. Keep the decision empirical: measure representative throughput, compare it to published API pricing, and include support and operations in the model. [21] [31]

How to measure your own cost-per-token

Divide your hourly or amortized infrastructure cost by measured throughput (tokens/sec) to get cost per million tokens, then compare that against the provider’s published token pricing for the model tier you are using. The Performance Tuning section covers representative benchmarking methodology with GuideLLM in more detail. [21] [31]

05

Adoption Playbook Business focus

A structured path from evaluation to production. Each phase maps to documented vLLM deployment modes, benchmarking steps, and support options.

You can start with vllm serve <model> for local evaluation and then move into Kubernetes-based deployment patterns documented by vLLM and KServe. Three phases, each with concrete exit criteria, take you from a single-GPU proof of concept to production. [7] [16]

Data Residency

Do prompts and responses need to stay on your network? Self-hosted vLLM lets inference traffic stay on infrastructure you control, but the security guide also notes that multi-node communications are insecure by default and should be isolated on trusted networks. [28]

Hardware Compatibility

The vLLM project documents support for CUDA, ROCm, Intel XPU, CPU, and additional accelerator plugins in the wider vllm-project organization. If you already run modern accelerator fleets, there is usually a documented path to start evaluating. [10]

Team Expertise

Complexity ranges from a one-liner (vllm serve <model>) to multi-node tensor/pipeline/data parallelism. Managed platforms like Red Hat OpenShift AI handle the infrastructure layer. [24]

Phased rollout

From Evaluation to Production in Three Phases

Phase 1: POC Single GPU, one model, internal users

Phase 2: Pilot K8s deployment, traffic shadow, SLO baseline

Phase 3: Production Multi-model, autoscaling, monitoring, handoff

Phase	vLLM Mode	Key Activity	Exit Criteria
1 — POC	Offline/batch via `LLM` class or `vllm serve` on a single GPU	Validate model quality, measure baseline throughput with GuideLLM [23]	Model produces acceptable outputs; throughput justifies self-hosting per TCO analysis
2 — Pilot	Kubernetes Deployment with health probes, GPU limits, shared memory volume	Shadow production traffic; define TTFT and ITL SLOs using GuideLLM sweep mode [21]	Meets latency SLOs at target concurrency; ops team comfortable with monitoring
3 — Production	vLLM production-stack or KServe with autoscaling, observability, and routing [22] [16]	Multi-model serving, LoRA adapters, and supportable operational ownership [24]	Uptime SLO met; cost model reviewed against measured throughput and current API pricing baseline [31]

Stakeholder Map

Each role owns a different piece of the deployment. Get them aligned before Phase 2.

Platform / Infra

GPU provisioning, Kubernetes integration, Helm charts, shared memory and storage configuration. Concerned with resource limits, node autoscaling, and multi-tenant isolation.

Security

Data residency, endpoint protection (reverse proxy, API-key limitations), network isolation for inter-node ZMQ and PyTorch Distributed traffic, and pre-production hardening checks. [28]

ML / Data Science

Model selection from the supported-model catalog, quantization options, accuracy validation, and LoRA adapter management. [14]

Procurement / Finance

TCO analysis from the previous section, support contracts, GPU lease-vs-buy decisions, and measurable success criteria. [24]

Success criteria template

Tie your go/no-go decision to metrics vLLM exposes:

Latency SLOs: TTFT < X ms, ITL < Y ms, measured via vllm:time_to_first_token_seconds and vllm:inter_token_latency_seconds Prometheus histograms [25]
Cost reduction: tokens/sec per infrastructure dollar, compared against published API pricing. Measured via GuideLLM throughput benchmarks [21] [31]
Uptime: /health endpoint availability meets your target, paired with platform probes and alerting [28]
Operational readiness: Monitoring dashboards active, on-call runbook documented, graceful shutdown tested (--shutdown-timeout) [7] [22]

06

Deployment Options Start here

vLLM runs the same engine family on one GPU, behind an HTTP server, or on Kubernetes. That gives teams one serving stack with familiar model and runtime flags across those shapes instead of maintaining separate engines per environment. [7] [16]

Enterprise deployment

Kubernetes-Native Serving

Upstream docs cover running vLLM on Kubernetes with KServe (Hugging Face serving runtime or LLMInferenceService). [16] OpenShift AI documents KServe-based model serving; you apply the same integration patterns there when exposing vLLM as a served model. [24]

Optimized Models

Many organizations publish quantized or otherwise optimized checkpoints on Hugging Face. Treat those as starting points for evaluation rather than automatic production approval, and validate them on your own prompts, latency targets, and safety requirements.

Security and Compliance

vLLM is built for throughput; production security requires layers around it. The Production Hardening section covers reverse proxy setup, network isolation, SSRF protection, FIPS-sensitive deployment considerations, and the full pre-production security checklist. [28]

Offline

Batch Inference

The LLM class runs generation in-process with no server overhead.

Ideal for ETL, evals, and backfills
No network hop between caller and engine

from vllm import LLM, SamplingParams

llm = LLM(model="RedHatAI/gemma-4-31B-it-FP8-block")
params = SamplingParams(temperature=0.8, top_p=0.95)

outputs = llm.generate(["Explain Kubernetes in one paragraph."], params)
print(outputs[0].outputs[0].text)

Online

API Server

One command starts an OpenAI-compatible HTTP server with chat, completions, and model-listing endpoints; embedding and pooling flows are documented separately in the supported-model and pooling docs. [7] [14]

Drop-in for any OpenAI client
Continuous batching with streaming

vllm serve RedHatAI/gemma-4-31B-it-FP8-block

Kubernetes

Scalable Deployment

Deploy on Kubernetes or OpenShift with GPU node selectors, resource limits, and room to grow into multi-node topologies.

Best fit for shared clusters and platform teams
Pairs with autoscaling and service meshes
Good handoff point into OpenShift AI and KServe workflows

# Minimal example only; add probes, shared memory, auth, and model storage for production.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["vllm", "serve",
          "RedHatAI/gemma-4-31B-it-FP8-block"]
        resources:
          limits:
            nvidia.com/gpu: 1

Interactive chooser

Parallelism Strategies

Each strategy below lists topology, fit, tradeoffs, and the matching CLI flags from the serving docs. [12]

Tensor Parallelism

Split each layer across GPUs when the model fits on one node but not on one device.

Best when Single node, fast interconnect, latency-sensitive serving

Tradeoff More collectives and tighter node-level coupling

Pairs with Quantization or PP when the model still outgrows one node

--tensor-parallel-size N Watch interconnect bandwidth and collective overhead.

Scaling Patterns

Combine parallelism flags with load balancing and routing for production scale. [12]

Pattern	What It Does	When to Use
Parallelism flags	`-tp`, `-pp`, `-dp` plus expert and context parallel	Scale a single model across multiple GPUs. See the parallelism chooser above [12]
Multi-API-server	`--api-server-count N` runs multiple frontend processes sharing one engine. Auto-configures `PROMETHEUS_MULTIPROC_DIR` for shared metrics [12]	When the API frontend is the bottleneck (high request parsing overhead, many concurrent connections)
vLLM Router	Production-grade Rust load balancer with consistent hashing for KV cache reuse, prefill/decode disaggregation support, K8s service discovery, and circuit breakers [29]	Multi-replica deployments needing intelligent request routing and cache affinity
llm-d	Red Hat’s KV-cache-aware routing layer for multi-replica vLLM on OpenShift [30]	Multi-turn workloads where routing to the replica holding the conversation’s KV cache avoids recomputation

Multi-Model Serving

Pattern	How	Tradeoff
Served name aliases	`--served-model-name name1 name2` — one engine, multiple API model IDs [7]	No overhead; useful for API compatibility or gradual migration
Runtime LoRA adapters	`--lora-modules` at startup, plus `/v1/load_lora_adapter` and `/v1/unload_lora_adapter` at runtime (requires `VLLM_ALLOW_RUNTIME_LORA_UPDATING=1`) [32]	Share one base model across fine-tuned variants; adapter count limited by `--max-loras` and GPU memory
Separate deployments + routing	One Deployment per model, fronted by vLLM Router or llm-d [29] [30]	Full isolation; requires more GPUs but simplifies per-model scaling and upgrades

Model storage on Kubernetes

PVC-based cache: Mount a PersistentVolumeClaim at /root/.cache/huggingface so the model is downloaded once and persisted across pod restarts. Shown in official KServe and Kubernetes-oriented deployment examples [16]
ModelCar packaging: Package the model into a container image for reproducible, versioned deployments. Documented by Red Hat for OpenShift AI workflows [24]
Init-container download: Use an init container to pull the model before the vLLM container starts, decoupling download time from the readiness probe window. This is a common Kubernetes pattern when model download time would otherwise dominate readiness.

07

Observability & Troubleshooting Ops focus

vLLM exposes Prometheus metrics on the /metrics endpoint of the API server, supports OpenTelemetry tracing, and ships pre-built Grafana dashboards. The metrics below cover queue health, memory pressure, latency percentiles, and request outcomes. [25] [5] [27]

Prometheus OpenTelemetry Grafana

In plain English

Queue depth tells you whether traffic is backing up. KV cache usage warns you before memory pressure triggers evictions. TTFT and ITL are the latency numbers your users feel. The troubleshooting tree below pairs each symptom with the metric to check and the flag to change. [25] [8]

Key Operational Metrics

These metrics are defined in vLLM’s PrometheusStatLogger and documented in the official Production Metrics page. [25]

Metric	Type	What It Means	Alert When
`vllm:num_requests_running`	Gauge	Requests currently in model execution batches	Sustained at `--max-num-seqs` ceiling
`vllm:num_requests_waiting`	Gauge	Requests queued for scheduling (capacity + deferred)	Growing queue (>0 sustained)
`vllm:kv_cache_usage_perc`	Gauge	KV cache utilization (1.0 = 100%)	>0.9 — approaching preemption threshold
`vllm:num_preemptions`	Counter	Cumulative preemptions (requests evicted from KV cache)	Rate >0 — requests being recomputed
`vllm:time_to_first_token_seconds`	Histogram	Time from request arrival to first generated token	p99 exceeds your TTFT SLO
`vllm:inter_token_latency_seconds`	Histogram	Gap between consecutive output tokens (streaming smoothness)	p99 exceeds your ITL SLO
`vllm:e2e_request_latency_seconds`	Histogram	Total request duration from arrival to final token	p99 exceeds end-to-end SLO
`vllm:request_queue_time_seconds`	Histogram	Time a request spent in the WAITING state before scheduling	Growing queuing latency
`vllm:prefix_cache_hits` / `queries`	Counter	Prefix cache token hits vs. queries	Low hit ratio = wasted KV memory
`vllm:request_success`	Counter	Count of successfully processed requests	Success rate drops against your normal request volume baseline

Advanced metrics (flag-gated)

--kv-cache-metrics enables vllm:kv_block_lifetime_seconds, vllm:kv_block_idle_before_evict_seconds, and vllm:kv_block_reuse_gap_seconds — useful for diagnosing cache churn and sizing
--enable-mfu-metrics enables vllm:estimated_flops_per_gpu_total — Model Flops Utilization for hardware efficiency analysis
Speculative decoding metrics (vllm:spec_decode_num_accepted_tokens, acceptance rate by draft position) are registered automatically when speculative decoding is configured

Dashboard Blueprint

Four Grafana rows built from the metrics above. The vLLM Monitoring Dashboards docs and the production-stack provide pre-built versions. [27] [22]

Row 1: Throughput

rate(vllm:generation_tokens[1m]) and rate(vllm:prompt_tokens[1m]). Shows decode and prefill throughput in tokens per second.

Row 2: Latency Percentiles

histogram_quantile(0.99, ...) over vllm:time_to_first_token_seconds, inter_token_latency_seconds, and e2e_request_latency_seconds. The three numbers users feel.

Row 3: KV Cache Pressure

vllm:kv_cache_usage_perc, rate(vllm:num_preemptions[5m]), and prefix cache hit ratio. Warns before memory pressure causes evictions.

Row 4: Queue Depth

vllm:num_requests_running, vllm:num_requests_waiting, and num_requests_waiting_by_reason (split by capacity vs. deferred).

OpenTelemetry Tracing

vLLM ships an official OTLP tracing example and documents trace export through --otlp-traces-endpoint plus related environment variables. [5]

Flag / Env Var	What It Does
`--otlp-traces-endpoint <URL>`	Sends traces to an OTLP collector (gRPC by default, or HTTP/protobuf via `OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`)
`--collect-detailed-traces {model,worker,all}`	Enables per-module spans for model execution or worker-level tracing (requires OTLP endpoint)
`OTEL_EXPORTER_OTLP_TRACES_PROTOCOL`	Set to `http/protobuf` for HTTP export instead of the default `grpc`

With an OTLP collector configured, request traces can flow into Jaeger or any other OTLP-compatible backend you already run. [5]

Troubleshooting Decision Tree

Each branch pairs a Prometheus metric with the configuration flag to change.

TTFT is high

Check vllm:num_requests_waiting — is traffic queueing? If yes, scale replicas or increase --max-num-seqs
Check vllm:request_prompt_tokens histogram — are prompts unusually long? Long prefills are compute-bound
Consider increasing --tensor-parallel-size to split prefill across GPUs [12]
Tune chunked prefill via --max-num-batched-tokens to bound the per-step prefill budget [8]

OOM or preemptions

Check vllm:kv_cache_usage_perc — is it above 0.9?
Lower --gpu-memory-utilization (default 0.9) to leave headroom for activation memory
Reduce --max-model-len to cap KV cache per sequence
Enable quantization (--quantization awq, gptq, or fp8) to shrink model weights and free memory for KV [8]

Streaming stutters (high ITL)

Check vllm:inter_token_latency_seconds p99 for spikes
Check vllm:iteration_tokens_total — large batches per step increase per-token decode time
Lower --max-num-batched-tokens to reduce batch size, trading throughput for smoother streaming [8]

Requests timing out

Check vllm:request_queue_time_seconds for growing queue latency
Inspect vllm:num_requests_waiting_by_reason — is the bottleneck capacity (not enough GPU) or deferred (transient constraints like LoRA budget)?
For capacity: scale replicas or increase --max-num-seqs
For deferred: check LoRA adapter limits or KV transfer backpressure

Low prefix cache hit rate

Compute hit ratio: rate(vllm:prefix_cache_hits[5m]) / rate(vllm:prefix_cache_queries[5m])
Enable prefix caching if not already on: --enable-prefix-caching [13]
Structure prompts so system messages and common context appear at the beginning (cache matches from the prefix)
Use session affinity or the vLLM Router’s consistent hashing to route repeat users to the same replica [29]

For health probe configuration, graceful shutdown, resource limits, and logging controls, see the Resilience Checklist in Production Hardening.

08

Production Hardening Ops focus

A pre-production checklist drawn from vLLM’s security guide, official Kubernetes examples, and source code. [28]

In plain English

vLLM runs out of the box. Running it safely in production takes deliberate configuration. Each item below links to the flag or doc that implements it. [28] [7]

Security Checklist

Items are documented in vLLM’s security guide plus the official server arguments and deployment examples. [28] [7] [16]

Item	Why	How
Deploy behind a reverse proxy	`--api-key` only protects `/v1` endpoints. Endpoints like `/invocations` and `/pooling` are unprotected; `/pause` and `/resume` are also unprotected but only exist when `VLLM_SERVER_DEV_MODE=1` is set. The security guide states: “The most effective approach is to deploy vLLM behind a reverse proxy.” [28]	nginx, Envoy, or a Kubernetes Gateway that allowlists only the endpoints you need and adds rate limiting
API key authentication	Bearer token auth for `/v1` endpoints	`--api-key <token>` or `VLLM_API_KEY` env var. Note: only protects `/v1` path prefix; auth middleware skips other paths
TLS termination	Encrypt client-to-server traffic	`--ssl-keyfile`, `--ssl-certfile`, `--ssl-ca-certs`. For TLS 1.2 cipher control: `--ssl-ciphers`
FIPS-sensitive environments	Some regulated deployments need approved cryptographic settings and validation.	Use the documented hashing and TLS-related settings from the security guide, then validate the full deployment with your compliance team before claiming FIPS readiness. [28]
Network isolation	All inter-node traffic (PyTorch Distributed, KV cache transfer, ZMQ between API server and engine core) is insecure and unencrypted by default	Deploy on an isolated network segment. Use Kubernetes NetworkPolicies. Set `VLLM_HOST_IP` to a specific interface. Configure firewalls to block all ports except the API server
Disable dev mode	`VLLM_SERVER_DEV_MODE=1` exposes `/collective_rpc` (arbitrary RPC execution), cache resets, and sleep endpoints	Never set `VLLM_SERVER_DEV_MODE=1` in production. Never enable profiler endpoints (`--profiler-config`) in production
SSRF protection	Malicious users can supply URLs targeting internal services or cloud metadata endpoints	`--allowed-media-domains <domains>` to restrict media URL fetching. Set `VLLM_MEDIA_URL_ALLOW_REDIRECTS=0` to prevent redirect bypass
Request parameter limits	The `n` parameter can cause resource exhaustion if set very high	Set `VLLM_MAX_N_SEQUENCES` to a deployment-appropriate cap for public-facing traffic rather than leaving the risk unbounded. [28]

Resilience Checklist

Item	Why	How
Resource limits and requests	Prevent noisy-neighbor issues and ensure GPU scheduling	K8s: set `resources.limits` (CPU, memory, `nvidia.com/gpu`) and `resources.requests`. Mount `/dev/shm` as `emptyDir: { medium: Memory, sizeLimit: "2Gi" }` for tensor parallel shared memory [16]
Health endpoint	`GET /health` gives the orchestrator a simple liveness and readiness signal.	Use it for both liveness and readiness probes, and tune initial delays and thresholds for the model load time you actually observe.
Graceful shutdown	Without a drain window, restarts can interrupt in-flight requests.	Set `--shutdown-timeout N` to give the server time to finish or drain active work before exit. [7]
Logging controls	Production logs should be informative without leaking prompts or overwhelming storage	`--max-log-len` to bound logged prompt/output size. `--disable-access-log-for-endpoints /health,/metrics` to suppress probe noise. `--enable-log-requests` is off by default [7]

For scaling patterns, multi-model serving, and model storage options, see the Deployment Options section.

09

PagedAttention Technical deep-dive

The core idea: manage KV cache with the same block-and-table structure as virtual memory, so variable-length sequences and concurrent requests use a fixed GPU memory budget more efficiently. [9]

In plain English

When a model is answering multiple questions at once, it needs to remember context for each conversation. The straightforward approach reserves a big chunk of GPU memory up front for each request and wastes whatever goes unused. PagedAttention organizes that memory more efficiently—like how your computer's operating system manages RAM—so the same GPU can handle more conversations simultaneously without running out of space. [9]

Allocator showdown

Same workload, two allocation patterns

The steps below replay a short burst of requests: fixed contiguous reservations beside drawing fixed-size blocks from a pool only as each sequence grows.

Step 1 / 4

Naive Contiguous

Contiguous pre-allocation

vs

PagedAttention

On-demand block pool

Logical block tables

Physical block pool

Each request's logical KV blocks map into a shared physical block pool on the GPU, just as an OS maps virtual pages to RAM frames. [9] Sequences grow, share prefix storage, and release blocks on completion without copying KV tensors.

Grow without relocating Share prefix KV blocks Return blocks to the pool on completion Lower KV cache waste

Animated walkthrough

Allocate on demand

Req A arrives. Blocks land wherever space exists. No contiguous reservation needed.

Used0

Shared0

Free12

Utilization

Research Paper: Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention", SOSP 2023. arXiv:2309.06180

10

Continuous Batching & Chunked Prefill Technical deep-dive

Iteration-level schedulers run one forward pass per round and can change which requests are in the batch, instead of holding a fixed batch until every request finishes. [17] [8]

In plain English

Older systems process requests in fixed groups: if one user's question takes longer, everyone else in the group waits. Continuous batching lets vLLM swap finished requests out and new requests in on every processing step, so the GPU stays busy and no single slow response blocks the rest. Chunked prefill breaks very long prompts into smaller pieces so they do not monopolize the processor while other users are waiting for their next word. [17] [8]

Scheduler simulator

Continuous Batching Lab

Scrub through scheduler steps and compare static batching versus vLLM-style continuous admission.

Prefill Decode Admitted Waiting Idle

Static Batching

Batch-locked

Prefill Req A Long prompt prefill

Decode Req B Chat decode

Decode Req C Chat decode

Queued Req D Waiting to enter

vs

Continuous Batching

Iteration-level

Prefill Req A Long prompt prefill

Decode Req B Chat decode

Decode Req C Chat decode

Queued Req D Ready next step

Req D is queued in both panels. Iteration-level scheduling admits D on the next step when a slot frees; batch-locked scheduling waits until the current batch completes.

Budget-sharing explainer

Chunked Prefill

Long prompts span several steps so decode rows still advance in the same batch.

Req A has a 256-token prompt, but the scheduler only processes 64 tokens of it this step, leaving room for Reqs B and C to keep streaming their responses.

Prefilling Decoding (streaming) Just admitted Queued Complete

Req A Chunk 1 / 4 · long doc summary

64 tok

Req B Streaming · chat reply

20 tok

Req C Streaming · chat reply

20 tok

Req D Queued · new question

Req E Queued · code completion

Req F Queued · translation

104 / 128 tokens

This diagram uses a fixed step budget of 128 tokens (illustrative only). vLLM applies the same idea through max_num_batched_tokens, often in the thousands. The scheduler fills that cap with a mix of prefill and decode work. [8]

Without chunking: Req A’s 256-token prompt would consume the entire 128-token budget for 2 consecutive steps. Reqs B and C would not decode during those steps, and Reqs D through F would wait longer, increasing Time to First Token (TTFT) for queued work.

Chunking caps how much prefill runs per step, so decode is less likely to sit idle behind a long prompt. When prefill dominates load, operators can split the two phases across separate pools—see Disaggregated Prefill in the Modern Requirements section. [8] [2]

11

Architecture Deep Dive Technical deep-dive

From "what happens when I send a prompt?" to the V1 engine internals. Built for people new to inference. [11]

In plain English

vLLM is structured like a well-run kitchen. An API server takes orders (your prompts), a scheduler decides which orders to cook next, and worker processes run the model on the GPU. Separating these roles lets vLLM handle many users at once without one slow request blocking the others. This section walks through each piece from the outside in. [11]

Start here

What happens when you send a prompt?

Serving a prompt means turning text into token IDs, running the model on the GPU to produce more tokens, and streaming decoded text back as tokens arrive. The rest of the vLLM layout speeds up and overlaps those stages. [11]

1

Prepare

Text → tokens. The tokenizer converts your prompt into numbers. Images and audio get encoded too.

2

Generate

In the basic autoregressive case, the model runs once per output token. This is a loop: every new token depends on the one before it. The KV cache stores past computations so only the new token needs processing.

loop

3

Stream

Each new token is converted back to text and streamed to the user in real time.

Most serving cost and latency sit in step 2. The architecture below shows how vLLM runs that loop efficiently under concurrency, building on the KV cache and PagedAttention concepts covered earlier. [11] [9]

Interactive walkthrough

A prompt arrives

The user sends "The quick brown fox," four tokens. The model needs to continue this sequence, but it can't just look at the text. It must process every token through the full neural network to build an internal representation.

Tokens

Forward Pass Waiting for input…

KV Cache

0 entries

Prompt tokens4

Generated0

Forward passes0

KV entries0

Recomputations avoided0

Inside vLLM V1

The Engine Core Loop

The autoregressive loop above is what the engine core drives at high frequency: a tight schedule, execute, update cycle on every step: [11]

1

Schedule

Picks which requests get GPU time this step and assigns a token budget. Each request tracks how many tokens it has computed vs. how many it needs. No rigid prefill/decode split. [8]

vllm/v1/core/sched/scheduler.py

2

Execute

Sends the scheduled batch to GPU workers via the executor. Each worker runs the model’s forward pass using fused attention backends (FlashAttention, FlashInfer, etc.) and CUDA graphs against the paged KV cache. [11]

vllm/v1/worker/gpu_model_runner.py

3

Update

Applies sampled tokens from the model output, records progress in the scheduler, frees finished requests, and streams detokenized text to the user. Then the loop repeats. [11]

vllm/v1/engine/core.py

Simplified from EngineCore.step()

scheduler_output = self.scheduler.schedule()
future = self.model_executor.execute_model(scheduler_output, non_block=True)
grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
model_output = future.result()
engine_core_outputs = self.scheduler.update_from_output(
    scheduler_output, model_output
)

Order matches EngineCore.step(): after future.result(), the real code may call sample_tokens(grammar_output) if the model output is None, and it runs _process_aborts_queue() before update_from_output, inside logging context managers. The non_block=True call overlaps GPU work with grammar bitmask preparation. See vllm/v1/engine/core.py for the full branch logic. [11]

Full picture

V1 Multi-Process Architecture

V1 splits work across OS processes: the API server runs tokenization and streams responses while engine cores and GPU workers run the scheduler and forwards, connected with ZMQ as in the upstream architecture doc. That separation keeps CPU-side string work off the engine core’s busy loop; process isolation also limits how far a failure in one role spreads. [11]

API Edge Engine Core GPU Execution

The Cast of Characters

Each component has a specific job in the engine loop. Click to expand.

The Input Processor tokenizes text (and encodes images or audio when the model path requires it) in the API server process. The Output Processor detokenizes new tokens and streams them back as SSE events. Both run in the API process, which communicates with the engine core via ZMQ, so tokenization and detokenization stay off the core GPU scheduling path. [11]

Why it matters: Tokenization and detokenization are CPU-bound. Running them in a separate process from the GPU scheduler prevents them from adding latency to the critical engine loop.

vllm/v1/engine/input_processor.py vllm/v1/engine/output_processor.py

Decides which requests get GPU time each step. Tracks num_computed_tokens vs num_tokens per request with no rigid prefill/decode split, so chunked prefill, prefix caching, and speculative decoding share one scheduler. [8]

Why it matters: Scheduling policy and token budgets shape both TTFT and inter-token latency.

vllm/v1/core/sched/scheduler.py

Manages the block pool you saw in the PagedAttention section. It allocates and frees fixed-size blocks and enables prefix caching by hashing block contents. [13]

Why it matters: KV cache is the #1 memory consumer. This manager is why vLLM can serve more concurrent requests than naive implementations.

vllm/v1/core/kv_cache_manager.py

Runs the actual torch.nn.Module forward pass on each GPU. It prepares input tensors, replays CUDA graphs for speed, and coordinates with attention backends (FlashAttention, FlashInfer) to read and write the paged KV cache. [11]

Why it matters: This is where GPU time is spent. CUDA graph capture eliminates CPU launch overhead, making each decode step faster.

vllm/v1/worker/gpu_model_runner.py

Abstracts how scheduled work reaches GPU workers. ParallelConfig defaults to UniProc when world_size == 1 and to Multiproc (mp) for typical multi-GPU setups; Ray is used for multi-node or Ray-backed deployments. All implementations still expose the same execute_model surface to the engine core. [11]

Why it matters: EngineCore.step() stays the same; changing parallelism swaps the executor implementation behind model_executor.

vllm/v1/executor/

The scheduler orchestrates the concepts from the PagedAttention and Batching sections: it assigns tokens from a per-step budget with no rigid prefill/decode split, enabling chunked prefills, prefix caching, speculative decoding, and mixed batches in one loop. When KV cache space is exhausted, the scheduler preempts the lowest-priority request (V1 default: recompute). [8] For prefix caching details, see the design doc.

Illustrative Process Calculator

The architecture diagram above maps to real OS processes. This calculator shows a common V1 topology under the default assumption that API-server count tracks data-parallel replicas; adjust the inputs to see how process count changes under that model. [11] [12]

Tensor Parallel (TP) Splits each layer across GPUs. Use when a single model layer doesn’t fit on one GPU.

Pipeline Parallel (PP) Splits layers into stages across GPUs. Use to scale beyond one node’s GPU interconnect.

Data Parallel (DP) Runs independent model replicas for higher throughput. Each replica handles its own requests.

API Servers (default)1

Engine Cores1

GPU Workers4

DP Coordinator1

Total Processes6

Total GPUs4

12

Performance Tuning Technical deep-dive

The tuning lab shows how concurrency and token budget affect throughput, latency, and KV pressure. [8]

In plain English

More simultaneous users raise overall throughput but can slow individual responses. The sliders below let you explore that tradeoff, and the table lists the flags that control it. [8] [21]

Latency vs Throughput Lab

Sliders are illustrative: they show which levers usually lift throughput, which smooth streaming, and which add preemption or OOM risk.

Concurrent requests 128

Token budget per step 4096

GPU memory utilization 90%

Shared prefix hit rate 35%

Throughput High

TTFT Medium

Streaming ITL Medium

KV Pressure Controlled

Balanced shared cluster

This profile is tuned for a shared chat service: enough token budget to keep throughput healthy, enough headroom to avoid constant preemption, and enough prefix reuse to reward repeated scaffolding.

Raise concurrency when requests are short and similar.
Lower token budget if chat streaming feels sticky or TTFT drifts.
Protect memory headroom when context length or multimodal assets vary.

Key Tuning Parameters

Lever	Primary Effect	Tradeoff	How to explain it
`--gpu-memory-utilization`	More GPU memory for the model executor (weights, runtime, and KV cache); range (0, 1], default 0.9 [8]	Less memory headroom for spikes and variability	Raise only when the workload is stable and OOM risk is well understood.
`--max-num-seqs`	More sequences per scheduler step	More contention for KV capacity and scheduler budget	Best for short requests; lower it when contexts are long or heterogeneous.
`--max-num-batched-tokens`	More work per scheduler step	Larger values generally improve throughput and TTFT; smaller values often improve ITL when chunked prefill is on (vLLM V1 default when supported)	Tune against your TTFT, ITL, and prefill mix; vLLM docs recommend trying values above 8192 for throughput on smaller models on large GPUs.
`--max-model-len`	Maximum sequence length (prompt plus generated tokens) the engine allows	KV grows with actual length, but a higher cap raises worst-case per-request memory	Set the cap to the longest context you must serve; unused headroom still shapes KV reservation planning.
`--tensor-parallel-size`	Fits larger models across GPUs	More inter-GPU communication	Use the smallest TP value that fits comfortably on the available hardware.
`--quantization`	Lower memory footprint and cost	Model-specific accuracy and compatibility tradeoffs	Measure accuracy and latency on representative prompts; effects vary by scheme and kernel.
`--enable-prefix-caching` / `--no-enable-prefix-caching`	If omitted, vLLM enables prefix caching when the loaded model supports it; pass `--no-enable-prefix-caching` to force it off [8]	Extra cache bookkeeping with little value when prompts are unique	Disable when reuse is rare; pass `--enable-prefix-caching` if you need to turn it back on after disabling.
`--scheduling-policy`	Admission fairness and priority behavior	Different tail-latency profiles under contention	`priority` orders waiting work by request priority (lower numeric values first, then arrival time); `fcfs` is strict arrival order. Use `priority` when the API assigns priorities and you want that order under load.

Measurement Tips

Separate TTFT, inter-token latency, and total throughput instead of reporting only one aggregate score.
Warmup, cache state, and request arrival pattern materially change results for dynamic batching systems.
Use a real prompt-length distribution, not a single toy prompt, when comparing engines.
Always record hardware, precision, context limit, and key scheduler settings alongside the results. [21] [23]

Practical Benchmarking with GuideLLM

The test dataset is the most overlooked tuning variable. Synthetic or toy prompts lead to misleading conclusions because vLLM performance depends on incoming request shapes and arrival patterns. [21]

Build test datasets that match your production input/output token distributions, prompt variability, and repeated text patterns (vLLM optimizes recurring prefixes).
Use GuideLLM to move from synthetic load testing to production reality: configurable request shapes, varying concurrency patterns, and captured TTFT / ITL / throughput metrics. [21]
Define P95 and P99 targets for TTFT and ITL before tuning, then sweep concurrency to find where those targets break. Adjust one lever at a time and re-measure.

Parallelism Strategy

For guidance on choosing between TP, PP, DP, EP, and CP, see the interactive parallelism chooser in the Deployment section. The --tensor-parallel-size row above is the most common tuning entry point. [4]

Explore another perspective

You finished the path. Switch to see the guide from a different angle, or show everything.

Business & Decision-Makers

Understand costs, capabilities, and when vLLM fits.

The Problem — why inference is hard
Capabilities — models, features, hardware
TCO & ROI — self-host vs. API economics
Adoption Playbook — POC to production

Platform & Ops Engineers

Deploy, tune, and operate vLLM in production.

Deployment — Kubernetes, parallelism
Observability — metrics, tracing, troubleshooting
Hardening — security and resilience checklists
Tuning — performance knobs and benchmarking

Deep-Dive & Research

Understand the engine internals and scheduling.

PagedAttention — memory management
Batching — scheduler mechanics
Architecture — V1 engine internals
Tuning — latency vs throughput tradeoffs

13

Quick Reference

Commands, HTTP API paths, and links to cited sources.

CLI

# Serve a model
vllm serve <model>

# Common options
vllm serve <model> \
  --tensor-parallel-size 4 \
  --max-model-len 4096

# Benchmark
vllm bench serve --model <model>

Commands above are drawn from the official serve and benchmark CLI docs. [7] [33]

OpenAI-Compatible API

# Chat completions
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "<model>", "messages": [
    {"role": "user", "content": "Hello!"}
  ]}'

# List models
curl http://localhost:8000/v1/models

Paths follow the documented OpenAI-compatible server surface. [7]

Source Notes

Aleksa Gordić (vLLM Blog), Inside vLLM: Anatomy of a High-Throughput LLM Inference System.
vLLM docs, Disaggregated Prefilling (experimental).
NVIDIA, Introducing New KV Cache Reuse Optimizations in NVIDIA TensorRT-LLM.
Engineering at Meta, Scaling LLM Inference: Innovations in Tensor Parallelism, Context Parallelism, and Expert Parallelism.
vLLM docs, Setup OpenTelemetry POC (OTLP tracing example, exporter protocol, Jaeger walkthrough).
OpenAI API docs, Structured model outputs.
vLLM docs, Server Arguments (vllm serve, OpenAI-compatible server flags, operational controls).
vLLM docs, Optimization and Tuning (chunked prefill defaults, preemption, tuning parameters).
Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP 2023.
vLLM project, GitHub repository and README (project scope, feature overview, hardware and plugin links).
vLLM docs, Architecture Overview (V1 engine roles, process topology, API server / engine core / worker split).
vLLM docs, Parallelism and Scaling; see also Data Parallel Deployment, Expert Parallel Deployment, and Context Parallel Deployment.
vLLM docs, Automatic Prefix Caching.
vLLM docs, Supported Models.
vLLM docs, Structured Outputs.
vLLM docs, Deploying with KServe.
Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models, OSDI 2022.
SGLang project, GitHub repository.
Artificial Analysis, Open Source Models.
Hugging Face docs, Text Generation Inference (maintenance mode notice and downstream engine recommendations).
Trevor Royer (Red Hat Developer), Practical strategies for vLLM performance tuning.
vLLM Project, High Performance and Easy Deployment of vLLM in K8S with vLLM production-stack.
Red Hat Developer, GuideLLM: Evaluate LLM deployments for real-world inference.
Red Hat Developer, Optimize and deploy LLMs for production with OpenShift AI.
vLLM Docs, Production Metrics.
vLLM Docs, Prometheus and Grafana.
vLLM Docs, Monitoring Dashboards.
vLLM Docs, Security.
vLLM Project, vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving.
Red Hat Developer, Accelerate multi-turn LLM workloads on OpenShift AI with llm-d intelligent routing.
OpenAI, API Pricing.
vLLM docs, LoRA Adapters.
vLLM docs, vllm bench serve.
vLLM docs, Speculative Decoding.