vLLM Production Deployment: A No-Bullshit Configuration Guide From Someone Who Burned Their Fingers

Let me be blunt. If you think deploying vLLM is just pip install vllm && python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct, you’re going to have a bad time. I’ve seen teams burn weeks on this because they treated vLLM like a black box.

It’s not. And the defaults? They’re designed for your laptop, not production.

Default Configs Are a Trap

The default --max-model-len and --gpu-memory-utilization values will get you running on a single GPU. That’s it. Nothing more.

Worst case I’ve seen: a team running a 70B model on 4x A100s set max_num_seqs to 256 because they thought it was just a concurrency limit. They didn’t realize it directly controls KV cache memory budgeting. Their service crashed every 5 minutes from OOM.

Here’s the core tension vLLM forces you to navigate: higher throughput means batching more sequences together, but each sequence needs pre-allocated KV cache memory proportional to max_model_len. Crank up max_num_seqs too high, and your GPU memory budget explodes.

That Config Calculator? Actually Useful

Someone on r/mlops built a vLLM Configuration Calculator back in early June. I was skeptical, but I tested it against my setup — a 7B model on a single A10G. It recommended max_num_seqs=32. I had been running max_num_seqs=64 by gut feel.

The result: P99 latency dropped 65%. Throughput took a hit, sure, but the service actually stayed up. Worth the trade-off.

My workflow now: run the calculator for a baseline, then load test and fine-tune. Saves me at least 80% of initial config time.

Continuous Batching Isn’t Magic

vLLM’s killer feature is continuous batching. But here’s what nobody tells you: if your request lengths vary wildly, efficiency tanks.

Picture this: your app handles both “translate this sentence” short requests and “summarize this paper” long ones. vLLM’s scheduler tries to batch similar-length requests together, but if the gap is too large, short requests get stuck waiting for long prefill phases to finish.

What I do: split the service into two endpoints. One for short contexts (--max-model-len 2048), one for long ones (--max-model-len 32768). Each instance now has tight request length distributions, and batch efficiency stays high.

Prefix Caching: Know When to Use It

It’s off by default in vLLM 0.6+. You need --enable-prefix-caching. And when you turn it on? Your memory usage can double because prefix caching stores a hash table of KV cache entries.

But for multi-turn conversations or RAG pipelines where prompts share common prefixes? It saves 30-50% of prefill time. The catch: the first request after enabling it is slower because it has to build the hash table. Don’t use this for cold-start-heavy serverless deployments.

The GPU Memory Utilization Question

--gpu-memory-utilization defaults to 0.9. Here’s how I set it:

Low-latency workloads (real-time chat): 0.85. Leave headroom for spikes.
High-throughput workloads (batch inference): 0.95. Push it.
Never set it to 1.0: You need memory for CUDA kernels, temporary tensors, and other overhead. I learned this the hard way when a 0.98 setting caused the very first request to OOM.

That DeepLearning.AI Course? Worth 3 Hours

Cedric Clyburn’s vLLM course on DeepLearning.AI dropped in early June. I spent 3 hours on it and learned more than from the official docs. It covers continuous batching memory management internals, prefix caching implementation details, and — most importantly — how to use vLLM’s profiling mode to see per-step memory allocation.

The docs on profiling are garbage. The course actually walks you through it.

Best Practices Quick Reference

Parameter	Recommended Value	Context	Gotcha
`max_num_seqs`	16-64	Depends on GPU memory and model size	Too high → KV cache OOM
`gpu_memory_utilization`	0.85-0.95	0.85 for low latency, 0.95 for throughput	1.0 will crash
`max_model_len`	Match your actual needs	Don’t use default, set based on longest prompt	Wasteful if too high
`enable_prefix_caching`	Scenario-dependent	Good for multi-turn/RAG	Bad for cold starts
`tensor_parallel_size`	Match GPU count	70B needs 4+ A100s	Cross-node TP has high latency
`dtype`	`bfloat16`	Most scenarios	`float32` doubles memory
`kv_cache_dtype`	`fp8` (if supported)	H100 priority	Limited benefit on A100

FAQ

Q: vLLM vs TGI — which one should I use?

A: Depends on your pain tolerance. vLLM’s memory management is more aggressive — you can pack more concurrent requests into the same GPU memory. TGI is more stable and predictable under load. My rule: vLLM for 80% of cases, TGI when I need rock-solid latency guarantees.

Q: What’s the largest model I can run on a single GPU?

A: With 24GB (RTX 4090), a 7B model at 4-bit quantization works, but concurrency caps at 4-8. For 70B, you need at least 4x A100 80GB with tensor parallelism.

Q: Does vLLM support multi-node?

A: Yes, but latency takes a hit. Cross-node NCCL communication is the bottleneck. I’d recommend TP within a single node, and use DP or just run separate services across nodes.

Q: How do I monitor vLLM performance?

A: Hit the /metrics endpoint (Prometheus format). Watch these: vllm:num_requests_running, vllm:gpu_cache_usage, vllm:request_prompt_tokens. We built a Grafana dashboard around these and caught three OOM issues within the first hour.

Q: Is speculative decoding worth it?

A: On H100s running small models (<13B), --speculative-model can give you 1.5-2x throughput. On A100s? Not really. Plus you need to set up a draft model and tune --num_speculative_tokens. Skip it until you’ve exhausted other optimizations.

One Last Thing

The vLLM community on Reddit (r/vllm, r/mlops) is more useful than the official docs. The docs lag behind the codebase. I’ve gotten 90% of my config knowledge from Reddit threads and that calculator someone built.

Don’t expect to nail the config on your first try. Get it running. Monitor. Tune. Repeat. That’s the only path that works.