Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-remove-flash-beta-notification.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Most LLMs need specific configuration to run properly on vLLM. Default settings work for some models, but many require custom tokenization, attention mechanisms, or feature flags. Without the right settings, workers may fail to load or produce incorrect outputs.
When deploying a model, check its Hugging Face README and the vLLM documentation for required settings.
Environment variables
vLLM is configured using command-line flags. On Runpod, set these as environment variables instead.
Convert flag names to uppercase with underscores.
For example:
Becomes:
Example: Deploying Mistral
CLI command:
vllm serve mistralai/Ministral-8B-Instruct-2410 \
--tokenizer_mode mistral \
--config_format mistral \
--load_format mistral \
--enable-auto-tool-choice \
--tool-call-parser mistral
Equivalent Runpod environment variables:
| Variable | Value |
|---|
MODEL_NAME | mistralai/Ministral-8B-Instruct-2410 |
TOKENIZER_MODE | mistral |
CONFIG_FORMAT | mistral |
LOAD_FORMAT | mistral |
ENABLE_AUTO_TOOL_CHOICE | true |
TOOL_CALL_PARSER | mistral |
Model-specific configurations
Recommended environment variables for popular model families. Check your model’s documentation for exact requirements.
| Model family | Example model | Key environment variables | Notes |
|---|
| Qwen3 | Qwen/Qwen3-8B | ENABLE_AUTO_TOOL_CHOICE=true TOOL_CALL_PARSER=hermes | For AWQ/GPTQ versions, set QUANTIZATION accordingly. |
| OpenChat | openchat/openchat-3.5-0106 | None required | Use CUSTOM_CHAT_TEMPLATE if default templates produce poor results. |
| Gemma | google/gemma-3-1b-it | None required | Requires HF_TOKEN. Set DTYPE=bfloat16 for best results. |
| DeepSeek-R1 | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B | REASONING_PARSER=deepseek_r1 | Enables reasoning mode for chain-of-thought outputs. |
| Phi-4 | microsoft/Phi-4-mini-instruct | None required | ENFORCE_EAGER=true can resolve initialization issues on older CUDA versions. |
| Llama 3 | meta-llama/Llama-3.2-3B-Instruct | TOOL_CALL_PARSER=llama3_json ENABLE_AUTO_TOOL_CHOICE=true | Use MAX_MODEL_LEN to prevent KV cache from exceeding GPU VRAM. |
| Mistral | mistralai/Ministral-8B-Instruct-2410 | TOKENIZER_MODE=mistral CONFIG_FORMAT=mistral LOAD_FORMAT=mistral TOOL_CALL_PARSER=mistral ENABLE_AUTO_TOOL_CHOICE=true | Mistral models require specialized tokenizers. |
GPU selection
vLLM pre-allocates memory for its KV cache, so you need more VRAM than the minimum to load the model.
VRAM estimation
- FP16/BF16: 2 bytes per parameter.
- INT8: 1 byte per parameter.
- INT4 (AWQ/GPTQ): 0.5 bytes per parameter.
- KV cache: vLLM reserves 10-30% of remaining VRAM for concurrent requests.
| Model size | Recommended GPUs | VRAM |
|---|
| Small (<10B) | RTX 4090, A6000, L4 | 16-24 GB |
| Medium (10B-30B) | A6000, L40S | 32-48 GB |
| Large (30B-70B) | A100, H100, B200 | 80-180 GB |
Troubleshooting memory issues
- OOM errors: Lower
GPU_MEMORY_UTILIZATION from 0.90 to 0.85, or reduce MAX_MODEL_LEN.
- Context window limits: More context means more KV cache. A 7B model that OOMs at 32k context often runs fine at 16k.
- Limited VRAM: Use quantized models (AWQ/GPTQ) to reduce memory by 50-75%.
Additional resources