Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-remove-flash-beta-notification.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Deployment issues
Worker fails to start
If your worker fails to start or initialize:
- Check logs: View endpoint logs in the Runpod console for error messages.
- Verify local testing: Ensure your handler works in local testing before deploying.
- Check dependencies: Verify all dependencies are installed in your Docker image.
- GPU compatibility: Ensure your Docker image is compatible with the selected GPU type.
- Input format: Verify your input format matches what your handler expects.
Worker initializes but fails on requests
| Issue | Solution |
|---|
| Input validation errors | Add input validation in your handler and check logs for the expected format |
| Missing dependencies | Verify all required packages are in your Dockerfile |
| Model loading failures | Check GPU memory requirements and model path |
| Permission errors | Ensure files are readable and directories are writable |
Job issues
Jobs stuck in queue
If jobs remain IN_QUEUE for extended periods:
- No workers available: Check if
max_workers is set appropriately.
- Workers throttled: Your endpoint may be hitting rate limits. Check the Workers tab for throttled workers.
- Cold start delays: First requests after idle periods require worker initialization. Consider increasing
min_workers or enabling FlashBoot.
Jobs timing out
| Cause | Solution |
|---|
| Processing takes too long | Increase executionTimeout in your job policy |
| Model loading too slow | Use model caching or bake models into your image |
| TTL too short | Set ttl to cover both queue time and execution time |
Jobs failing
Check the job status response for error details. Common causes:
- Handler exceptions: Unhandled exceptions in your handler code. Add try/catch blocks and return structured errors.
- OOM (Out of Memory): Model or batch size exceeds GPU memory. Reduce batch size or use a larger GPU.
- Timeout: Job exceeded execution timeout. Increase timeout or optimize processing.
Cold start issues
Slow cold starts
Cold start time includes container startup, model loading, and initialization. To reduce cold starts:
- Use model caching: Store models on network volumes instead of downloading on each start.
- Enable FlashBoot: Use FlashBoot for faster container initialization.
- Optimize image size: Use smaller base images and remove unnecessary dependencies.
- Initialize outside handler: Load models at module level, not inside the handler function.
# Good: Load model once at startup
model = load_model()
def handler(job):
return model.predict(job["input"])
# Bad: Load model on every request
def handler(job):
model = load_model() # Slow!
return model.predict(job["input"])
Too many cold starts
If you’re seeing frequent cold starts:
- Increase idle timeout: Set a longer
idle_timeout to keep workers warm between requests.
- Set minimum workers: Configure
min_workers > 0 to maintain warm workers.
- Check traffic patterns: Sporadic traffic causes more cold starts than steady traffic.
Logging issues
Missing logs
If logs aren’t appearing in the console:
- Check throttling: Excessive logging triggers throttling. Reduce log verbosity.
- Verify output streams: Ensure you’re writing to stdout/stderr, not just files.
- Check worker status: Logs only appear for successfully initialized workers.
- Retention period: Logs older than 90 days are automatically removed.
Log throttling
To avoid log throttling:
- Reduce log verbosity in production.
- Use structured logging for efficiency.
- Store detailed logs on network volumes instead of console output.
vLLM-specific issues
OOM errors
If your vLLM worker runs out of memory:
- Lower
GPU_MEMORY_UTILIZATION from 0.90 to 0.85.
- Reduce
MAX_MODEL_LEN to limit context window.
- Use a GPU with more VRAM.
Model not loading
| Issue | Solution |
|---|
| Model not found | Verify MODEL_NAME matches the Hugging Face model ID exactly |
| Gated model access denied | Set HF_TOKEN with a token that has access to the model |
| Incompatible model | Check vLLM supported models |
OpenAI API errors
| Error | Cause | Solution |
|---|
| 401 Unauthorized | Invalid API key | Verify RUNPOD_API_KEY is correct |
| 404 Not Found | Wrong endpoint URL | Use the format https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1 |
| Connection refused | Endpoint not ready | Wait for workers to initialize |
Load balancing endpoint issues
”No workers available” error
This means workers didn’t initialize in time. Common causes:
- First request: Workers need time to start. Retry the request. (See Handling cold starts for more information.)
- All workers busy: Increase
max_workers to handle more concurrent requests.
- Workers crashing: Check logs for initialization errors.
Requests not reaching workers
Verify your HTTP server is:
- Listening on port 8000 (or the port specified in your configuration).
- Binding to
0.0.0.0, not 127.0.0.1.
- Returning proper HTTP responses.
Getting help
If you’re still experiencing issues:
- Check endpoint logs for detailed error messages.
- SSH into workers using SSH access to debug in real-time.
- Review metrics in the Metrics tab to identify patterns.
- Contact support at help@runpod.io with your endpoint ID and error details.