Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike that process requests sequentially, load balancing distributes requests across your worker pool for lower latency. You can create custom REST endpoints accessible via a unique URL:Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-remove-flash-beta-notification.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Build a worker
Create and deploy a load balancing worker.
vLLM load balancer
Deploy vLLM with load balancing.
Load balancing vs. queue-based endpoints
Queue-based endpoints
With queue-based endpoints, are placed in a queue and processed in order. They use the standard handler pattern (def handler(job)) and are accessed through fixed endpoints like /run and /runsync.
These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.
Load balancing endpoints (new)
Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP’s behavior in networking.Endpoint type comparison table
| Aspect | Load balancing | Queue-based |
|---|---|---|
| Request flow | Direct to worker HTTP server | Through queueing system |
| Implementation | Custom HTTP server (FastAPI, Flask, etc.) | Handler function |
| API flexibility | Custom URL paths, any HTTP capability | Fixed /run and /runsync endpoints |
| Backpressure | Drops requests when overloaded | Queue buffering |
| Latency | Lower (single-hop) | Higher (queue + worker) |
| Error handling | No built-in retry | Automatic retries |
Worker comparison
Queue-based worker (traditional):https://ENDPOINT_ID.api.runpod.ai/ping and https://ENDPOINT_ID.api.runpod.ai/generate
Health checks
Workers must expose a/ping endpoint on the PORT_HEALTH port. The load balancer periodically checks this endpoint:
| Response code | Status |
|---|---|
200 | Healthy |
204 | Initializing |
| Other | Unhealthy |
When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between
/ping first returning 204 until it first returns 200.Environment variables
| Variable | Default | Description |
|---|---|---|
PORT | 80 | Main application server port |
PORT_HEALTH | Same as PORT | Health check endpoint port |
Timeouts and limits
| Limit | Value |
|---|---|
| Request timeout | 2 min (no worker available) |
| Processing timeout | 5.5 min (per request) |
| Payload limit | 30 MB (request and response) |
Handling cold starts
When workers are initializing, you may get “no workers available” errors. Implement retry logic to handle this:When to use load balancing endpoints
Use load balancing endpoints when you need:- Direct access to your model’s HTTP server.
- Internal batching systems (like vLLM).
- Non-JSON payloads.
- Multiple endpoints within a single worker.
- Lower latency for real-time applications.