Overview

Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike that process requests sequentially, load balancing distributes requests across your worker pool for lower latency. You can create custom REST endpoints accessible via a unique URL:

https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH

Build a worker

Create and deploy a load balancing worker.

vLLM load balancer

Deploy vLLM with load balancing.

Load balancing vs. queue-based endpoints

Queue-based endpoints

With queue-based endpoints, are placed in a queue and processed in order. They use the standard handler pattern (def handler(job)) and are accessed through fixed endpoints like /run and /runsync. These endpoints are better for tasks that can be processed asynchronously and guarantee request processing, similar to how TCP guarantees packet delivery in networking.

Load balancing endpoints (new)

Load balancing endpoints send requests directly to workers without queuing. You can use any HTTP framework such as FastAPI or Flask, and define custom URL paths and API contracts to suit your specific needs. These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP’s behavior in networking.

Endpoint type comparison table

Aspect	Load balancing	Queue-based
Request flow	Direct to worker HTTP server	Through queueing system
Implementation	Custom HTTP server (FastAPI, Flask, etc.)	Handler function
API flexibility	Custom URL paths, any HTTP capability	Fixed `/run` and `/runsync` endpoints
Backpressure	Drops requests when overloaded	Queue buffering
Latency	Lower (single-hop)	Higher (queue + worker)
Error handling	No built-in retry	Automatic retries

Worker comparison

Queue-based worker (traditional):

import runpod

def handler(job):
    prompt = job["input"].get("prompt", "Hello world")
    return {"generated_text": f"Generated text for: {prompt}"}

runpod.serverless.start({"handler": handler})

Load balancing worker (custom HTTP server):

from fastapi import FastAPI
import os

app = FastAPI()

@app.get("/ping")
async def health_check():
    return {"status": "healthy"}

@app.post("/generate")
async def generate(request: dict):
    return {"generated_text": f"Generated text for: {request['prompt']}"}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "80")))

This exposes custom endpoints: https://ENDPOINT_ID.api.runpod.ai/ping and https://ENDPOINT_ID.api.runpod.ai/generate

Health checks

Workers must expose a /ping endpoint on the PORT_HEALTH port. The load balancer periodically checks this endpoint:

Response code	Status
`200`	Healthy
`204`	Initializing
Other	Unhealthy

Unhealthy workers are automatically removed from the routing pool.

When calculating endpoint metrics, Runpod calculates the cold start time for load balancing workers by measuring the time it takes between /ping first returning 204 until it first returns 200.

Environment variables

Variable	Default	Description
`PORT`	`80`	Main application server port
`PORT_HEALTH`	Same as `PORT`	Health check endpoint port

If using a custom port, add it to your endpoint’s environment variables and expose it in container configuration (under Expose HTTP Ports (Max 10)).

Timeouts and limits

Limit	Value
Request timeout	2 min (no worker available)
Processing timeout	5.5 min (per request)
Payload limit	30 MB (request and response)

For payloads larger than 30 MB, use network volumes or implement chunking.

If your server ports are misconfigured, workers stay up for 8 minutes before terminating, returning 502 errors.

Handling cold starts

When workers are initializing, you may get “no workers available” errors. Implement retry logic to handle this:

import requests
import time

def health_check_with_retry(base_url, api_key, max_retries=3, delay=5):
    headers = {"Authorization": f"Bearer {api_key}"}

    for attempt in range(max_retries):
        try:
            response = requests.get(f"{base_url}/ping", headers=headers, timeout=10)
            if response.status_code == 200:
                return True
        except Exception:
            pass
        if attempt < max_retries - 1:
            time.sleep(delay)
    return False

# Usage
if health_check_with_retry("https://ENDPOINT_ID.api.runpod.ai", "RUNPOD_API_KEY"):
    # Worker ready, send requests
    pass

Use at least 3 retries with 5-10 second delays.

When to use load balancing endpoints

Use load balancing endpoints when you need:

Direct access to your model’s HTTP server.
Internal batching systems (like vLLM).
Non-JSON payloads.
Multiple endpoints within a single worker.
Lower latency for real-time applications.

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Accounts and billing

Integrations

Hub

Reference

Build a worker

vLLM load balancer

Load balancing vs. queue-based endpoints

Queue-based endpoints

Load balancing endpoints (new)

Endpoint type comparison table

Worker comparison

Health checks

Environment variables

Timeouts and limits

Handling cold starts

When to use load balancing endpoints

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Accounts and billing

Integrations

Hub

Reference

Documentation Index

Build a worker

vLLM load balancer

​Load balancing vs. queue-based endpoints

​Queue-based endpoints

​Load balancing endpoints (new)

​Endpoint type comparison table

​Worker comparison

​Health checks

​Environment variables

​Timeouts and limits

​Handling cold starts

​When to use load balancing endpoints

Load balancing vs. queue-based endpoints

Queue-based endpoints

Load balancing endpoints (new)

Endpoint type comparison table

Worker comparison

Health checks

Environment variables

Timeouts and limits

Handling cold starts

When to use load balancing endpoints