Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runpod-b18f5ded-promptless-remove-flash-beta-notification.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Runpod offers custom pricing plans for large scale and enterprise workloads. Contact our sales team to learn more.
Serverless offers pay-per-second pricing with no upfront costs. You’re billed from when a worker starts until it fully stops, rounded up to the nearest second.

Worker types

Flex workersActive workers
BehaviorScale to zero when idleAlways running (24/7)
PricingStandard per-second rate20–30% discount
Best forVariable workloads, cost optimizationConsistent traffic, low-latency requirements

GPU pricing

GPU type(s)MemoryFlex cost per secondActive cost per secondDescription
A4000, A4500, RTX 400016 GB$0.00016$0.00011The most cost-effective for small models.
4090 PRO24 GB$0.00031$0.00021Extreme throughput for small-to-medium models.
L4, A5000, 309024 GB$0.00019$0.00013Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO48 GB$0.00053$0.00037Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A4048 GB$0.00034$0.00024A cost-effective option for running big models.
H100 PRO80 GB$0.00116$0.00093Extreme throughput for big models.
A10080 GB$0.00076$0.00060High throughput GPU, yet still very cost-effective.
H200 PRO141 GB$0.00155$0.00124Extreme throughput for huge models.
B200180 GB$0.00240$0.00190Maximum throughput for huge models.
For the latest pricing, visit the Runpod pricing page.

What you’re billed for

Your total cost includes compute time and storage:
Cost componentDescriptionRate
ComputeGPU time while workers runSee pricing table above
Container diskWorker storage (5-min intervals)~$0.10/GB/month
Network volumeShared persistent storage$0.07/GB/month (< 1TB), $0.05/GB/month (> 1TB)

Compute cost breakdown

Workers incur charges during three phases:
  1. Start time: Initializing the container and loading models into GPU memory. Minimize with FlashBoot or model caching.
  2. Execution time: Processing requests. Set execution timeouts to prevent runaway jobs.
  3. Idle timeout duration: The time a worker remains active (running) after completing a request, waiting for additional requests before scaling down (default: 5 seconds). Configure in endpoint settings.
For high-volume workloads with significant storage needs, use network volumes to share data across workers and reduce per-worker storage costs.

Account limits

Spend limit: Default limit of $80/hour across all resources. Contact support to increase.

Billing support

If you believe you’ve been billed incorrectly, contact support, including the following information in your ticket:
  • Endpoint ID
  • Request ID (if applicable)
  • Approximate time of the issue