Pricing

Runpod offers custom pricing plans for large scale and enterprise workloads. Contact our sales team to learn more.

Serverless offers pay-per-second pricing with no upfront costs. You’re billed from when a worker starts until it fully stops, rounded up to the nearest second.

Worker types

	Flex workers	Active workers
Behavior	Scale to zero when idle	Always running (24/7)
Pricing	Standard per-second rate	20–30% discount
Best for	Variable workloads, cost optimization	Consistent traffic, low-latency requirements

GPU pricing

GPU type(s)	Memory	Flex cost per second	Active cost per second	Description
A4000, A4500, RTX 4000	16 GB	$0.00016	$0.00011	The most cost-effective for small models.
4090 PRO	24 GB	$0.00031	$0.00021	Extreme throughput for small-to-medium models.
L4, A5000, 3090	24 GB	$0.00019	$0.00013	Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO	48 GB	$0.00053	$0.00037	Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A40	48 GB	$0.00034	$0.00024	A cost-effective option for running big models.
H100 PRO	80 GB	$0.00116	$0.00093	Extreme throughput for big models.
A100	80 GB	$0.00076	$0.00060	High throughput GPU, yet still very cost-effective.
H200 PRO	141 GB	$0.00155	$0.00124	Extreme throughput for huge models.
B200	180 GB	$0.00240	$0.00190	Maximum throughput for huge models.

For the latest pricing, visit the Runpod pricing page.

What you’re billed for

Your total cost includes compute time and storage:

Cost component	Description	Rate
Compute	GPU time while workers run	See pricing table above
Container disk	Worker storage (5-min intervals)	~$0.10/GB/month
Network volume	Shared persistent storage	$0.07/GB/month (< 1TB), $0.05/GB/month (> 1TB)

Compute cost breakdown

Workers incur charges during three phases:

Start time: Initializing the container and loading models into GPU memory. Minimize with FlashBoot or model caching.
Execution time: Processing requests. Set execution timeouts to prevent runaway jobs.
Idle timeout duration: The time a worker remains active (running) after completing a request, waiting for additional requests before scaling down (default: 5 seconds). Configure in endpoint settings.

For high-volume workloads with significant storage needs, use network volumes to share data across workers and reduce per-worker storage costs.

Account limits

Spend limit: Default limit of $80/hour across all resources. Contact support to increase.

Billing support

If you believe you’ve been billed incorrectly, contact support, including the following information in your ticket:

Endpoint ID
Request ID (if applicable)
Approximate time of the issue

OverviewWrite custom handler functions to process incoming requests to your queue-based endpoints.

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Accounts and billing

Integrations

Hub

Reference

Worker types

GPU pricing

What you’re billed for

Compute cost breakdown

Account limits

Billing support

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Accounts and billing

Integrations

Hub

Reference

Documentation Index

​Worker types

​GPU pricing

​What you’re billed for

​Compute cost breakdown

​Account limits

​Billing support

Worker types

GPU pricing

What you’re billed for

Compute cost breakdown

Account limits

Billing support