Claude API Rate Limits: Handle 429 Errors Cleanly
How Claude API rate limits work, why 429s happen, and a retry pattern that handles Retry-After correctly. Same pattern works across all 5 vendors.
What Rate Limits Actually Are
Rate limits are guardrails. Every API has them — Claude, GPT, Gemini, Grok, DeepSeek, all of them. They cap how many requests or tokens you can push through in a given window. Hit the cap and you get a 429 Too Many Requests response.
It's not a bug. It's the API saying "slow down a sec." The job of your code is to handle it gracefully and try again at a sane pace.
This post covers how Claude rate limits work in practice, the cleanest retry pattern, and why the same logic works for all 5 vendors when you run through aiapi.cheap.
For the official spec on what Anthropic measures and how, their rate limits doc is the source.
How Rate Limits Work in Practice
The limits are usually two-part:
You can hit either one. Sending 10 huge requests can trigger the token cap before the request cap, even if you'd think you're well under.
Limits scale with plan tier. Higher tier = more headroom for sustained workloads. The exact numbers vary by provider and we don't surface specific RPM/TPM numbers in our UI — just know that fair-use limits scale with your plan, and Pro has more headroom than Basic.
When you exceed a limit, the response includes:
429 statusRetry-After header (sometimes) telling you how long to wait, in secondsYour code should read both signals and back off accordingly.
Why You Hit 429 (Even When You Think You Shouldn't)
A few common patterns that cause it:
The fix in every case is: don't fire as fast as the network can carry. Pace yourself.
A Clean Retry Pattern
The pattern that handles 429s well does three things:
1. Reads Retry-After if the API provides it
2. Uses exponential backoff with jitter as a fallback
3. Has a max attempt count so it doesn't retry forever
Here's a working version using the Anthropic SDK on aiapi.cheap:
import time
import random
import anthropic
client = anthropic.Anthropic(
api_key="sk-aic-your-key-here",
base_url="https://aiapi.cheap/api/proxy",
)
def call_with_retry(messages, model="claude-sonnet-4-6", max_attempts=5):
for attempt in range(max_attempts):
try:
return client.messages.create(
model=model,
max_tokens=1024,
messages=messages,
)
except anthropic.RateLimitError as e:
if attempt == max_attempts - 1:
raise
# Honor Retry-After header if the API sent one
retry_after = getattr(e, "retry_after", None)
if retry_after is not None:
wait = float(retry_after)
else:
# Exponential backoff with jitter
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"429 hit. Attempt {attempt + 1}/{max_attempts}, waiting {wait:.1f}s")
time.sleep(wait)Why each piece matters:
+ random.uniform(0, 1) is critical when you have many clients. Without it, all your workers retry at exactly the same instant and re-hammer the API. With jitter, retries spread out.Same Pattern, All 5 Vendors
The nice thing about running through aiapi.cheap is that this exact retry logic works for every vendor. The proxy returns standard 429 status codes regardless of whether the model is Claude, GPT, Gemini, Grok, or DeepSeek.
If you use the OpenAI SDK to talk to our universal endpoint:
from openai import OpenAI, RateLimitError
import time, random
client = OpenAI(
api_key="sk-aic-your-key-here",
base_url="https://aiapi.cheap/api/proxy",
)
def call_with_retry(messages, model="gpt-4o", max_attempts=5):
for attempt in range(max_attempts):
try:
return client.chat.completions.create(
model=model,
messages=messages,
)
except RateLimitError as e:
if attempt == max_attempts - 1:
raise
# Same backoff logic
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)Same structure. Different SDK, different model name. The retry logic doesn't change.
This matters because it means you can A/B test models — try Claude for one batch, GPT for another, Gemini for the next — without rewriting your error handling for each one.
Don't Just Retry. Slow Down First.
Retries are a band-aid. The real fix for chronic 429s is to stop firing requests faster than the API can accept them.
A few patterns that prevent 429s:
Pace your batch jobs
If you're processing 10,000 records, don't spawn 10,000 parallel calls. Use a worker pool with a sane concurrency limit:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=10) as pool:
results = list(pool.map(call_api, records))Start with a small concurrency (5-10) and turn it up only if you're not hitting limits.
Add a token bucket
For sustained loads, a simple rate limiter on the client side keeps you under the cap. There are libraries for this in every language; you don't need to write your own.
Don't share keys across services
If your frontend, your worker, and your cron all use the same key, they share the same rate limit. Give each service its own key (you can have many keys per account on aiapi.cheap) so they don't fight each other.
When To Upgrade Plan Instead of Retrying
If you're consistently hitting limits even with proper backoff and pacing, the answer isn't more retry logic — it's more headroom. Pro tier on aiapi.cheap has a higher fair-use ceiling than Basic, and at $19 lifetime it pays for itself the first time it saves you from a Saturday-morning incident.
Decision rule: if 429s happen during normal traffic (not just bursts), upgrade. If they only happen during spikes, fix the spikes with backoff and pacing.
Quick Recap
Retry-After if the server sends it; otherwise use exponential backoff with jitterFor the full list of error codes (not just 429), see the error codes post. Setup details and supported models are in the docs.
Handle 429 well and your app runs smooth. Skip it and your users see errors at the worst times. The pattern above is short and works — drop it in once and forget about rate limits.