Skip to content
All posts
·6 min readrate-limits429tutorialapi

Claude API Rate Limits: Handle 429 Errors Cleanly

How Claude API rate limits work, why 429s happen, and a retry pattern that handles Retry-After correctly. Same pattern works across all 5 vendors.

What Rate Limits Actually Are

Rate limits are guardrails. Every API has them — Claude, GPT, Gemini, Grok, DeepSeek, all of them. They cap how many requests or tokens you can push through in a given window. Hit the cap and you get a 429 Too Many Requests response.

It's not a bug. It's the API saying "slow down a sec." The job of your code is to handle it gracefully and try again at a sane pace.

This post covers how Claude rate limits work in practice, the cleanest retry pattern, and why the same logic works for all 5 vendors when you run through aiapi.cheap.

For the official spec on what Anthropic measures and how, their rate limits doc is the source.

How Rate Limits Work in Practice

The limits are usually two-part:

  • Requests per minute — how many calls you can make
  • Tokens per minute — how much data flows through (input + output)
  • You can hit either one. Sending 10 huge requests can trigger the token cap before the request cap, even if you'd think you're well under.

    Limits scale with plan tier. Higher tier = more headroom for sustained workloads. The exact numbers vary by provider and we don't surface specific RPM/TPM numbers in our UI — just know that fair-use limits scale with your plan, and Pro has more headroom than Basic.

    When you exceed a limit, the response includes:

  • HTTP 429 status
  • A Retry-After header (sometimes) telling you how long to wait, in seconds
  • An error body with details
  • Your code should read both signals and back off accordingly.

    Why You Hit 429 (Even When You Think You Shouldn't)

    A few common patterns that cause it:

  • No throttle on a parallel script. You spin up 50 workers, they all fire at once, half get 429.
  • Sharing a key across services. Frontend, backend cron, batch job — all using the same key, all hitting limits in the same minute.
  • Big batch jobs. Iterating over a thousand records and calling the API for each one with no spacing.
  • Bursty webhooks. Your service receives a flood of webhook events and tries to process them all immediately.
  • The fix in every case is: don't fire as fast as the network can carry. Pace yourself.

    A Clean Retry Pattern

    The pattern that handles 429s well does three things:

    1. Reads Retry-After if the API provides it

    2. Uses exponential backoff with jitter as a fallback

    3. Has a max attempt count so it doesn't retry forever

    Here's a working version using the Anthropic SDK on aiapi.cheap:

    import time
    import random
    import anthropic
    
    client = anthropic.Anthropic(
        api_key="sk-aic-your-key-here",
        base_url="https://aiapi.cheap/api/proxy",
    )
    
    def call_with_retry(messages, model="claude-sonnet-4-6", max_attempts=5):
        for attempt in range(max_attempts):
            try:
                return client.messages.create(
                    model=model,
                    max_tokens=1024,
                    messages=messages,
                )
            except anthropic.RateLimitError as e:
                if attempt == max_attempts - 1:
                    raise
    
                # Honor Retry-After header if the API sent one
                retry_after = getattr(e, "retry_after", None)
                if retry_after is not None:
                    wait = float(retry_after)
                else:
                    # Exponential backoff with jitter
                    wait = (2 ** attempt) + random.uniform(0, 1)
    
                print(f"429 hit. Attempt {attempt + 1}/{max_attempts}, waiting {wait:.1f}s")
                time.sleep(wait)

    Why each piece matters:

  • `Retry-After` first. The server told you exactly how long to wait. Use that.
  • Exponential backoff. Doubles each attempt — 1s, 2s, 4s, 8s — so you back off faster on persistent issues.
  • Jitter. That + random.uniform(0, 1) is critical when you have many clients. Without it, all your workers retry at exactly the same instant and re-hammer the API. With jitter, retries spread out.
  • Max attempts. Don't retry forever. After 5 tries, surface the error to the caller.
  • Same Pattern, All 5 Vendors

    The nice thing about running through aiapi.cheap is that this exact retry logic works for every vendor. The proxy returns standard 429 status codes regardless of whether the model is Claude, GPT, Gemini, Grok, or DeepSeek.

    If you use the OpenAI SDK to talk to our universal endpoint:

    from openai import OpenAI, RateLimitError
    import time, random
    
    client = OpenAI(
        api_key="sk-aic-your-key-here",
        base_url="https://aiapi.cheap/api/proxy",
    )
    
    def call_with_retry(messages, model="gpt-4o", max_attempts=5):
        for attempt in range(max_attempts):
            try:
                return client.chat.completions.create(
                    model=model,
                    messages=messages,
                )
            except RateLimitError as e:
                if attempt == max_attempts - 1:
                    raise
                # Same backoff logic
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)

    Same structure. Different SDK, different model name. The retry logic doesn't change.

    This matters because it means you can A/B test models — try Claude for one batch, GPT for another, Gemini for the next — without rewriting your error handling for each one.

    Don't Just Retry. Slow Down First.

    Retries are a band-aid. The real fix for chronic 429s is to stop firing requests faster than the API can accept them.

    A few patterns that prevent 429s:

    Pace your batch jobs

    If you're processing 10,000 records, don't spawn 10,000 parallel calls. Use a worker pool with a sane concurrency limit:

    from concurrent.futures import ThreadPoolExecutor
    
    with ThreadPoolExecutor(max_workers=10) as pool:
        results = list(pool.map(call_api, records))

    Start with a small concurrency (5-10) and turn it up only if you're not hitting limits.

    Add a token bucket

    For sustained loads, a simple rate limiter on the client side keeps you under the cap. There are libraries for this in every language; you don't need to write your own.

    Don't share keys across services

    If your frontend, your worker, and your cron all use the same key, they share the same rate limit. Give each service its own key (you can have many keys per account on aiapi.cheap) so they don't fight each other.

    When To Upgrade Plan Instead of Retrying

    If you're consistently hitting limits even with proper backoff and pacing, the answer isn't more retry logic — it's more headroom. Pro tier on aiapi.cheap has a higher fair-use ceiling than Basic, and at $19 lifetime it pays for itself the first time it saves you from a Saturday-morning incident.

    Decision rule: if 429s happen during normal traffic (not just bursts), upgrade. If they only happen during spikes, fix the spikes with backoff and pacing.

    Quick Recap

  • Rate limits are guardrails; 429 is the signal you've hit them
  • Read Retry-After if the server sends it; otherwise use exponential backoff with jitter
  • Cap your retry attempts; don't loop forever
  • Pace batch jobs with a worker pool; don't fire all parallel
  • Same retry pattern works for all 5 vendors through aiapi.cheap
  • Upgrade plan tier if limits hit during normal traffic
  • For the full list of error codes (not just 429), see the error codes post. Setup details and supported models are in the docs.

    Handle 429 well and your app runs smooth. Skip it and your users see errors at the worst times. The pattern above is short and works — drop it in once and forget about rate limits.