· api operations · engineering · april 2026 ·

Claude API Rate Limits (2026): Handling 429s, Backoff, and Queues

TIER 3 40k RPM TIER 2 10k RPM TIER 1 5k RPM REFILLS EVERY MINUTE → 429 if empty on request → retry-after header tells you when → exponential backoff is wrong here
// FILED API Engineering// DATE APR 28, 2026// SLUG /blog/claude-api-rate-limit-strategies-2026.htmlcite this →

A 429 from the Claude API means you've hit a rate limit. The response includes a retry-after header that tells you exactly when to retry. Most developers ignore this header and implement exponential backoff instead—which is the wrong strategy for Anthropic's rate limit design and can make throughput worse, not better.

This post covers Anthropic's rate limit structure as of Abril 2026, the correct retry pattern, practical queue implementations for high-volume applications, and the specific failure modes you'll encounter with each model tier.

What Anthropic's rate limits actually are

Anthropic uses three limit types, all operating simultaneously. You can hit any of them independently:

  1. Requests per minute (RPM) — how many API calls you can make in a 60-second window
  2. Tokens per minute (TPM) — total tokens (input + output) across all requests in a 60-second window
  3. Tokens per day (TPD) — cumulative token usage in a 24-hour period

Your limits depend on your usage tier. Anthropic automatically promotes accounts through tiers based on spend history. As of Abril 2026:

TierCriteriaClaude Sonnet RPMSonnet TPMHaiku RPM
Build (Tier 1)New account, any spend525,00050
Scale (Tier 2)$100 spend + 7 days50100,0002,000
Growth (Tier 3)$500 spend + 14 days1,000500,0005,000
Scale (Tier 4)$2,000 spend + 14 days2,0001,000,00010,000

Tier 1 limits are severe. 5 requests per minute on Sonnet means you can run one request every 12 seconds on average. For a developer building a batch processing tool, this is the single biggest constraint in the early stages of a project. Haiku's higher RPM on Tier 1 (50 RPM) is why many developers use Haiku for early-stage testing and switch to Sonnet at Tier 2.

What the 429 response actually contains

When you hit a rate limit, the response looks like this:

HTTP/1.1 429 Too Many Requests
retry-after: 37
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z
anthropic-ratelimit-tokens-limit: 100000
anthropic-ratelimit-tokens-remaining: 48221
anthropic-ratelimit-tokens-reset: 2026-04-28T14:23:00Z

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded: requests"
  }
}

The headers tell you exactly what you need:

retry-after: 37   // seconds until you can retry
anthropic-ratelimit-requests-remaining: 0   // no requests left this window
anthropic-ratelimit-tokens-remaining: 48221   // tokens still available
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z   // exact reset time

In this example you hit the RPM limit (0 requests remaining) but still have token budget (48,221 tokens remaining). Waiting 37 seconds and retrying will succeed. Implementing exponential backoff here would wait longer than necessary and reduce throughput for no benefit.

The correct retry pattern: header-first, not exponential

For Claude API rate limits specifically, the correct retry strategy is:

  1. Check the retry-after header on a 429 response
  2. Wait exactly that many seconds (add 1 second for clock jitter)
  3. Retry the request
  4. If still 429 (unusual), apply exponential backoff starting from the retry-after base

Here's a Python implementation:

import anthropic
import time
import random

client = anthropic.Anthropic()

def make_request_with_retry(prompt: str, max_retries: int = 5):
    retries = 0
    backoff_base = 1.0

    while retries < max_retries:
        try:
            response = client.messages.create(
                model="claude-sonnet-4-5",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            return response

        except anthropic.RateLimitError as e:
            retries += 1
            if retries >= max_retries:
                raise

            # Read the retry-after header if available
            retry_after = None
            if hasattr(e, 'response') and e.response is not None:
                retry_after_str = e.response.headers.get('retry-after')
                if retry_after_str:
                    retry_after = float(retry_after_str)

            if retry_after is not None:
                # Use the header value + small jitter
                wait = retry_after + random.uniform(0, 1)
            else:
                # Fall back to exponential backoff
                wait = backoff_base * (2 ** (retries - 1)) + random.uniform(0, 1)

            print(f"Rate limited. Waiting {wait:.1f}s before retry {retries}/{max_retries}")
            time.sleep(wait)

        except anthropic.APIStatusError as e:
            # Non-rate-limit errors: don't retry
            raise

For TypeScript/Node.js applications:

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();

async function makeRequestWithRetry(
  prompt: string,
  maxRetries = 5
): Promise {
  let retries = 0;

  while (retries < maxRetries) {
    try {
      return await client.messages.create({
        model: 'claude-sonnet-4-5',
        max_tokens: 1024,
        messages: [{ role: 'user', content: prompt }],
      });
    } catch (err) {
      if (err instanceof Anthropic.RateLimitError) {
        retries++;
        if (retries >= maxRetries) throw err;

        const retryAfter = err.headers?.['retry-after'];
        const waitMs = retryAfter
          ? (parseFloat(retryAfter) + Math.random()) * 1000
          : Math.pow(2, retries) * 1000 + Math.random() * 1000;

        console.log(`Rate limited. Waiting ${(waitMs/1000).toFixed(1)}s`);
        await new Promise(resolve => setTimeout(resolve, waitMs));
      } else {
        throw err;
      }
    }
  }
  throw new Error('Max retries exceeded');
}

Queue patterns for batch workloads

Retry logic handles individual request failures. For applications that need to process hundreds or thousands of requests (document processing, code review pipelines, batch analysis), you need a queue pattern with rate-awareness built in at dispatch time.

Pattern 1: Token bucket queue (simple)

Track your own request count and sleep until the window resets if you're approaching the limit:

import time
from collections import deque

class RateLimitedClient:
    def __init__(self, rpm_limit: int = 50, tpm_limit: int = 100_000):
        self.rpm_limit = rpm_limit
        self.tpm_limit = tpm_limit
        self.request_times = deque()  # timestamps of recent requests
        self.token_usage = deque()    # (timestamp, token_count) pairs

    def _clean_window(self):
        now = time.time()
        cutoff = now - 60
        while self.request_times and self.request_times[0] < cutoff:
            self.request_times.popleft()
        while self.token_usage and self.token_usage[0][0] < cutoff:
            self.token_usage.popleft()

    def _wait_if_needed(self, estimated_tokens: int):
        self._clean_window()

        # Wait for RPM headroom
        if len(self.request_times) >= self.rpm_limit:
            oldest = self.request_times[0]
            wait = 60 - (time.time() - oldest) + 0.1
            if wait > 0:
                time.sleep(wait)
            self._clean_window()

        # Wait for TPM headroom
        current_tokens = sum(t for _, t in self.token_usage)
        if current_tokens + estimated_tokens > self.tpm_limit:
            oldest = self.token_usage[0][0]
            wait = 60 - (time.time() - oldest) + 0.1
            if wait > 0:
                time.sleep(wait)
            self._clean_window()

    def request(self, client, prompt: str, estimated_tokens: int = 1000):
        self._wait_if_needed(estimated_tokens)
        now = time.time()
        self.request_times.append(now)

        response = make_request_with_retry(prompt)

        # Record actual usage
        actual_tokens = response.usage.input_tokens + response.usage.output_tokens
        self.token_usage.append((now, actual_tokens))
        return response

Pattern 2: Anthropic Batch API (for throughput over latency)

For workloads where you don't need immediate responses, the Anthropic Batch API is the correct tool. It accepts up to 10,000 requests per batch, processes them asynchronously over up to 24 hours, and charges 50% less per token than the real-time API. Rate limits are much more generous for batch requests.

import anthropic

client = anthropic.Anthropic()

# Create a batch
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"request-{i}",
            "params": {
                "model": "claude-sonnet-4-5",
                "max_tokens": 1024,
                "messages": [{"role": "user", "content": prompt}]
            }
        }
        for i, prompt in enumerate(prompts)
    ]
)

print(f"Batch created: {batch.id}")
# Poll batch.results_url or subscribe to webhook for completion

Batch API is the right answer for: document analysis pipelines, nightly summarization jobs, large-scale code review automation, and any use case where results are needed within 24 hours rather than within seconds.

Per-model rate limit behavior

Rate limits are enforced per model family. A few things that catch developers by surprise:

Caching as a rate limit strategy

Prompt caching deserves its own section because it's underused. When you mark the top of your prompt (system prompt, large context document) as cacheable, Anthropic stores the KV representation of those tokens for 5 minutes. Subsequent requests that hit the cache:

For a 200-page document you're analyzing with multiple prompts, caching the document means only the first request costs full input tokens. Requests 2 through N pay 10% for the document. On 50 requests at 200,000 input tokens each, that's the difference between 10M tokens and 2M tokens of TPM consumption.

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": large_document_text,
            "cache_control": {"type": "ephemeral"}  # cache this block
        }
    ],
    messages=[{"role": "user", "content": specific_question}]
)

What to monitor in production

Four metrics worth tracking per deployment:

  1. 429 rate — percentage of requests that result in a rate limit error. Above 5% is a sign your queue isn't managing the limit correctly.
  2. TPM utilization — ratio of tokens used to tokens limit. Read anthropic-ratelimit-tokens-remaining from response headers and track it over time.
  3. Retry latency — p50/p95 of the wait time from initial 429 to successful retry. This tells you whether your backoff is calibrated to the actual retry-after values you're seeing.
  4. Cache hit rate — if you're using prompt caching, usage.cache_read_input_tokens / total input tokens. A hit rate below 60% on a system-prompt-heavy application suggests the cache is expiring too frequently.

Septim Vault: API key and credential management for Claude workflows

If you're building multi-environment Claude API integrations and juggling API keys across projects, Septim Vault is a key-management toolkit for Claude Code workflows. Keeps credentials out of your codebase, out of your shell history, and out of version control. Pague uma vez.

Get Septim Vault — $29 →

Requesting a rate limit increase

If you've hit Tier 4 limits and still need more capacity, Anthropic has an enterprise rate limit request form in the console. In practice, the Batch API at 50% cost and 10,000 requests per batch handles most high-volume use cases without needing an increase. Rate limit increases take 5–10 business days to process and aren't guaranteed.

For applications that need real-time throughput above Tier 4 limits, the correct answer is usually architectural: distribute requests across multiple API keys (separate Anthropic accounts), implement model cascading (Haiku for filtering, Sonnet for generation), or use the Batch API for the non-time-sensitive portion of the workload.