Claude API Rate Limits (2026): Handling 429s, Backoff, and Queues
A 429 from the Claude API means you've hit a rate limit. The response includes a retry-after header that tells you exactly when to retry. Most developers ignore this header and implement exponential backoff instead—which is the wrong strategy for Anthropic's rate limit design and can make throughput worse, not better.
This post covers Anthropic's rate limit structure as of April 2026, the correct retry pattern, practical queue implementations for high-volume applications, and the specific failure modes you'll encounter with each model tier.
What Anthropic's rate limits actually are
Anthropic uses three limit types, all operating simultaneously. You can hit any of them independently:
- Requests per minute (RPM) — how many API calls you can make in a 60-second window
- Tokens per minute (TPM) — total tokens (input + output) across all requests in a 60-second window
- Tokens per day (TPD) — cumulative token usage in a 24-hour period
Your limits depend on your usage tier. Anthropic automatically promotes accounts through tiers based on spend history. As of April 2026:
| Tier | Criteria | Claude Sonnet RPM | Sonnet TPM | Haiku RPM |
|---|---|---|---|---|
| Build (Tier 1) | New account, any spend | 5 | 25,000 | 50 |
| Scale (Tier 2) | $100 spend + 7 days | 50 | 100,000 | 2,000 |
| Growth (Tier 3) | $500 spend + 14 days | 1,000 | 500,000 | 5,000 |
| Scale (Tier 4) | $2,000 spend + 14 days | 2,000 | 1,000,000 | 10,000 |
Tier 1 limits are severe. 5 requests per minute on Sonnet means you can run one request every 12 seconds on average. For a developer building a batch processing tool, this is the single biggest constraint in the early stages of a project. Haiku's higher RPM on Tier 1 (50 RPM) is why many developers use Haiku for early-stage testing and switch to Sonnet at Tier 2.
What the 429 response actually contains
When you hit a rate limit, the response looks like this:
HTTP/1.1 429 Too Many Requests
retry-after: 37
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z
anthropic-ratelimit-tokens-limit: 100000
anthropic-ratelimit-tokens-remaining: 48221
anthropic-ratelimit-tokens-reset: 2026-04-28T14:23:00Z
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Rate limit exceeded: requests"
}
}
The headers tell you exactly what you need:
anthropic-ratelimit-requests-remaining: 0 // no requests left this window
anthropic-ratelimit-tokens-remaining: 48221 // tokens still available
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z // exact reset time
In this example you hit the RPM limit (0 requests remaining) but still have token budget (48,221 tokens remaining). Waiting 37 seconds and retrying will succeed. Implementing exponential backoff here would wait longer than necessary and reduce throughput for no benefit.
The correct retry pattern: header-first, not exponential
For Claude API rate limits specifically, the correct retry strategy is:
- Check the
retry-afterheader on a 429 response - Wait exactly that many seconds (add 1 second for clock jitter)
- Retry the request
- If still 429 (unusual), apply exponential backoff starting from the
retry-afterbase
Here's a Python implementation:
import anthropic
import time
import random
client = anthropic.Anthropic()
def make_request_with_retry(prompt: str, max_retries: int = 5):
retries = 0
backoff_base = 1.0
while retries < max_retries:
try:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response
except anthropic.RateLimitError as e:
retries += 1
if retries >= max_retries:
raise
# Read the retry-after header if available
retry_after = None
if hasattr(e, 'response') and e.response is not None:
retry_after_str = e.response.headers.get('retry-after')
if retry_after_str:
retry_after = float(retry_after_str)
if retry_after is not None:
# Use the header value + small jitter
wait = retry_after + random.uniform(0, 1)
else:
# Fall back to exponential backoff
wait = backoff_base * (2 ** (retries - 1)) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s before retry {retries}/{max_retries}")
time.sleep(wait)
except anthropic.APIStatusError as e:
# Non-rate-limit errors: don't retry
raise
For TypeScript/Node.js applications:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function makeRequestWithRetry(
prompt: string,
maxRetries = 5
): Promise {
let retries = 0;
while (retries < maxRetries) {
try {
return await client.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
});
} catch (err) {
if (err instanceof Anthropic.RateLimitError) {
retries++;
if (retries >= maxRetries) throw err;
const retryAfter = err.headers?.['retry-after'];
const waitMs = retryAfter
? (parseFloat(retryAfter) + Math.random()) * 1000
: Math.pow(2, retries) * 1000 + Math.random() * 1000;
console.log(`Rate limited. Waiting ${(waitMs/1000).toFixed(1)}s`);
await new Promise(resolve => setTimeout(resolve, waitMs));
} else {
throw err;
}
}
}
throw new Error('Max retries exceeded');
}
Queue patterns for batch workloads
Retry logic handles individual request failures. For applications that need to process hundreds or thousands of requests (document processing, code review pipelines, batch analysis), you need a queue pattern with rate-awareness built in at dispatch time.
Pattern 1: Token bucket queue (simple)
Track your own request count and sleep until the window resets if you're approaching the limit:
import time
from collections import deque
class RateLimitedClient:
def __init__(self, rpm_limit: int = 50, tpm_limit: int = 100_000):
self.rpm_limit = rpm_limit
self.tpm_limit = tpm_limit
self.request_times = deque() # timestamps of recent requests
self.token_usage = deque() # (timestamp, token_count) pairs
def _clean_window(self):
now = time.time()
cutoff = now - 60
while self.request_times and self.request_times[0] < cutoff:
self.request_times.popleft()
while self.token_usage and self.token_usage[0][0] < cutoff:
self.token_usage.popleft()
def _wait_if_needed(self, estimated_tokens: int):
self._clean_window()
# Wait for RPM headroom
if len(self.request_times) >= self.rpm_limit:
oldest = self.request_times[0]
wait = 60 - (time.time() - oldest) + 0.1
if wait > 0:
time.sleep(wait)
self._clean_window()
# Wait for TPM headroom
current_tokens = sum(t for _, t in self.token_usage)
if current_tokens + estimated_tokens > self.tpm_limit:
oldest = self.token_usage[0][0]
wait = 60 - (time.time() - oldest) + 0.1
if wait > 0:
time.sleep(wait)
self._clean_window()
def request(self, client, prompt: str, estimated_tokens: int = 1000):
self._wait_if_needed(estimated_tokens)
now = time.time()
self.request_times.append(now)
response = make_request_with_retry(prompt)
# Record actual usage
actual_tokens = response.usage.input_tokens + response.usage.output_tokens
self.token_usage.append((now, actual_tokens))
return response
Pattern 2: Anthropic Batch API (for throughput over latency)
For workloads where you don't need immediate responses, the Anthropic Batch API is the correct tool. It accepts up to 10,000 requests per batch, processes them asynchronously over up to 24 hours, and charges 50% less per token than the real-time API. Rate limits are much more generous for batch requests.
import anthropic
client = anthropic.Anthropic()
# Create a batch
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}]
}
}
for i, prompt in enumerate(prompts)
]
)
print(f"Batch created: {batch.id}")
# Poll batch.results_url or subscribe to webhook for completion
Batch API is the right answer for: document analysis pipelines, nightly summarization jobs, large-scale code review automation, and any use case where results are needed within 24 hours rather than within seconds.
Per-model rate limit behavior
Rate limits are enforced per model family. A few things that catch developers by surprise:
- Haiku has dramatically higher RPM at every tier. If you're bottlenecked on RPM rather than token quality, using Haiku for classification/routing and Sonnet only for generation is a common throughput pattern.
- TPM limits count input tokens on streaming requests at send time, not at completion. If you're streaming a long-context request, the input tokens are deducted from your TPM budget immediately, not after the stream finishes.
- Opus has separate (lower) limits. If your application uses
claude-opus-4, check the tier table for Opus specifically—it's not the same as Sonnet limits. - Cache reads don't count against TPM. Prompt caching with
cache_control: {"type": "ephemeral"}means the cached portion of a prompt doesn't consume TPM when it's a cache hit. For large-context applications, this is the single most important rate-limit optimization.
Caching as a rate limit strategy
Prompt caching deserves its own section because it's underused. When you mark the top of your prompt (system prompt, large context document) as cacheable, Anthropic stores the KV representation of those tokens for 5 minutes. Subsequent requests that hit the cache:
- Cost 10% of the full input token price
- Don't count against TPM for the cached portion
- Complete faster (cached tokens skip the attention computation)
For a 200-page document you're analyzing with multiple prompts, caching the document means only the first request costs full input tokens. Requests 2 through N pay 10% for the document. On 50 requests at 200,000 input tokens each, that's the difference between 10M tokens and 2M tokens of TPM consumption.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": large_document_text,
"cache_control": {"type": "ephemeral"} # cache this block
}
],
messages=[{"role": "user", "content": specific_question}]
)
What to monitor in production
Four metrics worth tracking per deployment:
- 429 rate — percentage of requests that result in a rate limit error. Above 5% is a sign your queue isn't managing the limit correctly.
- TPM utilization — ratio of tokens used to tokens limit. Read
anthropic-ratelimit-tokens-remainingfrom response headers and track it over time. - Retry latency — p50/p95 of the wait time from initial 429 to successful retry. This tells you whether your backoff is calibrated to the actual
retry-aftervalues you're seeing. - Cache hit rate — if you're using prompt caching,
usage.cache_read_input_tokens/ total input tokens. A hit rate below 60% on a system-prompt-heavy application suggests the cache is expiring too frequently.
Septim Vault: API key and credential management for Claude workflows
If you're building multi-environment Claude API integrations and juggling API keys across projects, Septim Vault is a key-management toolkit for Claude Code workflows. Keeps credentials out of your codebase, out of your shell history, and out of version control. Pay once.
Requesting a rate limit increase
If you've hit Tier 4 limits and still need more capacity, Anthropic has an enterprise rate limit request form in the console. In practice, the Batch API at 50% cost and 10,000 requests per batch handles most high-volume use cases without needing an increase. Rate limit increases take 5–10 business days to process and aren't guaranteed.
For applications that need real-time throughput above Tier 4 limits, the correct answer is usually architectural: distribute requests across multiple API keys (separate Anthropic accounts), implement model cascading (Haiku for filtering, Sonnet for generation), or use the Batch API for the non-time-sensitive portion of the workload.