Claude API Rate Limits (2026): Handling 429s, Backoff, and Queues
A 429 from the Claude API means you've hit a rate limit. The response includes a retry-after header that tells you exactly when to retry. Most developers ignore this header and implement exponential backoff instead—which is the wrong strategy for Anthropic's rate limit design and can make throughput worse, not better.
This post covers Anthropic's rate limit structure as of April 2026, the correct retry pattern, practical queue implementations for high-volume applications, and the specific failure modes you'll encounter with each model tier.
What Anthropic's rate limits actually are
Anthropic uses three limit types, all operating simultaneously. You can hit any of them independently:
- Requests per minute (RPM) — how many API calls you can make in a 60-second window
- الرموز per minute (TPM) — الإجمالي رمزs (input + output) across all requests in a 60-second window
- الرموز per day (TPD) — cumulative رمز usage in a 24-hour period
Your limits depend on your usage tier. Anthropic automatically promotes accounts through tiers based on spend history. As of April 2026:
| Tier | Criteria | Claude Sonnet RPM | Sonnet TPM | Haiku RPM |
|---|---|---|---|---|
| ابنِ (Tier 1) | New account, any spend | 5 | 25,000 | 50 |
| Scale (Tier 2) | $100 spend + 7 أيام | 50 | 100,000 | 2,000 |
| Growth (Tier 3) | $500 spend + 14 يومًا | 1,000 | 500,000 | 5,000 |
| Scale (Tier 4) | $2,000 spend + 14 يومًا | 2,000 | 1,000,000 | 10,000 |
Tier 1 limits are severe. 5 requests per minute on Sonnet means you can run one request every 12 seconds on average. For a developer مبنى a batch processing tool, this is the single biggest constraint in the early stages of a project. Haiku's higher RPM on Tier 1 (50 RPM) is why many developers use Haiku for early-stage testing and switch to Sonnet at Tier 2.
What the 429 response actually contains
When you hit a rate limit, the response looks like this:
HTTP/1.1 429 Too Many Requests
retry-after: 37
anthropic-ratelimit-requests-limit: 50
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z
anthropic-ratelimit-tokens-limit: 100000
anthropic-ratelimit-tokens-remaining: 48221
anthropic-ratelimit-tokens-reset: 2026-04-28T14:23:00Z
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Rate limit exceeded: requests"
}
}
The headers tell you exactly what you need:
anthropic-ratelimit-requests-remaining: 0 // no requests left this window
anthropic-ratelimit-رمزs-remaining: 48221 // رمزs still available
anthropic-ratelimit-requests-reset: 2026-04-28T14:23:00Z // exact reset time
In this مثال you hit the RPM limit (0 requests remaining) but still have رمز budget (48,221 رمزs remaining). Waiting 37 seconds and retrying will succeed. Implementing exponential backoff here would wait longer than necessary and reduce throughput for no benefit.
The correct retry pattern: header-first, not exponential
For Claude API rate limits specifically, the correct retry strategy is:
- Check the
retry-afterheader on a 429 response - Wait exactly that many seconds (add 1 second for clock jitter)
- Retry the request
- If still 429 (unusual), apply exponential backoff starting from the
retry-afterbase
Here's a Python implementation:
import anthropic
import time
import random
client = anthropic.Anthropic()
def make_request_with_retry(prompt: str, max_retries: int = 5):
retries = 0
backoff_base = 1.0
while retries < max_retries:
try:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response
except anthropic.RateLimitError as e:
retries += 1
if retries >= max_retries:
raise
# Read the retry-after header if available
retry_after = None
if hasattr(e, 'response') and e.response is not None:
retry_after_str = e.response.headers.get('retry-after')
if retry_after_str:
retry_after = float(retry_after_str)
if retry_after is not None:
# Use the header value + small jitter
wait = retry_after + random.uniform(0, 1)
else:
# Fall back to exponential backoff
wait = backoff_base * (2 ** (retries - 1)) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s before retry {retries}/{max_retries}")
time.sleep(wait)
except anthropic.APIStatusError as e:
# Non-rate-limit errors: don't retry
raise
For TypeScript/Node.js applications:
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
async function makeRequestWithRetry(
prompt: string,
maxRetries = 5
): Promise {
let retries = 0;
while (retries < maxRetries) {
try {
return await client.messages.create({
model: 'claude-sonnet-4-5',
max_tokens: 1024,
messages: [{ role: 'user', content: prompt }],
});
} catch (err) {
if (err instanceof Anthropic.RateLimitError) {
retries++;
if (retries >= maxRetries) throw err;
const retryAfter = err.headers?.['retry-after'];
const waitMs = retryAfter
? (parseFloat(retryAfter) + Math.random()) * 1000
: Math.pow(2, retries) * 1000 + Math.random() * 1000;
console.log(`Rate limited. Waiting ${(waitMs/1000).toFixed(1)}s`);
await new Promise(resolve => setTimeout(resolve, waitMs));
} else {
throw err;
}
}
}
throw new Error('Max retries exceeded');
}
Queue patterns for batch workloads
Retry logic handles individual request failures. For applications that need to process hundreds or thousands of requests (document processing, مراجعة الكود pipelines, batch analysis), you need a queue pattern with rate-awareness built in at dispatch time.
Pattern 1: Token bucket queue (simple)
Track your own request count and sleep until the window resets if you're approaching the limit:
import time
from collections import deque
class RateLimitedClient:
def __init__(self, rpm_limit: int = 50, tpm_limit: int = 100_000):
self.rpm_limit = rpm_limit
self.tpm_limit = tpm_limit
self.request_times = deque() # timestamps of recent requests
self.token_usage = deque() # (timestamp, token_count) pairs
def _clean_window(self):
now = time.time()
cutoff = now - 60
while self.request_times and self.request_times[0] < cutoff:
self.request_times.popleft()
while self.token_usage and self.token_usage[0][0] < cutoff:
self.token_usage.popleft()
def _wait_if_needed(self, estimated_tokens: int):
self._clean_window()
# Wait for RPM headroom
if len(self.request_times) >= self.rpm_limit:
oldest = self.request_times[0]
wait = 60 - (time.time() - oldest) + 0.1
if wait > 0:
time.sleep(wait)
self._clean_window()
# Wait for TPM headroom
current_tokens = sum(t for _, t in self.token_usage)
if current_tokens + estimated_tokens > self.tpm_limit:
oldest = self.token_usage[0][0]
wait = 60 - (time.time() - oldest) + 0.1
if wait > 0:
time.sleep(wait)
self._clean_window()
def request(self, client, prompt: str, estimated_tokens: int = 1000):
self._wait_if_needed(estimated_tokens)
now = time.time()
self.request_times.append(now)
response = make_request_with_retry(prompt)
# Record actual usage
actual_tokens = response.usage.input_tokens + response.usage.output_tokens
self.token_usage.append((now, actual_tokens))
return response
Pattern 2: Anthropic Batch API (for throughput over latency)
For workloads where you don't need immediate responses, the Anthropic Batch API is the correct tool. It accepts up to 10,000 requests per batch, processes them asynchronously over up to 24 ساعة, and charges 50% less per رمز than the real-time API. Rate limits are much more generous for batch requests.
import anthropic
client = anthropic.Anthropic()
# Create a batch
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"request-{i}",
"params": {
"model": "claude-sonnet-4-5",
"max_tokens": 1024,
"messages": [{"role": "user", "content": prompt}]
}
}
for i, prompt in enumerate(prompts)
]
)
print(f"Batch created: {batch.id}")
# Poll batch.results_url or subscribe to webhook for completion
Batch API is the right answer for: document analysis pipelines, nightly summarization jobs, large-scale مراجعة الكود automation, and any use case where results are needed within 24 ساعة rather than within seconds.
Per-model rate limit behavior
Rate limits are enforced per model family. A few things that catch developers by surprise:
- Haiku has dramatically higher RPM at every tier. If you're bottlenecked on RPM rather than رمز quality, using Haiku for classification/routing and Sonnet only for generation is a common throughput pattern.
- TPM limits count input رمزs on streaming requests at send time, not at completion. If you're streaming a long-context request, the input رمزs are deducted from your TPM budget immediately, not after the stream finishes.
- Opus has separate (lower) limits. If your application uses
claude-opus-4, check the tier table for Opus specifically—it's not the same as Sonnet limits. - Cache reads don't count against TPM. Prompt caching with
cache_control: {"type": "ephemeral"}means the cached portion of a prompt doesn't consume TPM when it's a cache hit. For large-context applications, this is the single most important rate-limit optimization.
Caching as a rate limit strategy
Prompt caching deserves its own section because it's underused. When you mark the top of your prompt (system prompt, large context document) as cacheable, Anthropic stores the KV representation of those رمزs for 5 minutes. Subsequent requests that hit the cache:
- التكلفة 10% of the full input رمز price
- Don't count against TPM for the cached portion
- Complete faster (cached رمزs skip the attention computation)
For a 200-page document you're analyzing with multiple prompts, caching the document means only the first request التكلفةs full input رمزs. Requests 2 through N pay 10% for the document. On 50 requests at 200,000 input رمزs each, that's the difference between 10M رمزs and 2M رمزs of TPM consumption.
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": large_document_text,
"cache_control": {"type": "ephemeral"} # cache this block
}
],
messages=[{"role": "user", "content": specific_question}]
)
What to monitor in production
Four metrics worth tracking per deployment:
- 429 rate — percentage of requests that result in a rate limit error. Above 5% is a sign your queue isn't managing the limit correctly.
- TPM utilization — ratio of رمزs used to رمزs limit. Read
anthropic-ratelimit-tokens-remainingfrom response headers and track it over time. - Retry latency — p50/p95 of the wait time from initial 429 to successful retry. This tells you whether your backoff is calibrated to the actual
retry-aftervalues you're seeing. - Cache hit rate — if you're using prompt caching,
usage.cache_read_input_tokens/ الإجمالي input رمزs. A hit rate below 60% on a system-prompt-heavy application suggests the cache is expiring too frequently.
Septim Vault: مفتاح API and credential management for Claude سير العملs
If you're مبنى multi-environment Claude API integrations and juggling مفتاح APIs across projects, Septim Vault is a key-management toolkit for Claude Code سير العملs. Keeps credentials out of your codebase, out of your shell history, and out of version control. ادفع مرة واحدة.
Requesting a rate limit increase
If you've hit Tier 4 limits and still need more capacity, Anthropic has an enterprise rate limit request form in the console. In practice, the Batch API at 50% التكلفة and 10,000 requests per batch handles most high-volume use cases without needing an increase. Rate limit increases take 5–10 business days to process and aren't guaranteed.
For applications that need real-time throughput above Tier 4 limits, the correct answer is usually architectural: distribute requests across multiple مفتاح APIs (separate Anthropic accounts), implement model cascading (Haiku for filtering, Sonnet for generation), or use the Batch API for the non-time-sensitive portion of the workload.