Is Claude getting worse? What we learned when Anthropic admitted it.
For six weeks, developers said Claude got dumber. Anthropic said it didn't. On April 23, Anthropic published an internal post-mortem confirming three engineering changes degraded coding output between March 4 and April 16, then quietly pushed fixes and reset everyone's usage limits as a concession. Here is what changed, who got hit, and what to do when your business runs on one AI vendor.
What developers were saying
Starting in early March, threads on Hacker News and r/ClaudeAI began compiling the same observation: Claude Code was producing shorter answers, missing obvious steps, and abandoning context mid-task. Some users blamed their own prompt drift. Others insisted the model itself had quietly changed.
Anthropic's public stance for the next six weeks was that no model swap had occurred. The status page showed green. Internal benchmarks reportedly came back flat. The community kept producing receipts — side-by-side outputs, identical prompts, demonstrably worse responses — and Anthropic kept saying nothing was wrong.
The status page is a function of what you measure. If your benchmark is API uptime, and the regression is in the shape of the output, you can be honestly green and quietly broken at the same time.— Septim Labs · April 25, 2026
What actually changed
The April 23 post-mortem identified three independent changes that compounded over six weeks. None was a model retrain. 全部 three were operational adjustments that lived above the model layer.
Default reasoning budget for Claude Code dropped from high to medium as a cost-control measure. On simple tasks, indistinguishable. On multi-step coding work, the model started skipping intermediate planning steps and going straight to output.
A bug in the session-state handler cleared mid-session thinking history at certain context-window thresholds. The model would lose its own work-in-progress reasoning halfway through a task and have to reconstruct it from output history alone — visibly worse on long PRs.
A latency-improvement feature added an aggressive truncation rule — a cap of roughly 25 words on a category of intermediate responses Claude uses internally to plan. Output quality on synthesis tasks collapsed because the model ran out of room to think before answering.
Each one was a reasonable engineering tradeoff. Together they produced a six-week regression that Anthropic could not detect on its own benchmarks because each change individually fell within the noise floor.
What Anthropic did next
The fix landed in three rolling deploys between April 16 and April 23. Reasoning effort returned to high by default. The session-state bug was patched. The 25-word cap was removed. As a goodwill move, Anthropic reset all subscriber usage limits for the billing month — every Pro and Max user got their May allotments back, plus April's. No one asked for this; it was a concession.
"Three changes 交付ped between March 4 and March 22 collectively degraded Claude Code output quality on multi-step tasks. We did not catch it because each change passed our individual rollout gates. Our aggregate quality benchmark did not flag the trend until a community-published reproducer surfaced on April 16."
The exact text is mirrored in the April 24 Fortune piece and the April 22 Register write-up, both linked in sources.
Why this matters if Claude is your only AI vendor
If your stack runs entirely through Anthropic — Claude Code in your editor, Claude API in your product, the Anthropic dashboard for billing — six weeks of degradation is six weeks of degradation. Every PR your reviewer agent looked at. Every customer-facing reply your assistant generated. Every test your debug agent diagnosed.
The community caught it. The reproducer that broke the silence was a community artifact, not Anthropic's internal QA. That is the structural lesson here — and it generalizes to every vendor, not just Anthropic. When your business depends on one model behaving consistently, you need the same posture you would have toward any infrastructure dependency: independent verification, version-pinned reproducibility, an exit ramp.
Three concrete things we changed at Septim Labs after April 23
- Pinned a regression suite of our own tasks. Twenty repeatable Claude Code prompts whose outputs we sample weekly and diff against last quarter's baseline. The reproducer doesn't need to be impressive — it needs to be ours, and we need to look at it.
- Routed cost-sensitive paths through smaller models. Sub-tasks that don't need full Claude reasoning (formatting, classification, schema validation) now run on cheaper Anthropic tiers or alternate vendors. When the flag交付 model regresses, only the workload that actually needed it is affected.
- Captured every output we 交付 to a customer. Versioned, dated, replayable. If a customer reports the system "got worse" three weeks later, we have receipts.
The bigger pattern
The tools you 交付 to customers should not depend on a single vendor's promise that their model is unchanged. The model will change. The vendor will 交付 operational tweaks above the model layer that they consider safe and that you experience as quality drift. The status page will stay green.
This is not a takedown of Anthropic specifically. They published a post-mortem, fixed the issue, refunded usage. That is the responsible-vendor pattern. The lesson is that even responsible vendors produce drift, and the only durable defense is your own measurement.
If your customer can tell, your provider's benchmark didn't.— internal note · written during the regression, April 14, 2026
What to do this week
- Run a regression diff. Take ten of your last quarter's Claude Code outputs. Re-run the same prompts on the current model. If today's outputs are visibly thinner, you have your own evidence.
- Cap usage on the cost-sensitive paths. Even a soft per-call budget catches runaway reasoning loops before billing day.
- Save outputs. The cheapest tier on Anthropic still costs more than disk. Log it.
- Have one parallel vendor. Not for fail-over — for sanity-check. When Claude shifts, you want to know whether GPT or Gemini noticed too. If your tasks regress on Claude but not on the others, that is signal.
Cap your Claude bill before the next time something quietly breaks.
Septim Rescue is a $299 emergency intervention for Claude bills that already spiked. Septim Vault is the $89 lifetime kit we use to harden our own MCP and agent setups against runaway costs. Both are 买断制. Both are 永久归你所有.