· cost control · checklist · april 2026 ·

Claude Code Cost Optimization: 25-Item Checklist

// FILED Cost Control// DATE APR 28, 2026// SLUG /blog/claude-code-cost-optimization-checklist-2026.htmlcite this →

Published April 28, 2026 · by Septim Labs · 15 min read

Claude Code's billing is fully transparent—every token costs something, and the cost shows up immediately in your Anthropic console. That transparency is useful because it makes optimization concrete: each item on this checklist has a measurable effect on your bill. There's no "might help" in token math.

These 25 items are organized by category. High-impact items appear first within each category. The savings estimates are based on real usage patterns, not theoretical maximums. Items marked "Low complexity" can be done in under an hour. Items marked "High complexity" require architectural changes.

How to use this checklist

Run through each category against your current setup. Every item you haven't implemented is money leaving the table. The total addressable savings varies by workload, but the median developer using Claude Code at moderate intensity (3–5 hours/day) can reduce monthly API spend by 40–60% by completing all 25 items.

// 1. Prompt caching · 6 items

Enable cache_control on your system prompt

If you use the API directly, wrap your system prompt (or any large context block repeated across requests) in a cache_control: {"type": "ephemeral"} block. Cached tokens cost 10% of uncached input tokens. On a 10,000-token system prompt repeated 50 times per day, that's 4.5M tokens saved per day.

system=[{
  "type": "text",
  "text": your_large_system_prompt,
  "cache_control": {"type": "ephemeral"}
}]

-55%input tokens · low complexity

Cache documents before querying them multiple times

If you're running multiple prompts against the same document (a codebase section, a spec, a PDF), cache the document on the first request. Every subsequent request that hits the cache pays 10% for the document. Break-even is at 2 requests; it pays off at 3 or more.

-45%repeat-doc workflows · low

Monitor cache hit rate from response headers

Read usage.cache_read_input_tokens from every API response. If your cache hit rate is below 60% for a system-prompt-heavy application, the cache is expiring before you can use it. The ephemeral cache lasts 5 minutes; make sure your requests are arriving within that window.

diagnostichit rate · low

Keep cached content at the top of the prompt

The cache is keyed on the content and its position. If you put dynamic content (user message, current date) before your cached system prompt, the cache won't hit. Put the static, large block first. Put the dynamic content last.

enables cachingprompt structure · low

Use extended cache (1-hour TTL) for large stable contexts

The standard ephemeral cache lasts 5 minutes. If your context is large (a full codebase index) and changes infrequently, Anthropic offers extended caching with a 1-hour TTL at a slightly higher cache-write cost but lower cache-read cost per hour. Worth it for contexts above 100K tokens.

-30%large context · medium

Cache CLAUDE.md content in long Claude Code sessions

In Claude Code sessions, the CLAUDE.md content is prepended to every message. If your CLAUDE.md is 5,000 tokens, that's 5,000 tokens billed per turn. Keep CLAUDE.md lean, and consider splitting project-specific context into a separate file that's referenced only when needed rather than injected into every turn.

-20%session tokens · low

// 2. Model selection · 5 items

Use Haiku for classification and routing tasks

Haiku 3.5 costs $0.25 per million input tokens vs. $3 for Sonnet 4.5. For tasks that are fundamentally pattern-matching (classify this error, categorize this issue, does this text match these criteria), Haiku produces equivalent quality at one-twelfth the price. Audit your subagents—anything with a 3-turn max and a classification-style output should run on Haiku.

-92%per token · model switch

Use Sonnet only when reasoning matters

Sonnet is worth the price for: code review, security audit, multi-step reasoning, anything that requires synthesizing conflicting information. It's not worth the price for: documentation generation, changelog writing, structured data extraction, or anything with a deterministic format.

-60%mixed workloads · audit

Set max_tokens conservatively per agent

The API bills for tokens generated, not tokens requested. But setting a high max_tokens when you don't need it means Claude may generate more than necessary. For structured outputs (JSON, YAML, tables), a lower max_tokens also forces Claude to be more concise. Audit each agent's actual output length and set max_tokens to 120% of the p95 observed output.

-15%output tokens · medium

Use streaming for long outputs; cancel early if needed

If you're using streaming, you can cancel mid-stream when you have enough output. On the API, partial streaming responses are billed for tokens generated so far, not the full max_tokens. For applications where you often need only the first part of a long output, streaming + early cancel can reduce output token costs by 40–70%.

-40%output tokens · high

Avoid Opus for tasks Sonnet handles equally well

Opus costs $15 per million input tokens—5x Sonnet. The quality difference between Opus and Sonnet is significant for open-ended creative work and complex multi-step reasoning. For code tasks, structured output, and most developer workflows, Sonnet matches Opus quality at one-fifth the price. Benchmark before defaulting to Opus.

-80%vs Opus · benchmark first

// 3. Context management · 5 items

Run /compact before long sessions drift past 50K tokens

Claude Code's /compact command summarizes the session context and replaces it with a compressed version. A 100K-token session becomes a 5K-token summary. The quality loss is minimal for task continuity; the cost saving is significant. Run it every 2 hours on active sessions.

-80%context tokens · low

Use Grep and Read instead of letting Claude explore the codebase

When Claude explores a codebase without direction, it reads many files to understand context. Directing it to the relevant files first ("read app/api/users.ts and the User model") reduces context by an order of magnitude. Use Grep to find relevant files before asking Claude to read them.

-50%exploration tasks · medium

Keep CLAUDE.md under 300 lines

Every CLAUDE.md line is a token that's prepended to every message in a Claude Code session. A 3,000-line CLAUDE.md adds ~4,500 tokens to every turn. A 300-line CLAUDE.md adds ~450 tokens. The 3000-line CLAUDE.md post covers how to structure it for minimal token spend without losing coverage.

-30%session tokens · medium

Scope subagent tool access to only what they need

A subagent with access to all tools will use them. A subagent with access to only [Read, Grep] can't spin up a Bash process and load a 10MB log file into context. Tool restriction is a cost guardrail and a security control simultaneously.

-25%per agent · low

Pass diffs, not full files, to review agents

When running a code review agent, pass the output of git diff HEAD~1 rather than the full file contents. A 2,000-line file with 40 changed lines costs 2,000 tokens if you pass the file and 200 tokens if you pass the diff. For review workflows, the diff is almost always sufficient.

-90%review tasks · medium

// 4. Batch and async patterns · 5 items

Use the Batch API for any non-time-sensitive workload

The Anthropic Batch API costs 50% less per token than the real-time API. Accepts up to 10,000 requests per batch, processes within 24 hours. If your use case doesn't need a response in under 60 seconds, Batch API is the correct choice. Document analysis, test generation, changelog writing—all batch-eligible.

-50%all tokens · medium

Deduplicate requests before sending to the API

If your application might send the same prompt twice (identical user query, same document analysis), check your request against a local hash before calling the API. A SHA-256 hash of (model + system_prompt + user_message) identifies duplicates. Cache the response keyed to the hash. A 5% duplicate rate in a high-volume application is significant savings over a month.

variablededup ratio · medium

Batch similar requests into a single multi-part prompt

If you need to perform the same operation on 20 documents (summarize, classify, extract), one multi-document request often costs less than 20 single-document requests because the system prompt is paid once. Test this against your actual token math—very large batches can exceed context limits and force splitting anyway.

-25%batch overhead · medium

Implement request coalescing for identical concurrent queries

In high-traffic applications, multiple users may trigger the same underlying API call simultaneously (same report, same analysis). Coalescing: when a request is in-flight, subsequent identical requests wait for the first response and share it. Saves API calls proportional to your traffic spike patterns.

variableconcurrent traffic · high

Schedule batch jobs during off-peak hours for Batch API priority

Batch API processing time varies with Anthropic's load. Submitting batches during low-traffic hours (UTC 02:00–08:00) typically yields faster completion without any additional cost. For batches with a 24-hour window, submitting at midnight and receiving results by morning is a reliable pattern.

faster turnaroundscheduling · low

// 5. Guardrails and limits · 4 items

Set per-session and per-day budget limits via PreToolUse hooks

A PreToolUse hook runs before every tool call. A 30-line hook that reads your session's cumulative cost from ~/.claude/projects/ and halts if it exceeds $10 prevents Tokenocalypse scenarios. The hook fires before the API call leaves your machine; there's no softer enforcement point than this.

prevents runawayhard cap · medium

Set max_turns limits on all subagents

A subagent without a max_turns limit can run indefinitely. Set max_turns: 10 on most agents and max_turns: 5 on agents with simple, bounded tasks. A runaway subagent at 50 turns costs 5-10x what a well-bounded one does on the same task.

-60%runaway prevention · low

Log and alert on cost anomalies, not just monthly totals

Monthly billing alerts catch Tokenocalypse events after the damage is done. Daily cost alerts (email or Slack webhook when daily spend exceeds 2x baseline) catch them in time to intervene. Anthropic's console supports daily spend threshold alerts. Set them.

early warningmonitoring · low

Kill zombie sessions before they accumulate

A Claude Code session left open but unattended will still bill when a subagent makes a tool call. Check active sessions with claude sessions list and kill sessions you're not actively using. On a machine shared between developers, zombie sessions are a significant and invisible cost source.

variablesession hygiene · low

Where to start

If you're going to do only five of these this week, do: 01 (enable prompt caching), 07 (switch classification tasks to Haiku), 12 (run /compact regularly), 22 (set budget limit hooks), and 23 (set max_turns on every agent). Those five address the highest-impact categories and cost less than two hours to implement combined.

The remaining 20 items are worth working through over the next month. Run ccusage total before and after each category to measure the actual impact on your workload. The numbers in this post are estimates; your actual savings will depend on your specific usage patterns.

Septim Drills: 47 exercises including hook configuration and cost guardrails

Items 22 and 23 above (PreToolUse hooks and max_turns) require writing hook scripts and YAML agent configs. Septim Drills includes 47 structured exercises that walk through both, with real examples from production Claude Code workflows. Pay once.

Get Septim Drills — $29 →