· cost control · checklist · अप्रैल 2026 ·

Claude Code Cost Optimization: 25-Item Checklist

// FILED Cost Control// DATE 28 अप्रैल, 2026// SLUG /blog/claude-code-cost-optimization-checklist-2026.htmlcite this →

28 अप्रैल, 2026 को published · Septim Labs · 15 मिनट का read

Claude Code की billing पूरी तरह transparent है — हर token की एक cost है, और वो cost तुरंत आपके Anthropic console में दिखती है. यह transparency इसलिए useful है क्योंकि वह optimization को concrete बनाती है: इस checklist के हर item का आपके bill पर एक measurable effect है. Token math में कोई "might help" नहीं होता.

ये 25 items category के हिसाब से organized हैं. हर category में पहले high-impact items आते हैं. Savings estimates असली usage patterns पर based हैं, theoretical maximums पर नहीं. "Low complexity" वाले items एक घंटे से कम में हो जाते हैं. "High complexity" वाले architectural changes माँगते हैं.

इस checklist को कैसे use करें

हर category को अपने current setup के against चलाइए. हर वो item जो आपने implement नहीं किया, वो table से जाते हुए पैसे हैं. कुल addressable savings workload के हिसाब से बदलती है, लेकिन moderate intensity (दिन में 3–5 घंटे) पर Claude Code use करने वाला median developer सभी 25 items complete करके अपना monthly API spend 40–60% तक कम कर सकता है.

// 1. Prompt caching · 6 items

अपने system prompt पर cache_control enable करें

अगर आप API को directly use कर रहे हैं, तो अपने system prompt (या किसी भी बड़े context block को जो requests में बार-बार आता है) को cache_control: {"type": "ephemeral"} block में wrap करें. Cached tokens की cost uncached input tokens की 10% होती है. 10,000-token system prompt को दिन में 50 बार repeat करें, तो रोज़ 4.5M tokens की बचत.

system=[{
  "type": "text",
  "text": your_large_system_prompt,
  "cache_control": {"type": "ephemeral"}
}]

-55%input tokens · low complexity

एक ही document पर बार-बार query करने से पहले उसे cache करें

अगर आप एक ही document पर कई prompts चला रहे हैं (codebase का एक section, एक spec, एक PDF), तो पहले request पर document cache कर लीजिए. हर अगला request जो cache hit करेगा, उसके लिए document की 10% cost पड़ेगी. Break-even 2 requests पर है; 3 या उससे ज़्यादा पर यह pay off करता है.

-45%repeat-doc workflows · low

Response headers से cache hit rate monitor करें

हर API response से usage.cache_read_input_tokens पढ़िए. अगर system-prompt-heavy application पर आपकी cache hit rate 60% से नीचे है, तो cache use होने से पहले ही expire हो रहा है. Ephemeral cache 5 मिनट चलता है; पक्का कीजिए कि आपके requests उस window में आ रहे हैं.

diagnostichit rate · low

Cached content को prompt के top पर रखें

Cache content और उसकी position पर keyed होता है. अगर आप dynamic content (user message, current date) cached system prompt से पहले डाल देते हैं, तो cache hit नहीं होगा. Static, बड़ा block पहले रखिए. Dynamic content सबसे आख़िर में.

enables cachingprompt structure · low

बड़े stable contexts के लिए extended cache (1-hour TTL) use करें

Standard ephemeral cache 5 मिनट चलता है. अगर आपका context बड़ा है (पूरा codebase index) और कम बदलता है, तो Anthropic 1-hour TTL के साथ extended caching offer करता है — cache-write cost थोड़ी ज़्यादा, लेकिन per-hour cache-read cost कम. 100K tokens से ऊपर के contexts के लिए worth है.

-30%large context · medium

लंबे Claude Code sessions में CLAUDE.md content cache करें

Claude Code sessions में CLAUDE.md content हर message के साथ prepend होता है. अगर आपकी CLAUDE.md 5,000 tokens की है, तो हर turn पर 5,000 tokens bill होते हैं. CLAUDE.md को lean रखिए, और project-specific context को एक अलग file में निकालने पर विचार कीजिए — जो ज़रूरत पड़ने पर ही reference हो, हर turn में inject न हो.

-20%session tokens · low

// 2. Model selection · 5 items

Classification और routing tasks के लिए Haiku use करें

Haiku 3.5 की cost है $0.25 per million input tokens, जबकि Sonnet 4.5 की है $3. जो tasks मूलतः pattern-matching हैं (इस error को classify करो, इस issue को categorize करो, क्या यह text इन criteria से match करता है), उन पर Haiku Sonnet जैसी quality देता है — एक-बारहवीं price पर. अपने subagents audit कीजिए — जिनका max 3-turn हो और output classification-style हो, वे सब Haiku पर run होने चाहिए.

-92%per token · model switch

Sonnet सिर्फ़ तब use करें जब reasoning ज़रूरी हो

Sonnet इन कामों के लिए worth है: code review, security audit, multi-step reasoning, और कुछ भी जिसमें conflicting information को synthesize करना हो. यह इन कामों के लिए worth नहीं है: documentation generation, changelog writing, structured data extraction, या ऐसी कोई भी चीज़ जिसका format deterministic हो.

-60%mixed workloads · audit

हर agent के लिए max_tokens conservatively set करें

API generated tokens का bill करती है, requested tokens का नहीं. लेकिन ज़रूरत न होने पर भी max_tokens high रखने का मतलब है Claude ज़रूरत से ज़्यादा generate कर सकता है. Structured outputs (JSON, YAML, tables) के लिए कम max_tokens Claude को ज़्यादा concise होने पर मजबूर भी करता है. हर agent की असली output length audit कीजिए और max_tokens को p95 observed output का 120% set कीजिए.

-15%output tokens · medium

लंबे outputs के लिए streaming use करें; ज़रूरत पड़ने पर जल्दी cancel करें

अगर आप streaming use कर रहे हैं, तो जब enough output मिल जाए तो mid-stream cancel कर सकते हैं. API पर partial streaming responses अब तक generated tokens के लिए bill होते हैं, full max_tokens के लिए नहीं. ऐसी applications में जहाँ अक्सर एक लंबे output का सिर्फ़ पहला हिस्सा चाहिए, streaming + early cancel output token costs 40–70% तक कम कर सकता है.

-40%output tokens · high

जो tasks Sonnet अच्छी तरह handle कर लेता है, उन पर Opus मत use कीजिए

Opus की cost है $15 per million input tokens — Sonnet से 5x. Open-ended creative work और complex multi-step reasoning के लिए Opus और Sonnet की quality में significant फ़र्क है. लेकिन code tasks, structured output, और अधिकतर developer workflows के लिए Sonnet — एक-पाँचवीं price पर — Opus quality match कर लेता है. Opus को default बनाने से पहले benchmark कीजिए.

-80%vs Opus · benchmark first

// 3. Context management · 5 items

लंबे sessions के 50K tokens पार करने से पहले /compact चलाइए

Claude Code का /compact command session context को summarize करता है और उसे एक compressed version से replace कर देता है. एक 100K-token session 5K-token summary बन जाती है. Task continuity के लिए quality loss minimal होता है; cost saving significant. Active sessions पर हर 2 घंटे में चलाइए.

-80%context tokens · low

Claude को codebase explore करने देने के बजाय Grep और Read use कीजिए

जब Claude बिना direction के codebase explore करता है, तो वह context समझने के लिए कई files पढ़ता है. उसे पहले relevant files की ओर भेजना ("app/api/users.ts और User model पढ़ो") context को एक order of magnitude तक कम कर देता है. Claude से पढ़ने को कहने से पहले relevant files खोजने के लिए Grep use कीजिए.

-50%exploration tasks · medium

CLAUDE.md को 300 lines से कम रखिए

CLAUDE.md की हर line एक token है जो Claude Code session के हर message के साथ prepend होती है. एक 3,000-line CLAUDE.md हर turn में ~4,500 tokens जोड़ती है. एक 300-line CLAUDE.md ~450 tokens जोड़ती है. 3000-line CLAUDE.md वाली post समझाती है — coverage खोए बिना minimal token spend के लिए इसे कैसे structure करें.

-30%session tokens · medium

Subagent का tool access सिर्फ़ उतना रखें जितना उसे चाहिए

जिस subagent को सारे tools का access है, वह उन सबको use करेगा. जिस subagent के पास सिर्फ़ [Read, Grep] है, वह न Bash process spin up कर सकता है, न 10MB की log file context में load कर सकता है. Tool restriction एक साथ cost guardrail और security control दोनों है.

-25%per agent · low

Review agents को पूरी files नहीं, diffs भेजिए

Code review agent चलाते समय, पूरे file contents के बजाय git diff HEAD~1 का output भेजिए. 40 बदली lines वाली एक 2,000-line file — पूरी file भेजने पर 2,000 tokens, सिर्फ़ diff भेजने पर 200 tokens. Review workflows के लिए diff लगभग हमेशा काफ़ी होती है.

-90%review tasks · medium

// 4. Batch और async patterns · 5 items

हर non-time-sensitive workload के लिए Batch API use कीजिए

Anthropic की Batch API real-time API से per token 50% सस्ती है. Per batch 10,000 तक requests accept करती है, 24 घंटे में process. अगर आपके use case को 60 seconds से कम में response नहीं चाहिए, तो Batch API ही सही choice है. Document analysis, test generation, changelog writing — सब batch-eligible हैं.

-50%all tokens · medium

API पर भेजने से पहले requests deduplicate कीजिए

अगर आपकी application एक ही prompt दो बार भेज सकती है (एक जैसी user query, एक ही document analysis), तो API call करने से पहले request को एक local hash के against check कीजिए. (model + system_prompt + user_message) का SHA-256 hash duplicates की पहचान करता है. Response को hash के साथ key करके cache कीजिए. एक high-volume application में 5% duplicate rate महीने में significant savings है.

variablededup ratio · medium

एक जैसे requests को एक multi-part prompt में batch कीजिए

अगर आपको 20 documents पर एक ही operation करना है (summarize, classify, extract), तो एक multi-document request आम तौर पर 20 single-document requests से सस्ती पड़ती है — क्योंकि system prompt एक बार ही pay होता है. इसे अपने actual token math के against test कीजिए — बहुत बड़े batches context limits cross कर सकते हैं और वैसे भी split करना पड़ सकता है.

-25%batch overhead · medium

एक जैसी concurrent queries के लिए request coalescing implement कीजिए

High-traffic applications में कई users एक ही underlying API call को एक साथ trigger कर सकते हैं (एक ही report, एक ही analysis). Coalescing का मतलब: जब एक request in-flight है, तो अगली एक जैसी requests पहले response का wait करती हैं और उसे share करती हैं. यह आपके traffic spike patterns के अनुपात में API calls बचाता है.

variableconcurrent traffic · high

Batch API priority के लिए off-peak hours में batch jobs schedule कीजिए

Batch API processing time Anthropic के load के साथ बदलता है. Low-traffic hours (UTC 02:00–08:00) में batches submit करना आम तौर पर बिना additional cost के तेज़ completion देता है. 24-hour window वाले batches के लिए, midnight पर submit करना और सुबह तक results पाना एक reliable pattern है.

faster turnaroundscheduling · low

// 5. Guardrails और limits · 4 items

PreToolUse hooks के ज़रिए per-session और per-day budget limits set कीजिए

PreToolUse hook हर tool call से पहले चलता है. एक 30-line hook जो ~/.claude/projects/ से session की cumulative cost पढ़ता है और $10 cross करने पर halt कर देता है — Tokenocalypse scenarios को रोकता है. Hook आपकी machine छोड़ने से पहले API call पर fire होता है; इससे ज़्यादा soft enforcement point और कोई नहीं है.

prevents runawayhard cap · medium

सभी subagents पर max_turns limits set कीजिए

बिना max_turns limit वाला subagent अनिश्चित काल तक चल सकता है. अधिकतर agents पर max_turns: 10 set कीजिए, और simple, bounded tasks वाले agents पर max_turns: 5. 50 turns पर भागता subagent वही task पूरा करने में well-bounded वाले से 5-10x ज़्यादा cost देता है.

-60%runaway prevention · low

सिर्फ़ monthly totals नहीं, cost anomalies पर भी log और alert कीजिए

Monthly billing alerts Tokenocalypse events को तब पकड़ते हैं जब damage हो चुका होता है. Daily cost alerts (email या Slack webhook जब daily spend baseline का 2x cross करे) उन्हें time पर पकड़ते हैं ताकि intervene किया जा सके. Anthropic का console daily spend threshold alerts support करता है. इन्हें set कीजिए.

early warningmonitoring · low

Zombie sessions accumulate होने से पहले उन्हें kill कीजिए

एक Claude Code session जो खुला तो है लेकिन unattended है, फिर भी bill करता रहेगा जब कोई subagent tool call करेगा. claude sessions list से active sessions check कीजिए और जो आप actively use नहीं कर रहे, उन्हें kill कीजिए. कई developers के बीच shared machine पर zombie sessions एक significant और invisible cost source हैं.

variablesession hygiene · low

शुरुआत कहाँ से करें

अगर इस हफ़्ते आप इनमें से सिर्फ़ पाँच ही करने वाले हैं, तो ये कीजिए: 01 (prompt caching enable करें), 07 (classification tasks Haiku पर switch करें), 12 (नियमित /compact चलाएँ), 22 (budget limit hooks set करें), और 23 (हर agent पर max_turns set करें). ये पाँच highest-impact categories address करते हैं और मिलकर implement करने में दो घंटे से कम लगते हैं.

बाकी 20 items अगले महीने में पूरा करने लायक हैं. हर category से पहले और बाद में ccusage total चलाइए ताकि आपके workload पर असली impact माप सकें. इस post के numbers estimates हैं; आपकी actual savings आपके specific usage patterns पर निर्भर करेंगी.

Septim Drills: 47 exercises — इसमें hook configuration और cost guardrails शामिल

ऊपर वाले items 22 और 23 (PreToolUse hooks और max_turns) के लिए hook scripts और YAML agent configs लिखने पड़ते हैं. Septim Drills में 47 structured exercises हैं जो दोनों से गुज़रते हैं — production Claude Code workflows के असली examples के साथ. Pay once.

Septim Drills लें — $29 →