AI / LLM / CLI / Cost Optimization / Developer Tooling / Claude / Gemini / Cursor / Aider
Cutting AI Coding Agent Token Burn 75%+
Tokens are dollars, context pressure, and cache-miss penalties. Here's a four-layer strategy I use to cut AI coding agent token usage by 75% or more across Claude Code, Gemini CLI, Cursor, and Aider.
|11 min read
Introduction
Token usage is the silent tax on AI-assisted development. Every verbose git status, every polite "Sure! I'd be happy to help" preamble, every redundant context reload eats your budget twice: once at the API meter, and again as context-window pressure that degrades reasoning quality.
After instrumenting my own workflow across Claude Code, Gemini CLI, Cursor, and Aider, I cut average session token burn by roughly 75% without losing technical accuracy. This post documents the four-layer strategy, the tools that implement each layer, and the cases where you should never compress.
Why Tokens Matter Beyond Cost
Cost is the obvious reason to care, but it is not the most important one. Three forces compound:
- Dollars: Premium models price between $3-15 per million input tokens and $15-75 per million output. A noisy session can burn $5-20 in minutes.
- Context window pressure: Even a 1M-token Opus or 2M-token Gemini window degrades reasoning as it fills. Models trained on shorter sequences attend less effectively to the middle of long contexts.
- Cache-miss penalty: Anthropic's prompt cache has a 5-minute TTL. Idle past it and your next turn pays full input price plus latency. Gemini's default 1-hour TTL is more forgiving but still bounded.
Bloated input also produces bloated output. Models mirror the verbosity of their context. Trim aggressively at every layer and the agent itself becomes terser.
The Four-Layer Strategy
The mental model is simple: tokens flow through four checkpoints, and you can compress at each one.
| Layer | What flows | Compression target |
|---|---|---|
| Input | Tool output, file contents, shell results | Strip noise before model sees it |
| Output | Model response text | Drop articles, filler, hedging |
| Context | Cross-session state | Persist via memory files, never re-explain |
| Automation | Repetitive command rewrites | Hooks intercept transparently |
Each layer is independent. You can adopt one or all four. Gains stack multiplicatively.
Layer 1 — Input Compression
The biggest single waste in most sessions is tool output. A single npm install can dump 2000+ lines of progress noise. git log --stat on a large repo, a verbose Cargo build, or a Maven dependency tree can each consume 5-10K tokens that contain almost no signal.
The Generic Approach
Wrap noisy commands. The simplest version is shell aliases:
alias gst='git status -sb'
alias gl='git log --oneline -20'
alias ni='npm install --no-audit --no-fund --silent'
This works but stops there. The AI agent still calls the underlying command directly, bypassing your alias.
Input Filtering Per Tool
Claude Code — RTK (Rust Token Killer): a CLI proxy that auto-filters common dev commands. Installed once, configured via Claude Code hooks, it transparently rewrites git status to rtk git status before execution. Typical savings: 60-90% on dev ops.
# Install verification
rtk --version
which rtk
# Analytics
rtk gain # Token savings summary
rtk gain --history # Command-by-command history
rtk discover # Scan transcripts for missed opportunities
rtk proxy <cmd> # Bypass filtering, raw output (debugging)
The hook-based rewrite is the key insight: the agent does not know it's being filtered. Zero prompt engineering required.
⚠️ Name collision warning: there is a separate
reachingforthejack/rtk(Rust Type Kit) on crates.io. Ifrtk gainreturns "command not found" after install, you likely have the wrong binary.
Aider: /run command output is sent to the model. Use --no-stream and explicit flags like git log --oneline rather than letting the agent run unbounded commands. .aiderignore excludes paths from the repo map.
Cursor: .cursorrules can instruct the agent to prefer git status -s over git status. Cursor's terminal integration respects shell aliases when the agent reads output.
Gemini CLI: custom tool definitions in ~/.gemini/ can wrap shell calls with filtering logic. Less mature ecosystem than RTK, but the pattern is the same.
What to Filter
The high-value targets are universal:
- Package manager output (npm, yarn, pnpm, pip, cargo, maven)
- Build tools (webpack, vite, tsc, cargo build)
- Git operations on large repos (log, diff, blame on long files)
- Test runner verbose modes (jest, pytest, go test)
- Docker build output
A reasonable rule: if a command produces more than 100 lines and you only care about the last 20, filter it.
Layer 2 — Output Compression
The model's response is the other direction of the same problem. Defaults across providers err on the side of being polite, thorough, and structured. That is excellent for end users and expensive for developers who already know the context.
Compression Pattern
Drop articles, filler, pleasantries, and hedging. Keep technical substance exact. Fragments are fine when unambiguous. Code blocks, error messages, and security warnings stay in full prose.
Before:
"Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by a stale token in the authentication middleware. Specifically, the expiration check appears to be using a less-than operator when it should be using less-than-or-equal."
After:
"Bug in auth middleware. Token expiry check uses
<not<=. Fix atauth.ts:42."
Same information, roughly 75% fewer tokens.
Output Compression Per Tool
Claude Code — Caveman mode: a skill installed via ~/.claude/skills/. Activates per-session or globally. Supports intensity levels:
lite— drop filler onlyfull— fragments OK, short synonymsultra— telegram-style, maximum compressionwenyan-lite|full|ultra— Classical Chinese inspired variants
Trigger with /caveman or natural language ("be brief", "less tokens"). Auto-suspends for security warnings and destructive operation confirmations.
Cursor: rules file with explicit instructions:
1
You are talking to a senior engineer. Be terse.
2
No preamble. No trailing summaries. No hedging.
3
Code blocks normal. Errors quoted exact.
Aider: --no-pretty reduces formatting overhead. Custom system prompts via --message-file apply terse-mode instructions.
Gemini CLI: systemInstruction field on the request, plus thinkingBudget parameter on Gemini 2.5 to cap reasoning tokens. Structured output via responseSchema forces shorter completions when schema permits.
When to Suspend Compression
This list is universal across tools. Compression off for:
- Security warnings
- Destructive operation confirmations (
rm -rf,DROP TABLE, force push) - Multi-step sequences where fragment order risks misread
- Commit messages and PR descriptions
- Error messages quoted from the system
- Code review comments where nuance matters
Most compression skills auto-detect these cases. Verify before trusting it on production-affecting work.
Layer 3 — Persistent Memory
The third layer addresses a different waste: re-explaining yourself every session. If you tell the agent your role, preferences, and project context at the start of every conversation, you are paying for that context each time and contributing to cache fragmentation.
Memory Pattern
A persistent memory directory holds small topic files. An index file lists them. The agent loads the index every session and pulls specific files only when relevant.
The categories that pay off:
- User: role, expertise, communication preferences
- Feedback: rules learned from past corrections, with the why
- Project: ongoing work, deadlines, motivations not derivable from code
- Reference: pointers to external systems (Linear, Grafana, Slack channels)
What does not belong: code patterns, architecture, file paths, debugging recipes. Those are derivable from the current repository state. Memory is for the things git log and grep cannot tell you.
Memory Storage Per Tool
Claude Code: ~/.claude/projects/<project-slug>/memory/ with MEMORY.md as the index. Topic files use frontmatter:
1
---
2
name: feedback-no-coauthor-trailer
3
description: User dislikes Claude attribution in commits
4
metadata:
5
type: feedback
6
---
7
8
Omit Co-Authored-By trailer from all commits in this project.
9
10
**Why:** User considers it noise; prefers clean conventional commits.
11
**How to apply:** Never append the trailer; this overrides default behavior.
MEMORY.md is one line per entry, under ~200 lines total to fit in startup context.
Cursor: .cursorrules at the repo root, or .cursor/rules/ with glob-scoped rule files. Cursor auto-applies rules whose globs match the active file.
Aider: .aider.conf.yml for persistent configuration plus the repo map, which Aider builds automatically from your codebase to inject relevant symbols on demand.
Continue.dev: .continuerc.json with context providers. The @codebase provider is the closest analog to a memory system, though it indexes code rather than preferences.
Gemini CLI: GEMINI.md at repo root, mirroring the CLAUDE.md pattern. Plus context caching API for explicit reuse of large stable prefixes.
The Stale-Memory Problem
Memory rots. A file path you saved three months ago may have moved. A flag you noted may have been removed. Before acting on a memory that names a specific function, file, or flag: verify it still exists. Treat memory as a hypothesis to check, not a fact to assert.
Layer 4 — Hooks and Automation
The final layer is invisible. The other three layers all require some agent participation. Hooks intercept before the agent sees anything, making compression zero-overhead.
Claude Code Hooks
Defined in ~/.claude/settings.json. Two relevant event types:
1
{
2
"hooks": {
3
"UserPromptSubmit": [
4
{
5
"matcher": "*",
6
"hooks": [
7
{ "type": "command", "command": "rtk-rewrite-prompt" }
8
]
9
}
10
],
11
"PreToolUse": [
12
{
13
"matcher": "Bash",
14
"hooks": [
15
{ "type": "command", "command": "rtk-rewrite-bash" }
16
]
17
}
18
]
19
}
20
}
The PreToolUse hook rewrites git status to rtk git status before execution. The agent never sees the rewrite, the user never sees the original. Pure transparent compression.
Cross-Tool Equivalents
Cursor: rules with auto-attach globs achieve a similar effect for context injection, though there is no equivalent for command rewriting.
Aider: pre/post-commit hooks can run linters or filters, but the request-flow is not as openly extensible.
Gemini CLI: custom tool definitions allow pre-execution filtering. The plumbing is heavier than Claude Code hooks but functionally equivalent.
Generic shell layer: regardless of agent, direnv, shell aliases, and git hooks operate below the agent layer and apply universally.
Context Window Math Per Agent
The compression calculus shifts based on context window size and cache economics:
| Agent | Context | Cache TTL | Cache Discount |
|---|---|---|---|
| Claude Code | 200K (1M Opus) | 5 min | ~90% off |
| Gemini CLI | 1M-2M | 1 hr default, configurable | ~75% off |
| Cursor | per-model | per-model | per-model |
| Aider | per-model | per-model | per-model |
The practical implications:
- Claude Code: 5-minute cache TTL makes aggressive trimming critical. A 7-minute idle wipes your cache and your next turn pays full price.
- Gemini CLI: 1M-2M context tolerates more bloat, but per-token cost still matters. Longer cache TTL means stable prefixes pay off more.
- Cursor and Aider: model-agnostic, so the math depends on which provider you point them at.
Bigger context windows are not a substitute for compression. They just shift the failure mode from "context overflow" to "expensive context plus degraded reasoning in the middle."
Measuring Savings
You cannot improve what you do not measure. Each tool has different instrumentation:
RTK (Claude Code):
rtk gain # Aggregate savings
rtk gain --history # Per-command breakdown
rtk discover # Scan transcripts for missed opportunities
Anthropic API direct: response includes usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, usage.cache_creation_input_tokens. Track cache hit rate over time.
Gemini API: response includes usageMetadata with promptTokenCount, candidatesTokenCount, cachedContentTokenCount, totalTokenCount. Cached tokens billed at 25% of normal rate.
Generic: tiktoken (OpenAI tokenizer) or Anthropic's count_tokens API can measure a prompt before sending. Useful for budgeting individual requests.
The metrics that matter:
- Input-to-output ratio: if input is 50x your output, you are over-loading context. Target 5-10x.
- Cache hit rate: above 70% means your prefix is stable. Below 30% means you are re-paying for the same context repeatedly.
- Tokens per task completion: track end-to-end. A "cheap" session that never finishes the task is more expensive than an expensive session that ships.
When NOT to Compress
The four-layer strategy assumes you are doing routine engineering work. Some categories deserve their full prose:
- Security reviews: nuance, hedging, and explicit threat modeling are the work. Compression hides risk.
- Destructive operations: confirmations for
rm -rf, force push, database drops, secret rotation. Verbose by design. - Code review comments: the audience is a teammate, not you. Standard prose, standard tone.
- Commit messages and PR descriptions: written once, read many times by future you and others. Optimize for clarity, not tokens.
- Documentation and onboarding: the audience needs context you have already internalized.
- Multi-step destructive sequences: when fragment ordering could be misread, write normal prose.
A good rule: if a human other than you will read the output, default to standard prose. Compress only when you are the sole audience.
A Realistic Workflow
The compounding effect across layers is the point. A single optimization at one layer might cut 20% of tokens; all four together routinely cross 75%.
A typical optimized session on Claude Code:
- RTK hook rewrites verbose commands transparently (input layer)
- Caveman mode terse responses (output layer)
- Memory directory loads project context once, persists across sessions (context layer)
- Hooks apply both rewrites automatically (automation layer)
Result: a session that used to burn 200K tokens now burns 40-50K, with no measurable loss in technical quality.
Across other tools the names change but the strategy holds: filter input, compress output, persist memory, automate the boring parts.
Closing
Token efficiency is not about being cheap. It is about preserving the context window for actual work, keeping the cache warm so latency stays low, and reducing the cognitive overhead of wading through verbose model output.
The investment is hours, not days. Install one CLI proxy, write one rules file, set up one memory directory. The savings start on the next session and compound from there.
Repo links:
- RTK (Rust Token Killer): GitHub
- Caveman skill:
~/.claude/skills/caveman/ - Cursor rules examples:
.cursorrulesin any well-tuned repo - Aider config:
.aider.conf.ymlexamples in Aider docs - Gemini CLI:
GEMINI.mdpattern
For the production-readiness counterpart to this post, see AI Coding CLI Workflow: From Prompt Chaos to Engineering Rigor. That post covers what to do once your tokens are under control.
Written by Erik Yuntantyo·Software Engineer·About me