Cutting AI Coding Agent Token Burn 75%+

Introduction

Token usage is the silent tax on AI-assisted development. Every verbose git status, every polite "Sure! I'd be happy to help" preamble, every redundant context reload eats your budget twice: once at the API meter, and again as context-window pressure that degrades reasoning quality.

After instrumenting my own workflow across Claude Code, Gemini CLI, Cursor, and Aider, I cut average session token burn by roughly 75% without losing technical accuracy. This post documents the four-layer strategy, the tools that implement each layer, and the cases where you should never compress.

Why Tokens Matter Beyond Cost

Cost is the obvious reason to care, but it is not the most important one. Three forces compound:

Dollars: Premium models price between $3-15 per million input tokens and $15-75 per million output. A noisy session can burn $5-20 in minutes.
Context window pressure: Even a 1M-token Opus or ~1M-token Gemini window degrades reasoning as it fills. Models trained on shorter sequences attend less effectively to the middle of long contexts.
Cache-miss penalty: Anthropic's prompt cache has a 5-minute default TTL (a longer option exists). Idle past it and your next turn pays full input price plus latency. Gemini's default 1-hour TTL is more forgiving but still bounded.

Bloated input also produces bloated output. Models mirror the verbosity of their context. Trim aggressively at every layer and the agent itself becomes terser.

The Four-Layer Strategy

The mental model is simple: tokens flow through four checkpoints, and you can compress at each one.

Layer	What flows	Compression target
Input	Tool output, file contents, shell results	Strip noise before model sees it
Output	Model response text	Drop articles, filler, hedging
Context	Cross-session state	Persist via memory files, never re-explain
Automation	Repetitive command rewrites	Hooks intercept transparently

Each layer is independent. You can adopt one or all four. Gains stack multiplicatively.

Layer 1 — Input Compression

The biggest single waste in most sessions is tool output. A single npm install can dump 2000+ lines of progress noise. git log --stat on a large repo, a verbose Cargo build, or a Maven dependency tree can each consume 5-10K tokens that contain almost no signal.

The Generic Approach

Wrap noisy commands. The simplest version is shell aliases:

alias gst='git status -sb'
alias gl='git log --oneline -20'
alias ni='npm install --no-audit --no-fund --silent'

This works but stops there. The AI agent still calls the underlying command directly, bypassing your alias.

Input Filtering Per Tool

Claude Code — RTK (Rust Token Killer): a CLI proxy that auto-filters common dev commands. Installed once, configured via Claude Code hooks, it transparently rewrites git status to rtk git status before execution. Typical savings: 60-90% on dev ops.

# Install verification
rtk --version
which rtk

# Analytics
rtk gain              # Token savings summary
rtk gain --history    # Command-by-command history
rtk discover          # Scan transcripts for missed opportunities
rtk proxy <cmd>       # Bypass filtering, raw output (debugging)

The hook-based rewrite is the key insight: the agent does not know it's being filtered. Zero prompt engineering required.

⚠️ Name collision warning: there is a separate reachingforthejack/rtk (Rust Type Kit) on crates.io. If rtk gain returns "command not found" after install, you likely have the wrong binary.

Aider: /run command output is sent to the model. Use --no-stream and explicit flags like git log --oneline rather than letting the agent run unbounded commands. .aiderignore excludes paths from the repo map.

Cursor: .cursorrules can instruct the agent to prefer git status -s over git status. Cursor's terminal integration respects shell aliases when the agent reads output.

Gemini CLI: custom tool definitions in ~/.gemini/ can wrap shell calls with filtering logic. Less mature ecosystem than RTK, but the pattern is the same.

What to Filter

The high-value targets are universal:

Package manager output (npm, yarn, pnpm, pip, cargo, maven)
Build tools (webpack, vite, tsc, cargo build)
Git operations on large repos (log, diff, blame on long files)
Test runner verbose modes (jest, pytest, go test)
Docker build output

A reasonable rule: if a command produces more than 100 lines and you only care about the last 20, filter it.

Layer 2 — Output Compression

The model's response is the other direction of the same problem. Defaults across providers err on the side of being polite, thorough, and structured. That is excellent for end users and expensive for developers who already know the context.

Compression Pattern

Drop articles, filler, pleasantries, and hedging. Keep technical substance exact. Fragments are fine when unambiguous. Code blocks, error messages, and security warnings stay in full prose.

Before:

"Sure! I'd be happy to help you with that. The issue you're experiencing is likely caused by a stale token in the authentication middleware. Specifically, the expiration check appears to be using a less-than operator when it should be using less-than-or-equal."

After:

"Bug in auth middleware. Token expiry check uses < not <=. Fix at auth.ts:42."

Same information, roughly 75% fewer tokens.

Output Compression Per Tool

Claude Code — Caveman mode: a skill installed via ~/.claude/skills/. Activates per-session or globally. Supports intensity levels:

lite — drop filler only
full — fragments OK, short synonyms
ultra — telegram-style, maximum compression
wenyan-lite|full|ultra — Classical Chinese inspired variants

Trigger with /caveman or natural language ("be brief", "less tokens"). Auto-suspends for security warnings and destructive operation confirmations.

Cursor: rules file with explicit instructions:


    
        
          1
          You are talking to a senior engineer. Be terse.
        
        
          2
          No preamble. No trailing summaries. No hedging.
        
        
          3
          Code blocks normal. Errors quoted exact.

Aider: --no-pretty reduces formatting overhead. Custom system prompts via --message-file apply terse-mode instructions.

Gemini CLI: systemInstruction field on the request, plus thinkingBudget parameter on Gemini 2.5 to cap reasoning tokens. Structured output via responseSchema forces shorter completions when schema permits.

When to Suspend Compression

This list is universal across tools. Compression off for:

Security warnings
Destructive operation confirmations (rm -rf, DROP TABLE, force push)
Multi-step sequences where fragment order risks misread
Commit messages and PR descriptions
Error messages quoted from the system
Code review comments where nuance matters

Most compression skills auto-detect these cases. Verify before trusting it on production-affecting work.

Layer 3 — Persistent Memory

The third layer addresses a different waste: re-explaining yourself every session. If you tell the agent your role, preferences, and project context at the start of every conversation, you are paying for that context each time and contributing to cache fragmentation.

Memory Pattern

A persistent memory directory holds small topic files. An index file lists them. The agent loads the index every session and pulls specific files only when relevant.

The categories that pay off:

User: role, expertise, communication preferences
Feedback: rules learned from past corrections, with the why
Project: ongoing work, deadlines, motivations not derivable from code
Reference: pointers to external systems (Linear, Grafana, Slack channels)

What does not belong: code patterns, architecture, file paths, debugging recipes. Those are derivable from the current repository state. Memory is for the things git log and grep cannot tell you.

Memory Storage Per Tool

Claude Code: ~/.claude/projects/<project-slug>/memory/ with MEMORY.md as the index. Topic files use frontmatter:


    
        
          1
          ---
        
        
          2
          name: feedback-no-coauthor-trailer
        
        
          3
          description: User dislikes Claude attribution in commits
        
        
          4
          metadata:
        
        
          5
            type: feedback
        
        
          6
          ---
        
        
          7
          
        
        
          8
          Omit Co-Authored-By trailer from all commits in this project.
        
        
          9
          
        
        
          10
          **Why:** User considers it noise; prefers clean conventional commits.
        
        
          11
          **How to apply:** Never append the trailer; this overrides default behavior.

MEMORY.md is one line per entry, under ~200 lines total to fit in startup context.

Cursor: .cursorrules at the repo root, or .cursor/rules/ with glob-scoped rule files. Cursor auto-applies rules whose globs match the active file.

Aider: .aider.conf.yml for persistent configuration plus the repo map, which Aider builds automatically from your codebase to inject relevant symbols on demand.

Continue.dev: .continuerc.json with context providers. The @codebase provider is the closest analog to a memory system, though it indexes code rather than preferences.

Gemini CLI: GEMINI.md at repo root, mirroring the CLAUDE.md pattern. Plus context caching API for explicit reuse of large stable prefixes.

The Stale-Memory Problem

Memory rots. A file path you saved three months ago may have moved. A flag you noted may have been removed. Before acting on a memory that names a specific function, file, or flag: verify it still exists. Treat memory as a hypothesis to check, not a fact to assert.

Layer 4 — Hooks and Automation

The final layer is invisible. The other three layers all require some agent participation. Hooks intercept before the agent sees anything, making compression zero-overhead.

Claude Code Hooks

Defined in ~/.claude/settings.json. Two relevant event types:


    
        
          1
          {
        
        
          2
            "hooks": {
        
        
          3
              "UserPromptSubmit": [
        
        
          4
                {
        
        
          5
                  "matcher": "*",
        
        
          6
                  "hooks": [
        
        
          7
                    { "type": "command", "command": "rtk-rewrite-prompt" }
        
        
          8
                  ]
        
        
          9
                }
        
        
          10
              ],
        
        
          11
              "PreToolUse": [
        
        
          12
                {
        
        
          13
                  "matcher": "Bash",
        
        
          14
                  "hooks": [
        
        
          15
                    { "type": "command", "command": "rtk-rewrite-bash" }
        
        
          16
                  ]
        
        
          17
                }
        
        
          18
              ]
        
        
          19
            }
        
        
          20
          }

The PreToolUse hook rewrites git status to rtk git status before execution. The agent never sees the rewrite, the user never sees the original. Pure transparent compression.

Cross-Tool Equivalents

Cursor: rules with auto-attach globs achieve a similar effect for context injection, though there is no equivalent for command rewriting.

Aider: pre/post-commit hooks can run linters or filters, but the request-flow is not as openly extensible.

Gemini CLI: custom tool definitions allow pre-execution filtering. The plumbing is heavier than Claude Code hooks but functionally equivalent.

Generic shell layer: regardless of agent, direnv, shell aliases, and git hooks operate below the agent layer and apply universally.

Context Window Math Per Agent

The compression calculus shifts based on context window size and cache economics:

Agent	Context	Cache TTL	Cache Discount
Claude Code	200K (1M Opus)	5 min	~90% off
Gemini CLI	~1M	1 hr default, configurable	~75% off
Cursor	per-model	per-model	per-model
Aider	per-model	per-model	per-model

The practical implications:

Claude Code: 5-minute cache TTL makes aggressive trimming critical. A 7-minute idle wipes your cache and your next turn pays full price.
Gemini CLI: 1M-2M context tolerates more bloat, but per-token cost still matters. Longer cache TTL means stable prefixes pay off more.
Cursor and Aider: model-agnostic, so the math depends on which provider you point them at.

Bigger context windows are not a substitute for compression. They just shift the failure mode from "context overflow" to "expensive context plus degraded reasoning in the middle."

Measuring Savings

You cannot improve what you do not measure. Each tool has different instrumentation:

RTK (Claude Code):

rtk gain              # Aggregate savings
rtk gain --history    # Per-command breakdown
rtk discover          # Scan transcripts for missed opportunities

Anthropic API direct: response includes usage.input_tokens, usage.output_tokens, usage.cache_read_input_tokens, usage.cache_creation_input_tokens. Track cache hit rate over time.

Gemini API: response includes usageMetadata with promptTokenCount, candidatesTokenCount, cachedContentTokenCount, totalTokenCount. Cached tokens billed at 25% of normal rate.

Generic: tiktoken (OpenAI tokenizer) or Anthropic's count_tokens API can measure a prompt before sending. Useful for budgeting individual requests.

The metrics that matter:

Input-to-output ratio: if input is 50x your output, you are over-loading context. Target 5-10x.
Cache hit rate: above 70% means your prefix is stable. Below 30% means you are re-paying for the same context repeatedly.
Tokens per task completion: track end-to-end. A "cheap" session that never finishes the task is more expensive than an expensive session that ships.

When NOT to Compress

The four-layer strategy assumes you are doing routine engineering work. Some categories deserve their full prose:

Security reviews: nuance, hedging, and explicit threat modeling are the work. Compression hides risk.
Destructive operations: confirmations for rm -rf, force push, database drops, secret rotation. Verbose by design.
Code review comments: the audience is a teammate, not you. Standard prose, standard tone.
Commit messages and PR descriptions: written once, read many times by future you and others. Optimize for clarity, not tokens.
Documentation and onboarding: the audience needs context you have already internalized.
Multi-step destructive sequences: when fragment ordering could be misread, write normal prose.

A good rule: if a human other than you will read the output, default to standard prose. Compress only when you are the sole audience.

A Realistic Workflow

The compounding effect across layers is the point. A single optimization at one layer might cut 20% of tokens; all four together routinely cross 75%.

A typical optimized session on Claude Code:

RTK hook rewrites verbose commands transparently (input layer)
Caveman mode terse responses (output layer)
Memory directory loads project context once, persists across sessions (context layer)
Hooks apply both rewrites automatically (automation layer)

Result: a session that used to burn 200K tokens now burns 40-50K, with no measurable loss in technical quality.

Across other tools the names change but the strategy holds: filter input, compress output, persist memory, automate the boring parts.

Closing

Token efficiency is not about being cheap. It is about preserving the context window for actual work, keeping the cache warm so latency stays low, and reducing the cognitive overhead of wading through verbose model output.

The investment is hours, not days. Install one CLI proxy, write one rules file, set up one memory directory. The savings start on the next session and compound from there.

Repo links:

RTK (Rust Token Killer): GitHub
Caveman skill: ~/.claude/skills/caveman/
Cursor rules examples: .cursorrules in any well-tuned repo
Aider config: .aider.conf.yml examples in Aider docs
Gemini CLI: GEMINI.md pattern

For the production-readiness counterpart to this post, see AI Coding CLI Workflow: From Prompt Chaos to Engineering Rigor. That post covers what to do once your tokens are under control. And for the mindset underneath all of it, see why I treat AI as agile assistance, not autopilot.