Back to Blog

AI / Workflow / Engineering / Productivity

AI Coding CLI Workflow: From Prompt Chaos to Engineering Rigor

AI CLIs like Claude Code, Gemini CLI, and Cursor accelerate delivery, but they default to happy-path code. Here's the workflow I use to wrap them in production-grade engineering discipline.

|9 min read

Introduction

Terminal-based AI coding agents like Claude Code, Gemini CLI, Qwen CLI, and Cursor accelerate development by translating natural language into working code. But they share a predictable flaw: they optimize for speed and completeness over safety. Without guardrails, they generate happy-path implementations that skip error handling, omit test coverage, ignore observability, and bypass rollback strategies.

The solution is not to restrict the AI, but to structure the interaction. This workflow enforces engineering discipline at every stage, ensuring AI output meets production standards before it reaches version control.

Initial Setup: Anchor AI Behavior & Tool Syntax

Most CLI agents support auto-loading project context files at the start of each session (verify your tool's version for exact behavior). To anchor the AI, place one of the following in your repository root:

  • CLAUDE.md (Claude Code)
  • .cursorrules (Cursor)
  • AGENTS.md or .gemini/config (Gemini CLI, Qwen CLI, custom agents)

Include your tech stack, coding standards, and architectural preferences in the file. This prevents repetitive prompting and ensures consistent behavior across fresh sessions.

💡 Pro tip: Paste the baseline directive below directly into your context file and omit it from individual phase prompts. This saves ~150 tokens/turn and prevents duplication drift.

Baseline Directive for Context File:

1 You are a senior production engineer. Always prioritize security, observability, and rollback safety over speed. Never skip error handling, tests, or structured logging. Output diffs, not raw files. Ask clarifying questions before assuming edge cases.

Tool-Specific Execution Notes:

  • Claude Code: Use /plan to lock the agent into design mode. Reference files with @path to inject exact context.
  • Cursor: Use @workspace or @file to scope the agent. Enable Agent Mode for iterative file generation.
  • Gemini CLI / Qwen CLI: Pass --context-file PLAN.md or use inline @ references. Some versions support --strict or sandbox flags to reduce speculative code generation—check your CLI docs for availability.

Required Upstream Artifacts:

The workflow assumes pre-existing or stakeholder-provided reference material — user stories, acceptance criteria, business rules, regulatory constraints, and any existing schema or API contracts. Place these in a dedicated directory (e.g., docs/references/) and pass them by @path when prompting Phase 1.

The AI synthesizes PLAN.md from architectural intent plus these upstream references. It does not invent requirements. If a reference is missing, the agent will guess — and those guesses surface as gaps during Phase 2 audit. Cheaper to provide the inputs than to backfill them later.

Typical reference set:

  • Product requirements document (PRD) or feature brief
  • User stories with acceptance criteria
  • Business rules and compliance constraints (PCI, HIPAA, GDPR, UU PDP, etc.)
  • Existing schema, API contracts, or integration boundaries (when extending a system)
  • Style guide or design tokens (when UI is in scope)

Tiered Workflow Paths

Not every system requires the same rigor. Select a path based on impact and risk:

Path Scope Phases Required Approval Gate
Full Customer-facing, financial, or data-critical systems 1 → 2 → 3 → 4 → 5 → 6 Security scans, ≥80% coverage, peer review, rollback dry-run
Lite Internal tools, admin dashboards, low-risk MVPs 1 → 3 → 4 → 5 → 6 Lint pass, basic unit tests, peer review
Emergency Hotfixes, incident mitigation 3 → 5 → 6 (compressed) Targeted test, security scan, post-incident review

Phase 1: Architecture & Planning

Define the system before generating code. AI performs best when given explicit boundaries, compliance requirements, and a clear separation between design and implementation.

1 We are building a production-ready application.
2 Stack: [specify]. Constraints: [specify]. Compliance/Security: [if applicable].
3
4 Generate a PLAN.md covering:
5 1. Architecture and core components
6 2. Database schema and migration strategy
7 3. Authentication and authorization pattern
8 4. Error handling and structured logging
9 5. Testing pyramid (unit, integration, e2e)
10 6. CI/CD and deployment strategy
11 7. Observability (metrics, tracing, alerting)
12 8. Known risks and rollback plan
13
14 Do not write code. Focus on technical design that can be audited. Output in Markdown format.

Phase 2: Critical Audit

Open a fresh session to eliminate confirmation bias. Treat the new context as an independent reliability and security auditor. The agent must only produce an audit report without modifying the original plan.

1 Read PLAN.md in the root. Audit it from a production-readiness perspective.
2 Use these benchmarks: 12-Factor App, OWASP Top 10, SRE fundamentals.
3
4 Identify:
5 - Security and data privacy gaps
6 - Missing test coverage strategy
7 - Single points of failure
8 - Deployment and rollback risks
9
10 Output as AUDIT.md. Do not modify PLAN.md. Provide concrete recommendations, not theoretical advice.

Phase 3: Task Breakdown

Consolidate PLAN.md and AUDIT.md back into your primary session. Convert the validated architecture into an executable checklist. Each item must contain acceptance criteria and rollback instructions.

1 Based on PLAN.md and AUDIT.md, create TASKS.md.
2 Format per task:
3 - Task ID: T-001
4 Scope: ...
5 Acceptance Criteria: ...
6 Test Strategy: ...
7 Dependencies: ...
8 Rollback Step: ...
9 Estimated Complexity: Low, Med, or High
10
11 Prioritize by dependency and risk. Maximum 15 initial tasks. Ready for incremental execution.

Phase 4: Documentation Consolidation

Once TASKS.md locks scope, generate durable handoff documentation. The AI consumes the validated plan, audit, and task list, then produces two artifacts split by audience.

Why after task breakdown, not before:

  • PLAN.md already covers architecture, schema, and auth — Phase 2 validates them.
  • Pre-breakdown docs drift every time tasks pivot. Post-breakdown docs reflect locked scope.
  • Functional and Technical docs are deliverables, not planning inputs. They serve Product, QA, Support, and future engineers — not the audit gate.
1 Read PLAN.md, AUDIT.md, and TASKS.md.
2
3 Generate FUNCTIONAL.md:
4 - User stories grouped by feature
5 - Acceptance criteria per story
6 - Business rules and constraints
7 - Edge cases and error states
8 - Audience: Product, QA, Support
9
10 Generate TECHNICAL.md:
11 - System architecture diagram (mermaid)
12 - Database schema and relationships
13 - API contracts (endpoints, payloads, error codes)
14 - Authentication and authorization flow
15 - Observability surface (logs, metrics, traces)
16 - Rollback and recovery procedures
17 - Audience: Engineering, SRE
18
19 Flag any inconsistency between PLAN.md and TASKS.md before generating.
20 Output as two separate Markdown files. Do not modify upstream artifacts.

💡 Tip: Treat these as living ebooks. Re-run this prompt after any scope change in TASKS.md so the docs match the shipped state. Store alongside PLAN.md and AUDIT.md in your documentation directory.

Phase 5: Atomic Execution & Pass/Fail Gates

AI CLI agents fragment focus when handling large features. Break implementation into atomic commits. Each prompt should reference a single task ID, enforce diff-only output, and require validation before moving to the next item.

Context Budget & Resume Pattern:

  • Cap sessions at ~80% of the model's context window.

  • Before rotating, generate CONTEXT_SUMMARY.md. Use this skeleton to anchor the next session:

    1 ## Current Progress: [Task/Phase completed]
    2 ## Open Decisions: [Unresolved architectural or logic choices]
    3 ## Next Steps: [Exact next task ID and file targets for new session]
  • Resume new sessions by passing CONTEXT_SUMMARY.md as the initial context anchor.

💡 Tip: Most CLIs display token usage in the session header. If yours doesn't, cap planning prompts to ~3k tokens and execution prompts to ~2k to stay safely under truncation thresholds. For a deeper dive on cutting token burn across CLI agents, see Cutting AI Coding Agent Token Burn 75%+.

1 Execute TASKS.md item [Task ID].
2 Follow these rules:
3 - Implement only the requested scope
4 - Include unit and integration tests matching the test strategy
5 - Apply structured logging and proper error handling
6 - Output changes as unified diffs only
7 - Provide verification steps before marking the task complete
8
9 Do not modify unrelated files. Wait for human review before proceeding.

💡 CLI Tip: If your agent prints diffs to stdout, redirect them: ai-cli "Execute T-001" > t001.patch before running git apply --check.

Measurable Pass/Fail Gates:

  • Test Coverage: ≥80% for new/modified files (Full), ≥60% (Lite)
  • Security Scans: 0 critical/high vulnerabilities
  • Lint & Type Check: 0 blocking errors
  • Manual Approval: Required PR review from 1 senior engineer
  • Rollback Script: Present and tested in staging

Phase 6: Human Code Review & Merge

AI output is draft code until validated by human judgment. After passing all gates:

  • Run git diff or use your IDE's PR view to inspect every change line-by-line.
  • Verify that tests cover the new logic and that no regressions were introduced.
  • Check for hardcoded values, missing error paths, or over-engineered abstractions.
  • Squash or merge via protected branch rules. Never skip peer review for production branches.

Safe Review & Enforceable Git Rules

Verbal guardrails fail under pressure. Enforce AI behavior through deterministic commands and repository policies.

Review AI Changes Safely: Modern AI CLIs edit files directly in your working tree and can read git context natively. Instead of manual patch application, use:

# 1. Review all AI-modified files before staging
git diff

# 2. Stage only verified changes
git add -p  # Interactive staging to review hunks individually

# 3. Commit with clear, scoped messages
git commit -m "feat: implement T-001 with tests and error handling"

Enforceable Git Hooks & Policies:

  • Local Pre-commit Guard: Run this one-liner before committing to block untested AI diffs: npm run lint && npm test -- --findRelatedTests $(git diff --name-only || echo .) (Replace with your stack's equivalent: ./mvnw verify or gradle check for Spring Boot, ruff check . && pytest for Python, or go vet ./... && go test ./... for Golang.) This validates changes locally without complex setup. For mature teams, migrate to .pre-commit-config.yaml with automated linting, secret scanning, and coverage gates.
  • Branch Protection Rules: Require status checks (CI, coverage, security scans) and pull request approvals before merging. Disable force-push and direct commits to main/release branches.
  • AI Execution Constraint: Configure CLI wrappers to disable native git commit/git push commands. Route all state changes through human-reviewed patch application.

Provenance & Audit Logging

AI-generated code must be traceable for compliance and incident response.

  • Log every CLI interaction using tee or structured logging: your-ai-cli-command | tee logs/ai-session-$(date +%s).log
  • Include prompt version, model name, context files loaded, and output diffs in each log entry.
  • Store PLAN.md, AUDIT.md, TASKS.md, FUNCTIONAL.md, TECHNICAL.md, and CONTEXT_SUMMARY.md in a dedicated docs/ai-audit/ directory. Commit alongside code changes for full traceability.

Quick-Reference: Production Hardening Rules

Area Standard AI Behavior Human Responsibility
Security OWASP Top 10, least-privilege access, secret rotation Rejects hardcoded credentials, enforces input validation Reviews auth flows, validates access policies, scans dependencies
Observability Structured JSON logs, distributed tracing, health endpoints Injects correlation IDs, adds span propagation, formats logs Defines SLOs, configures alerting thresholds, reviews dashboards
Testing Unit, integration, and e2e coverage; failure-path validation Generates mocks, asserts edge cases, verifies test execution Reviews test coverage, validates flaky tests, approves suites
Deployment Incremental releases, immutable build artifacts, rollback scripts Outputs infrastructure diffs, flags configuration drift Approves deployments, runs dry-runs, monitors rollout metrics
Execution Atomic commits, feature flags, single-scope changes Outputs unified diffs, isolates changes, blocks multi-step merges Applies patches, runs CI, reviews PRs, triggers merge

Reality Check: What AI Cannot Replace

AI amplifies engineering effort; it does not replace engineering judgment. Always perform manual review before deploying:

  • Auth and payment flows frequently miss race conditions, idempotency guarantees, and token rotation logic.
  • Data migrations are often irreversible; schema changes can cause downtime or silent corruption.
  • Infrastructure and secrets require environment-specific tuning. AI-generated configurations expose systems if deployed without validation.
  • Edge cases like memory leaks, graceful shutdowns, rate limiting, and exponential backoff demand human stress testing.
  • Compliance frameworks (like data privacy or financial regulations) require architectural and legal verification that AI cannot guarantee.

Never deploy without automated security scans, load testing, secret rotation validation, and rollback dry-runs. Treat AI CLI agents as disciplined implementors guided by senior engineering oversight. You remain the final authority on technical trade-offs and production safety.

Conclusion

Production readiness does not emerge from prompt sophistication. It emerges from explicit context, audited architecture, security guardrails injected from day one, version-controlled planning, and incremental execution with measurable acceptance criteria. By treating AI CLI agents as structured engineering collaborators rather than autonomous code generators, you reclaim predictability, reduce technical debt, and ship systems that survive real-world traffic.

If you're applying this workflow to a real system and want a second pair of eyes on your architecture plan, reach out to compare notes—I'm always happy to share stack-specific adjustments that keep AI output reliably shippable.

Written by Erik Yuntantyo·Software Engineer·About me