Agentic Coding Field Manual

I was skeptical about AI coding until I hurt my hand and had to find a better way to keep shipping. I didn’t want to fall into “vibe coding” - just accepting whatever the model spits out and hoping it works. So I distilled a workflow from what I’ve learned and what others have proven - one that keeps you in control without sacrificing speed.

That’s what this manual is about.

Who This Is For

Engineers who ship production software and want to move faster without losing their judgment. Not hobbyists. Not people who want AI to “build my app.”

If you don’t already have opinions about linting, type systems, and test strategy, start there first. This manual assumes you do.

The Core Problem

The biggest risk of AI coding isn’t bad code - linters catch that. It’s loss of agency. A gradual, barely noticeable surrender of control over your own codebase. It shows up in five ways:

You stop thinking. The model always has an answer, so you stop forming your own. Architectural instincts atrophy. Six months in, you can’t design a system without prompting an LLM first.

You stop reading. Diffs get accepted with a glance. You build on unreviewed code, then ask the AI to fix what breaks. You become a passenger in your own codebase.

You get average output. The model produces the statistical mean of its training data. Default configs, tutorial patterns, mock-heavy tests that pass but verify nothing. Without constraints, everything converges to mediocre.

You lose context. Long sessions degrade. The model contradicts earlier decisions, introduces subtle inconsistencies, and drifts from the plan. Each individual diff looks fine; the aggregate doesn’t.

You mistake speed for quality. Shipping faster feels like shipping better. Design thinking gets skipped because “I can always regenerate it.” But regeneration without judgment is just churn.

Every chapter in this manual targets at least one of these. The goal is the same throughout: you stay in control.

The Mental Model

AI is a fast typist with encyclopedic knowledge and no stake in your system. It produces statistically average code - the mean of everything it’s seen. Left unconstrained, you get the average of every tutorial, every Stack Overflow answer, every GitHub repo in its training data.

Your job is to constrain it: tell it what to build, how to build it, think alongside it on the hard problems, and reject anything that doesn’t meet your standards.

You drive. AI types.

01 - Tools

Requirements

Your AI coding tool needs three things:

Plan mode - Separate thinking from writing. You align on the approach before code is produced.
LSP integration - The model sees type errors, completions, and diagnostics. Same information your editor has.
Conversation forking - Explore an alternative without losing your current thread.

Two tools check all boxes:

OpenCode - Open source, any model. My preference.
Claude Code - Anthropic’s CLI.

OpenCode Setup

Install OpenCode, then update ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "model": "openai/gpt-5.3-codex",
  "tui": {
    "scroll_speed": 3,
    "scroll_acceleration": {
      "enabled": true
    },
    "diff_style": "auto"
  },
  "tools": {
    "lsp": true
  },
  "mcp": {
    "context7": {
      "type": "remote",
      "url": "https://mcp.context7.com/mcp",
      "headers": {
        "CONTEXT7_API_KEY": "{env:CONTEXT7_API_KEY}"
      }
    },
    "playwright": {
      "type": "local",
      "enabled": true,
      "command": [
        "bunx",
        "@playwright/mcp@latest"
      ]
    }
  }
}

Add env vars to your .bashrc, .zshrc, or config.fish (if you are one of those guys):

export OPENCODE_EXPERIMENTAL_LSP_TOOL=true
export CONTEXT7_API_KEY=<your-key-here>

Get a Context7 API key at context7.com.

On first run, OpenCode prompts for a provider. Pick either OpenAI or Anthropic. I personally recommend Codex-5.3 for coding because he does what I say (heisenberg, BB), and Claude for creative writing or design experimentation because it’s like a kid wanting to do everything on the first shot and try creative stuff.

What’s an LSP? It’s how your IDE understands autocompletion and syntax. The model gets the same type information and diagnostics your editor does - real errors, not hallucinated ones. Watch this explainer by TJ DeVries if unfamiliar.

MCP Servers

The config above includes two MCP servers:

Context7 - Documentation lookup. The model queries real, up-to-date library docs instead of hallucinating APIs from training data. This is non-negotiable - without it, the model invents function signatures that don’t exist.
Playwright - Browser automation. Takes screenshots of UI changes for visual verification. More on this in the UI appendix.

You can add more MCP servers as needed. The point is: give the model access to real data sources instead of letting it guess.

Keep it lean. Every MCP server or Skill you add bloats the system prompt and eats into the model’s context window. The more you pile on, the dumber the model gets - less room to actually reason about your problem. Only keep what you actively use. If you’re not using it, rip it out.

02 - Project Setup

Three things to set up before AI touches your codebase: scaffolding, guardrails, and AGENTS.md.

Scaffold It Yourself

Let the AI scaffold your project and you get the lowest common denominator of every tutorial it’s ever seen - Create React App patterns in 2026, default ESLint configs, boilerplate folder structures nobody actually wants.

Do it yourself:

Create the base with your preferred tooling
Set up a single validation command that covers linting, formatting, and type-checking - plus a separate test command
Configure your linter with your preferences
Configure your type checker with your preferences

The AI joins after the foundation is set. It works within your system, not its statistical average.

Guardrail Tooling

The best way to prevent AI from writing bad code: make it impossible at the tooling level.

The AI can’t merge garbage if the linter rejects it.

The examples below use TypeScript and Biome, but the pattern applies to any language and toolchain. Swap in your linter, your type checker, your runner.

Linter at Max Strictness

Your linter is the first line of defense. Crank it to maximum strictness - whatever tool you use. Here’s an example with Biome:

{
  "assist": {
    "enabled": true,
    "actions": {
      "recommended": true,
      "source": {
        "organizeImports": "on",
        "useSortedKeys": "on",
        "useSortedProperties": "on"
      }
    }
  },
  "linter": {
    "enabled": true,
    "rules": {
      "recommended": true,
      "complexity": {
        "noExcessiveCognitiveComplexity": "warn",
        "noExcessiveLinesPerFunction": {
          "level": "warn",
          "options": { "maxLines": 100, "skipBlankLines": true }
        },
        "useMaxParams": "error"
      },
      "style": {
        "noEnum": "error",
        "noDefaultExport": "error",
        "noNestedTernary": "error",
        "useConsistentTypeDefinitions": {
          "level": "error",
          "options": { "style": "type" }
        },
        "useForOf": "error"
      },
      "suspicious": {
        "noConsole": {
          "level": "error",
          "options": {
            "allow": ["assert", "error", "info", "warn"]
          }
        },
        "noEvolvingTypes": "warn",
        "useErrorMessage": "warn"
      }
    }
  },
  "overrides": [
    {
      "includes": ["**/*.config.{js,cjs,mjs,ts,cts,mts}"],
      "linter": {
        "rules": {
          "style": { "noDefaultExport": "off" }
        }
      }
    }
  ]
}

This is a subset - a real config covers 50+ rules across complexity, style, correctness, and suspicious categories.

Note the overrides block: config files need default exports, so you loosen that specific rule for those files only. Start strict, loosen deliberately.

Type System at Full Strictness

If your language has a type system, use its strictest settings. The AI loves to take shortcuts around type safety - strict mode forces it to deal with reality. Here’s an example with TypeScript - not just "strict": true, go further:

{
  "compilerOptions": {
    "strict": true,
    "noUncheckedIndexedAccess": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true,
    "exactOptionalPropertyTypes": true
  }
}

The One-Command Validation Pattern

Wrap all your validation into a single command. One command, zero ambiguity - if it passes, the code meets your standards regardless of who wrote it.

In a TypeScript project with Biome, that might look like:

{
  "scripts": {
    "check": "biome check",
    "typecheck": "tsc --noEmit",
    "lazycheck": "bun run check && bun run typecheck",
    "test": "bun test"
  }
}

After every AI change, run your validation command. The tool and the name don’t matter. What matters is that it’s one command, it covers everything, and you run it every time.

Tooling catches structural problems. For everything else - style, workflow, architectural boundaries - you need to tell the model how you work.

AGENTS.md

AGENTS.md is the system prompt for your repo. It’s always in the model’s context window - tools like OpenCode and Cursor read it automatically. Claude Code reads CLAUDE.md instead, because Anthropic wasn’t going to use a name they didn’t invent. Same concept, different filename - just symlink it and move on.

You don’t need one on day one. Start without it. When the model makes the same mistake twice - wrong branch, bad error handling pattern, style you hate - that’s when you add a rule. AGENTS.md grows from friction, not from planning.

Write It By Hand, Keep It Short

Tools like OpenCode offer /init to generate one. Don’t. Auto-generated files are too verbose, not constraint-first, and miss your actual preferences. You know your codebase better than any model.

Keep it under 100 lines. The shorter it is, the more the model follows it. A 300-line AGENTS.md gets skimmed and mostly ignored - same as humans with long READMEs.

Constraint-First

Tell the model what it can’t do, not why. Short imperatives, not explanations. The example at the end of this section shows the style - every line is a constraint or a rule, nothing is a rationale.

Patterns That Earn Their Place

Only add rules that address patterns the AI actually gets wrong. Common ones:

Style violations - The model defaults to its training distribution. State your preferences explicitly.
Tool-specific commands - Exact commands. The model will guess wrong otherwise.
Architectural boundaries - What not to touch, where tests live, which branch to use.
Error handling style - Without guidance, every function gets wrapped in try/catch.
When to ask for help - Without this, the model guesses and plows forward.

Example: Root AGENTS.md

- Prefer automation: execute requested actions without confirmation unless
  blocked by missing info or safety/irreversibility.
- Assume senior-level knowledge. No explanations for standard practices.
  Terse, scannable, constraint-first.
- Branch from `staging`.
- Always rebase, never merge.
- After changes, run: `bun run lazycheck`.
- Always check if local server is running to use that instead of spawning
  new processes.
- Don't ignore linting or type errors - fix them, or ask for help if unsure.
- If build mode requires human input (a decision, question, or manual step),
  pause and request it explicitly, then continue once provided.
- Don't turn off rules without a very good reason. Bring it up for discussion
  instead of bypassing it.

## Style Guide

- Avoid `try`/`catch` where possible
- Prefer single word variable names where possible
- Rely on type inference; avoid explicit annotations unless necessary
  for exports or clarity
- Type casting is a last resort - prefer type guards and runtime checks
- When uncertain about a library, use `context7` tools
- If using webfetch, look for llms.txt in the page's root first

## Testing

- Avoid mocks as much as possible
- Test actual implementation, do not duplicate logic into tests
- Prefer package-level tests over workspace-level tests

## Plan Mode

- Make the plan extremely concise. Sacrifice grammar for brevity.
- End each plan with unresolved questions, if any.

About 30 lines. Covers style, workflow, testing philosophy, and plan mode behavior. The Plan Mode section is borrowed from Matt Pocock.

Monorepo: Root + Leaf

Single repo? Skip this section.

In a monorepo, AGENTS.md files cascade:

Root AGENTS.md - Transversal rules. Git workflow, code style, testing philosophy. Applies everywhere.
Leaf AGENTS.md (per package) - Inherits root, adds package-specific constraints.

A leaf looks like:

- Inherit global rules from root `AGENTS.md`.
- Use semantic tokens for colors to support dark mode and themes;
  avoid hardcoded values.
- No arbitrary Tailwind values - use design tokens unless explicitly required.
- UI changes must be verified with screenshots using playwright MCP tools.

## Patterns

### Component Decision Tree (strict order)

1. Shadcn component exists? -> use it
2. Reusable (3+ uses)? -> `src/components/ui/`
3. Section-specific? -> `src/components/[section]/`
4. One-off layout? -> semantic HTML

The first line - “Inherit global rules from root AGENTS.md” - is critical. Without it, the model may ignore root-level constraints when working inside a package.

Keep leaves even shorter than root. Package-specific rules only. If a rule applies everywhere, it belongs in root.

03 - The Loop

Plan. Execute. Test. Commit. Repeat.

The goal: never lose control of what’s happening. This is the workflow that separates agentic coding from vibe coding.

  Plan ──> Execute ──> Test ──> Commit
   ^                              |
   └──────────────────────────────┘

Plan

Use plan mode. Think through the approach together before any code is written.

In OpenCode, toggle between plan and build with Tab.

A good plan prompt:

I need to add rate limiting to the API endpoints.
Requirements: per-user, 100 req/min, 429 response with retry-after header.
Plan the implementation.

The model proposes an approach. You review. You adjust. You agree. Then you execute.

Plans should be a numbered list of steps, not an essay. If the plan is longer than 10 lines, the scope is too big - break it down.

What Makes a Good Plan

Concise - a list, not a narrative
Scoped - one feature, one bug, one refactor
Specific about files and functions it will touch
Ends with unresolved questions if any remain

Execute

Switch to build mode. The AI writes code that matches the agreed plan.

Key discipline: the model implements what was agreed, nothing more. If it starts freelancing - adding features you didn’t ask for, refactoring nearby code, “improving” things - intervene. Don’t wait until it’s finished. This is the most common drift point.

Steering Mid-Execution

You don’t have to wait until the model finishes to course-correct. Watch what it’s doing - the files it opens, the tool calls it makes, the hunks appearing in the diff. If something looks wrong, send a message. The model will read it and adjust.

Soft redirect. “We’re adding rate limiting to the API, not refactoring the middleware - back on track.” No need to stop or undo. Just nudge.

Assist. If the model is struggling with a library or API - looping on the same error, guessing at function signatures - point it to the right docs. Drop a link to the documentation, an llms.txt file, or a specific code example. The model works better with real references than with its training data.

Scope anchoring. When the model drifts, restate the original scope and ask why: “The plan was to update the user service. Why are you modifying the payment module? Revert that and stick to user service.” The model lost the thread - remind it, and get an explanation if the detour wasn’t obvious.

Hard stop. If it’s too far gone, stop and revert. In OpenCode: /undo. Or use git. Don’t spend time untangling a mess - reset and re-prompt with tighter constraints.

Fork. Different from correcting mistakes. Fork when you want to explore an alternative without losing your current state - “what if we used Redis instead of in-memory caching” or “let’s try a different schema design.” Forking is for exploration, not damage control.

The CI Gate

The model runs your validation command and tests as part of execution - not as a separate step you do by hand. When checks fail, let the model fix its own mistakes. Don’t hand-edit AI output.

But don’t just watch passively either. Each failure is information:

Was the prompt unclear? Tighten the next one.
Is this a recurring pattern? Add a rule to AGENTS.md.
Did the linter or type checker miss it? Strengthen your CI gates.

If the model can’t self-correct after two or three attempts, the plan was wrong. Go back to plan mode with a fresh session.

Test

CI passing means the code is structurally sound. It doesn’t mean the feature works.

This step is yours. Run the feature. Click through the UI. Hit the endpoint. Check the actual behavior against what you intended. This is where you catch what automation can’t - the code that passes every check but doesn’t actually do what you need.

If something’s off, go back to Execute with a specific correction: “The rate limiter returns 200 instead of 429 when the limit is exceeded.” Give the model concrete behavior to fix, not vague feedback.

Commit

Ship it. Then start the cycle again.

Rules:

Small, atomic commits. One logical change per commit. Don’t let the AI create monster diffs touching 20 files for what should be a 3-file change.
Read the diff before committing. Every time. No exceptions. This is where “engineer stops thinking” starts - skipping diff review.
Always rebase, never merge. Clean history. The model doesn’t care, but you will when debugging at 2am.

Context Management

Model output quality degrades as the conversation grows. The context window fills up, older instructions get diluted, and the model starts contradicting its earlier decisions.

When to Kill and Start Fresh

The model starts ignoring your AGENTS.md rules
Output is increasingly sloppy or repetitive
It loops on the same error without making progress
You’re 15+ messages deep and quality is dropping

Fresh context = fresh start. Don’t cling to a degraded session. Starting over with a clear prompt is faster than fighting a confused model.

Why Not Compact?

Most tools offer a way to summarize or compact the conversation (OpenCode: /compact, Claude Code: /compact). In theory, this frees up context without losing the thread. In practice, compaction degrades quality significantly - the model loses nuance, forgets constraints, and makes worse decisions on the summarized context than it would with a clean start.

Always prefer a fresh session over a compacted one. A new conversation with a clear prompt outperforms a compressed version of a long, degraded one.

Context Budgeting

Think of each conversation as a budget. You have a finite amount of useful context before quality drops. Spend it on:

The plan (cheap - a few lines)
Execution (the bulk of it)
Error correction (expensive - each round adds noise)

If error correction is eating your budget, the plan was wrong. Go back to step 1 with a fresh session.

04 - Testing

No Mocks

The stance: never mock. Test actual implementations against real dependencies.

If you can’t test something without mocking, that’s a design problem - your boundaries are in the wrong place. Fix the design, don’t paper over it with jest.mock().

Why This Matters for Agentic Coding

AI defaults to mock-heavy tests. Every model has been trained on millions of test files full of jest.mock, sinon.stub, and unittest.mock. Left unconstrained, you get:

jest.mock("$/lib/db")
jest.mock("$/lib/auth")
jest.mock("$/lib/email")

test("user signup", () => {
  // mocked db, mocked auth, mocked email
  // testing literally nothing
})

Coverage goes up. Confidence doesn’t. This is coverage theater.

The model will generate these tests eagerly because they’re easy to write, always pass, and look productive. That’s exactly why they’re dangerous.

🌶️ Hot take: Code coverage is a vanity metric. 90% coverage means nothing if your app breaks the moment all the pieces run together. Unit tests with mocks tell you each piece works in isolation - great, but your users don’t run your app in isolation. What matters is whether the system works end-to-end with real data, real dependencies, and real failure modes. If your test suite is green and your app is broken, your test suite is lying to you.

What To Do Instead

Test real behavior with real dependencies:

Databases - Test against the real database engine you use in production. Use testcontainers or a dedicated test instance with proper setup and teardown. If your tests break during cleanup, good - that’s a problem you’d rather find now than in production. Stop treating your database like a sacred artifact only gods can touch. In the real world, it’s under constant pressure from users trying to break things.
API endpoints - Spin up the actual server. Hit real routes. Assert real responses.
Side effects - If the function sends an email, test against a real SMTP test server, not a mock.

test("user signup flow", async () => {
  const db = await createTestDb()      // real database instance
  const app = createApp({ db })        // real app instance

  const res = await app.request("/signup", {
    method: "POST",
    body: JSON.stringify({ email: "test@test.com", password: "secure123" }),
  })

  expect(res.status).toBe(201)

  // verify the user actually exists in the real database
  const user = await db.query("SELECT * FROM users WHERE email = ?", ["test@test.com"])
  expect(user).toBeDefined()
})

Slower than mocks. Tests something real.

Don’t Duplicate Logic Into Tests

Another AI habit - reimplementing business logic inside the test to assert against:

// Bad: duplicating the pricing calculation in the test
test("calculates price", () => {
  const price = 100
  const tax = price * 0.21       // duplicated logic
  const total = price + tax      // duplicated logic
  expect(calc(100)).toBe(total)  // testing nothing
})

// Good: testing behavior against known values
test("calculates price with tax", () => {
  expect(calc(100)).toBe(121)
})

Test behavior, not implementation.

Package-Level Tests

Prefer tests scoped to a single package over workspace-level integration tests.

If a test requires importing from multiple packages, it probably belongs in its own package. This keeps test scope small and helps the AI - less context to hold, better output.

packages/
  auth/
    src/
    tests/          <-- tests for auth only
  billing/
    src/
    tests/          <-- tests for billing only
  integration/
    tests/          <-- tests that cross package boundaries

Prompting for Good Tests

Add this to your AGENTS.md (see Chapter 02):

- Avoid mocks as much as possible
- Test actual implementation, do not duplicate logic into tests
- Prefer package-level tests over workspace-level tests

Then when asking for tests, be explicit:

Write integration tests for the auth flow.
No mocks. Real SQLite database. Test signup -> verify -> login.

Without that instruction, the model defaults to what it’s seen the most: mocked unit tests.

05 - Anti-Patterns

When NOT to Delegate to AI

There’s a difference between letting AI decide and using AI to think harder. The anti-pattern isn’t involving AI in hard problems - it’s handing the wheel over on the problems where your judgment matters most.

Architecture decisions - Don’t prompt “design my system” and accept what comes back. Do use plan mode to pressure-test your own design - surface tradeoffs, challenge assumptions, explore alternatives you hadn’t considered. The model is a sparring partner, not the architect. You decide.
Security-critical code - Auth flows, encryption, access control. AI can write these - the danger isn’t who produces the code, it’s accepting it without understanding the end-to-end architecture and reasoning behind every decision. One subtle bug is a breach, and the model will confidently produce subtle bugs. Walk through the implementation step by step. Challenge the threat model. Poke at edge cases. If you can’t explain why every line exists, it doesn’t ship.
Performance-sensitive paths - Hot loops, memory layout, cache behavior. AI writes readable code, not fast code. But it’s useful as a companion while you reason about cache lines, allocation patterns, or algorithmic complexity - ask it to challenge your assumptions or enumerate cases you might be missing.
Novel algorithms - If the solution doesn’t exist in training data, the model will hallucinate something plausible. Don’t let it generate the algorithm. Do use it to rubber-duck your approach, validate your reasoning at each step, and catch logical gaps while you work through the problem yourself.

The common thread: you do the thinking. AI makes your thinking more rigorous. The moment you stop reasoning and start accepting, you’ve delegated the one thing that can’t be delegated.

The Acceptance Spiral

The most dangerous pattern in agentic coding:

AI writes code. You glance at it. Looks fine. Accept.
AI writes more on top. You glance. Accept.
You now have 500 lines of unreviewed code as your foundation.
Something breaks. You don’t understand the code because you never read it.
You ask AI to fix it. It adds more unreviewed code.
You are now a passenger.

This is how engineers stop thinking. It doesn’t happen in one moment - it’s a gradual surrender of agency, one lazy diff review at a time.

Break the cycle: read every diff before committing. Every one. If you can’t explain what changed and why, you don’t commit it.

Common Failure Modes

Plausible nonsense - Code that looks correct, passes type checks, and has subtle logic errors. AI is excellent at producing code that seems right. The type system catches structural issues; it doesn’t catch “this algorithm doesn’t actually do what we need.”

Over-abstraction - AI loves creating abstractions: factories, wrappers, base classes, utility functions, generic helpers. Most of the time, the concrete solution is better. When you see the model creating AbstractHandlerFactory, intervene.

Ignoring existing patterns - Your codebase handles errors one way. AI adds a new function doing it differently because that’s what it’s seen more often in training data. AGENTS.md mitigates this but doesn’t eliminate it. You still need to review.

Cargo-culting popular patterns - The model gravitates toward whatever has the most representation in its training data, regardless of whether it fits your context. You get Redux in a project that doesn’t need state management, or a full ORM when raw queries would do.

Scope creep - You ask for one change. The model “helpfully” refactors three nearby files. Each refactor is individually reasonable but collectively they’ve changed the semantics of code you didn’t ask it to touch.

How to Review AI Output

Read diffs, not files. What changed matters more than what exists.

Checklist:

Does it follow AGENTS.md constraints?
Did it modify files you didn’t ask it to touch?
Are there new abstractions that aren’t justified?
Does the approach match what you agreed in the plan phase?
Would you approve this in a code review from a human?

That last question is the filter. If a junior dev submitted this PR, would you merge it? Apply the same standard.

The Three-Attempt Rule

If the AI can’t produce acceptable output in three tries for the same task, stop. Write it yourself.

Some tasks are faster by hand. Spending 30 minutes prompt-engineering a solution you could write in 10 isn’t “being productive with AI” - it’s sunk cost fallacy with extra steps.

Know when to take the wheel.

Appendix - UI With Real Designs

AI is actually good at generating UI now. That’s not the problem. The problem is that most products already have a design language - a designer delivered screens in Figma, there’s an existing component library, there are color tokens and spacing scales that need to be respected. The challenge isn’t getting the AI to produce something that looks nice. It’s getting it to produce something that looks like yours.

This appendix covers two levels of constraint for turning real designs into code with AI assistance. They’re not mutually exclusive - use tokens as the foundation and screenshots for specific implementations.

Screenshot-Driven

Use this when a designer has delivered a specific screen or layout and you need to implement it faithfully. Good for full pages, one-off sections, and anything where the visual reference is the source of truth.

Designer delivers in Figma
Export the relevant screen or component as a screenshot
Feed the screenshot to the agent with constraints:

Here's the design for the settings page [attach screenshot].
Implement it using existing components from the design system.
Use semantic color tokens - no hardcoded values.
No arbitrary Tailwind values.
Check the component tree before creating new components.

Strengths: Fast. Low ceremony. Works well for unique layouts where the design is already final.

Limitations: The model interprets the screenshot, so spacing and sizing won’t be pixel-perfect. Expect to iterate on details.

Token-Driven

Use this when your product has a design system and you want the AI to stay on-brand by default. Good for component systems, reusable UI, and anything that needs theme support.

Designer maintains a token system in Figma (colors, spacing, typography)
Export tokens to your codebase as CSS variables or Tailwind theme config
AI builds components constrained to those tokens

The constraint is built into the tooling - the model can’t use bg-blue-500 if your Tailwind config only exposes bg-primary and bg-surface. The design system enforces consistency, not discipline.

Add to your frontend package’s AGENTS.md:

- Use semantic tokens for colors to support dark mode and themes;
  avoid hardcoded values.
- No arbitrary Tailwind values - use design tokens unless explicitly required.

Strengths: Consistent output. Theme support built in. Less iteration needed.

Limitations: Requires upfront investment in the token system. Designer and engineer need to agree on the token structure.

Verifying UI Changes

Use the Playwright MCP server (configured in Chapter 01) to close the feedback loop.

Add to your AGENTS.md:

- UI changes must be verified with screenshots using playwright MCP tools.

The model takes a screenshot after implementation, compares against the design, and iterates. This catches:

Wrong tokens or colors
Layout mismatches
Missing responsive states
Broken component composition

The model reviews its own output visually before you do. Most obvious issues get fixed before the diff reaches you.

Implementing a Design

Don’t let the model one-shot a full screen from a screenshot. It will inline everything into a single monolithic component — hardcoded values, duplicated patterns, no reuse. Instead, follow this workflow:

Start from the screenshot. Look at the design. What are the building blocks?
Inventory check. Do you already have the components needed? Check your component library, your existing custom components, your tokens.
Build what’s missing. If the design needs components you don’t have, create them first — in isolation, before assembling the full screen.
Then assemble. Compose the screen from existing + newly created components. Semantic HTML is for layout and glue code only, not for things that should be components.

The key: build the blocks first, then compose. This prevents the model from generating a wall of JSX with everything inlined.

When the model encounters a UI need during implementation, it should follow this priority:

1. Component library already has it? -> use it
2. Reusable (3+ uses)? -> src/components/ui/
3. Section-specific? -> src/components/[section]/
4. One-off layout? -> semantic HTML

Add this to your frontend AGENTS.md:

- Before implementing a design, inventory existing components that match.
- Build missing reusable components before assembling the full screen.
- Use semantic tokens for all values. No hardcoded colors, spacing, or typography.
- Semantic HTML is for layout and glue code only - not for repeating UI patterns.

Without these constraints, the model creates new components for everything. You end up with 40 button variants when your component library already handles it.

Keyboard shortcuts

Agentic Coding Field Manual