# Plan Authoring Patterns for Automated Build Harnesses

**Date:** 2026-03-09
**Context:** Post-mortem on a harness run that produced incomplete results, contrasted with a separate run that achieved zero-debug perfection.
**Reviewed by:** Opus 4.6 consultation (17 recommendations integrated below)

---

## The Two Data Points

### Data Point 1: The Failure (Workbench Refinement, tonight)

A 9-phase plan with Phase 5 as the implementation phase:

```markdown
## Phase 5: Implement Workbench Refinements

Apply the top-priority recommendations from SYNTHESIS.md to the Manual Workbench code.

Read SYNTHESIS.md first to determine what to implement.

Validation:
python3 -c "import ast; ast.parse(open('editors/review_server.py').read())" && \
node -c editors/manual.js && echo "Syntax OK"
```

SYNTHESIS.md contained 5 priority tiers (P0-P4), each a distinct feature. The sub-agent implemented P0 and P1, stopped, passed syntax validation, and the harness moved on. The user expected all 5 features. Got 2.

### Data Point 2: The Success (Coil Editor, same week)

A consult → adversarial review → detailed spec pipeline produced a 2,647-line `OVERNIGHT_BUILD_SPEC.md` with exact file contents, phase boundaries, and validation gates. The harness built it in 74 minutes across 8 phases with zero debug cycles. Every feature was implemented exactly as specified.

### What separated them

The failed plan delegated scope decisions to the sub-agent ("read the synthesis, figure out what to build"). The successful spec told the sub-agent exactly what to produce, file by file, function by function.

---

## Failure Mode Taxonomy

### 1. The Indirection Problem (what happened tonight)

**Pattern:** Phase says "read Document X and implement what it recommends."

**Why it fails:** The sub-agent gets a fresh context. It reads Document X, sees 5 recommendations, and must decide: all of them? Top 3? Just the critical ones? The phrase "top-priority" is subjective. The harness principle says agents should NOT make architectural decisions — but this plan forced one.

**Fix:** Never ask a sub-agent to interpret a dynamic document for scope. The plan author (or a spec-generation step) must resolve the indirection before the harness runs. If Phase 4 produces SYNTHESIS.md, then Phase 5's spec should enumerate exactly which items to implement — even if that means the plan can't be fully written until Phase 4 completes.

**Implication for skill design:** This is where a `/spec` step would sit. After consultation and synthesis, but before harness execution. The spec resolves all indirection into concrete instructions.

### 2. The Fat Phase

**Pattern:** One phase contains multiple independent features.

**Why it fails:** Sub-agents have limited context windows. A phase with 5 features means the agent must hold all 5 in mind, implement them in the right order, handle interactions between them, and not forget any. Even if the spec is detailed, cognitive load causes drops.

**Rule of thumb:** One *coherent deliverable* per phase — not necessarily one file or one function. A deliverable is something that can be validated as a unit. "The Widget model exists and can be CRUD'd through the API" is one deliverable even though it touches 4 files. "The Widget model AND the Dashboard AND the export function" is three deliverables. (Opus correction: "one feature per phase" was too rigid — a type definition, schema, CRUD functions, and API route for one model are one conceptual unit.)

**The "if you removed this phase, would the others still make sense?" test:** If yes, it should be a separate phase.

### 3. The Weak Gate

**Pattern:** Validation is just a syntax check (`ast.parse`, `node --check`, `tsc --noEmit`).

**Why it fails:** Syntax checks verify "does this code parse?" not "does this code do what the spec asked?" The harness marks a phase as PASS when validation exits 0. If validation doesn't check for the feature's existence, incomplete implementations pass silently.

**Fix:** Each phase needs feature-specific validation. Not brittle string matching, but structural checks:

```bash
# BAD: Just syntax
python3 -c "import ast; ast.parse(open('file.py').read())"

# BETTER: Syntax + feature existence
python3 -c "import ast; ast.parse(open('file.py').read())" && \
grep -q 'def resolve_shot' editors/review_server.py && \
grep -q 'mw-btn-resolve' editors/manual-workbench.html && \
echo "Phase OK"

# BEST: Functional test
python3 editors/review_server.py --self-test && \
curl -s http://localhost:8430/api/manual/resolve -X POST -d '{}' | \
python3 -c "import sys,json; d=json.load(sys.stdin); assert 'error' in d or 'ok' in d"
```

The ideal validation ladder: syntax → structural (key functions/elements exist) → functional (API responds correctly).

**Critical rule (from Opus):** Validation identifiers must be specified in the phase spec. If you're going to grep for `resolve-btn`, the spec must say "use id `resolve-btn`." The validation and the spec must be co-authored — they're two views of the same contract.

### 4. The Context Starvation

**Pattern:** Sub-agent gets the phase spec but not enough context about the existing codebase.

**Why it fails:** A fresh agent told to "add a RESOLVE button to the detail view" doesn't know what the detail view looks like, what CSS class namespace to use, what event patterns exist, or how the existing API endpoints work. It guesses, and guesses wrong.

**Fix:** The spec must include:
- Exact file paths to modify
- Relevant existing code (or instructions to read specific functions)
- Naming conventions (CSS namespace, function naming patterns)
- API contracts (endpoint URL, request/response format)
- What already exists from prior phases

The Coil spec included literal file contents — the agent's job was essentially to write them to disk. That's the extreme end, but it worked.

### 5. The Phantom Dependency

**Pattern:** Phase N depends on Phase M's output, but the plan doesn't say so. The sub-agent for Phase N doesn't know Phase M exists.

**Why it fails:** Fresh agents have no memory of prior phases. If Phase 3 created a utility function that Phase 5 should use, Phase 5's agent will create a duplicate or use a different pattern unless explicitly told.

**Fix:** Every phase spec should include a "What already exists" section listing files and key functions from prior phases. Not the full contents — just enough for the agent to `Read` the relevant files.

### 6. The Scope Leak

**Pattern:** Phase spec says "implement feature X" but the agent also refactors surrounding code, adds error handling, or "improves" adjacent functions.

**Why it fails:** Extra changes create merge conflicts with future phases, introduce untested behavior, and waste time on work nobody asked for.

**Fix:** Scope boundaries should be **permissive for necessary integration, restrictive for discretionary changes.** Instead of "do NOT touch function X," use "do NOT refactor, restructure, or rename existing code. You MAY add parameters or extend existing interfaces if necessary for the new feature to integrate." (Opus correction: overly rigid "do NOT touch" rules force agents into horrible workarounds when a one-line parameter addition is genuinely needed.)

```markdown
## Scope boundary
- Do NOT refactor or rename existing functions
- You MAY extend function signatures if needed for integration
- Do NOT add error handling, polish, or enhancements beyond what's specified
- Implement exactly what is specified — plain and functional is correct
```

### 7. Agent Creativity (from Opus review)

**Pattern:** Agent told to "add a sidebar" also adds animation, search, custom icons, and collapsible behavior — none of which were in the spec.

**Why it fails:** This isn't scope leak (modifying existing code) — it's *scope inflation* (adding unrequested features to new code). The extras introduce bugs, increase complexity, and cause downstream phases to encounter unexpected code.

**Fix:** Add to sub-agent prompts: "Implement exactly what is specified. Do not add features, polish, or enhancements beyond the spec. Plain and functional is correct."

### 8. The Rollback Problem (from Opus review)

**Pattern:** Phase 5 fails. Debug agent "fixes" it by modifying a Phase 4 file. Phase 4 was already PASS, but now its code has been silently corrupted. The harness doesn't re-validate Phase 4.

**Fix:** After any debug agent touches a file from a prior phase, re-validate that prior phase. This should be a harness behavior, not just a guideline.

### 9. Idempotency (from Opus review)

**Pattern:** Harness crashes mid-Phase 5 and resumes. Phase 5 re-runs from scratch, but files are partially written. The sub-agent reads partially-implemented code and produces confused results.

**Fix:** Phase specs should either be idempotent (agent can re-run on partial files) or the harness should capture a git snapshot before each phase and restore on retry. The git-stash approach is cleaner but requires the project to be in a git repo.

---

## What a Good Phase Spec Looks Like

```markdown
## Phase 5: RESOLVE Button + Inline Failure Pills

### What to build
Add a RESOLVE button and inline failure type pills to the Manual Workbench detail view.

### Files to modify
- `editors/manual-workbench.html` — Add pill row and button to detail section
- `editors/manual.js` — Add click handlers, state tracking, API call
- `editors/styles/manual.css` — Style the pills and button
- `editors/review_server.py` — Update resolve endpoint to accept overrides

### Detailed requirements

**HTML (manual-workbench.html):**
- Add a `div.mw-detail-pills` row inside `#mw-detail` section, above the prompt
- 5 pill buttons: COMPOSITION, ARTIFACTS, MOTION, SAFETY FILTER, CHARACTER
- Each pill has `data-type` attribute matching the failure type key
- Add a RESOLVE button (`#mw-detail-resolve`, class `mw-btn-resolve`) in `.mw-detail-actions`
- RESOLVE starts disabled, enables when a failure pill is selected

**JS (manual.js):**
- Track `_detailSelectedFailureType` state (null initially)
- Pill click toggles `.active` class, updates state, enables/disables RESOLVE
- RESOLVE click: POST to `/api/manual/resolve` with shot_id, failure_type, current prompt, current model
- On success: update triage card badge, show toast, move to next unresolved shot

**CSS (manual.css):**
- `.mw-detail-pills` — flex row, gap 8px
- `.mw-pill.active` — bright background matching failure type color
- `.mw-btn-resolve` — green accent, disabled state grayed out
- Follow existing `mw-*` namespace convention

**Server (review_server.py):**
- `_api_manual_resolve`: Accept optional `manual_overrides` dict in payload
- Store overrides in `gate_results.manual_overrides`
- Store `new_prompt` and `new_model` in the fix entry if provided

### What already exists
- Detail view with target/output comparison, editable prompt, model selector
- `/api/manual/resolve` endpoint exists but doesn't accept overrides
- CSS uses `mw-*` namespace, dark theme, accent colors per failure type
- `MODEL_HINTS` object in manual.js maps model IDs to style guidance

### Scope boundary
- Do NOT modify the triage grid rendering (that's Phase 6)
- Do NOT modify canvas/main.js or dailies.js (that's Phase 6)
- Do NOT wire QUICK RETRY (that's Phase 8)

### Validation
```bash
python3 -c "import ast; ast.parse(open('editors/review_server.py').read())" && \
node --check editors/manual.js && \
grep -q 'mw-detail-pills' editors/manual-workbench.html && \
grep -q 'mw-btn-resolve' editors/manual-workbench.html && \
grep -q '_detailSelectedFailureType' editors/manual.js && \
grep -q 'manual_overrides' editors/review_server.py && \
echo "Phase 5 OK"
```
```

Compare this to the failed Phase 5 spec (7 lines, no detail, syntax-only validation). The difference is specificity at every level: what to build, where to build it, what NOT to touch, and how to verify it was built.

---

## The Spec Generation Gap

`/consult` produces decisions: "do X, not Y, prioritize Z." Those decisions still need translation into harness-ready specs. That translation requires:

1. **Codebase awareness** — reading the actual files to determine exact insertion points, function names, CSS classes
2. **Phase decomposition** — splitting features into build-sized units with proper dependency ordering
3. **Validation design** — writing checks that verify feature existence without being brittle
4. **Cross-phase dependency mapping** — which phases produce artifacts that later phases consume
5. **Scope boundaries** — explicit "do NOT touch" lists that prevent scope leak

This translation is non-trivial. The Coil spec took significant effort to produce but resulted in a flawless build. The workbench plan skipped this step and paid for it.

### Where this sits in the pipeline

```
/consult (decisions)  →  /spec (translation)  →  /harness (execution)
     ↓                        ↓                       ↓
  SYNTHESIS.md          BUILD_SPEC.md            build-log.md
  "what to do"          "exactly how"            "what happened"
```

The middle step is what was missing tonight. The harness ran on a plan that was still at the "what to do" level of abstraction.

### The open question

Is `/spec` worth formalizing as a skill, or is it better done ad hoc? Arguments both ways:

**For a skill:**
- The translation has a repeatable structure (read synthesis → read codebase → emit phases)
- Phase sizing rules, validation patterns, and scope boundaries are formulaic
- It enforces discipline — easy to skip the detailed spec step when you're eager to build
- The one time the full pipeline ran (Coil), it produced extraordinary results

**Against a skill:**
- One data point. The Coil build might have succeeded because the project was well-suited to automation, not because the spec format was magic.
- Spec generation requires deep codebase knowledge — the kind of thing that's hard to formalize into rules
- An ad hoc spec written by someone who understands the project might be better than a formulaic one
- Over-formalizing the creative step (deciding phase boundaries, writing validation) might produce worse specs than just asking "write me a harness plan for these changes"

**Opus's recommendation:** Formalize as a **checklist + template + verify** framework, not full auto-generation. `/spec` reads consultation output and codebase, generates a *skeleton* (phase headings, file lists, validation stubs, dependency graph), flags areas needing human input, the human fills gaps, then `/spec --verify` checks internal consistency (imports match exports, all referenced files exist or are created by a prior phase, validation commands are syntactically valid). The verification step alone would have caught tonight's failure — it would have flagged that Phase 5 has no enumerated deliverables and no feature-specific validation.

---

## The Dynamic Plan Problem

When Phase 4 generates SYNTHESIS.md and Phase 5 must implement it, the plan author can't write Phase 5's detailed spec in advance. Three approaches (from Opus review):

### Option A: Two-Stage Harness Run (best for unattended builds)

1. **Harness Run 1:** Phases 1-4 (analysis, consultation, synthesis)
2. **Human reviews SYNTHESIS.md, writes detailed Phase 5+ spec**
3. **Harness Run 2:** Phases 5+ (implementation)

Cleanest solution. Respects "agents should not make scope decisions." Downside: requires human intervention between runs — not fully unattended.

### Option B: Spec-Generation Phase (risky)

Phase 4.5 reads SYNTHESIS.md and produces a detailed implementation spec. Phase 5 reads that spec instead of SYNTHESIS.md. Tempting but dangerous — you've just moved the indirection problem one step back. The spec-generation agent might produce a vague spec.

**If you do this**, validate the generated spec structurally:
```bash
python3 -c "
import re
spec = open('BUILD_SPEC.md').read()
assert '## Phase' in spec, 'No phase headings'
assert len(re.findall(r'### Files', spec)) >= 2, 'Missing file lists'
assert 'Validation:' in spec, 'No validation section'
print('Spec structure OK')
"
```

### Option C: Pre-Enumerate in the Plan (best for attended builds)

If the plan author knows the *categories* (even if not specifics), pre-constrain the scope:

```markdown
## Phase 5: Implement Synthesis Recommendations

Read SYNTHESIS.md. Implement ALL items in the following tiers:
- P0 (Critical): ALL items.
- P1 (High): ALL items.
- P2 (Medium): ALL items.
- P3 and below: SKIP.

For EACH item: note the item name, identify files to modify, make the change, verify it works.
```

Removes scope ambiguity ("ALL items in P0-P2") while still allowing the synthesis to be dynamic.

**The meta-rule:** If a phase depends on a prior phase's dynamic output, the plan must specify an **explicit completeness criterion.** Not "apply the top-priority recommendations" (subjective) but "implement ALL P0 and P1 items" (objective).

---

## The Spec Detail Spectrum (from Opus review)

Not every build needs a 2,647-line spec. Match detail level to risk and context:

| Level | What the spec contains | When to use | Effort |
|-------|----------------------|-------------|--------|
| **Maximum** (Coil-style) | Exact file contents, function signatures, literal code | Unattended overnight builds, high-risk features | Hours |
| **High** | Exact file paths, function names, type signatures, behavioral requirements. Agent fills in implementation logic. | Complex features where architecture matters | 30-60 min |
| **Medium** | Feature requirements, file paths, naming conventions, integration points. Agent makes implementation decisions within constraints. | Simpler features, low-risk changes | 15-30 min |
| **Low** (the failure) | "Read this and figure it out." | Never in a harness context | 5 min |

The economic argument: each debug cycle costs ~5 minutes (agent dispatch + validation + retry). A detailed spec that prevents 3-5 debug cycles across a build "pays for itself" if it takes less than 15-25 minutes to add that detail. For complex builds, the payoff is much higher because blocked phases cascade.

---

## Harness Skill Updates Needed (from Opus review)

Two changes to the harness skill itself:

1. **Change the ambiguity default.** Currently: "If the spec is ambiguous, implement the simplest reasonable interpretation." This directly caused tonight's failure. **Change to:** "If the spec is ambiguous, implement the most complete reasonable interpretation." Over-building is caught by validation; under-building is not.

2. **Add to the sub-agent prompt template:** "Implement exactly what is specified. Do not add unrequested features or enhancements. If the spec lists N items, implement all N items."

3. **Give debug agents the build diff.** Currently debug agents get the error output and the original spec. Adding the `git diff` of what the build agent changed would significantly improve debug accuracy — the debug agent would know whether a line was intentionally written or was a typo.

---

## Phase Ordering Heuristics (from Opus review)

When decomposing features into phases, build in this order:

1. **Scaffolding first** — config files, directory structure, dependencies
2. **Types/interfaces before implementations** — get contracts right
3. **Leaf modules before integration** — build bottom-up so each phase validates independently
4. **Tests alongside their implementation** — not as a separate phase at the end
5. **UI/integration last** — these depend on everything else

---

## Cross-Phase Contract Verification (from Opus review)

Before the harness runs, the spec should be machine-checkable for consistency: every import referenced in Phase N must be exported in some Phase M where M < N. This could be a `/spec --verify` step. Without it, Phase 5 might reference a function that Phase 3's spec calls something different, causing a guaranteed failure that wastes a build cycle.

---

---

## Plan Authoring Checklist

Before running `/harness` on a plan, verify:

### Phase sizing
- [ ] Each phase has ONE clear deliverable (not 2+ independent features)
- [ ] No phase modifies more than 4-5 files
- [ ] Each phase is completable by a sub-agent in one session (~15-30 min)
- [ ] If you can split a phase without creating a dependency, split it

### Specificity
- [ ] Every phase lists exact file paths to create or modify
- [ ] Every phase describes what to build at the function/component level
- [ ] No phase says "read Document X and decide what to implement"
- [ ] All indirection is resolved before harness execution

### Scope boundaries
- [ ] Every phase has scope boundaries (permissive for integration, restrictive for discretionary changes)
- [ ] No phase asks the agent to make architectural decisions
- [ ] Refactoring is only requested when explicitly needed
- [ ] Sub-agent prompt includes "implement exactly what is specified, no extras"

### Context
- [ ] Every phase lists what already exists from prior phases
- [ ] Naming conventions are specified (CSS namespaces, function patterns)
- [ ] API contracts are documented (endpoints, request/response shapes)

### Validation
- [ ] Validation goes beyond syntax checks
- [ ] Validation checks for feature existence (structural grep)
- [ ] Validation checks for functional correctness where possible (API tests, curl)
- [ ] Each validation criterion maps to a specific requirement in the phase spec
- [ ] Validation identifiers (IDs, class names, function names) are specified in the spec itself
- [ ] Validation is included in the spec body (not just at the end) so the agent sees it before building

### Dependencies
- [ ] Cross-phase dependencies are documented
- [ ] No phase assumes the agent knows about prior phases' work
- [ ] Files shared across phases are explicitly called out
- [ ] Cross-phase contracts are consistent (exports match imports across phases)

### Dynamic inputs
- [ ] No phase says "read Document X and decide what to implement"
- [ ] Phases depending on prior phase output have explicit completeness criteria ("ALL items" not "top items")
- [ ] If the build is unattended, consider two-stage harness (analysis → human review → implementation)

---

## Rules for an Architect/Spec Skill

If formalizing spec generation, these rules should be embedded:

1. **Resolve all indirection.** If a prior step produced a synthesis document, the spec must enumerate every item from it — not reference it by name.

2. **One coherent deliverable per phase.** The unit of work is something validatable as a unit. "The Widget model with CRUD and API route" is one deliverable (even though it's 4 files). "The Widget AND the Dashboard AND the export" is three deliverables.

3. **Validation must verify intent, not just syntax.** Every phase needs at least one check that proves the feature exists. Co-author validation identifiers with the spec — if you'll grep for `resolve-btn`, the spec must say "use id `resolve-btn`."

4. **Scope boundaries are permissive for integration, restrictive for discretion.** Allow necessary parameter additions and interface extensions. Forbid refactoring, renaming, and unrequested enhancements.

5. **Context is explicit.** Don't assume the agent knows the codebase. List the files it needs to read, the patterns it should follow, and the functions that already exist. But be mindful of context window limits — for later phases in large builds, summarize prior work rather than listing everything.

6. **Phase order minimizes rework.** Scaffolding → types/interfaces → leaf modules → integration → UI. Tests alongside their implementation, not at the end.

7. **The spec is the contract.** Once written, the harness executes it literally. If the spec is wrong, the build is wrong. Invest time in the spec, not in hoping the agent "figures it out."

8. **Default to completeness.** When agents face ambiguity, they should implement the most complete interpretation, not the simplest. Over-building is caught by validation; under-building is not.

9. **Cross-phase contracts must be consistent.** Every import in Phase N must be exported by some Phase M where M < N. A `/spec --verify` step should catch these before the harness runs.

10. **Debug agents get the diff.** Pass the `git diff` of build agent changes to debug agents alongside the error output. Knowing what was intentionally written vs. what was a typo dramatically improves debug accuracy.
