# Overnight Build: Workbench Refinement + Consult Skill Rewrite

**Status:** Approved
**Date:** 2026-03-09
**Working directory:** /Users/joeturnerlin/Dropbox/CLAUDE_PROJECTS/starsend
**Validation:** Python syntax check on .py files, Node syntax check on .js files, server restart + curl tests

Two parallel tracks: (A) consult Gemini and Opus on Manual Workbench UX architecture, then implement findings; (B) research AI council patterns and rewrite the /consult skill.

---

## Phase 1: Build Consultation Context Bundle

Package all Manual Workbench code into a single context document for Gemini and Opus consultations.

**Files to create:**
- `consultations/manual_workbench_ux/context.md`
- `consultations/manual_workbench_ux/round_1_prompt.md`

**What to include in context.md:**
1. Project overview (from CLAUDE.md — solo filmmaker production tool, Recoil Visual Pipeline)
2. Full contents of these files:
   - `editors/manual-workbench.html`
   - `editors/manual.js`
   - `editors/styles/manual.css`
   - `editors/tabs/canvas/previz.js` (the MANUAL button addition)
   - `editors/tabs/dailies.js` (the M key escalation)
   - Review server endpoints: grep for all `_api_manual_*` and `_api_enhance_prompt` methods from `editors/review_server.py`
3. The model profiles enrichment config from `config/model_profiles.json`
4. JT's key questions (listed below)

**Key questions for consultants:**
1. Should the Manual Workbench remain a standalone page at `/manual`, or should it be integrated as a 6th tab in the Production Console? What are the tradeoffs?
2. How should information flow between the workbench and the pipeline? Currently: flag → triage → detail → fix → resolve → return to pipeline. Is this the right flow?
3. The current detail view has: target vs output comparison, editable metadata (shot type, camera, action), editable prompt with ENHANCE (model-specific rewriting), model selector with per-model hints, failure tagging modal. Is this the right set of controls? Too many? Too few?
4. The reconciliation zone (drop files → link to shots → tag failures) — is drag-and-drop the right pattern for a solo-dev tool? Or is there a simpler approach?
5. How should stills vs video be handled? Same workbench? Different views? The user just added still models to the dropdown but the underlying pipeline is different.
6. What would make this "the most elegant and intuitive solution" for a filmmaker who needs to diagnose and fix generation failures?

**round_1_prompt.md should include:**
- The consultant persona (technical consultant reviewing a production tool)
- The full context bundle
- The 6 key questions above
- Instruction to be concrete — propose specific UX changes, not abstract advice

**Validation:** `test -f consultations/manual_workbench_ux/context.md && test -f consultations/manual_workbench_ux/round_1_prompt.md && echo "PASS"`

---

## Phase 2: Gemini 3-Round Consultation

Run a 3-round consultation with Gemini 3.1 Pro on the Manual Workbench architecture.

**Round 1:** Send the round_1_prompt.md through `tools/consult.py`:
```bash
python3 tools/consult.py \
  --context consultations/manual_workbench_ux/context.md \
  --prompt consultations/manual_workbench_ux/round_1_prompt.md \
  --output consultations/manual_workbench_ux/gemini_round_1.md
```

**Round 2:** Read Gemini's round 1 response. Write Claude's reply at `consultations/manual_workbench_ux/claude_round_2_gemini.md` addressing:
- Agreements and pushbacks on Gemini's recommendations
- Follow-up questions on the most impactful suggestions
- Probe specifically on: standalone vs tab, reconciliation UX, model-specific workflows

Then send to Gemini:
```bash
python3 tools/consult.py \
  --context consultations/manual_workbench_ux/context.md \
  --prompt consultations/manual_workbench_ux/claude_round_2_gemini.md \
  --output consultations/manual_workbench_ux/gemini_round_2.md \
  --prior consultations/manual_workbench_ux/gemini_round_1.md
```

**Round 3:** Read Gemini's round 2. Write Claude's convergence reply at `consultations/manual_workbench_ux/claude_round_3_gemini.md`:
- Push for final concrete recommendations
- Ask for priority ordering of changes
- Request a "if you could only change 3 things" answer

Then send to Gemini:
```bash
python3 tools/consult.py \
  --context consultations/manual_workbench_ux/context.md \
  --prompt consultations/manual_workbench_ux/claude_round_3_gemini.md \
  --output consultations/manual_workbench_ux/gemini_round_3.md \
  --prior consultations/manual_workbench_ux/gemini_round_1.md,consultations/manual_workbench_ux/gemini_round_2.md
```

**Validation:** `test -f consultations/manual_workbench_ux/gemini_round_3.md && echo "PASS"`

---

## Phase 3: Opus 3-Round Consultation

Run a 3-round consultation with Claude Opus 4.6 via sub-agents on the same questions.

**Important:** Do NOT call tools/consult.py for Opus. Use the Agent tool to spawn sub-agents as independent consultants.

**Round 1:** Spawn an Agent with:
- The consultant persona (same as Gemini prompt)
- The full context bundle content (read from context.md)
- The 6 key questions
- Instruction to output ONLY the consultation response
Write the agent's response to `consultations/manual_workbench_ux/opus_round_1.md`

**Round 2:** Read Opus round 1. Write Claude's reply at `consultations/manual_workbench_ux/claude_round_2_opus.md`. Spawn a new Agent with:
- Prior round transcript (opus_round_1.md + claude_round_2_opus.md)
- Push on the same follow-up areas as the Gemini track
Write response to `consultations/manual_workbench_ux/opus_round_2.md`

**Round 3:** Read Opus round 2. Write convergence reply at `consultations/manual_workbench_ux/claude_round_3_opus.md`. Spawn Agent with all prior rounds. Write to `consultations/manual_workbench_ux/opus_round_3.md`

**Validation:** `test -f consultations/manual_workbench_ux/opus_round_3.md && echo "PASS"`

---

## Phase 4: Pairwise Scoring & Synthesis

Compare the Gemini and Opus consultation outputs. Score each on a rubric, then synthesize the best recommendations.

**Scoring rubric (1-5 each):**
1. **Specificity** — Are recommendations concrete and implementable, or vague?
2. **Architectural coherence** — Do the recommendations fit the existing system?
3. **UX insight** — Does the consultant understand filmmaker workflow?
4. **Feasibility** — Can changes be implemented by a solo dev in reasonable time?
5. **Innovation** — Are there genuinely novel ideas, or just obvious suggestions?

**Process:**
1. Read all 3 Gemini round files and all 3 Opus round files
2. Score each engine on the 5 criteria
3. Identify areas of agreement (high confidence — both engines say the same thing)
4. Identify areas of disagreement (need human judgment or further investigation)
5. Create a ranked list of recommendations with source attribution

**Files to create:**
- `consultations/manual_workbench_ux/SCORING.md` — rubric scores + justification
- `consultations/manual_workbench_ux/SYNTHESIS.md` — final recommendations, ranked

**SYNTHESIS.md structure:**
```markdown
# Manual Workbench Architecture — Synthesis

## Scoring
| Criterion | Gemini | Opus | Winner |
|-----------|--------|------|--------|
| ... | ... | ... | ... |

## Agreed Recommendations (both engines)
{numbered list}

## Gemini-Only Recommendations
{with assessment of whether to adopt}

## Opus-Only Recommendations
{with assessment of whether to adopt}

## Disagreements
{with resolution}

## Implementation Priority
1. {highest impact, lowest effort}
2. ...
3. ...

## Changes NOT to make
{recommendations that were considered but rejected, with reasoning}
```

**Validation:** `test -f consultations/manual_workbench_ux/SYNTHESIS.md && echo "PASS"`

---

## Phase 5: Implement Workbench Refinements

Apply the top-priority recommendations from SYNTHESIS.md to the Manual Workbench code.

**Read SYNTHESIS.md first** to determine what to implement. Do NOT guess — only implement what the synthesis recommends.

**Files likely to be modified:**
- `editors/manual-workbench.html`
- `editors/manual.js`
- `editors/styles/manual.css`
- `editors/review_server.py` (endpoints)
- Possibly `editors/tabs/canvas/previz.js` or `editors/tabs/canvas/main.js`

**Constraints:**
- Do NOT break existing Production Console functionality
- Do NOT change the data model (gate_results.manual_escalated pattern)
- Keep the lightweight aesthetic the user explicitly praised
- Bump cache version on all modified assets
- Test with curl after changes

**Validation:**
```bash
python3 -c "import ast; ast.parse(open('editors/review_server.py').read())" && \
node -c editors/manual.js && \
echo "Syntax OK"
```

---

## Phase 6: Research AI Council Patterns

Research best practices for multi-LLM consultation, debate, and ensemble decision-making.

**Search topics:**
1. "LLM debate" / "multi-LLM discussion" — papers on using multiple models to improve reasoning
2. "LLM-as-judge" — using one model to evaluate another's output
3. "Society of Mind" AI architectures — Minsky-inspired multi-agent approaches
4. "Constitutional AI debate" — Anthropic's approach to AI self-improvement through debate
5. "Mixture of Experts" consultation patterns — not MoE architecture, but using multiple expert models
6. "AI ensemble methods" — combining outputs from multiple models
7. "Adversarial collaboration" in AI — models that argue opposing positions
8. "Chain of verification" — using a second model to verify the first
9. Practical frameworks: LangChain multi-agent, CrewAI, AutoGen council patterns

**Output:** Create `consultations/ai_council_research/research_notes.md` with:
- Key findings from each source
- Patterns that map to JT's use cases
- Effectiveness data (which patterns actually improve output quality?)
- Practical implementation notes

**JT's stated use cases for the consult skill:**
1. Quick 1-round consultation (single engine) — "just ask Gemini a quick question"
2. 3-round convergent consultation (single engine) — current default
3. Dual-engine parallel with pairwise scoring — what we're doing in this build
4. Adversarial debate — models argue opposing positions, then a judge scores
5. Council mode (3+ models) — for critical architectural decisions

**Validation:** `test -f consultations/ai_council_research/research_notes.md && echo "PASS"`

---

## Phase 7: Design Consult Skill Modes

Based on Phase 6 research, design the new /consult skill with multiple consultation modes.

**Read the research notes from Phase 6 first.**

**Read the current skill:** `~/.claude/skills/consult/SKILL.md`

**Design document:** Create `consultations/ai_council_research/skill_design.md` with:

1. **Mode taxonomy** — Each mode with:
   - Name and short description
   - When to use it
   - Number of rounds
   - Number of engines
   - Cost estimate
   - Example invocation

2. **CLI interface design** — How the user invokes each mode:
   ```
   /consult "topic"                          # default: 3-round single engine
   /consult --quick "topic"                  # 1-round, single engine
   /consult --dual "topic"                   # 3+3 rounds, Gemini + Opus, pairwise scoring
   /consult --adversarial "topic"            # 2 models argue, judge scores
   /consult --council "topic"               # 3+ models, majority vote
   ```

3. **Output format** for each mode — What files get created, what the synthesis looks like

4. **Engine selection** — Which models for which modes, cost/quality tradeoffs

5. **Scoring rubric** — Standard rubric used across all comparison modes

**Constraints:**
- Keep it practical for a solo dev — no modes that cost >$5 per consultation
- Keep backward compatibility with current `/consult` invocations
- The default behavior should be the same as today (3-round Gemini)
- New modes are opt-in via flags

**Validation:** `test -f consultations/ai_council_research/skill_design.md && echo "PASS"`

---

## Phase 8: Rewrite Consult Skill

Rewrite the /consult skill based on the design document from Phase 7.

**Read the design document from Phase 7 first.**
**Read the current skill:** `~/.claude/skills/consult/SKILL.md`

**Files to modify:**
- `~/.claude/skills/consult/SKILL.md` — Complete rewrite with new modes

**The rewritten skill must:**
1. Support all modes designed in Phase 7
2. Keep backward compatibility (bare `/consult "topic"` works the same as today)
3. Include clear documentation for each mode
4. Include the execution protocol for each mode (step-by-step, like the current skill)
5. Include the scoring rubric for comparison modes
6. Include cost estimates for each mode
7. Include examples

**Constraints:**
- Do NOT delete the existing Gemini consultation protocol — extend it
- The skill file must be self-contained (no external dependencies beyond tools/consult.py)
- Keep the file under 500 lines (the current one is ~250)

**Validation:** `test -f ~/.claude/skills/consult/SKILL.md && grep -q "adversarial\|dual\|council\|quick" ~/.claude/skills/consult/SKILL.md && echo "PASS"`

---

## Phase 9: Update tools/consult.py

Update the Python consultation tool to support any new features needed by the rewritten skill.

**Read the skill design from Phase 7 and the rewritten skill from Phase 8.**
**Read the current tool:** `tools/consult.py`

**Potential changes:**
- Support for `--prior` flag to include prior round transcripts (may already exist)
- Support for `--mode` flag if the tool needs mode-awareness
- Better error handling and retry logic
- Token counting improvements
- Any new flags needed by the new skill modes

**If no changes are needed** (the skill handles everything through prompt engineering and sub-agents), create a brief `consultations/ai_council_research/tools_assessment.md` explaining why no changes were needed.

**Validation:** `python3 -c "import ast; ast.parse(open('tools/consult.py').read())" && echo "PASS"`