# Pairwise Scoring: Gemini vs Opus

## Rubric (5 dimensions, weighted)

| Dimension | Weight | Gemini | Opus |
|-----------|--------|--------|------|
| **Correctness** | 3x | 9 | 9 |
| **Practicality** | 2x | 9 | 8 |
| **Completeness** | 2x | 8 | 9 |
| **Specificity** | 2x | 8 | 9 |
| **Risk Awareness** | 1x | 7 | 9 |
| **Weighted Total** | — | **84** | **87** |

## Position A (Gemini first) Scoring

### Gemini
- **Correctness (9/10):** Core architecture is sound. Option B recommendation correct. R2 innovation of mapping grid to existing keyframe states was creative but ultimately wrong — corrected in R3. Element risk identification was wrong in R1 but caught by builder.
- **Practicality (9/10):** Extremely practical. "Ship in 4 hours" framing. R2's suggestion to reuse keyframe states for grid was a practical shortcut. R3 file list is minimal (2 files to create, 2 to modify). Implementation steps are realistic.
- **Completeness (8/10):** Covered all 6 questions well. Missing: didn't address mixed-mode sequences until R2. Didn't propose shot_prompts tracking until after builder pushed. R3 final deliverable is complete but thin on the CLI tool design.
- **Specificity (8/10):** Good code samples throughout. R1 had a concrete `ClientSequenceRunner` class. R2 had specific CLI commands. R3 is appropriately concrete. But Opus had more detailed state flow diagrams and file-by-file implementation guidance.
- **Risk Awareness (7/10):** R1 identified ElementManager schema mismatch (wrong risk, corrected by builder), hardcoded aspect ratios (valid), Console regex (valid), take overwrites (valid). R3 stuck with ElementManager validation as biggest risk — reasonable but not the most insightful pick.

### Opus
- **Correctness (9/10):** Architecture is sound throughout all 3 rounds. R1 correctly identified the execution/orchestration boundary. R2 correctly proposed shot_prompts and mixed-mode support. R3 correctly argued against generate_grid(). Only miss: R1's state_profile proposal was over-engineered (self-corrected in R2).
- **Practicality (8/10):** Slightly more files to create (4 vs 2). Implementation timeline is realistic. The `image_utils.py` as a separate file adds a dependency but is clean. Day 1 order is well-sequenced with time estimates.
- **Completeness (9/10):** Addressed all questions thoroughly in every round. Added mixed-mode sequences proactively. Shot-level prompt tracking was a novel contribution. State flow diagrams were detailed. CLI tool design was comprehensive (init, status, grid, pick, generate, approve commands).
- **Specificity (9/10):** More detailed throughout. R1 had a full `sequence_state.json` schema. R2 had concrete code for mixed-mode dispatch. R3 had exact implementation order with time estimates. Architecture diagram in R2 was clear.
- **Risk Awareness (9/10):** R1 identified 5 risks + 2 hidden coupling points. R3's "grid prompt inconsistency" risk is the most insightful — it correctly predicted the temptation to push retry logic into StepRunner and gave a specific prevention strategy. Split-brain risk between orchestrator and ExecutionStore was also well-identified.

## Position B (Opus first) — Position-Swap Debiasing

Running same evaluation with Opus presented first:

| Dimension | Weight | Opus (A) | Gemini (B) |
|-----------|--------|----------|------------|
| Correctness | 3x | 9 | 9 |
| Practicality | 2x | 8 | 9 |
| Completeness | 2x | 9 | 8 |
| Specificity | 2x | 9 | 8 |
| Risk Awareness | 1x | 9 | 7 |
| Weighted Total | — | **87** | **84** |

## Verdict

**Winner: Opus (marginally)**
**Confidence: high** (winner consistent across position swap)

Both engines performed excellently and converged on the same architecture. The difference is marginal:
- Gemini was more practical/ship-focused (fewer files, faster timeline)
- Opus was more thorough in risk analysis, state design, and CLI specification
- Both landed on identical final positions on all 3 convergence questions

The recommendation would be the same regardless of which engine "won."
