# Evaluation Pipeline

> The key insight: **Decomposed evaluation with structured rubrics is more reliable than holistic judgment.**

Asking "which option is better?" is unreliable. Asking "which option has clearer thematic integration?" is surprisingly consistent.

---

## Pipeline Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                     OPTIONS FROM SPECIALISTS                     │
│                    (2-3 options per agent)                       │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 1: BINARY GATES                                          │
│  Pass/fail on objective criteria                                 │
│  → Eliminate options that violate hard constraints               │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 2: RUBRIC SCORING                                        │
│  Score each dimension 1-10 with calibration examples             │
│  → Rank options by composite score                               │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 3: PAIRWISE COMPARISON                                   │
│  Head-to-head with reasoning-before-judgment                     │
│  → Select winner from top candidates                             │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│  STAGE 4: ADVERSARIAL REVIEW (Optional)                         │
│  Advocate A vs Advocate B with Judge                             │
│  → Final validation for high-stakes decisions                    │
└─────────────────────────────┬───────────────────────────────────┘
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                     WINNER + REASONING                           │
│              (Presented to user for approval)                    │
└─────────────────────────────────────────────────────────────────┘
```

---

## Stage 1: Binary Gates

**Purpose:** Eliminate options that violate objective constraints.

**Method:** Checklist verification. Any failure = option eliminated.

**See:** `binary_gates.md` for full criteria.

**Output:** List of surviving options (may be reduced from input).

**If all options fail:** Report failures, request regeneration from specialist.

---

## Stage 2: Rubric Scoring

**Purpose:** Score surviving options on multiple dimensions.

**Method:**
1. For each dimension, include rubric definition + calibration examples
2. Score 1-10 with explicit reasoning
3. Self-consistency check: does reasoning support score?
4. Calculate composite score (weighted average)

**Dimensions:**
- Thematic Integration (25%)
- Dramatic Potential (20%)
- Character Coherence (20%)
- Audience Engagement (20%)
- Execution Difficulty (15%)
- Unpredictability (15%) — NEW

**See:** `rubrics/` directory for calibrated rubrics.

**Note on Unpredictability:** This 6th dimension measures whether options contain surprising elements that subvert genre expectations. See `rubrics/unpredictability.md` for calibration.

**Output:** Ranked list of options with scores and reasoning.

---

## Stage 3: Pairwise Comparison

**Purpose:** Select winner from top candidates through head-to-head comparison.

**Method:**
1. Take top 2-3 options from Stage 2
2. Run pairwise comparison with reasoning-before-judgment
3. Optional: Ensemble voting (run 3-5 times, majority wins)

**Key Technique:** Force reasoning BEFORE stating preference. This dramatically improves consistency.

**See:** `pairwise_comparison.md` for methodology and prompt templates.

**Output:** Winner with detailed reasoning.

---

## Stage 4: Adversarial Review (Optional)

**Purpose:** Final validation for high-stakes decisions.

**Method:**
1. Advocate A makes strongest case FOR option A
2. Advocate B makes strongest case FOR option B
3. Judge evaluates arguments, selects more convincing
4. Winner is the option whose advocate was more persuasive

**When to Use:**
- Major structural decisions (act breaks, key plot points)
- Irreversible choices (protagonist identity, ending direction)
- Close calls from Stage 3 (margin < 10%)

**See:** `adversarial_review.md` for methodology.

**Output:** Final winner with adversarial analysis.

---

## Reliability Boosters

### 1. Calibration Shots
Always include 2-3 examples in rubric prompts:
```
"Here's an example of a 4/10 on thematic integration: [example]
Here's a 7/10: [example]
Now score the following..."
```

### 2. Reasoning-Before-Judgment
Force analysis before conclusion:
```
1. First, analyze strengths of Option A
2. Then, analyze strengths of Option B
3. Finally, state your choice
```

### 3. Ensemble Voting
For close decisions, run pairwise comparison 3-5 times and take majority.

### 4. Self-Consistency Checks
After generating a score, verify reasoning supports it. If mismatch, re-evaluate.

### 5. Decomposition
Break "is this good?" into 5-8 specific questions. Aggregate scores.

---

## Cost Optimization

| Stage | Recommended Model | Rationale |
|-------|-------------------|-----------|
| Binary Gates | Haiku | Objective checks, low complexity |
| Rubric Scoring | Sonnet | Nuanced judgment with reasoning |
| Pairwise Comparison | Sonnet | Analysis and comparison |
| Adversarial Review | Opus | High-stakes, maximum reliability |

**Estimated cost per decision:**
- Simple decision (no adversarial): ~$0.05-0.10
- Complex decision (with adversarial): ~$0.20-0.50

---

## Integration Points

### Input
Options from specialist agents (Character, World, Plot), each with:
- The proposal itself
- How it serves the theme
- How it creates dramatic tension
- Potential complications/trade-offs

### Output
Winner with:
- Full proposal content
- Composite score breakdown
- Pairwise reasoning (if applicable)
- Adversarial analysis (if applicable)

### State Updates
After user approval, winner is recorded in `decisions_log.json`:
```json
{
  "decision_id": "asi-bridge-001",
  "objective": "Choose anchor type for protagonist",
  "timestamp": "2026-01-14T...",
  "options_considered": 3,
  "options_eliminated": 0,
  "scores": {
    "option_a": 7.2,
    "option_b": 6.8,
    "option_c": 8.1
  },
  "winner": "option_c",
  "reasoning": "The Mirror anchor type directly embodies the thematic question...",
  "user_approved": true
}
```

---

## Error Handling

### All options eliminated in Stage 1
```
GATE FAILURE: All options failed binary gates.

Failures:
- Option A: Contradicts established canon (ASI cannot read minds)
- Option B: Missing required element (no thematic connection stated)
- Option C: Logic chain broken (effect precedes cause)

ACTION: Return to specialist with feedback, request regeneration.
```

### Tie in Stage 3
Run ensemble voting (3-5 iterations). If still tied, escalate to adversarial review.

### Self-consistency failure
Re-score the dimension. If persistent mismatch, flag for human review.

---

## Example Run

**Objective:** Choose anchor type for protagonist in ASI-BRIDGE

**Input:** 3 options from Character Agent

**Stage 1: Binary Gates**
```
Option A (Cub): ✓ Pass all gates
Option B (Ghost): ✓ Pass all gates
Option C (Mirror): ✓ Pass all gates
```

**Stage 2: Rubric Scoring**
```
                        A      B      C
Thematic Integration   6      5      9
Dramatic Potential     7      6      8
Character Coherence    7      8      7
Audience Engagement    6      5      8
Execution Difficulty   8      7      5
Unpredictability       5      4      7

Composite:            6.5    5.8    7.3
```

**Stage 3: Pairwise Comparison**
```
A vs C: C wins (stronger thematic embodiment)
B vs C: C wins (more dramatic potential)
Winner: C (The Mirror)
```

**Stage 4: Adversarial Review** (high-stakes decision)
```
Advocate C: "The Mirror is the only anchor type that directly embodies
the thematic question of trust across cognitive divides..."

Advocate A: "The Cub provides emotional stakes without the risk of a
non-human relationship feeling cold..."

Judge: Advocate C's argument is more compelling because the theme is
the core of the story. The Mirror's risks are worth the thematic payoff.

Final Winner: C (The Mirror)
```

**Output to User:**
```
═══════════════════════════════════════════════════════════════
EVALUATION COMPLETE: Anchor Type Selection
═══════════════════════════════════════════════════════════════

RECOMMENDATION: THE MIRROR (The ASI itself)

SCORE: 7.4/10 composite
  • Thematic Integration: 9/10 (directly embodies the question)
  • Dramatic Potential: 8/10 (trust with non-human creates tension)
  • Character Coherence: 7/10 (fits isolated protagonist)
  • Audience Engagement: 8/10 (unique relationship creates ache)
  • Execution Difficulty: 5/10 (harder to write, higher payoff)

PAIRWISE RESULT: Won against both alternatives

ADVERSARIAL REVIEW: Passed (stronger thematic argument)

───────────────────────────────────────────────────────────────
Do you approve this selection? [Y/N/Discuss]
═══════════════════════════════════════════════════════════════
```