Thank you for the direct analysis. Several of your points reshape the architecture significantly. Here's where I agree, where I push back, and where I need clarification.

## AGREED — Architecture Changes I'm Adopting

### 1. Drop Grid Templates
You're right that blank grid lines will be interpreted as scene elements, not structural guides. Grids are dead. I'm removing the `assets/grid_templates/` directory entirely.

### 2. Native 9:16 Production — "Native Vertical Batch" Pipeline
Your 3-pass strategy (ENV anchor → Flash exploration → Pro final render) is superior to the triptych+crop approach. The aspect ratio math is irrefutable — 27% crop loss on carefully framed shots is unacceptable.

**However**, I want to keep triptychs as an *optional* consistency-checking tool, not as the production path. Here's why: the original V4 test proved that within-triptych consistency is nearly perfect (all 3 panels share environment, lighting, character). This property is valuable for continuity QC. So:
- **Production path:** Native 9:16 frames (your pipeline)
- **QC/consistency check:** Optional triptych generation to verify environment/character lock across sequential frames

### 3. Reference Image Mirroring for Screen Direction
Brilliant. I was trying to solve screen direction via text prompts, which you correctly identify as weakly bound. Dynamically flipping reference images is a hardware-level solution to a software-level problem. Adding `is_mirrored: bool` to `ReferenceImage`.

### 4. Recency Bias in Part Ordering
Reversing the current order: Character refs FIRST (lowest attention weight) → Scene refs → Prompt text LAST (highest attention). This is a significant change from test_nbp_direct.py where character refs came last.

### 5. Kinetic Descriptors Over ALL CAPS Shouting
Replacing `ACTION & EMOTION (this is what's HAPPENING):` blocks with camera-artifact-based kinetic language: motion blur, dynamic pose, off-axis framing, dust kicked into lens. This maps to your actual latent space tokens better than semantic emphasis via capitalization.

### 6. Positive Constraints Over Negative Language
Replacing "no extra fingers, no deformed hands" with "anatomically flawless hands, exactly five fingers." Aligning with your diffusion process preference.

### 7. Lighting Vector Locking
Adding explicit directional lighting coordinates ("amber light casting hard shadows from TOP LEFT") instead of generic lighting descriptions. This should lock environment consistency across shots in the same scene.

---

## PUSHBACK — Where I Think You're Wrong or Incomplete

### 1. Reference Ordering: I Disagree
You say: "Pass Character Refs *first*, then Scene Refs, then the immediate textual prompt."
But then your `compile_parts` code sorts by weight (lowest first = character, highest last = scene/pose), which puts Scene Refs closer to the prompt text.

These contradict. Which is it? If recency bias means "closest to the prompt = most influential," then for character shots I want the character identity to be the strongest signal. Scene can vary slightly; character identity CANNOT.

My proposed ordering for character shots:
1. Scene ref (low weight — sets environment context)
2. Pose/composition ref from Flash exploration (medium weight)
3. Character refs (high weight — identity must dominate)
4. Prompt text

For ENV shots:
1. Location ref if available (medium weight)
2. Previous scene ref (high weight)
3. Prompt text

### 2. The "Blank Stare" Bug — What's the Workaround?
You identified that neutral reference images override text prompts for facial expression. This is a critical problem because Jinx's 12 curated picks are:
- 4 neutral
- 3 focused
- 3 exhausted/tired
- 2 from non-frontal angles

If the shot requires "screaming in terror" (e.g., Shot 30, THE TURN when she discovers the Harvest), I'm stuck. Options:
a) Generate expression-specific reference images on-the-fly using Flash (cheap, but adds a generation step)
b) Use only non-facial reference images for extreme emotion shots (body refs only, drop face refs)
c) Find the closest emotional match in the picks (exhausted for fear?)
d) Some other technique you know?

### 3. Color Contamination Fix — Won't This Degrade Identity?
You suggest multiplying an amber/dark overlay on white-bg character refs before sending them. But Jinx's picks are white-background for a reason — to isolate her visual identity from environment. If I tint them amber, won't her skin tone shift? Won't the rust-stained cuticles become indistinguishable from the overlay?

Alternative: What if I use white-bg refs for IDENTITY (close to prompt, high weight) and add a separate ENV-lit reference image specifically generated to show Jinx in the target lighting (medium weight, further from prompt)?

### 4. 3x3 Grid — I Hear You, But What About 2x2?
You said max grid size should be 2x2. At 1024x1024, that's ~512x512 per panel — enough for composition selection but not detail work. Is 2x2 at 1:1 viable as a composition exploration tool, parallel to your Flash exploration approach? Or is Flash exploration strictly better because each candidate is full-resolution?

### 5. Wide-Shot Face Degradation — This Affects 35% of Our Shots
EP001 has shots typed as MS, LS, and WIDE. If face quality degrades beyond MS, what's the solution at scale? Options:
a) Face detailer inpainting pass (adds pipeline complexity)
b) Generate face crop separately, composite (manual work)
c) Accept degradation for wide shots (the face is small enough viewers won't notice on mobile at 9:16)
d) Upscale just the face region via a second 3-pro call

Which approach have you seen work best?

---

## NEW QUESTIONS FROM YOUR ANALYSIS

### 1. Flash Exploration — How Many Candidates Per Call?
Your code shows `num_candidates=4`. Can `gemini-2.5-flash-image` actually generate multiple images in a single API call? Or does this require 4 separate calls? If 4 separate calls at $0.039 each, that's $0.156 for exploration — more expensive than a single Pro call.

### 2. The "Color Contamination" Effect — Degree of Influence
How strong is this effect? If I pass 3 white-bg character refs alongside 1 dark-lit scene ref, does the white background from 3 images overpower the dark scene? Or does the scene ref win because of its label/position?

### 3. Cost Model for Native Vertical Batch
Your proposed pipeline per character shot:
- Pass 1: ENV anchor (amortized across scene, ~$0.004/shot)
- Pass 2: Flash exploration x4 ($0.156 if 4 calls, or $0.039 if batched)
- Pass 3: Pro final render ($0.134)
- Total per shot: ~$0.30-0.18

This is slightly more expensive than single-pass triptych ($0.134/3 panels = $0.045/frame) but produces dramatically better quality. I can justify this if the take-1 acceptance rate is high enough to avoid regen cycles.

What take-1 acceptance rate should we expect with your pipeline vs the triptych approach?

### 4. `gemini-3.1-flash-image-preview` — Where Does It Fit?
The model list shows `gemini-3.1-flash-image-preview` alongside `gemini-2.5-flash-image`. Is Flash 3.1 better for the exploration pass? What are the capability differences?

---

## PROPOSED MERGED ARCHITECTURE

Based on your feedback, here's the revised pipeline:

```
Per Scene:
  1. ENV Anchor — generate hero environment frame at 9:16 via 3-pro
     - Sanitized prompt, no character refs
     - Saves as scene_ref for all subsequent shots in scene

  2. Per Character Shot:
     a. Flash Exploration — 4 candidates at 9:16 via 2.5-flash
        - Character refs (expression-matched, mirrored for direction)
        - Scene ref
        - Kinetic descriptors, lighting vectors locked
     b. Hero Selection — review UI or VLM auto-pick
     c. Pro Final Render — 9:16 via 3-pro
        - Flash hero as composition/pose reference (highest weight)
        - Character identity refs (high weight, close to prompt)
        - Scene ref (medium weight)
        - Full cinematic prompt with positive constraints
```

The grid engine becomes a composition exploration tool only, not a production path. Upload bundles still work. The Review UI adds candidate review workflow.

Does this merged approach address your concerns while keeping the pipeline buildable?
