# Visual Pipeline Innovations

> Last current as of: 2026-03-04 (verify before relying on for current architecture)

**Status:** Conceptual / Pre-Implementation
**Date:** 2026-02-06
**Updated:** 2026-03-04

These are novel pipeline techniques developed during Leviathan visual R&D. Potential IP.

> **March 2026 Status Note:** Some techniques below reference the LoRA training pipeline (Z-Image, Flux2, fal.ai) which was **eliminated Feb 27, 2026** in favor of native frontier model generation (Gemini 3 Pro / NBP). The underlying concepts (Decisive Moment, identity anchoring, etc.) remain valid and may be revisited if LoRA pipelines become viable again. The current production stack is Flash 3.1 (previz) → NBP (keyframes) → Kling/SeedDance/Veo (video).

---

## 1. Decisive Moment Generation (Mid-Frame-First Keyframing)

### The Problem

Traditional AI video pipelines generate a FIRST frame (start of action) and a LAST frame (end of action), then interpolate between them. But the most cinematic frame in any action sequence is the MIDDLE — the decisive moment. A fist at full extension. A panel half-torn from the wall, metal screaming. The peak of effort.

When you generate from the first frame ("she reaches for the panel"), you get a static pose. When you generate the decisive moment ("she wrenches the panel mid-pull, body torqued, rust cascading"), you get cinema.

### The Technique

**Invert the generation order.** Start from the peak:

1. **Generate the HERO frame** — the decisive moment, peak action, maximum tension. This is the frame that would be in the trailer. Use cinematic prose prompts with active verbs: "wrenches," "spins," "slams," "catches." This frame has the most visual energy and emotional information.

2. **Derive the FIRST frame** — the anticipation. Same character, same wardrobe, same location/lighting DNA, but the verb shifts to preparation: "braces against," "fingers finding the seam," "eyes locked on." Generated with matching seed/style to maintain visual consistency.

3. **Derive the LAST frame** — the aftermath. The result of the action: "stumbles back as the panel clatters free," "catches her breath, wiring exposed," "spins to face what she found." The emotional payoff.

4. **Interpolate with WAN FLF** — Feed all three frames into WanFirstMiddleLastFrame extension. The AI video model creates smooth motion through the full arc: anticipation → peak → aftermath.

### Why This Works

- **Better creative direction.** Cinematographers and storyboard artists think in decisive moments. "What's THE shot?" Not "what's the first frame?"
- **Higher quality generations.** Active verbs and mid-action poses produce more dynamic, cinematic images than static "about to" poses. Tested and confirmed: E-style action prompts dramatically outperform static descriptions.
- **Natural shot structure.** Every shot becomes a micro-story with three beats. The triplet prompt approach forces the storyboard agent to think in arcs, not stills.
- **Better use of FLF.** The WanFirstMiddleLastFrame extension is designed for exactly this — three keyframes with smooth interpolation. Using it with decisive-moment-first generation gives it the most information at the critical point.

### Prompt Architecture (Per Shot)

Each storyboard shot produces a **triplet** of prompts sharing common DNA:

```
COMMON DNA (shared across all 3):
- Character description + LoRA trigger
- Location/environment description
- Lighting conditions
- Camera model (Arri Alexa Mini LF)
- Film stock (Kodak Vision3 500T)
- Technical specs (shallow DOF, visible grain)

FIRST FRAME (anticipation):
- Verb: "braces," "reaches," "positions," "eyes locked on"
- Expression: tension, focus, calculation
- Body: coiled, preparing, weight shifting

HERO FRAME (decisive moment):
- Verb: "wrenches," "spins," "slams," "catches"
- Expression: effort, determination, raw exertion
- Body: full extension, mid-action, peak energy
- Motion cues: hair whipping, particles frozen, blur on extremities

LAST FRAME (aftermath):
- Verb: "stumbles back," "catches breath," "stares at"
- Expression: relief, discovery, realization
- Body: releasing, settling, reacting to result
```

### Implementation Notes

- Tested on Klein (local Flux 2 distilled) with E-style prompts. Action verbs ("wrenches mid-pull," "spins mid-turn") dramatically outperform static descriptions. See `leviathan/storyboards/pose_test_results.html`.
- The WanFirstMiddleLastFrame extension (ComfyUI custom node) takes 3 reference images + 3 CLIP vision encodes. `middle_position` param defaults to 0.5.
- Key WAN FLF setting: `low_noise_influence = 0` to prevent flicker.
- Vertical (9:16) works better for character-focused shots.
- 81 frames at 960x544 is the tested sweet spot; 121 stretches compute.

---

## 2. Virtual Multi-Camera Grid Generation

### The Problem

In live-action filmmaking, multi-camera setups capture the same moment from 2-4 angles simultaneously. This gives the editor perfect coverage — every angle is the same take, same lighting, same performance. Cutting between them is seamless.

AI image generation produces one frame per generation. If you generate the same scene from 4 different angles in 4 separate generations, you get 4 images with subtly different lighting, character proportions, wardrobe details, and atmosphere. They don't cut together cleanly.

### The Technique

**Generate all angles in a single image, then split.**

1. **Prompt a 2x2 grid** of the same scene moment from 4 different camera positions:
   - **Top-left:** Close-up (face, emotion, performance)
   - **Top-right:** Wide shot (geography, blocking, environment)
   - **Bottom-left:** Over-the-shoulder (relationship, POV)
   - **Bottom-right:** Low angle (power dynamic, scale)

2. **Single generation pass.** Because all 4 frames share the same noise seed, the same model pass, and the same generation context, they have inherently consistent:
   - Lighting and color temperature
   - Character appearance and proportions
   - Wardrobe details
   - Environmental detail and atmosphere
   - Mood and tone

3. **Auto-split the quadrants.** Trivial image processing — crop at the midpoints. Each quadrant becomes its own frame.

4. **Upscale via Gemini NanobananaPro** (or similar). Each quadrant starts at reduced resolution (e.g., 512x512 from a 1024x1024 grid). AI upscaling recovers detail and brings each to production resolution.

5. **Edit between angles** like a real multi-cam shoot. Close-up for the emotional beat, wide for the geography, OTS for the relationship moment, low angle for the power shift. Perfect consistency between cuts.

### Extended Pipeline Integration

Each quadrant can independently be fed into:
- **WAN I2V** for video generation (each angle becomes its own video clip)
- **Decisive Moment triplet** (each angle gets first/hero/last treatment)
- **RIFE frame interpolation** for framerate upscaling

The combination of Grid + Decisive Moment means: for one scene moment, you generate:
- 1 grid image = 4 angles
- Each angle × 3 frames (first/hero/last) = 12 frames
- Each triplet → WAN FLF = 4 video clips
- Editor assembles coverage from 4 angles of continuous motion

**This is a virtual multi-camera shoot from minimal generation.**

### Prompt Architecture

```
"A 2x2 grid showing the same moment from four camera angles.

Top-left: Extreme close-up of [character], [expression], [emotion].
Shot on Arri Alexa Mini LF with 100mm anamorphic macro lens.

Top-right: Wide shot of [character] in [location], [action], [blocking].
Shot on Arri Alexa Mini LF with 32mm anamorphic lens, deep focus.

Bottom-left: Over-the-shoulder from behind [character B] looking at [character A],
[action], [relationship detail]. Medium focal length, shallow depth of field.

Bottom-right: Low angle looking up at [character], [power/scale detail],
[environment above]. Dutch tilt, dramatic perspective.

[Shared: Kodak Vision3 500T, visible grain, chiaroscuro lighting,
practical light sources, photorealistic skin texture.]"
```

### Open Questions

- **Does Flux 2 Dev handle grid prompts well?** Midjourney does this natively. Flux/SD community has grid techniques but quality varies. Needs testing.
- **Minimum resolution per quadrant?** If generating 1024x1024, each quadrant is 512x512 — may be too low even with upscaling. May need 1536x1536 or 2048x2048 base generation.
- **Prompt format for grids.** Some models respond to "2x2 grid" or "four-panel layout." Others need more structural language. Needs testing per model.
- **Consistency within the grid.** Does the model actually maintain character consistency across quadrants, or does each quadrant drift? The hypothesis is that shared generation context prevents drift — but this is unproven.
- **Does this compound with LoRA?** If LoRA provides character identity lock, the grid's consistency advantage shifts to environment/lighting/wardrobe consistency. LoRA + Grid may be the strongest consistency combo.

### Automation Pipeline

```
generate_grid(scene_prompt, seed)
  → split_quadrants(grid_image) → [CU, WIDE, OTS, LOW]
  → upscale_each(gemini_nanobananapro) → [CU_hires, WIDE_hires, OTS_hires, LOW_hires]
  → for each angle:
      generate_triplet(first_prompt, hero_prompt, last_prompt)
      → wan_flf(first, hero, last) → video_clip
  → assemble_edit(CU_clip, WIDE_clip, OTS_clip, LOW_clip)
```

### Related Art

- **Midjourney default output** is a 2x2 grid, but all 4 are the same prompt/angle — no multi-cam.
- **"4-panel comic" SD technique** uses grid prompts but typically for sequential narrative, not simultaneous angles.
- **Multi-view diffusion models** (Zero123++, SV3D) generate multiple views of a 3D object — different goal (3D reconstruction, not cinematography).
- **None of these combine grid generation + auto-split + upscale + video generation + editorial assembly.** That pipeline is novel.

---

## 3. Cinematic Prose Prompting (E-Style)

### Discovery

Through systematic A/B testing of 8 prompt strategies on Klein (local Flux 2 distilled), a specific prompt style dramatically outperformed all others for cinematic keyframe generation. We call it "E-style" after the winning test variant.

### The Formula

**~150-180 words of cinematic prose** structured as:

1. **Active verb opening** — "A wiry young salvager wrenches a corroded panel off the wall mid-pull" (NOT "a woman stands in a corridor")
2. **Physical detail cascade** — hair, sweat, effort, gear, all in motion
3. **Environment as character** — the setting responds to the action (rust cascading, light swinging, particles disturbed)
4. **Wardrobe as story** — "patched cargo pants reinforced at the knees, a leather salvage harness darkened with machine oil" — each detail implies history
5. **Expression as decision** — "the expression of someone who does not stop once committed" (NOT "determined face")
6. **Camera technical block** — "Shot on Arri Alexa Mini LF with anamorphic Panavision C-series glass, shallow depth of field"
7. **Film stock and texture** — "Kodak Vision3 500T film stock, visible grain, chiaroscuro lighting"
8. **Motion cues** — "motion blur on the falling rust particles" — gives the still image implied movement

### Why It Works

- **Cinema cameras in the prompt** (Arri Alexa Mini LF, not Sony A7IV) trigger the model's training data associations with high-budget cinematography — different lens rendering, different color science, different depth of field characteristics.
- **Active verbs** force the model to compose dynamically rather than posing a figure.
- **Narrative prose** (not keywords) lets the model understand spatial relationships and cause-effect chains.
- **Film stock reference** (Kodak 500T) biases toward grain, warmth, and analog texture vs. digital cleanness.

### What It Beat

| Style | Words | Result |
|-------|-------|--------|
| Short keywords | 20 | Generic, documentary feel |
| BFL structured | 77 | Detailed but stiff, 3-arm artifact |
| BFL + HEX colors | 121 | Darker/moodier but still posed |
| **E-style cinematic** | **187** | **Winner — cinematic, dynamic, atmospheric** |
| Front-loaded | 64 | Runner-up, clean but less energy |
| Camera model (Sony) | 55 | Softer, more "real" but prosumer look |
| Film still | 36 | Indie quality, simple |

### Key Insight

**The BFL official guide says 30-80 words.** E-style uses 150-180. The official recommendation is wrong for cinematic content — or more precisely, it's optimized for general-purpose image generation, not for production keyframes that need to feel like they were pulled from a movie.

---

## 4. Triptych Strip Generation (Tested 2026-02-06)

### Discovery

When generating the decisive moment triplet as 3 separate images (Innovation #1), character consistency breaks — different hair, different face, different physics of the action. But when generating all 3 frames in a **single horizontal strip** (like the multi-camera grid), consistency is dramatically better because all frames share the same generation context.

### The Technique

Prompt a single wide image (e.g., 1536x912) containing 3 vertical panels laid out left to right:

```
Left panel:   ANTICIPATION — preparation verb state
Center panel: PEAK ACTION — decisive moment verb state
Right panel:  AFTERMATH — result verb state
```

All three panels share common DNA (character, wardrobe, location, lighting, camera) written once at the top of the prompt. Only the verb/action/expression changes per panel.

### Test Results (Klein, 2026-02-06)

| Variant | Dimensions | Prompt Style | Result |
|---------|------------|-------------|--------|
| **A_wide** | 1536x768 | "Horizontal triptych" | **Winner.** Photographic, clean 3-panel split, consistent character, action arc reads clearly |
| B_wide | 1536x768 | "Storyboard strip / comic strip" | Went illustrated/comic-book style. Wrong for cinematic pipeline |
| **A_tall** | 1536x912 | "Horizontal triptych" | Same quality, taller panels give more vertical room. Better action physics |
| B_sq | 1024x1024 | "Storyboard strip" | Comic style again. "Storyboard" language triggers illustration mode |

**Key finding:** The word "triptych" keeps it photographic. The words "storyboard strip" or "comic strip" trigger illustration/comic-book rendering. For cinematic output, always use "triptych" framing language.

**Consistency improvement:** Night and day vs separate generations. Same face, same hair, same wardrobe, same corridor, same lighting across all 3 panels. The shared generation context solves the drift problem that made the individual-frame triplet unusable.

### Prompt Template

```
"A horizontal triptych of three vertical panels showing a continuous action sequence,
left to right, of [CHARACTER DESCRIPTION]. [WARDROBE]. [ENVIRONMENT].

Left panel — ANTICIPATION: [preparation verb state, coiled tension, setup]

Center panel — PEAK ACTION: [decisive moment, maximum exertion, hair flying, debris]

Right panel — AFTERMATH: [result, settling, discovery, emotional shift]

All three panels: Shot on Arri Alexa Mini LF with anamorphic [lens] glass.
Kodak Vision3 500T film stock, visible grain, chiaroscuro lighting,
consistent character and environment across all panels."
```

---

## 5. Two-Tier Triptych Workflow (The Full System)

### Concept

Combine the triptych strip (Innovation #4) with the decisive moment approach (Innovation #1) and the multi-camera grid (Innovation #2) into a complete production pipeline.

### Tier 1: Hero Triptych (Sequence Keyframes)

For a sequence of 3 consecutive story beats, generate a single triptych strip showing the **decisive moment** of each beat. This locks cross-shot consistency — the character looks the same across all 3 shots because they're one generation.

```
HERO TRIPTYCH (1 generation, 3 decisive moments):
┌─────────────┬─────────────┬─────────────┐
│  Shot 1     │  Shot 2     │  Shot 3     │
│  Peak of    │  Peak of    │  Peak of    │
│  action A   │  action B   │  action C   │
└─────────────┴─────────────┴─────────────┘
```

### Tier 2: First/Last Triptychs (Per-Shot Motion Anchors)

For each shot in the hero triptych, generate two more triptych strips:
- A **"firsts" triptych** — the anticipation moment of each shot
- A **"lasts" triptych** — the aftermath moment of each shot

```
FIRST TRIPTYCH (1 generation, 3 anticipation frames):
┌─────────────┬─────────────┬─────────────┐
│  Shot 1     │  Shot 2     │  Shot 3     │
│  Before     │  Before     │  Before     │
│  action A   │  action B   │  action C   │
└─────────────┴─────────────┴─────────────┘

LAST TRIPTYCH (1 generation, 3 aftermath frames):
┌─────────────┬─────────────┬─────────────┐
│  Shot 1     │  Shot 2     │  Shot 3     │
│  After      │  After      │  After      │
│  action A   │  action B   │  action C   │
└─────────────┴─────────────┴─────────────┘
```

### Split + Upscale + Video

After generation:
1. **Auto-split** each triptych into 3 individual frames (trivial crop at 1/3 and 2/3 marks)
2. **Gemini upscale** each frame to production resolution
3. **Assemble triplets** — each shot now has first/hero/last
4. **WAN FLF** interpolation per shot → 3 video clips
5. **Editorial assembly** — cut the sequence together

### Math

For a 3-shot sequence:
- 3 triptych generations = 9 frames total
- 3 WAN FLF video generations
- **Total: 6 generation calls** (3 image + 3 video)

For a 3-shot sequence with multi-camera coverage (4 angles each):
- 3 triptychs × 4 angles = 12 triptych generations = 36 frames
- 12 WAN FLF video generations
- **Total: 24 generation calls** for full multi-cam coverage of a 3-shot sequence

vs. traditional pipeline (no triptych, no grid):
- 3 shots × 3 frames × 4 angles = 36 separate image generations (with drift)
- 12 video generations
- **Total: 48 generation calls** with inconsistency between every frame

### Panel Fix: Selective Inpainting

When one panel in a triptych is wrong (e.g., character facing wrong direction, wrong action physics):

1. **Mask the bad panel** — create a mask covering just that third of the triptych image
2. **Write a corrected prompt** for that panel's action/direction
3. **Inpaint** via ComfyUI `SetLatentNoiseMask` — regenerates only the masked region
4. **Good panels stay pixel-identical** — the surrounding context guides the regeneration

This preserves the core advantage (shared-context consistency) while allowing targeted fixes. The mask is trivially computed: `left_third`, `center_third`, or `right_third` of the image dimensions.

**Needs: Panel Fix Tool** — see Backlog. An HTML editor where the user selects which panel is wrong, optionally adjusts the prompt, clicks "Regenerate Panel," and the inpainting pipeline runs automatically.

---

## 6. Resolution and Scaling Limits

### Flux 2 Dev Resolution

| Resolution | Megapixels | Use Case | Quality |
|-----------|-----------|----------|---------|
| 1024x1024 | 1.0 MP | Native training resolution | Best quality |
| 768x1344 | 1.0 MP | Vertical (9:16) single frame | Best quality |
| 1536x768 | 1.2 MP | Triptych strip (3 panels) | Good — tested, works |
| 1536x912 | 1.4 MP | Taller triptych strip | Good — tested, works |
| 1024x1024 | 1.0 MP | 2x2 multi-camera grid | Good — tested, works (~512x512/panel) |
| 1536x1536 | 2.4 MP | Larger 2x2 grid (~768x768/panel) | Untested — worth trying |
| 2048x2048 | 4.2 MP | Max grid attempt | Likely artifacts/duplications |

### Panel Counts vs Resolution

| Panels | Layout | Min Image Size | Per-Panel Size | Notes |
|--------|--------|---------------|----------------|-------|
| 3 (triptych) | 1×3 horizontal | 1536x912 | ~512x912 | Tested, works well |
| 4 (grid) | 2×2 | 1024x1024 | ~512x512 | Tested, works well |
| 4 (grid) | 2×2 | 1536x1536 | ~768x768 | Untested, likely works |
| 9 (grid) | 3×3 | 3072x3072 | ~1024x1024 | Too high for Flux 2 |
| 9 (grid) | 3×3 | 1536x1536 | ~512x512 | Might work — tight |

**9-panel grid at 1536x1536** is the stretch goal. Each panel would be ~512x512 — same as our working 2x2 grid panels. The question is whether Flux 2 can coherently compose 9 distinct panels at that resolution.

---

## Combination: The Full Pipeline

The innovations compound:

```
SEQUENCE (3 consecutive story beats)
  │
  ├─ E-style cinematic prose prompts (~150-180 words per panel)
  │
  ├─ TIER 1: Hero Triptych (Innovation #4)
  │   3 decisive moments in one strip → auto-split
  │
  ├─ TIER 2: First Triptych + Last Triptych (Innovation #4 × 2)
  │   3 anticipation frames in one strip → auto-split
  │   3 aftermath frames in one strip → auto-split
  │
  ├─ OPTIONAL: Multi-Camera Grid per shot (Innovation #2)
  │   4 angles of each decisive moment in one grid → auto-split
  │
  ├─ Gemini upscale all split frames
  │
  ├─ WAN FLF per shot (first → hero → last) → video clips
  │
  └─ Editorial assembly
      Multi-angle coverage of a complete action sequence
      with consistent character/environment across every frame
```

**Per 3-shot sequence:**
- Without multi-cam: 3 triptych gens + 3 WAN FLF = **6 total calls, 9 frames**
- With multi-cam: 12 triptych gens + 12 WAN FLF = **24 total calls, 36 frames**

**Selective panel inpainting** handles any individual frame that needs correction without regenerating the whole strip.

---

## 7. Gemini NBP Upscale + Split FLF Pipeline (Tested 2026-02-06)

### The Problem

WAN 2.2 FLF on fal.ai caps at 720p regardless of input resolution. But "garbage in, garbage out" still applies — higher quality source frames produce noticeably cleaner, more coherent video. Triptych-split panels (488x892 each) are below WAN's effective input quality threshold, and they carry white border artifacts from the splitting process.

Additionally, 3-keyframe local FLF (WanFirstMiddleLastFrame) is blocked by LightX2V dependency and VAE mismatches. We need 3-keyframe control using cloud endpoints that only support 2-keyframe FLF.

### The Techniques

**Gemini NBP Upscale:**
1. Crop 4px from each edge (removes triptych border artifacts)
2. Send to Gemini 2.5 Flash Image with "keep everything exactly the same, only increase resolution" prompt
3. Result: 488x892 → 768x1344 (or higher), preserving all facial features, colors, lighting, composition
4. ~7s per panel, essentially free (Gemini Flash pricing)

**Split FLF (Two-Segment Approach):**
1. Generate three keyframes: FIRST (anticipation) + MIDDLE (peak action) + LAST (aftermath)
2. Upscale all three via Gemini NBP
3. Run two FLF calls: First→Middle (segment 1) + Middle→Last (segment 2)
4. The middle keyframe is the shared join point — character identity is anchored at the most visually complex moment
5. Each segment: 49 frames, 40 steps, 720p, 9:16, 16 fps
6. Total: ~78s for two segments (~39s each)

### Why Split FLF Solves Decoherence

Standard 2-keyframe FLF with a complex action (character turns, expression shifts, occlusion) often produces "decoherence" — the AI interpolates through the occluded moment and the character's face/body drifts. By anchoring identity at the exact moment of maximum complexity (the middle keyframe), each segment only needs to interpolate a simpler A→B motion.

### Test Results

| Test | Input | Output | Time | Quality |
|------|-------|--------|------|---------|
| Upscale (3 panels) | 488x892 each | 768x1344 each | 7s avg | Faithful — same face, same colors, sharper detail |
| Split FLF seg1 | First→Middle (upscaled) | 49 frames, 720p | 39s | Smooth anticipation→peak arc |
| Split FLF seg2 | Middle→Last (upscaled) | 49 frames, 720p | 39s | Smooth peak→aftermath arc |
| Standard FLF | First→Last (no middle) | 81 frames, 720p | 138s | Good but face drifts at occlusion point |

### What Doesn't Work

- **Image-trained LoRAs for video** → flickering/noise. Video LoRAs (I2V-trained) required for WAN 2.2.
- **Local 3-keyframe FLF without LightX2V** → blurry mess or solid color. The LightX2V acceleration LoRA is required for local 3KF.
- **No crop before upscale** → Gemini reproduces the white border artifacts at higher resolution.

### Implementation

- `tools/upscale_gemini.py` — Standalone utility + importable module
- `tools/generate_storyboard_keyframes.py` — Full pipeline: T2I → crop → upscale → split FLF
- Shots flagged as `wan_flf_reaction` in storyboard JSON get 3-keyframe split FLF treatment
- Shots flagged as `wan_i2v` get standard 2-keyframe FLF

---

*Last updated: 2026-02-06*
*Origin: Leviathan visual pipeline R&D, Klein prompt testing + fal.ai API testing*
