# FORMAT SPEC — Puzzle Box

> Canonical specification for 30-second mood/mystery microserial episodes.
> If this document contradicts another file, this document wins.
> Constants live in `CONSTANTS.md`. This file defines rules and structure.

---

## 1. Philosophy

The Puzzle Box does not compress a story into 30 seconds. It delivers a **moment** — a fragment of feeling and meaning that accumulates across episodes into a larger pattern. Each episode is a stanza, not a chapter. The series is the poem.

**Three influences, one aesthetic:**
- **Wong Kar-Wai** — Withhold emotional resolution. Neon, longing, fragmented time, what characters DON'T say.
- **Terrence Malick** — Withhold narrative clarity. Whispered memory, poetic fragmentation, beauty as violence.
- **JJ Abrams** — Withhold information. Unanswered questions as engagement engine. "What does it MEAN?" not "what happens NEXT?"

**The combined effect:** The poetics of withholding. The audience leans forward, wants more, replays what they saw.

**The validation question is not** "does this escalate tension?" **It is:** "does this deepen wanting?"

---

## 2. Episode Structure

Every 30-second episode follows this structure. Timing is ±2 seconds flex. Sequence is rigid.

| Beat | Duration | Function |
|------|----------|----------|
| **ENTRY IMAGE** | 2-5s | Wordless visual. No text, no VO. The audience arrives in a feeling before language begins. Rain on neon, a hand on glass, an empty chair. This is the scroll-stopper. |
| **VOICE** | 16-20s | VO fragment over 3-4 visual shots. Overheard memory register. Carries THE FRAGMENT when present. Must be emotionally orthogonal to the visuals. |
| **LINGER** | 6-8s | Held closing image. No cut to black on a gasp. A face, a space, something unresolved but beautiful. The feeling equivalent of an ellipsis. Oracle vote overlays here on Exposure-final episodes. |
| **ORACLE** | 10s off-clock | The vote. Mood-fork. Appears on episodes 4, 8, 12, and 16 only. |

### THE FRAGMENT

THE FRAGMENT is not a beat — it is a metadata annotation. Embedded in VOICE or expressed through image choice in ENTRY IMAGE or LINGER, one piece of the larger puzzle shifts. Not a reveal, not a twist — a recontextualization. Something from a prior episode now means something different because of what the audience is hearing or seeing.

THE FRAGMENT is what keeps the audience returning. It is the Abrams engine working through accumulation instead of withholding.

**Script notation:** In the narrative layer, annotate the fragment:
```
FRAGMENT: Recontextualizes Ep 3 — the empty chair.
  Original context: Abandonment.
  New context: Preservation. She kept his seat.
  Carrier: VOICE line "I kept his seat warm for a year."
```

### Episode 1 Exception

No prior episode exists. ENTRY IMAGE establishes the world's visual grammar — the palette, the texture, the atmosphere. VOICE introduces the narrator. THE FRAGMENT plants the first mystery seed instead of recontextualizing. LINGER establishes the first recurring image (this image will return in Episode 16).

### ERUPTION Episodes (FRACTURE Events)

3 episodes per series use the FRACTURE structure instead of the standard structure:

| Beat | Duration | Function |
|------|----------|----------|
| **BREAK** | 5s | Sudden kinetic violence or rupture. No build-up. Interrupts the atmosphere. |
| **AFTERMATH** | 25s | Atmospheric aftermath. Silence. Faces. The cost. Wong Kar-Wai longing in the wake of violence. |

**Fallen Angels Rule:** The kinetic action happens in the first 5 seconds. The remaining 25 seconds are atmosphere. This solves the TikTok retention problem — the algorithm gets the scroll-stop in 3-5 seconds, then the audience stays for 25 seconds of mood.

**Placement:** One Primary FRACTURE in Exposure 2 (eps 5-6), one SHIFT FRACTURE in Exposure 3 (eps 9-10), and one optional Ghost FRACTURE anywhere in Exposures 2-4.

**Ghost FRACTURE:** An episode that builds toward eruption intensity — the rhythm accelerates, the sound design escalates — then retreats into stillness at the moment of expected release. The denied climax. Optional tool, not a mandate.

---

## 3. Episode Script — Narrative Only

Episodes contain ONLY narrative content. Visual specifications are derived downstream by the Breakdown Agent reading the episode + FORMAT.md + characters.md.

**What belongs in the episode:**
- Metadata block
- FRAGMENT annotation
- THE MOMENT annotation
- Narrative script (VO text + minimal stage directions)
- WORLD VOTE or ORACLE

**What does NOT belong in the episode:**
- Shot sizes, lens choices, camera movement
- Lighting direction, color temperature
- Audio direction (ambient beds, SFX)
- Pipeline direction of any kind

The episode is the script. The breakdown is the shot list. The manifest is the render queue. These are separate documents with separate concerns.

---

## 4. VO Policy

### Frequency
VO appears in **10-12 of 16 episodes**. The remaining 4-6 episodes are silent or near-silent. Silence is structural — its sudden absence is an event that signals something has changed.

Silent episodes cluster in Exposure 1 (pure image mood-setting) and at key FRACTURE moments (when action replaces voice).

### Register: Overheard Memory
The narrator is telling you something they half-understand themselves. Less articulate than the images. Confused, grieving, confessing, or remembering. Never explaining.

**This is not narration.** It is not a character thinking aloud. It is not exposition. It is the sound of someone trying to make sense of something they lived through.

### The Orthogonality Rule

**VO must not describe the physical state of what's on screen.** It CAN reference the same subject but must add a layer the image can't provide — history, meaning, feeling, context. The gap between what we see and what we hear is where meaning lives.

- **BAD (captioning):** We see a phone on a counter. VO: "I keep the phone on the counter. Face down." — that's a subtitle track.
- **GOOD (orthogonal):** We see a phone on a counter. VO: "The number. It hasn't worked in two years. I watched them disconnect it. But it calls." — same subject, different layer. The image gives the object, the voice gives the history.
- **ALSO GOOD (fully disjoint):** We see rain on glass. VO talks about a memory of summer. — different subjects entirely.

Both approaches are valid. The test: cover the image and read the VO. Cover the VO and watch the image. If either one alone tells the full story, the other is redundant. If each tells HALF the story, you have orthogonality.

### Found-Document Variant
When VO quotes a diegetic document — a medical log, a black-box recording, a transcript, a confession, a readout — clinical register is permitted. The writer must answer: "What document is this, and who recorded it?" But found-document VO is a tool in the toolkit, not the default.

### Word Budget
- Maximum **60 words** per VO appearance
- Most VO lines should be 20-40 words
- VO-silent episodes: 0 words

### Prohibitions (VO)
- Never expository ("The station was built in...")
- Never synchronous with visuals ("She walked down the corridor")
- Never interior monologue ("She thought to herself...")
- Never a narrator ("In a world where...")
- Never flowery for its own sake ("The neon tears of the cybernetic dawn...")

---

## 5. Series Arc: Exposures × Sequences

Two layers working together. **Exposures** are the mood layer — what the audience *understands*. **Sequences** are the story layer — what *happens* in the plot. Each Exposure contains two sequences. Plot is the engine, mood is the car.

### The Nested Architecture

| Exposure | Seq | Episodes | Exposure (Mood Layer) | Sequence (Story Layer) |
|----------|-----|----------|-----------------------|------------------------|
| **THE MOOD** | 1 | 1-2 | Establish world, voice, central absence. The audience doesn't know what the story is. They know what it feels like. | **INCITING** — World-state boot / the disaster that locks them in |
| | 2 | 3-4 | By episode 4: a specific longing they can't name, one question forming without being stated. | **ESCALATION** — First attempt / makes things worse |
| **THE FRACTURE** | 3 | 5-6 | Counter-fragments complicate the mood. What felt melancholy now has an edge. Mystery emerges from dissonance. | **THE TOLL** — Physical cost / emotional-resource cost |
| | 4 | 7-8 | Something happened between these people, and the fragments contradict each other. | **MIDPOINT** — The big swing / the massive reversal |
| **THE SHIFT** | 5 | 9-10 | Recontextualization. Fragments rearrange. Images from Exposure 1 return with different meaning. VO may be unreliable. | **FALLOUT** — Surviving the midpoint / new desperate plan |
| | 6 | 11-12 | Stops being atmospheric, starts being *about something*. Mystery sharpens into a question the audience can almost articulate. | **ALL IS LOST** — Plan fails / lowest point |
| **THE TRUTH** | 7 | 13-14 | Emotional convergence, not plot resolution. Fragments assemble into understanding. | **THE RALLY** — Spark of hope / turning tables |
| | 8 | 15-16 | Final episode echoes Episode 1's ENTRY IMAGE. Mystery box opens — inside is an emotion, not a fact. | **CLIMAX** — Resolution + universe expansion |

### Push/Pushback Within Sequences
- **Odd episodes (Push):** Proactive — a choice, a discovery, an action
- **Even episodes (Pushback):** Reactive — the cost, the consequence, the reverberation

### Internal Exposure Shape (each group of 4)
1. Establish the Exposure's new element (Sequence A, Push)
2. Deepen / complicate (Sequence A, Pushback)
3. The Exposure's FRAGMENT turn — the recontextualization that defines this movement (Sequence B, Push)
4. Settle into new understanding; Oracle vote on Exposure-final episode (Sequence B, Pushback)

---

## 6. Ending Taxonomy (Resonance Points)

Episodes end on resonance points — "what did that MEAN?" not "what happens NEXT?"

| Type | Mechanism | Example |
|------|-----------|---------|
| **RHYME** | Familiar image or gesture recurs with altered meaning. The audience recognizes the echo but not the implication. | Same hand on same vial — but now we know what it contains. |
| **WITHHOLD** | Cut away at the moment of revelation. Not a cliffhanger — more like a door closing. We know something was about to be understood. | Character opens mouth to speak. Cut to black. Next episode shows the listener's face. |
| **DISSONANCE** | VO and image contradict each other. The audience reconciles two competing truths. | VO: "She saved him." Image: his face, terrified of her. |
| **OBJECT** | Object or detail centered in final frame. No explanation. It will recur. The audience doesn't yet know why it matters. | A serial number on a blood bag. Just a number. Held 3 seconds. |
| **ABSENCE** | Episode ends on what is missing — empty room, vanished character, silence where a sound should be. Negative space becomes the question. | The medbay. Her equipment. She is not there. |

### Distribution Rules
- No type used consecutively
- Each Exposure should use at least 3 of the 5 types across its 4 episodes
- Each LINGER beat must be classifiable as exactly one type

---

## 7. Fragment Linkage Map

A series-level document tracking how each episode's FRAGMENT connects backward and forward. This is the puzzle-box engine — the system that ensures the mystery assembles rather than dissipates.

### Requirements
- Every episode in Exposures 2-4 (episodes 5-16) must have at least one backward linkage
- The referenced element must exist in the cited episode
- The new meaning must differ from the original context
- Fragment linkage entries are written during treatment and maintained during generation

### Format
```
Episode 7:
  Recontextualizes: Ep 3 — empty chair in café
  Original meaning: Abandonment
  New meaning: Preservation — she kept his seat
  Carrier: VOICE line "I kept his seat warm for a year"
  Ending type: RHYME (echoes Ep 3 LINGER)
```

### Validation
- Every episode 5-16 must have a linkage entry
- The source episode must be earlier than the current episode
- No single source episode can be referenced more than 3 times (prevents over-reliance)
- By Episode 16, every planted OBJECT ending must have been referenced at least once

---

## 8. Voting — The Dungeon Master System

**Every episode ends with a vote.** The audience is the co-creator — a dungeon master collaborating with the show's world builders. They don't control the characters. They shape the world the characters move through.

### Two Vote Tiers

| Tier | Frequency | Function | What the Audience Chooses |
|------|-----------|----------|--------------------------|
| **World Vote** | Every episode (12 per series) | Audience shapes environment — what object, what sound, what detail appears next | What's behind the door, not whether to open it |
| **Oracle Vote** | Episodes 4, 8, 12, 16 (Exposure-final) | Mood-fork — sets emotional register for the next Exposure | What the story *means*, not what happens |

### World Votes (Every Episode)

The audience populates the world. They choose textures, details, what's behind the next door. The characters react to whatever the audience built. The DM (the engine) weaves it into the narrative regardless.

**World Vote Design Rules:**
- Options are **environmental/sensory**, never character decisions
- Options are **concrete and tactile** — objects, sounds, textures, details
- Options should be **1-3 words maximum**
- Both options must produce interesting story possibilities
- Whatever the audience chooses becomes a detail that recurs in the FRAGMENT system — they are building the puzzle pieces
- **Magician's Choice applies** — both options converge on the same structural destination

**World Vote Examples:**
- "Behind his voice:" `[A] Rain` / `[B] Music`
- "What he left behind:" `[A] A cigarette, still warm` / `[B] A word traced in condensation`
- "The sound on the other end:" `[A] Breathing` / `[B] Glasses clinking`
- "The earring is shaped like:" `[A] A teardrop` / `[B] A key`

**World Vote Format in Scripts:**
```
## WORLD VOTE
**[DM_PROMPT]** Behind his voice:
* [A] Rain
* [B] Music
```

### Oracle Votes (Exposure-Final Episodes)

Structural mood-forks at episodes 4, 8, 12, and 16. These are the big votes — they set the emotional register for the next Exposure.

| Episode | Oracle Question Pattern | Effect |
|---------|----------------------|--------|
| 4 (end of MOOD) | "Is this ___ or ___?" | Determines Exposure 2's emotional register |
| 8 (end of FRACTURE) | "Is this ___ or ___?" | Determines Exposure 3's SHIFT nature |
| 12 (end of SHIFT) | "Is this ___ or ___?" | Determines Exposure 4's convergence |
| 16 (end of TRUTH) | "What remains?" | Seeds next series (if applicable) |

**Oracle Vote Design Rules:**
- Options must be **mutually exclusive** (creates factions, generates argument)
- Options should be **1-2 words maximum** (tweetable faction identity)
- Both options must be **genuinely ambiguous** — defensible readings, no correct answer
- Framing implies consequence for the next Exposure

**Oracle Vote Format in Scripts:**
```
## ORACLE
**[DM_PROMPT]** The voicemails. The empty seat. The earring.
* [A] Longing
* [B] Dread
```

### How Votes Feed the Fragment System

The audience's World Vote choices become recurring details in the FRAGMENT system. If the audience votes "Rain" behind his voice in Episode 1, the rain becomes a thread — it recurs, gains meaning, gets recontextualized. The audience is literally choosing the puzzle pieces that the series will assemble.

This means the treatment's Fragment Linkage Map must account for both vote outcomes at each World Vote. Each FRAGMENT entry should note: "If the audience voted A, this references X. If B, this references Y."

### Production Advantage
World Votes are easier to implement than plot-forks. The structural skeleton is fixed — the audience is choosing environmental details, not narrative direction. Pre-generate 2 versions of the NEXT episode's ENTRY IMAGE (one per option). The VO and LINGER can often remain the same, with only the visual details shifting.

Oracle mood-forks are prompt-level adjustments — changing emotional register, not narrative dependency chains.

### Variant Pre-Generation
- **World Votes:** Pre-generate 2 versions of the next episode's ENTRY IMAGE and relevant VOICE-period visuals (one per option). VO may remain identical.
- **Oracle Votes:** Pre-generate 2 complete versions of the next Exposure's first episode.

### DM_RESOLUTION Format
Injected before generating the episode following a vote:

**World Vote resolution:**
```
> [!DM_RESOLUTION]
> The audience voted: **RAIN** (over MUSIC).
> The next episode's ENTRY IMAGE must feature rain. This detail becomes a recurring element in the fragment system.
```

**Oracle Vote resolution:**
```
> [!DM_RESOLUTION]
> The audience voted: **LONGING** (over DREAD).
> The next Exposure's emotional register leans into ache and wanting, not threat.
```

---

## 9. Rhythm System

Each episode is tagged with one rhythm type that dictates visual pacing.

| Tag | Shots | Characteristics | Used For |
|-----|-------|----------------|----------|
| **SUSPENDED** | 1-2 | Near-static frames. Slow movement — smoke, rain, a turning head. VO carries the meaning. Time feels stopped. | ENTRY IMAGE beats, contemplative LINGERs, Exposure 1 episodes |
| **LAYERED** | 3-4 | Images that rhyme — cutting between two faces, two time periods, two versions of the same space. Meaning lives in the juxtaposition. | VOICE beats, symbol-heavy episodes, FRAGMENT-carrying episodes |
| **KINETIC** | 5-8 | Fast cuts where each cut connects to a prior image. Echoes detonating. | FRACTURE episodes only (BREAK beat) |
| **DRIFT** | 2-3 | Camera moves through space slowly — tracking through a corridor, pushing in on a detail, pulling back to reveal emptiness. Movement itself is the meaning. | Transitions, pure atmosphere episodes, AFTERMATH beats |

### Rhythm Rules
- Max **2 consecutive** episodes with the same rhythm tag
- KINETIC is reserved for FRACTURE episodes — never used in standard episodes
- Default episode rhythm: SUSPENDED → LAYERED within a single episode's beats

---

## 10. Emotional Beat Schedule

Fixed emotional beats pinned to specific episodes. Non-negotiable structural anchors.

| Episode | Beat | Function |
|---------|------|----------|
| 3 | THRESHOLD | First commitment to the mystery — character crosses a line |
| 8 | MIDPOINT | The mood fractures. What felt one way now has an edge. |
| 11 | VULNERABILITY | The mask slips. The narrator's reliability cracks. Quiet before the fall. |
| 12 | ALL_IS_LOST | Lowest point. The fragments seem to cancel each other out. Nothing coheres. |
| 15 | FRACTURE/RECONCILIATION | Connection tested. The audience's understanding is shaken and reformed. |
| 16 | RESOLUTION | Emotional convergence. The ENTRY IMAGE from Episode 1 returns with full weight. |

Tolerance: ±1 episode. Missing beats are hard failures in validation.

---

## 11. THE MOMENT

Every episode has one image or line that stays — the single thing the audience carries with them between episodes. More like a poem's closing image than a plot turn.

THE MOMENT is not the dramatic beat. It is the **residue**. It is what the audience describes when they tell a friend about the show: "There's this shot where..."

In the script, annotate THE MOMENT:
```
THE MOMENT: Her hand resting beside the tourniquet. Not reaching for it. Just... beside it.
```

THE MOMENT is often but not always in the LINGER beat. It can live in VOICE (a line that haunts) or ENTRY IMAGE (a visual that won't leave).

---

## 12. Episode File Template

```markdown
# EP[NN] — [TITLE]

## Metadata
- Exposure: [MOOD/FRACTURE/SHIFT/TRUTH]
- Sequence: [INCITING/ESCALATION/TOLL/MIDPOINT/FALLOUT/ALL_IS_LOST/RALLY/CLIMAX]
- Push/Pushback: [Push/Pushback]
- Rhythm: [Suspended/Layered/Drift]
- VO: [Yes/No]
- Ending Type: [Rhyme/Withhold/Dissonance/Object/Absence]
- Emotional Beat: [beat name or "none"]

## FRAGMENT
- Recontextualizes: [Ep N — description]
- Original meaning: [what it meant before]
- New meaning: [what it means now]
- Carrier: [VOICE/ENTRY IMAGE/LINGER]

## THE MOMENT
[One sentence — the image or line that stays]

---

## NARRATIVE SCRIPT

### [00:00-00:04] ENTRY IMAGE

[Visual description. No words.]

### [00:04-00:22] VOICE

[VO: CHARACTER]
"Spoken VO text."

### [00:22-00:30] LINGER

[Final image. Held. Unresolved.]

## WORLD VOTE
**[DM_PROMPT]** [Environmental/sensory prompt, ≤ 15 words]
* [A] [concrete detail — 1-3 words]
* [B] [concrete detail — 1-3 words]

## ORACLE (Exposure-final episodes only — replaces WORLD VOTE on eps 4, 8, 12, 16)
**[DM_PROMPT]** [Vote text ≤ 15 words]
* [A] [option — 1-2 words]
* [B] [option — 1-2 words]
```

### FRACTURE Episode Template

```markdown
# EP[NN] — [TITLE] (FRACTURE)

## Metadata
[Same as above, with Rhythm: Kinetic]

---

## NARRATIVE SCRIPT

### [00:00-00:05] BREAK

[Sudden kinetic action. No build-up.]

### [00:05-00:30] AFTERMATH

[VO: CHARACTER] (if present)
"Spoken VO text."

[Atmospheric aftermath. Silence. Cost.]

```

---

## 13. Prohibitions

Absolute rules. If any of these appear in a script, it fails validation.

| Prohibition | Explanation |
|-------------|-------------|
| **Exposition** | Characters and VO cannot explain plot, lore, history, or feelings. |
| **Synchronous VO** | VO cannot describe on-screen action or visuals. |
| **Generic verbs** | "ran," "looked," "fought" — must be specific to character and environment. |
| **Interior monologue** | No "She thought to herself..." |
| **Narrator voice** | No omniscient narration. VO is always a specific person remembering. |
| **Abstraction in VO** | Never "she felt lonely." Always concrete: "she bought a pineapple because it would expire in two days." |
| **Cliffhangers** | No mid-action cuts demanding "what happens next." Endings are resonance points. |
| **Dialogue exposition** | No character explaining the situation to another character. |

---

## 14. Validation Checklist

### Hard Fails
- Narrative word count > 80
- Missing ENTRY IMAGE, VOICE, or LINGER beat (standard episodes)
- Missing BREAK or AFTERMATH beat (FRACTURE episodes)
- Missing ending type annotation
- Missing FRAGMENT annotation (episodes 5-16)
- VO describes current visual (orthogonality violation)
- Emotional beat missing at target episode (±1 tolerance)
- Exposition or interior monologue detected

### Soft Warnings
- Narrative word count < 40
- Rhythm tag matches previous 2 episodes
- Ending type matches previous episode
- THE MOMENT not annotated
- Fragment linkage references same source episode >3 times across series
- Generic action verbs detected

### Series-Level Validation
- All 16 episodes present
- 3 FRACTURE episodes placed (1 in Exposure 2, 1 in Exposure 3, 1 optional)
- Fragment linkage covers all episodes 5-16
- Every planted OBJECT ending referenced at least once by Episode 16
- Oracle votes present on episodes 4, 8, 12, 16
- Emotional beats hit within tolerance
- Thread tracker: all threads resolved by Episode 16
- No thread stale for >4 episodes

---

## 15. Visual Pipeline Notes

The Puzzle Box format plays to AI video generation strengths:
- **Fewer shots** (2-4 per standard episode vs 5-8 for Kill Box)
- **Simpler compositions** — faces, negative space, texture, single characters
- **Slower movement** — higher success rate on I2V generation
- **Recurring environments** — same spaces reused, reducing unique asset count
- **Atmosphere-heavy frames** — rain, neon, smoke = what models do well

### Compositional Priorities
- Close-ups and 3/4 profiles dominate
- Two-character frames are significant events (rare)
- Negative space is compositionally intentional, not empty
- Neon palette provides consistent visual language across episodes
- Same environment in different lighting = different mood, same asset

### Risks
- Longer holds expose artifacts — quality bar per frame is higher
- Face close-ups require likeness consistency across episodes
- Subtle/slow camera movement can be misinterpreted by models ("drift left" becomes "whip left")
