> **ARCHIVED 2026-03-02.** LoRA pipeline eliminated Feb 27, 2026. Retained for potential future
> re-adoption of open-source identity models. For the current visual pipeline, see
> `../PRODUCTION_PIPELINE_GUIDE.md` Phase 6.

# LoRA Training Best Practices

**Last updated:** 2026-02-15 (pipeline redesign: Qwen MA → NBP → SeedVR2, smart expression distribution)
**Applies to:** Flux 2 T2I, Z-Image T2I, WAN 2.2 Video LoRA training via fal.ai

This document is the canonical reference for all character LoRA training in the Recoil pipeline. The validation hook in `train_lora.py` checks against these rules.

---

## 1. The Redundancy Rule

**If a property is constant across ALL training images, do not caption it.**

The LoRA absorbs constants from the pixel data. Captioning constants wastes token budget and can create conflicting training signals.

| Constant (do NOT caption) | Variable (DO caption) |
|---------------------------|-----------------------|
| Style (photorealistic, cinematic) | Camera angle |
| Camera specs, film stock | Facial expression |
| Quality language ("8K", "detailed") | Environment / background |
| Identity features (face, body, scars) | Lighting setup |
| Default wardrobe appearance | Pose (if distinctive) |
| Hair style/color | |

**Why this matters:** T5-based models (Flux 2, Z-Image) learn text-to-image mappings. If every caption says "photorealistic cinematic" and every image is photorealistic cinematic, the model doesn't learn anything from those tokens — they become noise. Worse, if identity text is included, the model learns to reconstruct identity from text instead of the trigger word, weakening identity lock.

---

## 2. Caption Strategy

### Natural Language Format

All models in the Recoil pipeline use T5-based text encoders trained on natural language prose. Captions should be **sentences**, not comma-separated tag lists.

**Correct (natural language):**
```
KIANCHAR with a neutral expression, seen from above, camera looking down. An industrial corridor with worn metal walls. Cool overhead light.
```

**Wrong (comma tag list):**
```
KIANCHAR, neutral expression, high angle looking down, industrial corridor, cool overhead lighting, photorealistic cinematic, gritty documentary aesthetic
```

**Wrong (identity contamination):**
```
KIANCHAR, massive combat chassis with grey-blue armored plating, solid electric blue eyes, neutral expression
```

**Wrong (wardrobe state from breakdown.json):**
```
KIANCHAR, NO PHYSICAL BODY — Kian exists as data encoded across ship systems
```

### Model-Specific Caption Length

| Model | Word Count | Style | Notes |
|-------|-----------|-------|-------|
| **Flux 2** | 30-80 words | Natural language sentences | T5 trained on prose; varied sentence structure |
| **Z-Image** | 20-50 words | Concise natural language | Shorter = better for Z-Image's smaller encoder |
| **WAN 2.2** | 80-150 words | Detailed paragraphs | Include motion/camera movement for video LoRAs |

The `--target-model` flag on `train_lora.py prepare` automatically adjusts caption style and length.

### Caption Content (All Models)

Each caption describes **only what varies** between training images:

1. **Trigger word** — always first (e.g., `KIANCHAR`)
2. **Expression** — what the face is doing
3. **Camera angle** — how the subject is framed
4. **Environment** — where the subject is
5. **Lighting** — what light source defines the mood

The caption builder uses varied sentence templates for natural language diversity. Template selection is deterministic per image (hash-based), so captions are reproducible.

---

## 3. Dataset Composition — The Keystone Approach

### Coverage Matrix

For a **25-30 image** character training set:

| Framing | Target % | In 30 images |
|---------|----------|-------------|
| Close-up / headshot | 33-40% | 10-12 |
| Medium / waist-up | 25-30% | 7-8 |
| Full body | 20-25% | 6-7 |
| Back / misc | 10% | 3-4 |

The `validate` subcommand reports actual coverage against these targets.

### Priority Order for Diversity

1. **Angle diversity** (most important for multi-angle identity)
2. **Framing diversity** (close-up, medium, full body)
3. **Expression diversity** (5-7 distinct types)
4. **Lighting diversity** (3+ setups)
5. **Background diversity** (5-10 environments — CRITICAL)

### Expression Guidelines

- Use **5-7 distinct expression types** in the final training set
- **Neutral should be most common** (~30-40% of final set)
- **No single expression more than 5-6 times** in 25-30 images
- **Avoid theatrical extremes** that distort facial geometry (gaping mouth, howling). Moderate intensity works best for identity learning. Strong intensity is OK if the mouth shape is preserved.
- Expression families: anger, sadness, fear, exhaustion, resolve — with mild/moderate/strong gradients

### Expression Distribution Across Angles (Feb 14, 2026)

**Core finding: The LoRA learns identity; the base model already knows what expressions look like from every angle.** You are not teaching Flux 2 / Z-Image what a smile looks like from profile — it already knows. You are teaching it what *this character's face* looks like. Expression variation prevents the LoRA from accidentally suppressing the base model's expression range, but it does not need to appear from every angle.

**Academic support:** Ibarretxe-Bilbao et al. (2021) found frontal expressions were "significantly better recognized" than the same expressions in profile. Guo et al. (2015) found profile views "significantly decreased perceived intensity for ALL tested expressions." If humans can barely read expressions from profile, training images from those angles carry less expression signal for the model to learn from.

**Community consensus:** SeaArt's official training guide explicitly recommends: *"Expressions (recommended to use in only portraits or close views)."* The CivitAI Definitive Guide notes expressions are "semantic variation" — they add prompt-responsiveness, not identity information.

**The one risk — angle-expression coupling** — occurs when the LoRA learns that a specific expression always appears with a specific angle (e.g., "angry = frontal"). Mitigate by including 1-2 mild expressions at 3/4 angles to break the association, and by captioning angle and expression as independent attributes.

**Recommended distribution for a 25-30 image dataset:**

| Angle Category | Expressions | Rationale |
|----------------|------------|-----------|
| Frontal + close-ups (3 angles) | Full range (5 expressions) | Expressions most readable here; highest training signal |
| 3/4 views (2 angles) | Neutral + 1-2 mild | Breaks angle-expression coupling without adding noise |
| Profile, low, high (3-4 angles) | Neutral only | Clean identity signal (nose, jaw, ear geometry) |
| Back angles (3 angles) | Neutral only, 2-pass (skip Pass 3) | Face not visible; teaches silhouette/hair |

This reduces pre-curation generation from ~51 images (12 angles x 5 expressions) to ~28 images, already close to the final training set target.

**Sources:**
- Ibarretxe-Bilbao et al. (2021), PubMed — Emotion Recognition of Facial Expressions Presented in Profile
- Guo et al. (2015), Acta Psychologica — Face in profile view reduces perceived facial expression intensity
- SeaArt: How To Create Dataset For Training (official guide)
- CivitAI: The Definitive Guide to Character LoRA Training

### Wardrobe Variation

- Use **3-5 different outfits** per character in the training set
- Caption wardrobe explicitly (e.g., "wearing a torn grey jumpsuit" vs "wearing chrome-plated officer's uniform") so the model learns clothing is a variable, not fused with identity
- **Without wardrobe variation, the LoRA bakes the outfit into the character's identity.** Generating the character in different clothes at inference becomes difficult or impossible.
- Flux-based models are especially sensitive to wardrobe baking — even 2 outfit changes significantly improve generalization
- If breakdown.json defines wardrobe states, distribute them across the training set weighted by episode count (primary wardrobe ~50%, others split evenly)

### Background Diversity

- Use **5-10 distinct environments** across the training set
- **All-white or studio-only backgrounds cause overfitting and "prompt inertia"** — the LoRA learns to always place the character on white, making it resist environment prompts at inference
- Caption backgrounds explicitly (e.g., "dark industrial corridor with worn metal walls")
- Rotate through: interior industrial, interior residential, exterior urban, exterior natural, atmospheric/fog, warm-lit, cool-lit, dramatic shadows, open space, confined space
- fal.ai does not support regularization images — compensate with in-dataset background diversity instead

### The Keystone Workflow

1. **JT manually generates 7-8 hero keystone images** in MidJourney (front, 3/4 L, 3/4 R, profile L, profile R, full body, close-up, back — **all neutral expression**)
2. **Place keystones in:** `[project]/visual/lora_candidates/[CHARACTER]/keystones/`
3. Optionally creates `keystone_metadata.json` alongside images with angle/expression/lighting per image
4. `batch_generate_refs.py --lora-prep 50` generates AI variation candidates — **using 2-3 keystone images as identity references per candidate** (Gemini 2.5 Flash max 3 input images). Smart angle-based selection: always front keystone + closest-angle match + one more for 3D depth. Candidates grouped by angle (one-variable-at-a-time) for better identity consistency.
5. **LoRA picker:** select best 15-20 from AI batch to fill coverage gaps
6. **Final set:** ~8 keystones + ~17-22 AI picks = 25-30 with guaranteed coverage

**Why keystones matter:** Without keystones, each AI candidate is a different text-to-image interpretation of the character description — different face each time. The LoRA would be trained on mismatched identities, producing a blurry averaged-out character. Keystones give Gemini visual anchors so every candidate shares the same face geometry.

**Keystone priorities (most to least important):**

| Priority | Image | Why |
|----------|-------|-----|
| 1 | Front close-up (neutral) | The identity anchor |
| 2 | 3/4 right close-up (neutral) | Most common cinematic angle |
| 3 | 3/4 left close-up (neutral) | Proves symmetry |
| 4 | Profile left (neutral) | Side geometry — nose, jaw, brow |
| 5 | Profile right (neutral) | Confirms the other side |
| 6 | Full body front (neutral) | Body proportions, posture |
| 7 | Back / over-shoulder (neutral) | Hair, silhouette |
| 8 | Full body 3/4 (neutral) | Body in perspective |

**Keep keystones neutral.** Expressions are for AI candidates to explore — keystones establish identity geometry only.

### Keystone Metadata Format

Place `keystone_metadata.json` in the candidates directory alongside manually-generated images:

```json
{
  "existing_01_hero.jpeg": {
    "angle": "front",
    "expression": "neutral",
    "lighting": "soft_studio"
  },
  "existing_02_profile.jpeg": {
    "angle": "profile_left",
    "expression": "neutral",
    "lighting": "warm_ambient"
  }
}
```

Images without manifest metadata or keystone metadata receive trigger-only captions (safe — tells the model to associate the trigger with the visual identity).

---

## 3a. Candidate Generation Engines

LoRA training candidates can be generated from multiple AI engines, each with different strengths.

### Three-Pass Sequential Pipeline (Recommended)

The recommended approach chains three engines sequentially, each handling what it does best:

```
Hero Image
  → Pass 1: Qwen Multi-Angle (angle geometry from hero)
    → Pass 2: NBP / Gemini 3 Pro Image Preview (bg swap + expression, single input)
      → Pass 3: SeedVR2 (non-generative quality upscale)
        → Final candidate
```

| Pass | Engine | Purpose | Cost/img | Speed |
|------|--------|---------|----------|-------|
| 1 | Qwen Multi-Angle | Angle geometry (12 angles × 360°) | ~$0.035 | ~7-37s |
| 2 | NBP (Gemini 3 Pro Image Preview) | Background swap + expression + identity lock | ~$0.065 | ~20-40s |
| 3 | SeedVR2 | Non-generative quality upscale | ~$0.001 | ~5-10s |
| | **Total per angle (3-pass)** | | **~$0.101** | **~56s** |
| | **Total per angle (2-pass)** | | **~$0.036** | **~26s** |

**Why three passes:** No single engine handles all dimensions. Qwen nails angle geometry but can't change environments or expressions. NBP handles background swap, expression, and identity lock in a single pass — combining what previously required both Qwen Edit and a separate NBP polish step. SeedVR2 provides non-generative quality upscale without altering identity or composition. The chain exploits each engine's strength while reducing cost and pipeline complexity.

**Testing results (Feb 14-15, 2026):** Testing showed that skipping Qwen Edit and letting NBP handle background + expression in a single pass produces better results. Dual-reference (sending hero alongside pipeline intermediate) caused NBP to pull head rotation back toward front, overriding the angle Qwen MA established. Key findings:
- Qwen MA provides accurate angle geometry including back/low/high views
- NBP handles bg swap + expression + identity lock in one pass, replacing both Qwen Edit and the old NBP polish step
- SeedVR2 upscales without generative artifacts, preserving identity faithfully

### NBP Skin Texture Prompting

NBP (Gemini 3 Pro Image Preview) responds well to explicit skin texture language. These prompts produce photographic skin vs the default airbrushed look:

**Positive prompts (add these):**
- "detailed skin texture with visible pores and natural variation"
- "every freckle, pore, and minor imperfection visible"
- "natural light revealing every pore, wrinkle, and slight flaw"
- "micro-texture, skin variation, organic skin tone with subtle imperfections"

**Negative prompts (add these):**
- "not airbrushed, not smoothed, not retouched"
- Avoid: plastic skin, waxy skin, smooth skin, poreless skin, glossy

**Photography reference (add these):**
- "shot on professional camera, DSLR 50mm lens"
- "Kodak Vision3 500T" (matches project film stock — use project-specific stock)

**Key insight:** Request flaws explicitly. AI models default to idealized skin. Prompting for freckles, asymmetry, pores, and natural variation overrides the smoothing bias.

**Lighting:** Do NOT hardcode lighting style (e.g., "warm amber tones"). Lighting should come from the prompt/breakdown data or the `--lighting` parameter in `engine_shootout.py`. Default: "cinematic lighting with modeling".

### Legacy Modes

The original hybrid pipeline modes are still available but not recommended:

- **Parallel (`--hybrid parallel`):** Qwen MA + Gemini Flash independently from hero. Issue: Gemini Flash can't hold angles (back_left → frontal).
- **Two-Pass (`--hybrid twopass`):** Qwen MA → Gemini Flash variations. Issue: Compounds artifacts, Gemini can't add expressions reliably.

fal.ai does not support regularization images for LoRA training. Compensate with in-dataset diversity (varied wardrobe, backgrounds, lighting) rather than external regularization sets.

**Full engine specs, API parameters, and architecture:** See [`candidate_generation_engines.md`](candidate_generation_engines.md).

**Quick comparison tool:** `engine_shootout.py --threepass` chains the three passes for a single angle/expression.

**Review tool:** `http://127.0.0.1:8420/shootout_reviewer.html?project=<name>&character=<CHAR>` — browse all runs, compare passes, mark winners, add notes.

### Pass 2 (Qwen Edit) — Legacy

**NOTE (Feb 15, 2026):** Qwen Edit has been removed from the recommended pipeline. NBP now handles background swap + expression in a single pass (Pass 2). Qwen Edit remains available in `engine_shootout.py` as a standalone engine for testing.

The fal.ai defaults for Qwen Edit lose facial detail in pipeline use. Optimized settings discovered Feb 14, 2026:

- **`guidance_scale: 3.5`** (down from 4.5) — lower = less aggressive editing, better identity lock. Official Qwen demo uses 4.0.
- **`num_inference_steps: 45`** (up from 28) — more denoising passes to resolve pores, iris, hair texture. Official demo uses 40-50.
- **`acceleration: "none"`** (was "regular") — disables step-skipping shortcuts that degrade fine detail.
- **`negative_prompt`** — `"blurry face, distorted features, deformed eyes, asymmetric face, smooth skin, plastic look"` steers away from facial degradation.
- **`output_format: "png"`** — lossless handoff to Pass 3.
- **No `strength` parameter exists** for Qwen Edit — it is not traditional img2img. Only levers are guidance, steps, acceleration, and prompt specificity.

See `candidate_generation_engines.md` → Engine 4 for full parameter reference and prompt engineering patterns.

### Pass 3 (NBP) — Expression Prompting Best Practices

**Use "emotion anchor + 2-3 physical descriptors" — not single words, not paragraphs.**

Research finding (Feb 14, 2026): Single-word emotion labels ("exhausted") leave the model guessing what that looks like. Full paragraphs of facial muscle anatomy over-constrain and produce unnatural results. The sweet spot is a short phrase that names the emotion and adds 2-3 concrete physical cues.

**Format:** `emotion — physical descriptor, physical descriptor, physical descriptor`

**Expression tiers for LoRA training:**

| Tier | Purpose | Examples |
|------|---------|---------|
| **Neutral** | Baseline identity (most images) | `neutral — relaxed features, steady gaze, lips closed naturally` |
| **Moderate** | Visible but natural range | `tired — slightly heavy eyelids, soft unfocused gaze, relaxed jaw` |
| | | `focused — narrowed eyes, set jaw, intent forward stare` |
| | | `wary — guarded gaze, slight tension around the mouth, watchful eyes` |
| **Intense** | Strong but not caricature (fewer images) | `exhausted — heavy-lidded eyes, slight frown, drained hollow gaze` |
| | | `furious — bared teeth, flared nostrils, intense glare with furrowed brow` |

**LoRA training mix:** Mostly neutral + moderate, with 1-2 intense for range. Don't overweight intense expressions or the LoRA learns exaggerated as the default.

**Key research points:**
- Gemini (NBP) "shines with descriptive language" — emotion + physical descriptors outperform single words
- Single-word labels have wildly varying model agreement: 87% for sadness but only 3% for fear (Pagan et al., 2025)
- Physical descriptors close the interpretation gap between what you mean and what the model generates
- For LoRA captions: use physical descriptions primarily ("heavy-lidded eyes") so the model learns visual-to-text mappings

**Batch generation:** `batch_threepass.py` runs the full 12-angle × multi-expression set with smart defaults:
```
python3 batch_threepass.py leviathan/ --character JINX              # full set (~$2.35, ~23 min)
python3 batch_threepass.py leviathan/ --character JINX --dry-run    # preview jobs with mode/env tags
python3 batch_threepass.py leviathan/ --character JINX --expressions neutral,moderate  # skip intense
python3 batch_threepass.py leviathan/ --character JINX --no-smart-back  # force 3-pass on back angles
python3 batch_threepass.py leviathan/ --character JINX --no-env-rotation  # use breakdown.json default env
```

**Smart defaults (Feb 15, 2026):**
- **Expression angles** (front, closeup_front, closeup_three_quarter) get full 5-expression range, 3-pass pipeline
- **Mild expression angles** (three_quarter_right, three_quarter_left) get 3 expressions (neutral + tired + focused), 3-pass pipeline — breaks angle-expression coupling
- **Neutral-only angles** (profile_right, profile_left, low_angle, high_angle) get neutral only, 2-pass (skip NBP)
- **Back angles** (back, back_left, back_right) automatically get neutral-only expression + 2-pass (Qwen MA → SeedVR2). Reason: Gemini rotates the head toward camera when given emotional prompts on back views, breaking the training data.
- **Environment rotation** cycles through 8 diverse environments (industrial, urban, natural, lab, etc.) across jobs. Prevents LoRA overfitting to a single backdrop.

### Pass 2 (NBP) — Identity Lock Strategy (Feb 14, 2026)

After reviewing a full 57-image batch, several identity drift issues were identified:
- **Facial geometry changes** — Gemini sometimes alters bone structure, jawline, eye shape
- **Soft/out-of-focus eyes** — iris detail lost, especially with side lighting
- **Skin texture breakdown** — smoothing/beautification despite texture prompts
- **Hair changes** — color or texture shifts between Pass 2 and Pass 3
- **Back-angle face rotation** — Gemini turns the character to face camera when given emotion + back angle

**Solutions implemented in `run_gemini()` (threepass mode):**

1. **Identity lock prompt lead-in:** "face identity locked — DO NOT generate a new face" as the first instruction
2. **Background swap + expression + identity lock in one pass:** The prompt separates **permanent skeletal structure** (skull shape, brow ridge, nose bridge, nose width, cheekbone position, chin shape, ear shape, eye spacing, eye size, iris color, skin tone, hair) from **temporary muscular movement** (brows furrowing, eyes narrowing/widening, nostrils flaring, lips curling/parting, jaw clenching/dropping). Lock the skeleton, free the muscles. NBP handles background replacement, expression application, and identity preservation simultaneously. This resolves the prior conflict where "DO NOT alter eye shape" contradicted expression requests like "heavy-lidded eyes."
3. **85mm f/1.8 lens cue:** "Sharp focus on eyes. Tack-sharp iris detail with visible iris fibers. Subtle catchlights." — mimics portrait photography convention that signals sharp eye rendering
4. **Anti-beautification block:** "DO NOT smooth, beautify, or stylize. No global smoothing. No airbrushing. Preserve pore texture, freckles, fine lines, and skin imperfections."
5. **Dual-reference removed (Feb 15, 2026):** Sending the original hero alongside the Pass 1 output caused NBP to pull head rotation back toward front, overriding the angle Qwen MA established. Single input preserves angle better.
6. **Scope:** NBP handles background swap, expression, and quality refinement in a single pass — no longer limited to enhancement-only since it now replaces both Qwen Edit and the old polish step

**Temperature:** Always 1.0 for Gemini image generation. Google warns against lowering temperature for image tasks.

---

## 3b. Background Rules: Training vs Inference

**Training data and inference references have opposite background requirements.** This is the single most important distinction in reference image preparation.

### For LoRA Training Data: Varied Backgrounds (REQUIRED)

Varied, real-world backgrounds prevent the LoRA from overfitting to a single environment. If all training images share the same background, the model fuses that background with the character's identity, creating "prompt inertia" — the LoRA resists environment prompts at inference.

- Use **5-10 distinct environments** (see §3 Background Diversity)
- Caption backgrounds explicitly so the model learns they're variable
- If source images have problematic backgrounds, replace with **randomized solid colors** (not all-white) as a preprocessing step — confirmed effective by Kohya/Musubi community testing
- fal.ai does not support regularization images — in-dataset diversity is the only mechanism

**Why it works:** LoRA training optimizes model weights to associate a trigger token with visual features. If backgrounds don't vary, background features get baked into the trigger's learned representation. Variation forces the model to isolate the invariant subject.

### For Inference-Time References: Clean/White Backgrounds (PREFERRED)

Reference images used as conditioning inputs during generation (Flux 2 multi-reference slots, IP-Adapter, video model character references) should use **clean, plain, or white backgrounds**. This is the strong consensus across all frontier models tested as of February 2026:

| Model | Ref Images | Background Recommendation | Source |
|-------|-----------|--------------------------|--------|
| **Kling 3.0 Elements** | 1-4 per element | White/neutral, "passport style" for faces | fal.ai, KlingAIO, Age of LLMs |
| **Sora 2** | 1 per API call | Clean/solid-color — reference style "bleeds" into output | OpenAI Cookbook |
| **Veo 3.1** | 2-4 images | Plain white or solid green | Skywork, Arsturn |
| **Seedance 2.0** | Up to 9 | Clean/simple — "busy backgrounds distract the model" | WaveSpeedAI, Flux-AI |
| **Hailuo S2V-01** | 1 image | Simple, non-distracting | Segmind, MiniMax |
| **Runway Gen-4** | Character refs | "Solid colors or simple gradients" | Runway Academy |
| **MidJourney --cref** | Character ref | "Plain background, studio lighting" | ImaginePro |
| **Wan 2.1 VACE** | Ref image | RMBG (background removal) explicitly recommended | Stable Diffusion Art |
| **Flux 2 Multi-Ref** | Up to 10 slots | "High-quality, clean references" | Together AI |

**Why it works:** Inference-time conditioning encodes the entire reference image into tokens that the model attends to during generation. The model cannot separate "subject tokens" from "background tokens" — background elements compete for attention and bleed into outputs. Clean backgrounds maximize the signal-to-noise ratio for identity.

**Style bleed is the primary risk.** Sora 2's documentation explicitly warns that the reference image's "color palette, lighting, and artistic style influence the entire video" — even when the prompt asks for a different setting. A contextual background injects those visual properties into every frame.

### Wan 2.1 VACE: Background Removal

The ComfyUI community explicitly recommends running **RMBG (background removal)** on reference images before feeding them to Wan VACE: "RMBG removes the background so VACE can lock onto the foreground subject more reliably, helping identity consistency across frames." This goes beyond white backgrounds to no background at all. Consider adding RMBG as a preprocessing step when routing through Wan for video generation.

### Summary

| Use Case | Background | Why |
|----------|-----------|-----|
| LoRA training data | **Varied** (5-10 environments) | Prevents overfitting; forces model to isolate subject |
| Inference reference sheets | **Clean/white** | Maximizes identity signal; prevents style bleed |
| Wan VACE video refs | **Removed** (RMBG) | VACE locks onto foreground more reliably |

In the Recoil pipeline: `batch_generate_refs.py` (standard refs) uses white backgrounds for inference conditioning. `batch_generate_refs.py --hybrid` or `--lora-prep` uses varied environmental backgrounds for LoRA training candidates. This distinction is enforced in code via conditional `bg_directive`.

---

## 4. Resolution Requirements

- **All images must be the same resolution.** Mixed resolutions cause training instability.
- **Recommended:** 1024x1024 for Flux 2 and Z-Image.
- **Auto-normalization:** `train_lora.py prepare --from-candidates` automatically detects mixed resolutions and normalizes with face-aware cropping. Originals are never overwritten; normalized copies go to `lora_training/`.
- **Face-aware crop:** For mismatched images, the prepare step finds the face region (brightness heuristic in upper frame), crops a square centered on it, and resizes to target. Falls back to upper-center crop if no face detected.

---

## 5. Dataset Requirements

| Requirement | Flux 2 T2I | Z-Image T2I | WAN 2.2 Video |
|-------------|-----------|-------------|---------------|
| Optimal count | 20-30 images | 20-30 images | 10-15 clips (or images for I2V) |
| Resolution | Consistent (1024x1024) | Consistent | Match reference type |
| Format | PNG or JPEG + .txt captions | PNG or JPEG + .txt captions | Same for image training |

### Exclusion Criteria

- No watermarks or text overlays
- No compression artifacts (use original resolution PNGs)
- No near-duplicates (same angle + same expression + similar lighting)
- No face obstruction (masks, extreme angles where face is not visible)
- No images where appearance contradicts the majority

---

## 6. Training Parameters

| Parameter | Flux 2 | Z-Image | Z-Image Base | WAN 2.2 |
|-----------|--------|---------|--------------|---------|
| Steps | 800-1500 (~1000) | 1000-2000 | 1000-2000 | 800-1500 |
| Learning rate | 0.0001-0.00015 | 0.0001 | 0.0005 | 0.0007 |
| Rank | 32-48 (48 for faces) | Default | Default | N/A |
| Alpha | 2x rank | Default | Default | N/A |
| Output | 1 .safetensors | 1 .safetensors | 1 .safetensors | 2 .safetensors |
| create_masks | True | N/A | N/A | True |
| is_style | False | N/A (content) | N/A | False |
| use_face_detection | N/A | N/A | N/A | True |

### Notes

- Flux 2 default LR on fal (0.00005) is too low for character LoRAs. Use 0.0001-0.00015.
- Z-Image Base: no trigger_word API param. Trigger in caption files only, pass as `default_caption`.
- WAN 2.2 produces two LoRA files: high-noise (early denoising) and low-noise (late). Both needed.

---

## 7. Common Mistakes

1. **Describing identity features in captions.** The model should learn identity from images. Captioning it makes the model rely on text, producing weaker identity lock.

2. **Including constant properties in captions.** Style tags, camera specs, quality language — if it's the same in every image, it wastes token budget and creates noise.

3. **Wardrobe states that contradict images.** breakdown.json wardrobe states are narrative ("NO PHYSICAL BODY"). Reference images always show a physical character.

4. **Using comma-separated tag lists.** T5 models are trained on natural language. Prose sentences produce better text-image alignment than tag lists.

5. **Training beyond convergence.** After loss plateaus (~800-1200 steps for Flux 2), additional training causes mode collapse.

6. **Insufficient diversity.** If 80% of images are front-facing neutral, the LoRA won't generalize. Use the coverage matrix.

7. **Mixed resolutions.** Causes training instability. Use auto-normalization or ensure all images match.

8. **Near-duplicate images.** Same angle + expression + lighting = redundancy without diversity.

9. **Using hero prompts as captions.** Hero prompts describe identity in detail — exactly what should NOT be in captions.

10. **Expression overlap.** "Intense" and "defiant" are nearly identical in execution. Use clearly distinct emotion families with intensity gradients.

11. **All-white studio backgrounds across all images.** Causes overfitting — the LoRA learns to always place the character on white and resists environment prompts at inference. Use 5-10 varied environments.

12. **Single wardrobe across all images.** The LoRA fuses the outfit with the character's identity, making it difficult to generate the character in different clothes. Use 3-5 wardrobe variations.

13. **Using only one generation model for all candidates.** Different engines have different strengths and failure modes. A multi-pass approach (Qwen for angles → NBP for background swap + expression → SeedVR2 for quality) produces better training data than any single engine.

14. **Using an existing LoRA to generate LoRA training data (circular dependency).** Z-Image Turbo with a character LoRA produces images that already encode the LoRA's learned biases. Training a new LoRA on these images amplifies those biases rather than learning from new visual information. Use LoRA-free engines (Qwen, NBP) for candidate generation.

---

## 8. fal.ai Specifics

### Endpoints

| Type | Endpoint | Cost |
|------|----------|------|
| Flux 2 T2I | `fal-ai/flux-lora-fast-training` | ~$3/1K steps |
| Z-Image Turbo | `fal-ai/z-image-trainer` | ~$0.85/1K steps |
| Z-Image Base | `fal-ai/z-image-base-trainer` | ~$0.85/1K steps |
| WAN 2.2 Video | `fal-ai/wan/v2.2/image-to-video/lora/training` | ~$2/1K steps |

### Dataset Upload

- ZIP containing images + matching `.txt` caption files
- Each `image_name.png` needs `image_name.txt`
- Upload via `fal_client.upload_file()` for a URL
- Pass as `images_data_url` (Flux 2, WAN) or `image_data_url` (Z-Image)

### Auto-Captioning

fal.ai auto-captioning describes everything visible (including identity), which is exactly what we avoid. Always use manual captions.

---

## 9. Pre-Training Checklist

Run `train_lora.py leviathan/ validate [CHARACTER]` to check these automatically.

- [ ] 15-25 curated images
- [ ] All same resolution (auto-normalization available)
- [ ] Coverage matrix: close-ups 33-40%, medium 25-30%, full body 20-25%, back 10%
- [ ] 5+ distinct camera angles
- [ ] 5-7 distinct expressions, neutral dominant (~40%), no single expression >5-6x
- [ ] 3+ wardrobe variations (outfits captioned explicitly)
- [ ] 5+ distinct backgrounds (no all-white sets)
- [ ] Mixed angle sources (three-pass or hybrid pipeline)
- [ ] No near-duplicates (same angle + expression combo)
- [ ] Captions are natural language (not comma tags), model-appropriate length
- [ ] Captions contain ONLY: trigger, expression, angle, environment, lighting
- [ ] Captions do NOT contain: style tags, identity features, wardrobe states, hero prompts
- [ ] No face obstruction, watermarks, or compression artifacts
- [ ] ZIP built and ready for upload

---

## Sources

### LoRA Training
- fal.ai documentation (Flux 2, Z-Image, WAN 2.2 training)
- CivitAI LoRA training guides (caption strategies, dataset curation, wardrobe variation)
- CivitAI community guide: "LoRA Training — Avoiding Wardrobe Baking" (3-5 outfits minimum)
- Apatero (fal.ai): Flux LoRA training guide (background diversity, regularization alternatives)
- RunDiffusion: "Advanced LoRA Training Tips" (dataset composition, background overfitting, varied backgrounds > white)
- SimpleTuner documentation (learning rate, rank, alpha)
- HuggingFace engineering notes (LoRA rank, alpha scaling)
- SeaArt: LoRA Image Training Guide (subject isolation via background variation)
- Kohya/Musubi-Tuner Issue #227: randomized solid-color backgrounds as overfitting workaround

### Candidate Generation Engines
- Google Gemini API documentation (Flash Image, Gemini 3 Pro Image Preview)
- fal.ai Qwen Image Edit 2511 + Multi-Angle LoRA endpoint documentation
- Recoil pipeline: Qwen vs Gemini comparison test (Feb 13, 2026) — angle accuracy, wardrobe fidelity benchmarks
- Recoil pipeline: Engine shootout + three-pass architecture test (Feb 14, 2026) — Qwen MA → Qwen Edit → NBP
- Recoil pipeline: Pipeline redesign — remove Qwen Edit, add SeedVR2, smart expression distribution (Feb 15, 2026)

### Skin Texture Prompting
- Media.io: Guide to realistic skin in AI-generated images
- 302.AI: 2025 model roundup — realistic portraiture techniques
- PXZ.ai: Negative prompts guide for photorealistic portraits
- Fiddl.art: Portrait prompt engineering — requesting natural imperfections

### Inference-Time Reference Best Practices (Feb 2026 survey)
- OpenAI Cookbook: Sora 2 Prompting Guide (style bleed from reference backgrounds)
- fal.ai / KlingAIO / Age of LLMs: Kling 3.0 Elements — white/neutral backgrounds, "passport style" for faces
- Skywork / Arsturn: Veo 3.1 Multi-Prompt Best Practices (plain white/green backgrounds)
- WaveSpeedAI / Flux-AI: Seedance 2.0 — clean/simple backgrounds
- Segmind / MiniMax: Hailuo S2V-01 Subject Reference (simple, non-distracting backgrounds)
- Runway Academy / Kristopher Dunham: Gen-4 character consistency (solid colors, simple gradients)
- ImaginePro: MidJourney --cref guide (plain background, studio lighting)
- Stable Diffusion Art: Wan 2.1 VACE Reference Tutorial (RMBG background removal recommended)
- Together AI: FLUX.2 Multi-Reference documentation (clean references)
- IP-Adapter Face Wiki (Tencent): face segmentation reduces background dependence
- CivitAI: AnimateDiff/IPAdapter/ControlNet background handling guide

### Recoil Pipeline Experience
- Jinx T2I (28 images, 1000 steps), Jinx WAN video (1000 steps), Kian T2I (Z-Image)
- Hybrid pipeline testing (Feb 13-14, 2026) — Qwen angle coverage + Gemini diversity
- Engine shootout (Feb 14, 2026) — 5-engine comparison, three-pass architecture validated
- NBP skin texture test (Feb 14, 2026) — confirmed pore/freckle detail with anti-airbrushing prompts
- Full batch test (Feb 14, 2026) — 60-job JINX batch, 57/60 succeeded. Identified: identity drift, eye softness, back-angle face rotation, environment sameness. Led to identity lock prompt, dual-reference, back-angle 2-pass strategy, environment rotation pool.
