docs/pipeline.md

# Pipeline: How All the Components Work Together

This document explains the end-to-end Ideogram 4 inference pipeline
conceptually. For the architecture spec and code pointers, see
[model_architecture.md](model_architecture.md).

## Overview

Ideogram 4 is a **flow-matching text-to-image model** built on a
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
components:

```
 ┌─────────────┐   ┌──────────────────────┐   ┌──────────────┐   ┌───────────┐
 │  Qwen3-VL   │   │  Ideogram4          │   │  KL VAE      │   │           │
 │  Text       ├──►│  Transformer (DiT)   ├──►│  VAE         ├──►│  Image    │
 │  Encoder    │   │  + Euler Sampler     │   │  Decoder     │   │           │
 └─────────────┘   └──────────────────────┘   └──────────────┘   └───────────┘
     frozen              trainable                 frozen
```

## 1. Text Encoder — Qwen3-VL-8B-Instruct

The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
vision-language model, used in text-only mode (no vision inputs).

**What it does:**
- Tokenizes the prompt using the Qwen3 chat template.
- Runs a forward pass through the 36-layer transformer.
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
  24, 27, 30, 33, 35.
- Concatenates these hidden states along the feature dimension, producing a
  multi-scale text representation.

**Why multi-layer extraction?** Different layers capture different levels of
abstraction — early layers encode surface-level token information, while later
layers encode deeper semantic meaning. Concatenating them gives the DiT access
to the full spectrum.

**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.

## 2. DiT Backbone — Ideogram4Transformer

The core generative model is a 34-layer single-stream Diffusion Transformer.

### Sequence layout

Text tokens and image latent tokens are concatenated into one sequence and
processed through the same self-attention layers.

```
Sequence layout (per sample):

  ┌───────────────────┬────────────────────────┐
  │  text tokens      │  image latent tokens   │
  │  (up to 2048)     │  (grid_h × grid_w)     │
  └───────────────────┴────────────────────────┘
           ▲                    ▲
     Qwen3-VL features    noisy latents z_t
```

### Key components per block

- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
  positional encoding is 3-dimensional: for text tokens it uses a 1D position
  broadcast to 3 axes; for image tokens it uses (temporal, height, width)
  coordinates. This lets text and image tokens coexist in a unified positional
  space.
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
  activation.
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
  and generates per-block scale and gate parameters. This conditions every layer
  on the current noise level.

### Flow matching

The model is trained with a **flow-matching** objective. Instead of predicting
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
defines the ODE:

```
dz/dt = v(z_t, t)
```

At inference time, we start from pure Gaussian noise `z_1` and integrate
backward to `z_0` (the clean image) using the Euler method:

```
z_{t-dt} = z_t + v(z_t, t) * dt
```

### Noise schedule

The timestep distribution follows a **logit-normal schedule** parameterized by
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
different noise levels — higher `mu` shifts more steps toward higher noise
(important for high-resolution images). The schedule auto-adjusts for
resolution:

```
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
```

where `base_pixels = 512 * 512`.

## 3. Classifier-Free Guidance (CFG)

At each sampling step, two forward passes are run through the DiT:

1. **Conditional (positive):** full text features + noisy image latents.
2. **Unconditional (negative):** zeroed text features + noisy image latents
   (image-only tokens, asymmetric CFG).

The guided velocity is a weighted combination:

```
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
```

where `gw` is the per-step guidance weight. With
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
unconditional prediction, producing images that follow the prompt more
faithfully.

**Asymmetric CFG:** The unconditional branch only processes image tokens (no
text padding), making it computationally cheaper than a full-sequence negative
pass.

**Per-step schedules:** The guidance weight can vary across steps. The
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
`gw=3` for the final 3 "polish" steps near `t=0`.


## 4. VAE Decoder — KL Autoencoder

The denoised latent `z_0` is decoded to pixel space using a frozen KL
autoencoder.

**What it does:**
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
  input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
  `(batch, channels, grid_h * 2, grid_w * 2)`.
- **Denormalization:** Per-channel shift and scale are applied to undo the
  latent normalization used during training.
- **Decoding:** The VAE decoder maps latents to RGB pixels.
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.

**Compression factor:** The autoencoder provides 8× spatial compression on each
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
is represented as a 64×64 grid of latent tokens, each with 128 channels
(32 base channels × 2² patch).

## Putting it all together

```python
# Pseudocode for one generation call:

# 1. Encode text
text_features = qwen3_vl.encode(prompt)  # (B, L_text, D)

# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128)  # pure noise at t=1

# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
    t = schedule(step)
    s = schedule(step - 1)

    # Conditional pass (text + image)
    v_cond = dit(text_features, z, t)

    # Unconditional pass (image only, zeroed text)
    v_uncond = dit(zeros, z, t)

    # CFG combination
    v = gw[step] * v_cond + (1 - gw[step]) * v_uncond

    # Euler step
    z = z + v * (s - t)

# 4. Decode to pixels
image = vae.decode(z)
```