184 lines
6.7 KiB
Markdown
184 lines
6.7 KiB
Markdown
|
|
# Pipeline: How All the Components Work Together
|
|||
|
|
|
|||
|
|
This document explains the end-to-end Ideogram 4 inference pipeline
|
|||
|
|
conceptually. For the architecture spec and code pointers, see
|
|||
|
|
[model_architecture.md](model_architecture.md).
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
Ideogram 4 is a **flow-matching text-to-image model** built on a
|
|||
|
|
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
|
|||
|
|
components:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
|
|||
|
|
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
|
|||
|
|
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
|
|||
|
|
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
|
|||
|
|
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
|
|||
|
|
frozen trainable frozen
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## 1. Text Encoder — Qwen3-VL-8B-Instruct
|
|||
|
|
|
|||
|
|
The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
|||
|
|
vision-language model, used in text-only mode (no vision inputs).
|
|||
|
|
|
|||
|
|
**What it does:**
|
|||
|
|
- Tokenizes the prompt using the Qwen3 chat template.
|
|||
|
|
- Runs a forward pass through the 36-layer transformer.
|
|||
|
|
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
|
|||
|
|
24, 27, 30, 33, 35.
|
|||
|
|
- Concatenates these hidden states along the feature dimension, producing a
|
|||
|
|
multi-scale text representation.
|
|||
|
|
|
|||
|
|
**Why multi-layer extraction?** Different layers capture different levels of
|
|||
|
|
abstraction — early layers encode surface-level token information, while later
|
|||
|
|
layers encode deeper semantic meaning. Concatenating them gives the DiT access
|
|||
|
|
to the full spectrum.
|
|||
|
|
|
|||
|
|
**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
|
|||
|
|
|
|||
|
|
## 2. DiT Backbone — Ideogram4Transformer
|
|||
|
|
|
|||
|
|
The core generative model is a 34-layer single-stream Diffusion Transformer.
|
|||
|
|
|
|||
|
|
### Sequence layout
|
|||
|
|
|
|||
|
|
Text tokens and image latent tokens are concatenated into one sequence and
|
|||
|
|
processed through the same self-attention layers.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Sequence layout (per sample):
|
|||
|
|
|
|||
|
|
┌───────────────────┬────────────────────────┐
|
|||
|
|
│ text tokens │ image latent tokens │
|
|||
|
|
│ (up to 2048) │ (grid_h × grid_w) │
|
|||
|
|
└───────────────────┴────────────────────────┘
|
|||
|
|
▲ ▲
|
|||
|
|
Qwen3-VL features noisy latents z_t
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Key components per block
|
|||
|
|
|
|||
|
|
- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
|
|||
|
|
positional encoding is 3-dimensional: for text tokens it uses a 1D position
|
|||
|
|
broadcast to 3 axes; for image tokens it uses (temporal, height, width)
|
|||
|
|
coordinates. This lets text and image tokens coexist in a unified positional
|
|||
|
|
space.
|
|||
|
|
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
|
|||
|
|
activation.
|
|||
|
|
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
|
|||
|
|
and generates per-block scale and gate parameters. This conditions every layer
|
|||
|
|
on the current noise level.
|
|||
|
|
|
|||
|
|
### Flow matching
|
|||
|
|
|
|||
|
|
The model is trained with a **flow-matching** objective. Instead of predicting
|
|||
|
|
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
|
|||
|
|
defines the ODE:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
dz/dt = v(z_t, t)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
At inference time, we start from pure Gaussian noise `z_1` and integrate
|
|||
|
|
backward to `z_0` (the clean image) using the Euler method:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
z_{t-dt} = z_t + v(z_t, t) * dt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Noise schedule
|
|||
|
|
|
|||
|
|
The timestep distribution follows a **logit-normal schedule** parameterized by
|
|||
|
|
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
|
|||
|
|
different noise levels — higher `mu` shifts more steps toward higher noise
|
|||
|
|
(important for high-resolution images). The schedule auto-adjusts for
|
|||
|
|
resolution:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
where `base_pixels = 512 * 512`.
|
|||
|
|
|
|||
|
|
## 3. Classifier-Free Guidance (CFG)
|
|||
|
|
|
|||
|
|
At each sampling step, two forward passes are run through the DiT:
|
|||
|
|
|
|||
|
|
1. **Conditional (positive):** full text features + noisy image latents.
|
|||
|
|
2. **Unconditional (negative):** zeroed text features + noisy image latents
|
|||
|
|
(image-only tokens, asymmetric CFG).
|
|||
|
|
|
|||
|
|
The guided velocity is a weighted combination:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
where `gw` is the per-step guidance weight. With
|
|||
|
|
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
|
|||
|
|
unconditional prediction, producing images that follow the prompt more
|
|||
|
|
faithfully.
|
|||
|
|
|
|||
|
|
**Asymmetric CFG:** The unconditional branch only processes image tokens (no
|
|||
|
|
text padding), making it computationally cheaper than a full-sequence negative
|
|||
|
|
pass.
|
|||
|
|
|
|||
|
|
**Per-step schedules:** The guidance weight can vary across steps. The
|
|||
|
|
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
|
|||
|
|
`gw=3` for the final 3 "polish" steps near `t=0`.
|
|||
|
|
|
|||
|
|
|
|||
|
|
## 4. VAE Decoder — KL Autoencoder
|
|||
|
|
|
|||
|
|
The denoised latent `z_0` is decoded to pixel space using a frozen KL
|
|||
|
|
autoencoder.
|
|||
|
|
|
|||
|
|
**What it does:**
|
|||
|
|
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
|
|||
|
|
input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
|
|||
|
|
`(batch, channels, grid_h * 2, grid_w * 2)`.
|
|||
|
|
- **Denormalization:** Per-channel shift and scale are applied to undo the
|
|||
|
|
latent normalization used during training.
|
|||
|
|
- **Decoding:** The VAE decoder maps latents to RGB pixels.
|
|||
|
|
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
|
|||
|
|
|
|||
|
|
**Compression factor:** The autoencoder provides 8× spatial compression on each
|
|||
|
|
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
|
|||
|
|
is represented as a 64×64 grid of latent tokens, each with 128 channels
|
|||
|
|
(32 base channels × 2² patch).
|
|||
|
|
|
|||
|
|
## Putting it all together
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Pseudocode for one generation call:
|
|||
|
|
|
|||
|
|
# 1. Encode text
|
|||
|
|
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
|
|||
|
|
|
|||
|
|
# 2. Initialize noise
|
|||
|
|
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
|
|||
|
|
|
|||
|
|
# 3. Euler integration from t=1 to t=0
|
|||
|
|
for step in reversed(range(num_steps)):
|
|||
|
|
t = schedule(step)
|
|||
|
|
s = schedule(step - 1)
|
|||
|
|
|
|||
|
|
# Conditional pass (text + image)
|
|||
|
|
v_cond = dit(text_features, z, t)
|
|||
|
|
|
|||
|
|
# Unconditional pass (image only, zeroed text)
|
|||
|
|
v_uncond = dit(zeros, z, t)
|
|||
|
|
|
|||
|
|
# CFG combination
|
|||
|
|
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
|
|||
|
|
|
|||
|
|
# Euler step
|
|||
|
|
z = z + v * (s - t)
|
|||
|
|
|
|||
|
|
# 4. Decode to pixels
|
|||
|
|
image = vae.decode(z)
|
|||
|
|
```
|