Initial commit: Ideogram 4 Prompt Builder
PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,183 @@
|
||||
# Pipeline: How All the Components Work Together
|
||||
|
||||
This document explains the end-to-end Ideogram 4 inference pipeline
|
||||
conceptually. For the architecture spec and code pointers, see
|
||||
[model_architecture.md](model_architecture.md).
|
||||
|
||||
## Overview
|
||||
|
||||
Ideogram 4 is a **flow-matching text-to-image model** built on a
|
||||
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
|
||||
components:
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
|
||||
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
|
||||
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
|
||||
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
|
||||
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
|
||||
frozen trainable frozen
|
||||
```
|
||||
|
||||
## 1. Text Encoder — Qwen3-VL-8B-Instruct
|
||||
|
||||
The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
||||
vision-language model, used in text-only mode (no vision inputs).
|
||||
|
||||
**What it does:**
|
||||
- Tokenizes the prompt using the Qwen3 chat template.
|
||||
- Runs a forward pass through the 36-layer transformer.
|
||||
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
|
||||
24, 27, 30, 33, 35.
|
||||
- Concatenates these hidden states along the feature dimension, producing a
|
||||
multi-scale text representation.
|
||||
|
||||
**Why multi-layer extraction?** Different layers capture different levels of
|
||||
abstraction — early layers encode surface-level token information, while later
|
||||
layers encode deeper semantic meaning. Concatenating them gives the DiT access
|
||||
to the full spectrum.
|
||||
|
||||
**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
|
||||
|
||||
## 2. DiT Backbone — Ideogram4Transformer
|
||||
|
||||
The core generative model is a 34-layer single-stream Diffusion Transformer.
|
||||
|
||||
### Sequence layout
|
||||
|
||||
Text tokens and image latent tokens are concatenated into one sequence and
|
||||
processed through the same self-attention layers.
|
||||
|
||||
```
|
||||
Sequence layout (per sample):
|
||||
|
||||
┌───────────────────┬────────────────────────┐
|
||||
│ text tokens │ image latent tokens │
|
||||
│ (up to 2048) │ (grid_h × grid_w) │
|
||||
└───────────────────┴────────────────────────┘
|
||||
▲ ▲
|
||||
Qwen3-VL features noisy latents z_t
|
||||
```
|
||||
|
||||
### Key components per block
|
||||
|
||||
- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
|
||||
positional encoding is 3-dimensional: for text tokens it uses a 1D position
|
||||
broadcast to 3 axes; for image tokens it uses (temporal, height, width)
|
||||
coordinates. This lets text and image tokens coexist in a unified positional
|
||||
space.
|
||||
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
|
||||
activation.
|
||||
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
|
||||
and generates per-block scale and gate parameters. This conditions every layer
|
||||
on the current noise level.
|
||||
|
||||
### Flow matching
|
||||
|
||||
The model is trained with a **flow-matching** objective. Instead of predicting
|
||||
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
|
||||
defines the ODE:
|
||||
|
||||
```
|
||||
dz/dt = v(z_t, t)
|
||||
```
|
||||
|
||||
At inference time, we start from pure Gaussian noise `z_1` and integrate
|
||||
backward to `z_0` (the clean image) using the Euler method:
|
||||
|
||||
```
|
||||
z_{t-dt} = z_t + v(z_t, t) * dt
|
||||
```
|
||||
|
||||
### Noise schedule
|
||||
|
||||
The timestep distribution follows a **logit-normal schedule** parameterized by
|
||||
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
|
||||
different noise levels — higher `mu` shifts more steps toward higher noise
|
||||
(important for high-resolution images). The schedule auto-adjusts for
|
||||
resolution:
|
||||
|
||||
```
|
||||
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
|
||||
```
|
||||
|
||||
where `base_pixels = 512 * 512`.
|
||||
|
||||
## 3. Classifier-Free Guidance (CFG)
|
||||
|
||||
At each sampling step, two forward passes are run through the DiT:
|
||||
|
||||
1. **Conditional (positive):** full text features + noisy image latents.
|
||||
2. **Unconditional (negative):** zeroed text features + noisy image latents
|
||||
(image-only tokens, asymmetric CFG).
|
||||
|
||||
The guided velocity is a weighted combination:
|
||||
|
||||
```
|
||||
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
|
||||
```
|
||||
|
||||
where `gw` is the per-step guidance weight. With
|
||||
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
|
||||
unconditional prediction, producing images that follow the prompt more
|
||||
faithfully.
|
||||
|
||||
**Asymmetric CFG:** The unconditional branch only processes image tokens (no
|
||||
text padding), making it computationally cheaper than a full-sequence negative
|
||||
pass.
|
||||
|
||||
**Per-step schedules:** The guidance weight can vary across steps. The
|
||||
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
|
||||
`gw=3` for the final 3 "polish" steps near `t=0`.
|
||||
|
||||
|
||||
## 4. VAE Decoder — KL Autoencoder
|
||||
|
||||
The denoised latent `z_0` is decoded to pixel space using a frozen KL
|
||||
autoencoder.
|
||||
|
||||
**What it does:**
|
||||
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
|
||||
input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
|
||||
`(batch, channels, grid_h * 2, grid_w * 2)`.
|
||||
- **Denormalization:** Per-channel shift and scale are applied to undo the
|
||||
latent normalization used during training.
|
||||
- **Decoding:** The VAE decoder maps latents to RGB pixels.
|
||||
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
|
||||
|
||||
**Compression factor:** The autoencoder provides 8× spatial compression on each
|
||||
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
|
||||
is represented as a 64×64 grid of latent tokens, each with 128 channels
|
||||
(32 base channels × 2² patch).
|
||||
|
||||
## Putting it all together
|
||||
|
||||
```python
|
||||
# Pseudocode for one generation call:
|
||||
|
||||
# 1. Encode text
|
||||
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
|
||||
|
||||
# 2. Initialize noise
|
||||
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
|
||||
|
||||
# 3. Euler integration from t=1 to t=0
|
||||
for step in reversed(range(num_steps)):
|
||||
t = schedule(step)
|
||||
s = schedule(step - 1)
|
||||
|
||||
# Conditional pass (text + image)
|
||||
v_cond = dit(text_features, z, t)
|
||||
|
||||
# Unconditional pass (image only, zeroed text)
|
||||
v_uncond = dit(zeros, z, t)
|
||||
|
||||
# CFG combination
|
||||
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
|
||||
|
||||
# Euler step
|
||||
z = z + v * (s - t)
|
||||
|
||||
# 4. Decode to pixels
|
||||
image = vae.decode(z)
|
||||
```
|
||||
Reference in New Issue
Block a user