Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
@@ -0,0 +1,183 @@
+# Pipeline: How All the Components Work Together
+
+This document explains the end-to-end Ideogram 4 inference pipeline
+conceptually. For the architecture spec and code pointers, see
+[model_architecture.md](model_architecture.md).
+
+## Overview
+
+Ideogram 4 is a **flow-matching text-to-image model** built on a
+**single-stream DiT** (Diffusion Transformer). The pipeline has four main
+components:
+
+```
+ ┌─────────────┐   ┌──────────────────────┐   ┌──────────────┐   ┌───────────┐
+ │  Qwen3-VL   │   │  Ideogram4          │   │  KL VAE      │   │           │
+ │  Text       ├──►│  Transformer (DiT)   ├──►│  VAE         ├──►│  Image    │
+ │  Encoder    │   │  + Euler Sampler     │   │  Decoder     │   │           │
+ └─────────────┘   └──────────────────────┘   └──────────────┘   └───────────┘
+     frozen              trainable                 frozen
+```
+
+## 1. Text Encoder — Qwen3-VL-8B-Instruct
+
+The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+vision-language model, used in text-only mode (no vision inputs).
+
+**What it does:**
+- Tokenizes the prompt using the Qwen3 chat template.
+- Runs a forward pass through the 36-layer transformer.
+- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
+  24, 27, 30, 33, 35.
+- Concatenates these hidden states along the feature dimension, producing a
+  multi-scale text representation.
+
+**Why multi-layer extraction?** Different layers capture different levels of
+abstraction — early layers encode surface-level token information, while later
+layers encode deeper semantic meaning. Concatenating them gives the DiT access
+to the full spectrum.
+
+**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
+
+## 2. DiT Backbone — Ideogram4Transformer
+
+The core generative model is a 34-layer single-stream Diffusion Transformer.
+
+### Sequence layout
+
+Text tokens and image latent tokens are concatenated into one sequence and
+processed through the same self-attention layers.
+
+```
+Sequence layout (per sample):
+
+  ┌───────────────────┬────────────────────────┐
+  │  text tokens      │  image latent tokens   │
+  │  (up to 2048)     │  (grid_h × grid_w)     │
+  └───────────────────┴────────────────────────┘
+           ▲                    ▲
+     Qwen3-VL features    noisy latents z_t
+```
+
+### Key components per block
+
+- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
+  positional encoding is 3-dimensional: for text tokens it uses a 1D position
+  broadcast to 3 axes; for image tokens it uses (temporal, height, width)
+  coordinates. This lets text and image tokens coexist in a unified positional
+  space.
+- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
+  activation.
+- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
+  and generates per-block scale and gate parameters. This conditions every layer
+  on the current noise level.
+
+### Flow matching
+
+The model is trained with a **flow-matching** objective. Instead of predicting
+noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
+defines the ODE:
+
+```
+dz/dt = v(z_t, t)
+```
+
+At inference time, we start from pure Gaussian noise `z_1` and integrate
+backward to `z_0` (the clean image) using the Euler method:
+
+```
+z_{t-dt} = z_t + v(z_t, t) * dt
+```
+
+### Noise schedule
+
+The timestep distribution follows a **logit-normal schedule** parameterized by
+`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
+different noise levels — higher `mu` shifts more steps toward higher noise
+(important for high-resolution images). The schedule auto-adjusts for
+resolution:
+
+```
+mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
+```
+
+where `base_pixels = 512 * 512`.
+
+## 3. Classifier-Free Guidance (CFG)
+
+At each sampling step, two forward passes are run through the DiT:
+
+1. **Conditional (positive):** full text features + noisy image latents.
+2. **Unconditional (negative):** zeroed text features + noisy image latents
+   (image-only tokens, asymmetric CFG).
+
+The guided velocity is a weighted combination:
+
+```
+v_guided = gw * v_conditional + (1 - gw) * v_unconditional
+```
+
+where `gw` is the per-step guidance weight. With
+`gw > 1`, the model amplifies the text-conditional signal and suppresses the
+unconditional prediction, producing images that follow the prompt more
+faithfully.
+
+**Asymmetric CFG:** The unconditional branch only processes image tokens (no
+text padding), making it computationally cheaper than a full-sequence negative
+pass.
+
+**Per-step schedules:** The guidance weight can vary across steps. The
+`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
+`gw=3` for the final 3 "polish" steps near `t=0`.
+
+
+## 4. VAE Decoder — KL Autoencoder
+
+The denoised latent `z_0` is decoded to pixel space using a frozen KL
+autoencoder.
+
+**What it does:**
+- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
+  input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
+  `(batch, channels, grid_h * 2, grid_w * 2)`.
+- **Denormalization:** Per-channel shift and scale are applied to undo the
+  latent normalization used during training.
+- **Decoding:** The VAE decoder maps latents to RGB pixels.
+- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
+
+**Compression factor:** The autoencoder provides 8× spatial compression on each
+axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
+is represented as a 64×64 grid of latent tokens, each with 128 channels
+(32 base channels × 2² patch).
+
+## Putting it all together
+
+```python
+# Pseudocode for one generation call:
+
+# 1. Encode text
+text_features = qwen3_vl.encode(prompt)  # (B, L_text, D)
+
+# 2. Initialize noise
+z = torch.randn(B, grid_h * grid_w, 128)  # pure noise at t=1
+
+# 3. Euler integration from t=1 to t=0
+for step in reversed(range(num_steps)):
+    t = schedule(step)
+    s = schedule(step - 1)
+
+    # Conditional pass (text + image)
+    v_cond = dit(text_features, z, t)
+
+    # Unconditional pass (image only, zeroed text)
+    v_uncond = dit(zeros, z, t)
+
+    # CFG combination
+    v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
+
+    # Euler step
+    z = z + v * (s - t)
+
+# 4. Decode to pixels
+image = vae.decode(z)
+```