Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
@@ -0,0 +1,45 @@
+# Model Architecture
+
+```
+prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
+            │   
+            ▼
+    ┌──────────────────────────────────────────────────┐
+    │    Ideogram4Transformer                         │  
+    │  • 34 × Ideogram4TransformerBlock               │
+    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
+    │      – Ideogram4MLP (SwiGLU)                    │
+    │      – adaln scale/gate from t-embedding         │
+    │  • Ideogram4FinalLayer                          │
+    └──────────────────────────────────────────────────┘
+            │  velocity prediction
+            ▼
+    Euler flow-matching sampler with asymmetric CFG
+            │  denoised image latents
+            ▼
+    VAE decode
+            │
+            ▼
+            PIL.Image
+```
+
+The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
+the activation layers) and image latent tokens are concatenated into one
+sequence, modulated per-block by an AdaLN computed from the flow-matching
+timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
+image tokens share a unified positional space.
+
+Model spec:
+
+| field             | value         |
+|-------------------|---------------|
+| `emb_dim`         | 4608          |
+| `num_layers`      | 34            |
+| `num_heads`       | 18            |
+| `intermediate`    | 12288         |
+| `adanln_dim`      | 512           |
+| `rope_theta`      | 5_000_000     |
+| `mrope_section`   | (24, 20, 20)  |
+| latent channels   | 32 × 2² = 128 |
+| max text tokens   | 2048          |
+| sampler           | Euler flow-matching, logit-normal schedule, asymmetric CFG |