Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
+45
View File
@@ -0,0 +1,45 @@
# Model Architecture
```
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
┌──────────────────────────────────────────────────┐
│ Ideogram4Transformer │
│ • 34 × Ideogram4TransformerBlock │
Ideogram4Attention (QK-RMSNorm, MRoPE) │
Ideogram4MLP (SwiGLU) │
adaln scale/gate from t-embedding │
│ • Ideogram4FinalLayer │
└──────────────────────────────────────────────────┘
│ velocity prediction
Euler flow-matching sampler with asymmetric CFG
│ denoised image latents
VAE decode
PIL.Image
```
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
the activation layers) and image latent tokens are concatenated into one
sequence, modulated per-block by an AdaLN computed from the flow-matching
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
image tokens share a unified positional space.
Model spec:
| field | value |
|-------------------|---------------|
| `emb_dim` | 4608 |
| `num_layers` | 34 |
| `num_heads` | 18 |
| `intermediate` | 12288 |
| `adanln_dim` | 512 |
| `rope_theta` | 5_000_000 |
| `mrope_section` | (24, 20, 20) |
| latent channels | 32 × 2² = 128 |
| max text tokens | 2048 |
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |