Files
dimon a5c319a1fc Initial commit: Ideogram 4 Prompt Builder
PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00

46 lines
1.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Model Architecture
```
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
┌──────────────────────────────────────────────────┐
│ Ideogram4Transformer │
│ • 34 × Ideogram4TransformerBlock │
Ideogram4Attention (QK-RMSNorm, MRoPE) │
Ideogram4MLP (SwiGLU) │
adaln scale/gate from t-embedding │
│ • Ideogram4FinalLayer │
└──────────────────────────────────────────────────┘
│ velocity prediction
Euler flow-matching sampler with asymmetric CFG
│ denoised image latents
VAE decode
PIL.Image
```
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
the activation layers) and image latent tokens are concatenated into one
sequence, modulated per-block by an AdaLN computed from the flow-matching
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
image tokens share a unified positional space.
Model spec:
| field | value |
|-------------------|---------------|
| `emb_dim` | 4608 |
| `num_layers` | 34 |
| `num_heads` | 18 |
| `intermediate` | 12288 |
| `adanln_dim` | 512 |
| `rope_theta` | 5_000_000 |
| `mrope_section` | (24, 20, 20) |
| latent channels | 32 × 2² = 128 |
| max text tokens | 2048 |
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |