Files
ideogram4-prompt-builder/docs/model_architecture.md
T
dimon a5c319a1fc Initial commit: Ideogram 4 Prompt Builder
PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00

1.9 KiB
Raw Blame History

Model Architecture

prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
            │   
            ▼
    ┌──────────────────────────────────────────────────┐
    │    Ideogram4Transformer                         │  
    │  • 34 × Ideogram4TransformerBlock               │
    │       Ideogram4Attention (QK-RMSNorm, MRoPE)   │
    │       Ideogram4MLP (SwiGLU)                    │
    │       adaln scale/gate from t-embedding         │
    │  • Ideogram4FinalLayer                          │
    └──────────────────────────────────────────────────┘
            │  velocity prediction
            ▼
    Euler flow-matching sampler with asymmetric CFG
            │  denoised image latents
            ▼
    VAE decode
            │
            ▼
            PIL.Image

The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from the activation layers) and image latent tokens are concatenated into one sequence, modulated per-block by an AdaLN computed from the flow-matching timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and image tokens share a unified positional space.

Model spec:

field value
emb_dim 4608
num_layers 34
num_heads 18
intermediate 12288
adanln_dim 512
rope_theta 5_000_000
mrope_section (24, 20, 20)
latent channels 32 × 2² = 128
max text tokens 2048
sampler Euler flow-matching, logit-normal schedule, asymmetric CFG