Files

T

dimon a5c319a1fc Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-13 16:36:27 +08:00

1.9 KiB

Raw Blame History

Model Architecture

prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
            │   
            ▼
    ┌──────────────────────────────────────────────────┐
    │    Ideogram4Transformer                         │  
    │  • 34 × Ideogram4TransformerBlock               │
    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
    │      – Ideogram4MLP (SwiGLU)                    │
    │      – adaln scale/gate from t-embedding         │
    │  • Ideogram4FinalLayer                          │
    └──────────────────────────────────────────────────┘
            │  velocity prediction
            ▼
    Euler flow-matching sampler with asymmetric CFG
            │  denoised image latents
            ▼
    VAE decode
            │
            ▼
            PIL.Image

The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from the activation layers) and image latent tokens are concatenated into one sequence, modulated per-block by an AdaLN computed from the flow-matching timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and image tokens share a unified positional space.

Model spec:

field	value
`emb_dim`	4608
`num_layers`	34
`num_heads`	18
`intermediate`	12288
`adanln_dim`	512
`rope_theta`	5_000_000
`mrope_section`	(24, 20, 20)
latent channels	32 × 2² = 128
max text tokens	2048
sampler	Euler flow-matching, logit-normal schedule, asymmetric CFG

1.9 KiB Raw Blame History Unescape Escape

Model Architecture

1.9 KiB

Raw Blame History