docs/model_architecture.md

# Model Architecture

```
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
            │   
            ▼
    ┌──────────────────────────────────────────────────┐
    │    Ideogram4Transformer                         │  
    │  • 34 × Ideogram4TransformerBlock               │
    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
    │      – Ideogram4MLP (SwiGLU)                    │
    │      – adaln scale/gate from t-embedding         │
    │  • Ideogram4FinalLayer                          │
    └──────────────────────────────────────────────────┘
            │  velocity prediction
            ▼
    Euler flow-matching sampler with asymmetric CFG
            │  denoised image latents
            ▼
    VAE decode
            │
            ▼
            PIL.Image
```

The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
the activation layers) and image latent tokens are concatenated into one
sequence, modulated per-block by an AdaLN computed from the flow-matching
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
image tokens share a unified positional space.

Model spec:

| field             | value         |
|-------------------|---------------|
| `emb_dim`         | 4608          |
| `num_layers`      | 34            |
| `num_heads`       | 18            |
| `intermediate`    | 12288         |
| `adanln_dim`      | 512           |
| `rope_theta`      | 5_000_000     |
| `mrope_section`   | (24, 20, 20)  |
| latent channels   | 32 × 2² = 128 |
| max text tokens   | 2048          |
| sampler           | Euler flow-matching, logit-normal schedule, asymmetric CFG |