Initial commit: Ideogram 4 Prompt Builder
PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,45 @@
|
||||
# Model Architecture
|
||||
|
||||
```
|
||||
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────┐
|
||||
│ Ideogram4Transformer │
|
||||
│ • 34 × Ideogram4TransformerBlock │
|
||||
│ – Ideogram4Attention (QK-RMSNorm, MRoPE) │
|
||||
│ – Ideogram4MLP (SwiGLU) │
|
||||
│ – adaln scale/gate from t-embedding │
|
||||
│ • Ideogram4FinalLayer │
|
||||
└──────────────────────────────────────────────────┘
|
||||
│ velocity prediction
|
||||
▼
|
||||
Euler flow-matching sampler with asymmetric CFG
|
||||
│ denoised image latents
|
||||
▼
|
||||
VAE decode
|
||||
│
|
||||
▼
|
||||
PIL.Image
|
||||
```
|
||||
|
||||
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
|
||||
the activation layers) and image latent tokens are concatenated into one
|
||||
sequence, modulated per-block by an AdaLN computed from the flow-matching
|
||||
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
|
||||
image tokens share a unified positional space.
|
||||
|
||||
Model spec:
|
||||
|
||||
| field | value |
|
||||
|-------------------|---------------|
|
||||
| `emb_dim` | 4608 |
|
||||
| `num_layers` | 34 |
|
||||
| `num_heads` | 18 |
|
||||
| `intermediate` | 12288 |
|
||||
| `adanln_dim` | 512 |
|
||||
| `rope_theta` | 5_000_000 |
|
||||
| `mrope_section` | (24, 20, 20) |
|
||||
| latent channels | 32 × 2² = 128 |
|
||||
| max text tokens | 2048 |
|
||||
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |
|
||||
Reference in New Issue
Block a user