Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
@@ -0,0 +1,336 @@
+<p align="center"><a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><picture>
+  <source media="(prefers-color-scheme: dark)" srcset="assets/ideogram_logo_darkmode.svg">
+  <source media="(prefers-color-scheme: light)" srcset="assets/ideogram_logo.svg">
+  <img src="assets/ideogram_logo.svg" alt="Ideogram" width="500">
+</picture></a></p>
+
+<p align="center"><em>Ideogram 4: Open image model at the forefront of design</em></p>
+
+<p align="center">
+  <a href="https://ideogram.ai/blog/ideogram-4.0/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Blog-Post-orange" alt="Blog Post"></a>
+  <a href="https://github.com/ideogram-oss/ideogram4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github" alt="Code"></a>
+  <a href="https://huggingface.co/collections/ideogram-ai/ideogram-4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Model-HuggingFace-blue?logo=huggingface" alt="Model"></a>
+  <a href="https://developer.ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/API-developer.ideogram.ai-purple" alt="API"></a>
+  <a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Official%20Site-ideogram.ai-ff69b4" alt="Official Site"></a>
+</p>
+
+<p align="center">
+  <img src="assets/samples/collage_landscape.jpg" alt="A collage of Ideogram 4 samples spanning photorealism, illustration, typography, and poster design">
+</p>
+
+
+Ideogram 4 is **[Ideogram](https://ideogram.ai)'s first open-weight text-to-image model**. It is a **state-of-the-art foundation model trained from scratch** — not a fine-tune of any existing model. It introduces a new structured JSON prompting interface, with best-in-class multilingual text rendering, deep language understanding, explicit bounding-box layout and color-palette controls, and native 2k resolution images. The easiest way to try the model is online at **[ideogram.ai](https://ideogram.ai/)**.
+
+We believe openness drives innovation, and we invite the research community to innovate with us on the forefront of visual intelligence.
+
+## Table of Contents
+
+1. [News](#news)
+2. [Model Zoo](#model-zoo)
+3. [Performance](#performance)
+4. [Quick Start](#quick-start)
+5. [Model Summary](#model-summary)
+6. [Prompting Guide](#prompting-guide)
+7. [Documentation](#documentation)
+8. [Citation](#citation)
+
+## News
+
+* **[2026-06-03]** **Ideogram 4 released!** Inference code and weights
+  are now public, and our [technical blog post](https://ideogram.ai/blog/ideogram-4.0/) is live. See the
+  [Quick Start](#quick-start) section to generate your first image, or try the
+  model online at [ideogram.ai](https://ideogram.ai/).
+
+## Model Zoo
+
+| Model | Params | Weight Quantization | Supported Hardware | Diffusers Support | License |
+| :---  | :---:  | :---:        | :---:   | :---:   | :---:   |
+| **[Ideogram 4 (nf4)](https://huggingface.co/ideogram-ai/ideogram-4-nf4)** | 9.3B | nf4 | CUDA | Yes | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
+| **[Ideogram 4 (fp8)](https://huggingface.co/ideogram-ai/ideogram-4-fp8)** | 9.3B | fp8 | All | No | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
+
+We plan to support more quantizations in the future.
+
+
+## Performance
+
+We evaluate Ideogram 4 across third-party arenas and benchmarks, standard
+open-source benchmarks, and our own internal human-preference benchmark. Across
+all of them, **Ideogram 4 is the best open-weight image model by far, and sits
+at the frontier of design.**
+
+### Design Arena
+
+[Design Arena](https://www.designarena.ai/) is a third-party image Elo
+leaderboard focused specifically on design-oriented generation. On the overall
+board, Ideogram 4 is the top-ranked open-weight model, trailing only proprietary
+GPT and Gemini models:
+
+<p align="center">
+  <img src="assets/benchmarks/design_arena.png" alt="Design Arena overall image Elo leaderboard with Ideogram 4.0 as the top open-weight model">
+</p>
+
+Filtered to open-weight models only, Ideogram 4 leads by a commanding margin,
+well ahead of the next-best open model:
+
+<p align="center">
+  <img src="assets/benchmarks/design_arena2.png" alt="Design Arena open-weight image Elo leaderboard, with Ideogram 4.0 well ahead of all other open models">
+</p>
+
+### ContraLabs
+
+[ContraLabs](https://contralabs.com/research) ran a blind typography evaluation judged by
+ten professional designers from Contra's top-earning talent. Ideogram 4 leads on
+first-place win rate, picked as the best of four models 47.9% of the time
+overall — well ahead of Gemini 3.1 Flash Image Preview (Nano Banana 2) at 30.0%,
+FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%):
+
+<p align="center">
+  <img src="assets/benchmarks/contralabs_typography.png" alt="ContraLabs typography first-place win rate, with Ideogram v4 leading">
+</p>
+
+It also wins on practical usability: asked "Would you use this in real client
+work?", the same designers rated Ideogram 4 highest at 3.55 / 5 — significantly
+above Nano Banana 2 (2.84), Grok Imagine 1.0 (2.61), and FLUX.2 [max] (2.49):
+
+<p align="center">
+  <img src="assets/benchmarks/contralabs_typography2.png" alt="ContraLabs 'would you use this in real client work?' rating, with Ideogram v4 leading">
+</p>
+
+### LMArena
+
+On [LMArena](https://lmarena.ai/), a third-party text-to-image leaderboard that
+measures general-purpose text-to-image use cases, Ideogram is the top-ranked
+open-weight lab and a top-5 image generation lab overall — beaten only by giant
+companies with vastly larger budgets and resources:
+
+<p align="center">
+  <img src="assets/benchmarks/lmarena_benchmark.png" alt="LMArena text-to-image lab leaderboard with Ideogram">
+</p>
+
+### Ideogram internal eval
+
+For our internal human-preference benchmark, focused on graphic design and
+photography, we had graphic designers deeply familiar with professional design
+work do the rating blind. Bradley-Terry scores rank Ideogram 4 #2 overall —
+behind only GPT Image 2 medium — and the top open-weight model:
+
+<p align="center">
+  <img src="assets/benchmarks/ideogram_benchmark.png" alt="Ideogram internal design leaderboard with Ideogram 4.0">
+</p>
+
+### Open-source benchmarks
+
+On standard open-source benchmarks measuring core capabilities — layout control
+(7Bench), spatial reasoning and object fidelity (SpatialGenEval), text rendering
+(X-Omni OCR), and prompt alignment (Prism) — Ideogram 4 closes the gap to the
+leading closed-source models across every axis. On layout control (7Bench), it
+is significantly better than all closed-source models:
+
+<p align="center">
+  <img src="assets/benchmarks/opensource.png" alt="Five-axis capability radar comparing Ideogram 4.0 to leading closed-source models on layout control, spatial reasoning, object fidelity, prompt alignment, and text rendering">
+</p>
+
+At 9.3B parameters, Ideogram 4 delivers the best text rendering of any open-weight
+release we benchmarked — ahead of much larger models like Qwen-Image (20B),
+FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE):
+
+<p align="center">
+  <img src="assets/benchmarks/opensource2.png" alt="Parameter-efficiency scatter plot showing Ideogram 4.0 at 9.3B parameters leading all other open-weight models on text rendering">
+</p>
+
+
+## Quick Start
+
+### Install
+
+```bash
+pip install .
+```
+
+If you plan to modify the code, install in editable mode instead so changes
+under `src/ideogram4/` take effect without reinstalling:
+
+```bash
+pip install -e .
+```
+
+### Model access
+
+The model weights are **gated** on Hugging Face, so you must accept the gate and
+authenticate before the code can download them — otherwise the download fails
+with a `404` / `GatedRepoError`.
+
+1. Open the model page — [ideogram-ai/ideogram-4-nf4](https://huggingface.co/ideogram-ai/ideogram-4-nf4)
+   (or [ideogram-ai/ideogram-4-fp8](https://huggingface.co/ideogram-ai/ideogram-4-fp8)) — and click
+   **Agree and access repository** to accept the license gate.
+2. Create a Hugging Face access token at
+   [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and log in so the
+   download is authenticated:
+
+   ```bash
+   hf auth login
+   ```
+
+   Alternatively, export the token directly: `export HF_TOKEN="hf_..."`.
+
+### CLI
+
+The plain `--prompt` is rewritten into the structured JSON caption the model
+expects by a "magic prompt" LLM. By default this uses Ideogram's hosted
+magic-prompt API, which is **free** and does the expansion server-side (no local
+model or system prompt needed). It reads `IDEOGRAM_API_KEY` — get a key at
+https://developer.ideogram.ai/:
+
+```bash
+python run_inference.py \
+  --prompt "a ginger cat wearing a tiny wizard hat reading a spellbook" \
+  --output out.png \
+  --quantization "nf4" \
+  --magic-prompt-key "$IDEOGRAM_API_KEY"
+```
+
+You can also run the expansion through your own LLM provider — one of our magic-prompt
+system prompt is **open source**. See the
+[Prompting Guide](docs/prompting.md#magic-prompt) for details.
+
+For the highest-quality images, set `--height 2048 --width 2048` and
+`--sampler-preset V4_QUALITY_48`.
+
+#### Safety screening with Hive
+
+Prompt and output safety screening is performed via [Hive](https://thehive.ai/).
+Sign up and create a Text Moderation key and a Visual Content Moderation key,
+then export them as `HIVE_TEXT_MODERATION_KEY` and `HIVE_VISUAL_MODERATION_KEY`
+(or pass them via `--hive-text-key` / `--hive-visual-key`).
+
+```bash
+python run_inference.py \
+  --prompt "an isometric illustration of a tiny city floating in the clouds" \
+  --output out.png \
+  --quantization "nf4" \
+  --magic-prompt-key "$MAGIC_PROMPT_API_KEY" \
+  --hive-text-key "$HIVE_TEXT_MODERATION_KEY" \
+  --hive-visual-key "$HIVE_VISUAL_MODERATION_KEY"
+```
+
+For sampler presets, parameter reference, and optimization tips, see
+[docs/inference.md](docs/inference.md).
+
+## Model Summary
+
+Ideogram 4 is a **foundation model trained entirely from scratch**, not a
+fine-tune or distillation of any existing checkpoint. It is a flow-matching
+text-to-image model built on a **fully single-stream** Diffusion Transformer
+(DiT) architecture.
+
+**Architecture:**
+- **Fully single-stream DiT.** Text and image tokens are concatenated into one
+  unified sequence and processed through the same 34-layer transformer, with no
+  separate text or image branches. This enables deep cross-modal interaction at
+  every layer.
+- **Vision-language model as text encoder.** Instead of a text-only encoder
+  like CLIP or T5, Ideogram 4 uses
+  [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct),
+  a full vision-language model that provides far richer understanding of visual
+  concepts. Hidden states are extracted from **13 intermediate layers** and
+  concatenated, giving the model multi-scale semantic features ranging from
+  surface-level token information to deep compositional understanding.
+- **Dual-branch classifier-free guidance.** The conditional (positive) and
+  unconditional (negative) branches can be independently refined, enabling
+  separate control over prompt adherence and image quality.
+- **Flexible resolution.** Native support for any resolution from 256 to 2048
+  (multiples of 16), with aspect ratios up to 6:1. A single model handles
+  everything from square thumbnails to ultrawide banners, with the noise
+  schedule auto-adjusting per resolution.
+
+**Key Capabilities:**
+- **Extreme controllability.** Ideogram 4 is trained on structured JSON
+  captions, giving users unprecedented control over composition, style,
+  lighting, color palette, typography, and spatial layout, all from a single
+  prompt.
+- **State-of-the-art text rendering.** Ideogram 4 delivers best-in-class
+  in-image text generation (signage, logos, captions, watermarks, multi-line
+  text) with high fidelity directly from the prompt.
+- **Spatial layout control.** Bounding-box coordinates in the prompt allow
+  explicit placement of subjects, text elements, and background regions.
+- **Color palette conditioning.** Specify hex colors in the prompt to steer the
+  image's dominant color scheme.
+
+For full architecture details, see
+[docs/model_architecture.md](docs/model_architecture.md). For a walkthrough of
+how the pipeline components fit together, see
+[docs/pipeline.md](docs/pipeline.md).
+
+## Prompting Guide
+
+Ideogram 4 is trained exclusively on **structured JSON captions**. While
+plain-text prompts work, you will get the best results by providing a JSON
+object that follows our caption schema.
+
+
+Key points:
+
+- **Use JSON prompts** for maximum controllability — the model was trained on
+  them and understands the structure natively.
+- **Color palette conditioning** — specify a `colour_palette` array of hex
+  colors in the style description to steer the image's color scheme.
+- **Aspect ratio flexibility** — Ideogram 4 supports a wide range of aspect
+  ratios (any multiple-of-16 resolution from 256 to 2048 on each side). This
+  is a key advantage for practical use: portraits, landscapes, banners,
+  phone wallpapers, social media formats, etc.
+- **Bounding-box layout** — specify `bbox` coordinates in the prompt to
+  explicitly place subjects, text elements, and background regions.
+- **Compositional control** — use `compositional_deconstruction` with bounding
+  boxes and per-element descriptions for precise spatial layout.
+
+
+**Why JSON-only training?** We train exclusively on JSON so that training
+and inference share a single, common prompt format. The training captions themselves are deliberately
+**extremely descriptive**: each JSON exhaustively describes everything in
+the image to maximize training efficiency. The more
+text-to-image relationships each caption pins down, the more grounded
+supervision the model extracts from a single training pair, rather than
+having to infer those relationships across many sparsely-captioned samples.
+
+**Why JSON at inference time?** Because the model was trained on captions
+that name every object explicitly, the most reliable way to get every
+requested object rendered is to mirror that pattern. Plain-text prompts still work, but
+won't perform as well since the model was only trained on structured JSON captions.
+
+**Don't want to write JSON by hand?** That's what *magic prompt* is for: it uses
+an LLM to expand a plain-text prompt into a full structured caption before
+generation, so you get JSON-quality results from a casual prompt. It runs by
+default in `run_inference.py` (see the [CLI](#cli) section).
+
+See [docs/prompting.md](docs/prompting.md) for a full guide.
+
+## Documentation
+
+| Document | Description |
+| :------- | :---------- |
+| [docs/prompting.md](docs/prompting.md) | How to write JSON prompts, color palette conditioning, aspect ratios |
+| [docs/inference.md](docs/inference.md) | Sampler presets, parameter reference, resolutions, optimization tips |
+| [docs/model_architecture.md](docs/model_architecture.md) | Architecture diagram, DiT spec, component details |
+| [docs/pipeline.md](docs/pipeline.md) | Conceptual pipeline walkthrough — how all components fit together |
+| [docs/development.md](docs/development.md) | Dev setup, pre-commit hooks, contributing |
+| [docs/safety.md](docs/safety.md) | Pre-training, post-training, and inference-time safety mitigations; how to report violations |
+
+## Citation
+
+If you find the provided code or models useful for your research, consider citing them as:
+
+
+```bibtex
+@misc{ideogram-4-2026,
+    author={Ideogram AI},
+    title={{Ideogram 4}},
+    year={2026},
+    howpublished={\url{https://ideogram.ai/blog/ideogram-4.0/}},
+}
+```
+
+## We're Hiring!
+
+We're looking for **Research Scientists** and **Research Engineers** to
+work on next-generation generative models and the products built on top of
+them. Interested candidates please apply https://jobs.ashbyhq.com/ideogram
@@ -0,0 +1,58 @@
+# Development
+
+## Editable install
+
+We recommend installing into an isolated environment — the dependencies include several GB of CUDA-built wheels.
+
+```bash
+python -m venv .venv && source .venv/bin/activate
+```
+
+For development, install the package in editable mode so changes to the source
+tree are picked up without reinstalling:
+
+```bash
+pip install -e .
+```
+
+or with [`uv`](https://docs.astral.sh/uv/):
+
+```bash
+uv venv && source .venv/bin/activate
+```
+
+```bash
+uv pip install -e .
+```
+
+## Pre-commit hooks
+
+This repo uses [pre-commit](https://pre-commit.com/) to run lint, format, and
+type checks (`ruff`, `mypy`, etc.) before each commit.
+
+Install once per clone:
+
+```bash
+pip install pre-commit
+pre-commit install
+```
+
+`pre-commit install` registers a git hook in `.git/hooks/pre-commit`, so it
+requires the directory to be a git repo. The hooks now run automatically on
+`git commit` against staged files.
+
+To run the hooks manually against every file in the repo (useful right after
+the first install, or in CI):
+
+```bash
+pre-commit run --all-files
+```
+
+The first run downloads each hook's environment (ruff, mypy, etc.) into
+`~/.cache/pre-commit/` and may take a minute. Subsequent runs are fast.
+
+To bump pinned hook versions in `.pre-commit-config.yaml`:
+
+```bash
+pre-commit autoupdate
+```
@@ -0,0 +1,63 @@
+# Inference Reference
+
+Detailed parameters, sampler presets, supported resolutions, and optimization
+tips for Ideogram 4 inference.
+
+## Sampler Presets
+
+Named presets bundle a step count, per-step CFG schedule, schedule mean (`mu`),
+and schedule standard deviation (`std`) into a single flag:
+
+```bash
+python run_inference.py \
+  --prompt "a cat wearing a tiny top hat" \
+  --sampler-preset V4_QUALITY_48 \
+  --output out.png
+```
+
+| Preset | Steps | CFG schedule | `mu` | `std` |
+| :----- | :---: | :----------- | :--: | :---: |
+| `V4_QUALITY_48` | 48 | 45 steps @ gw=7, then 3 polish steps @ gw=3 | 0.0 | 1.5 |
+| `V4_DEFAULT_20` | 20 | 18 steps @ gw=7, then 2 polish steps @ gw=3 | 0.0 | 1.75 |
+| `V4_TURBO_12` | 12 | 11 steps @ gw=7, then 1 polish step @ gw=3 | 0.5 | 1.75 |
+
+`V4_QUALITY_48` is the default. Fewer steps trade quality for speed. The full
+registry lives in
+[`ideogram4.sampler_configs.PRESETS`](../src/ideogram4/sampler_configs.py); add a
+new entry there to define your own.
+
+## Key Parameters
+
+These are the keyword arguments accepted by `Ideogram4Pipeline.__call__`. The
+defaults below apply when you call `pipe(...)` directly; `run_inference.py`
+overrides `num_steps`, `guidance_schedule`, `mu`, and `std` from the chosen
+sampler preset (see above).
+
+| Parameter | Default | Notes |
+| :-------- | :-----: | :---- |
+| `height` / `width` | 1024 | Must be multiples of 16. Supported range: 256–2048. Aspect ratios up to 6:1 or 1:6. |
+| `num_steps` | 48 | More steps = higher quality. The `V4_QUALITY_48` preset (48 steps) is a good speed/quality trade-off. |
+| `guidance_scale` | 7.0 | Constant guidance weight used when no `guidance_schedule` is given. Higher = more prompt adherence, lower = more diversity. |
+| `guidance_schedule` | `None` | Optional per-step guidance weights (loop-index order: index 0 is the final step). Overrides `guidance_scale`. |
+| `mu` | 0.5 | Logit-normal schedule mean. Auto-adjusted for resolution. |
+| `std` | 1.0 | Logit-normal schedule standard deviation. |
+| `seed` | `None` | Set for reproducible results. |
+
+## Supported Resolutions
+
+Ideogram 4 natively supports any resolution where both height and width are
+multiples of 16, within the range 256–2048 (aspect ratios up to 6:1 or 1:6).
+
+| Use case | Resolution | Aspect ratio |
+| :------- | :--------: | :----------: |
+| Square | 1024 × 1024 | 1:1 |
+| Landscape | 1536 × 1024 | 3:2 |
+| Portrait | 1024 × 1536 | 2:3 |
+| Widescreen | 1920 × 1088 | ~16:9 |
+| Ultrawide | 2048 × 768 | ~21:9 |
+| Phone wallpaper | 1024 × 1792 | ~9:16 |
+| Social banner | 1600 × 400 | 4:1 |
+
+Resolution buckets use 16-pixel increments, giving fine-grained control over
+output dimensions.
+
@@ -0,0 +1,45 @@
+# Model Architecture
+
+```
+prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
+            │   
+            ▼
+    ┌──────────────────────────────────────────────────┐
+    │    Ideogram4Transformer                         │  
+    │  • 34 × Ideogram4TransformerBlock               │
+    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
+    │      – Ideogram4MLP (SwiGLU)                    │
+    │      – adaln scale/gate from t-embedding         │
+    │  • Ideogram4FinalLayer                          │
+    └──────────────────────────────────────────────────┘
+            │  velocity prediction
+            ▼
+    Euler flow-matching sampler with asymmetric CFG
+            │  denoised image latents
+            ▼
+    VAE decode
+            │
+            ▼
+            PIL.Image
+```
+
+The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
+the activation layers) and image latent tokens are concatenated into one
+sequence, modulated per-block by an AdaLN computed from the flow-matching
+timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
+image tokens share a unified positional space.
+
+Model spec:
+
+| field             | value         |
+|-------------------|---------------|
+| `emb_dim`         | 4608          |
+| `num_layers`      | 34            |
+| `num_heads`       | 18            |
+| `intermediate`    | 12288         |
+| `adanln_dim`      | 512           |
+| `rope_theta`      | 5_000_000     |
+| `mrope_section`   | (24, 20, 20)  |
+| latent channels   | 32 × 2² = 128 |
+| max text tokens   | 2048          |
+| sampler           | Euler flow-matching, logit-normal schedule, asymmetric CFG |
@@ -0,0 +1,183 @@
+# Pipeline: How All the Components Work Together
+
+This document explains the end-to-end Ideogram 4 inference pipeline
+conceptually. For the architecture spec and code pointers, see
+[model_architecture.md](model_architecture.md).
+
+## Overview
+
+Ideogram 4 is a **flow-matching text-to-image model** built on a
+**single-stream DiT** (Diffusion Transformer). The pipeline has four main
+components:
+
+```
+ ┌─────────────┐   ┌──────────────────────┐   ┌──────────────┐   ┌───────────┐
+ │  Qwen3-VL   │   │  Ideogram4          │   │  KL VAE      │   │           │
+ │  Text       ├──►│  Transformer (DiT)   ├──►│  VAE         ├──►│  Image    │
+ │  Encoder    │   │  + Euler Sampler     │   │  Decoder     │   │           │
+ └─────────────┘   └──────────────────────┘   └──────────────┘   └───────────┘
+     frozen              trainable                 frozen
+```
+
+## 1. Text Encoder — Qwen3-VL-8B-Instruct
+
+The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
+vision-language model, used in text-only mode (no vision inputs).
+
+**What it does:**
+- Tokenizes the prompt using the Qwen3 chat template.
+- Runs a forward pass through the 36-layer transformer.
+- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
+  24, 27, 30, 33, 35.
+- Concatenates these hidden states along the feature dimension, producing a
+  multi-scale text representation.
+
+**Why multi-layer extraction?** Different layers capture different levels of
+abstraction — early layers encode surface-level token information, while later
+layers encode deeper semantic meaning. Concatenating them gives the DiT access
+to the full spectrum.
+
+**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
+
+## 2. DiT Backbone — Ideogram4Transformer
+
+The core generative model is a 34-layer single-stream Diffusion Transformer.
+
+### Sequence layout
+
+Text tokens and image latent tokens are concatenated into one sequence and
+processed through the same self-attention layers.
+
+```
+Sequence layout (per sample):
+
+  ┌───────────────────┬────────────────────────┐
+  │  text tokens      │  image latent tokens   │
+  │  (up to 2048)     │  (grid_h × grid_w)     │
+  └───────────────────┴────────────────────────┘
+           ▲                    ▲
+     Qwen3-VL features    noisy latents z_t
+```
+
+### Key components per block
+
+- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
+  positional encoding is 3-dimensional: for text tokens it uses a 1D position
+  broadcast to 3 axes; for image tokens it uses (temporal, height, width)
+  coordinates. This lets text and image tokens coexist in a unified positional
+  space.
+- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
+  activation.
+- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
+  and generates per-block scale and gate parameters. This conditions every layer
+  on the current noise level.
+
+### Flow matching
+
+The model is trained with a **flow-matching** objective. Instead of predicting
+noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
+defines the ODE:
+
+```
+dz/dt = v(z_t, t)
+```
+
+At inference time, we start from pure Gaussian noise `z_1` and integrate
+backward to `z_0` (the clean image) using the Euler method:
+
+```
+z_{t-dt} = z_t + v(z_t, t) * dt
+```
+
+### Noise schedule
+
+The timestep distribution follows a **logit-normal schedule** parameterized by
+`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
+different noise levels — higher `mu` shifts more steps toward higher noise
+(important for high-resolution images). The schedule auto-adjusts for
+resolution:
+
+```
+mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
+```
+
+where `base_pixels = 512 * 512`.
+
+## 3. Classifier-Free Guidance (CFG)
+
+At each sampling step, two forward passes are run through the DiT:
+
+1. **Conditional (positive):** full text features + noisy image latents.
+2. **Unconditional (negative):** zeroed text features + noisy image latents
+   (image-only tokens, asymmetric CFG).
+
+The guided velocity is a weighted combination:
+
+```
+v_guided = gw * v_conditional + (1 - gw) * v_unconditional
+```
+
+where `gw` is the per-step guidance weight. With
+`gw > 1`, the model amplifies the text-conditional signal and suppresses the
+unconditional prediction, producing images that follow the prompt more
+faithfully.
+
+**Asymmetric CFG:** The unconditional branch only processes image tokens (no
+text padding), making it computationally cheaper than a full-sequence negative
+pass.
+
+**Per-step schedules:** The guidance weight can vary across steps. The
+`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
+`gw=3` for the final 3 "polish" steps near `t=0`.
+
+
+## 4. VAE Decoder — KL Autoencoder
+
+The denoised latent `z_0` is decoded to pixel space using a frozen KL
+autoencoder.
+
+**What it does:**
+- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
+  input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
+  `(batch, channels, grid_h * 2, grid_w * 2)`.
+- **Denormalization:** Per-channel shift and scale are applied to undo the
+  latent normalization used during training.
+- **Decoding:** The VAE decoder maps latents to RGB pixels.
+- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
+
+**Compression factor:** The autoencoder provides 8× spatial compression on each
+axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
+is represented as a 64×64 grid of latent tokens, each with 128 channels
+(32 base channels × 2² patch).
+
+## Putting it all together
+
+```python
+# Pseudocode for one generation call:
+
+# 1. Encode text
+text_features = qwen3_vl.encode(prompt)  # (B, L_text, D)
+
+# 2. Initialize noise
+z = torch.randn(B, grid_h * grid_w, 128)  # pure noise at t=1
+
+# 3. Euler integration from t=1 to t=0
+for step in reversed(range(num_steps)):
+    t = schedule(step)
+    s = schedule(step - 1)
+
+    # Conditional pass (text + image)
+    v_cond = dit(text_features, z, t)
+
+    # Unconditional pass (image only, zeroed text)
+    v_uncond = dit(zeros, z, t)
+
+    # CFG combination
+    v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
+
+    # Euler step
+    z = z + v * (s - t)
+
+# 4. Decode to pixels
+image = vae.decode(z)
+```
@@ -0,0 +1,362 @@
+# Prompting Guide
+
+Ideogram 4 is trained exclusively on **structured JSON captions** (represented as string type). While the
+model can accept plain-text prompts, providing a JSON object that follows the
+caption schema gives significantly better results, especially for
+controllability, spatial layout, and style fidelity.
+
+## Plain-text vs. JSON prompts
+
+You can pass in plain-text prompts directly to the model and it will work. The
+sampling parameters come from a named preset in `ideogram4.PRESETS` (the same
+ones `run_inference.py` exposes via `--sampler-preset`), unpacked into the
+`pipe()` call:
+
+```python
+from ideogram4 import PRESETS
+
+preset = PRESETS["V4_QUALITY_48"]
+images = pipe(
+  "a golden retriever on a skateboard",
+  height=1024,
+  width=1024,
+  num_steps=preset.num_steps,
+  guidance_schedule=preset.guidance_schedule,
+  mu=preset.mu,
+  std=preset.std,
+)
+```
+
+
+But for higher quality image generations and more control, pass a JSON string as the prompt:
+
+```python
+import json
+from ideogram4 import PRESETS
+
+caption = {
+  "high_level_description": "A golden retriever riding a skateboard down a sunny sidewalk.",
+  "style_description": {
+    "aesthetics": "warm, playful, vibrant",
+    "lighting": "bright afternoon sunlight, long soft shadows",
+    "photo": "shallow depth of field, eye-level, 85mm lens",
+    "medium": "photograph",
+    "color_palette": ["#F5C542", "#87CEEB", "#4A4A4A", "#FFFFFF", "#2E8B57"]
+  },
+  "compositional_deconstruction": {
+    "background": "A sun-drenched suburban sidewalk lined with green hedges and a white picket fence. Dappled light filters through overhead trees.",
+    "elements": [
+      {"type": "obj", "bbox": [200, 300, 800, 900], "desc": "A golden retriever with a fluffy coat, standing on a red skateboard with all four paws. Its tongue is out and ears are flapping in the wind."},
+      {"type": "obj", "bbox": [250, 750, 750, 950], "desc": "A worn red skateboard with black wheels rolling along the concrete sidewalk."}
+    ]
+  }
+}
+
+preset = PRESETS["V4_QUALITY_48"]
+images = pipe(
+  json.dumps(caption, separators=(",", ":"), ensure_ascii=False),
+  height=1024,
+  width=1024,
+  num_steps=preset.num_steps,
+  guidance_schedule=preset.guidance_schedule,
+  mu=preset.mu,
+  std=preset.std,
+)
+```
+
+## Magic prompt
+
+Writing these captions by hand is optional. *Magic prompt* uses an LLM to expand
+a plain-text prompt into a full structured caption for you, so you get the
+quality of a JSON prompt from a casual one. It is enabled by default in
+`run_inference.py`; you can also call it directly:
+
+```python
+import os
+from ideogram4 import ClaudeOpusMagicPromptV1, PRESETS
+
+magic = ClaudeOpusMagicPromptV1(api_key=os.environ["MAGIC_PROMPT_API_KEY"])
+caption = magic.expand("a golden retriever on a skateboard", aspect_ratio="1:1")
+preset = PRESETS["V4_QUALITY_48"]
+images = pipe(
+  caption,
+  height=1024,
+  width=1024,
+  num_steps=preset.num_steps,
+  guidance_schedule=preset.guidance_schedule,
+  mu=preset.mu,
+  std=preset.std,
+)
+```
+
+The package ships three configurations, registered by name in
+`ideogram4.MAGIC_PROMPTS` (the keys `run_inference.py` accepts via
+`--magic-prompt-model`):
+
+| Config class | Registry key | Backend |
+| :--- | :--- | :--- |
+| `Ideogram4MagicPromptV1` | `ideogram-4-v1` | Ideogram's hosted magic-prompt API (free; reads `IDEOGRAM_API_KEY`) |
+| `ClaudeOpusMagicPromptV1` | `claude-opus-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
+| `ClaudeSonnetMagicPromptV1` | `claude-sonnet-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
+
+`ideogram-4-v1` is the default and is **free**. It runs the expansion
+server-side, so there is no local model or system prompt involved — it just needs
+an Ideogram API key (get one at
+[developer.ideogram.ai](https://developer.ideogram.ai)). The `claude-*`
+configurations instead send one of our open-source system prompt to an OpenRouter model;
+select one with `--magic-prompt-model` and export `MAGIC_PROMPT_API_KEY`:
+
+```bash
+python run_inference.py \
+  --prompt "an isometric illustration of a tiny city floating in the clouds" \
+  --output out.png \
+  --quantization "nf4" \
+  --magic-prompt-model claude-opus-v1 \
+  --magic-prompt-key "$MAGIC_PROMPT_API_KEY"
+```
+
+See the README's [CLI](../README.md#cli) section for the rest of the flags.
+
+Our magic-prompt system prompts are **open source** (they ship in
+`src/ideogram4/magic_prompt_system_prompts/`), so you're also welcome to
+construct the caption with any system prompt and LLM of your choosing.
+
+**A few caveats:**
+
+- At Ideogram we've tested this magic prompt with **Claude Opus**. You're welcome
+  to implement your own `MagicPrompt` configurations and/or drive a different LLM
+  with our system prompt, but those paths aren't tested by us and quality may
+  vary.
+- The magic prompt shipped here is **not** the same magic prompt used in
+  production at [Ideogram.ai](https://ideogram.ai) — results will differ from the
+  hosted product (including the `ideogram-4-v1` API).
+
+## JSON caption schema
+
+> **Note:** Following this schema is **not required** — the model accepts any
+> string as a prompt. The schema below describes the exact structure the model
+> was trained on, and matching it minimizes train/eval mismatch so the model
+> generates closer to its full quality. Treat the "required" / "must" language
+> in the rest of this section as the format the [`CaptionVerifier`](../src/ideogram4/caption_verifier.py)
+> checks against, not as a hard pipeline constraint. Deviating from the schema
+> is allowed; it just means you're sampling outside the training distribution.
+
+The full caption schema has three top-level fields:
+
+1. `high_level_description` — optional string, but strongly recommended.
+2. `style_description` — optional object.
+3. `compositional_deconstruction` — **required** object.
+
+`compositional_deconstruction` must always be present. Within it, both
+`background` and `elements` are required.
+
+### `high_level_description`
+
+A one- or two-sentence summary of the entire image. Strongly recommended in every prompt.
+
+```json
+"high_level_description": "A medium-shot photograph of a barista pouring latte art in a cozy cafe."
+```
+
+### `style_description`
+
+Controls the visual style, lighting, medium, and color palette.
+
+`style_description` must contain **exactly one** of:
+
+- `photo` — for photographic captions (paired with `medium: "photograph"`).
+- `art_style` — for non-photographic captions (illustration, painting, 3D render, etc.).
+
+`aesthetics`, `lighting`, and `medium` are also required when `style_description` is present. `color_palette` is optional.
+
+**Key order is strict** and depends on which of `photo` / `art_style` is used:
+
+| Caption type | Required key order |
+| :----------- | :----------------- |
+| Photo (uses `photo`) | `aesthetics`, `lighting`, `photo`, `medium`, `color_palette` |
+| Non-photo (uses `art_style`) | `aesthetics`, `lighting`, `medium`, `art_style`, `color_palette` |
+
+`color_palette` is the only field in this list that may be omitted; if it is included it must remain in the final position.
+
+Field descriptions:
+
+| Field | Type | Description |
+| :---- | :--- | :---------- |
+| `aesthetics` | string | Aesthetic keywords (e.g. "moody, cinematic, desaturated") |
+| `lighting` | string | Lighting description (e.g. "golden hour, rim light, dramatic shadows") |
+| `photo` | string | Camera/lens details for photographic outputs (e.g. "35mm, f/1.4, bokeh"). Use this OR `art_style`, not both. |
+| `medium` | string | Medium type: `"photograph"`, `"illustration"`, `"3d_render"`, `"painting"`, `"graphic_design"`, etc. |
+| `art_style` | string | Art style description for non-photo captions (e.g. "flat vector illustration, bold outlines"). Use this OR `photo`, not both. |
+| `color_palette` | list[str] | Hex color codes that steer the image's dominant colors. Up to 16 entries. |
+
+### `compositional_deconstruction`
+
+Provides fine-grained spatial control over the image layout using bounding
+boxes and per-element descriptions. Both fields below are required.
+
+| Field | Type | Description |
+| :---- | :--- | :---------- |
+| `background` | string | Description of the background/environment (required) |
+| `elements` | list[dict] | List of elements with optional bounding boxes (required) |
+
+`background` must come before `elements`.
+
+Each element in `elements` must follow a fixed **key order** depending on its
+type. `bbox` and `color_palette` are optional within an element; if present they
+must appear in the positions shown below.
+
+| Type | Required key order |
+| :--- | :----------------- |
+| `"obj"` | `type`, `bbox`, `desc`, `color_palette` |
+| `"text"` | `type`, `bbox`, `text`, `desc`, `color_palette` |
+
+Field descriptions:
+
+| Field | Type | Description |
+| :---- | :--- | :---------- |
+| `type` | string | `"obj"` for objects/subjects, `"text"` for in-image text |
+| `bbox` | list[int] | `[y_min, x_min, y_max, x_max]` in normalized `0–1000` coordinates (origin at top-left). Optional. |
+| `desc` | string | Detailed description of the element |
+| `text` | string | (only for `type: "text"`) The literal text to render |
+| `color_palette` | list[str] | Optional per-element palette. Up to 5 hex entries. |
+
+**Key ordering matters.** The model was trained on JSON with a consistent key
+order, so maintaining it improves generation quality. The pipeline runs
+[`CaptionVerifier`](../src/ideogram4/caption_verifier.py) on every prompt and emits
+warnings for unknown keys, missing required keys, or out-of-order keys.
+
+**Hex color format.** Colors in `color_palette` must be uppercase
+`#RRGGBB` strings (e.g. `#1B1B2F`, not `#1b1b2f` or `#fff`).
+
+**Encoding.** When serializing with Python's `json` module, pass
+`separators=(",", ":")` and `ensure_ascii=False`.
+`CaptionVerifier` warns when it detects `\uXXXX` escapes with no literal
+non-ASCII characters in the raw text.
+
+## Color palette conditioning
+
+One of Ideogram 4's distinctive features is **color palette control**. By
+providing a `color_palette` array of hex colors in `style_description`, you
+can steer the dominant colors of the generated image.
+
+```json
+"style_description": {
+  "aesthetics": "moody, cinematic",
+  "lighting": "low-key, deep shadows",
+  "photo": "35mm, f/1.4",
+  "medium": "photograph",
+  "color_palette": ["#1B1B2F", "#162447", "#1F4068", "#E43F5A", "#F5F5F5"]
+}
+```
+
+Tips for effective color palette use:
+
+- **Up to 16 colors** in `style_description.color_palette` for the overall
+  image palette, and **up to 5 colors** per element in
+  `compositional_deconstruction.elements[*].color_palette`.
+- **Include background colors** — if you want a dark background, include the
+  dark hex in the palette.
+- **Contrast pairs** — include both your highlight and shadow colors for more
+  controlled lighting.
+- **Uppercase hex only** — `#RRGGBB` form, no shorthand.
+
+### Example: warm sunset palette
+
+```json
+{
+  "high_level_description": "A lone sailboat on calm water at sunset.",
+  "style_description": {
+    "aesthetics": "serene, warm, golden hour",
+    "lighting": "golden hour backlighting, warm atmospheric haze",
+    "photo": "wide angle, f/8, long exposure",
+    "medium": "photograph",
+    "color_palette": ["#FF6B35", "#F7C59F", "#004E89", "#1A659E", "#2B2D42"]
+  },
+  "compositional_deconstruction": {
+    "background": "A calm ocean stretching to a low horizon, sky washed in orange and pink with thin wisps of cloud.",
+    "elements": [
+      {"type": "obj", "desc": "A single sailboat with a white triangular sail, silhouetted against the setting sun."}
+    ]
+  }
+}
+```
+
+
+### Example: corporate design palette
+
+```json
+{
+  "high_level_description": "A clean, modern business card layout for a tech company.",
+  "style_description": {
+    "aesthetics": "minimal, professional, geometric",
+    "lighting": "even, diffuse studio lighting",
+    "medium": "graphic_design",
+    "art_style": "flat vector design, generous whitespace, sans-serif typography",
+    "color_palette": ["#FFFFFF", "#F0F0F0", "#333333", "#0066FF", "#00CC88"]
+  },
+  "compositional_deconstruction": {
+    "background": "A solid off-white card surface with subtle paper texture.",
+    "elements": [
+      {"type": "text", "text": "ACME TECH", "desc": "Bold dark grey sans-serif company name across the upper third of the card."},
+      {"type": "text", "text": "hello@acme.tech", "desc": "Small blue sans-serif contact email near the bottom of the card."}
+    ]
+  }
+}
+```
+
+
+
+## Full example
+
+```json
+{
+  "high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.",
+  "style_description": {
+    "aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant",
+    "lighting": "overcast daylight, diffused, soft subtle shadows",
+    "photo": "shallow depth of field, sharp focus, eye-level, telephoto",
+    "medium": "photograph"
+  },
+  "compositional_deconstruction": {
+    "background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.",
+    "elements": [
+      {"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."},
+      {"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."},
+      {"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."},
+      {"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."},
+      {"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."},
+      {"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."},
+      {"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."},
+      {"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."},
+      {"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."},
+      {"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."},
+      {"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."},
+      {"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."},
+      {"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."},
+      {"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."},
+      {"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."},
+      {"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."},
+      {"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."}
+    ]
+  }
+}
+```
+
+## Safety filter
+
+NSFW prompts are blocked. Instead of an image, the model returns a gray screen
+with the text "Image blocked by safety filter". False positive rates for safety
+is higher for non-json like prompts. We are aware that this is an issue an we may
+make a future checkpoint update to improve it.
+
+# Congratulations!
+
+You are now a certified Ideogram 4 prompter!
+
+With structured JSON captions, you have fine-grained control over composition,
+color palettes, typography, and spatial layout — capabilities that go far
+beyond what plain-text prompts can express!
+We'd love to see what you create :-)
+Share your results, experiments, and creative discoveries with the community,
+especially the unexpected ones. Tag us on social media or open a discussion on
+the repo. Happy generating!