Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
@@ -0,0 +1,15 @@
 # Python
 __pycache__/
 *.py[cod]
 # Generated at runtime / regenerated from code on first launch
 translations.json
 # User-specific and runtime state (not part of the application source)
 comfy_settings.json
 draft.json
 prompt_library.json
 prompt_previews/
 # Editor / agent
 .claude/
@@ -0,0 +1,159 @@
 # Ideogram 4 Prompt Builder
 **English** · [Русский](#ideogram-4-prompt-builder-ru)
 A desktop GUI (PyQt6) for building structured JSON captions for **Ideogram 4** and ComfyUI workflows, with a prompt library, reference-image canvas, localisation, light/dark themes, and direct generation through a ComfyUI server.
 ![English interface](eng-vlack.png)
 ## Run
 ```powershell
 python ideogram_prompt_builder.py
 ```
 Requires `PyQt6` (no other third-party dependencies):
 ```powershell
 pip install PyQt6
 ```
 ## What it builds
 Prompts follow the schema from `docs/prompting.md`:
 - `high_level_description`
 - `style_description` with either `photo` or `art_style`
 - `compositional_deconstruction.background`
 - `compositional_deconstruction.elements`
 - optional uppercase HEX color palettes
 - optional bounding boxes in normalized `0-1000` coordinates
 Actions live in a menu bar (**File / Edit / Library / ComfyUI / View**) plus a slim toolbar (Generate, Undo/Redo, Save to library, Library, Copy) and the language/theme controls on the right. The right-hand panel is tabbed: **JSON** (output + validation) and **Result** (the generated image).
 ## Editing
 - Move and resize layout boxes directly with the mouse on the bbox canvas.
 - Palette fields accept comma-separated HEX, clickable swatches and a popup color picker, with a live `n/limit` counter and invalid-color highlighting.
 - **Undo / Redo** (`Ctrl+Z` / `Ctrl+Y`).
 - **Duplicate**, **reorder** (up/down) and add elements from **templates** (Character / Title text / Background object).
 - The validation list is clickable — clicking an element-specific message selects that element.
 - Text fields have a right-click translation menu (`Translate to RU` / `Translate to EN`, results cached).
 - Work is autosaved to `draft.json`; on the next launch you are offered to restore it.
 ## Reference image & zoom
 In the composition panel you can load a **reference image** (file or paste from clipboard) drawn under the bbox grid; the **grid scale** slider zooms the grid and the reference scales with it.
 ## Prompt library
 The **Library** menu saves the current caption (optionally with a preview image), updates the entry you loaded from, and opens the library browser, where you can:
 - search by name / tag / description and edit per-entry **tags**;
 - load any saved prompt back into the editor for reuse and editing;
 - attach a preview from a file or **paste it from the clipboard**, or remove it;
 - rename, delete entries, and view the preview + summary;
 - **export / import** the whole library (prompts + previews) as a single `.zip`.
 The library is stored in `prompt_library.json` next to the app, with preview images in `prompt_previews/` (created on first save).
 ## ComfyUI integration
 The **ComfyUI** menu connects the builder to a running ComfyUI server:
 - **ComfyUI settings** — host, port and HTTPS, with a *Test connection* button. Stored in `comfy_settings.json`.
 - **Check ComfyUI** — verifies that every model, sampler and custom node the bundled `ideogram4NSFWComfyui_v11.json` workflow needs is installed on the server, and lists anything missing.
 - **Generate in ComfyUI** — converts the bundled workflow to API format, injects the current compact JSON caption, submits it and retrieves the generated image. The result appears in the **Result** tab and can be saved to a file or into the library.
 ## Appearance & localisation
 - **Theme** (View menu) toggles a light / dark theme.
 - The interface language is switched at runtime from the **Language** selector; the default is **English**.
 UI strings are loaded from `translations.json`, created on first run from bundled `en` / `ru` translations. To add a language, add a top-level key with the same string keys (and optionally a display name in `LANGUAGE_NAMES`). Missing keys fall back to English then to the key name. Theme and language are saved in `comfy_settings.json`.
 ## Compact JSON for ComfyUI
 The output can be copied in pretty or compact form. Compact JSON matches the recommended serialization style for inference and can be pasted into the Ideogram 4 prompt field in ComfyUI.
 ---
 <a name="ideogram-4-prompt-builder-ru"></a>
 # Ideogram 4 Prompt Builder (RU)
 [English](#ideogram-4-prompt-builder) · **Русский**
 Десктопное GUI-приложение (PyQt6) для сборки структурированных JSON-промтов для **Ideogram 4** и ComfyUI: с библиотекой промтов, холстом с референс-изображением, локализацией, светлой/тёмной темой и прямой генерацией через сервер ComfyUI.
 ![Русский интерфейс](ru-white.png)
 ## Запуск
 ```powershell
 python ideogram_prompt_builder.py
 ```
 Нужен только `PyQt6` (других сторонних зависимостей нет):
 ```powershell
 pip install PyQt6
 ```
 ## Что собирается
 Промты соответствуют схеме из `docs/prompting.md`:
 - `high_level_description`
 - `style_description` с одним из `photo` или `art_style`
 - `compositional_deconstruction.background`
 - `compositional_deconstruction.elements`
 - опциональные палитры HEX в верхнем регистре
 - опциональные bbox в нормализованных координатах `0-1000`
 Действия вынесены в меню (**Файл / Правка / Библиотека / ComfyUI / Вид**) плюс компактная панель инструментов (Сгенерировать, Отменить/Повторить, Сохранить в библиотеку, Библиотека, Копировать) и переключатели языка/темы справа. Правая панель — вкладки: **JSON** (вывод + валидация) и **Результат** (сгенерированное изображение).
 ## Редактирование
 - Перемещайте и масштабируйте рамки прямо мышью на холсте bbox.
 - Поля палитры принимают HEX через запятую, кликабельные образцы и всплывающий выбор цвета, со счётчиком `n/лимит` и подсветкой некорректных цветов.
 - **Отмена / Повтор** (`Ctrl+Z` / `Ctrl+Y`).
 - **Дублирование**, **изменение порядка** (вверх/вниз) и добавление элементов из **шаблонов** (Персонаж / Заголовок / Фоновый объект).
 - Список валидации кликабельный — клик по сообщению об элементе выделяет этот элемент.
 - У текстовых полей есть контекстное меню перевода (`Перевести на RU` / `Перевести на EN`, с кэшированием).
 - Работа автосохраняется в `draft.json`; при следующем запуске предлагается восстановить черновик.
 ## Референс-изображение и масштаб
 В панели композиции можно загрузить **референс-изображение** (из файла или вставить из буфера), которое рисуется под сеткой bbox; ползунок **масштаба сетки** увеличивает сетку, и референс масштабируется вместе с ней.
 ## Библиотека промтов
 Меню **Библиотека** сохраняет текущий промт (по желанию с превью), обновляет загруженную запись и открывает браузер библиотеки, где можно:
 - искать по имени / тегам / описанию и редактировать **теги** записи;
 - загрузить любой сохранённый промт обратно в редактор для повторного использования и правки;
 - прикрепить превью из файла или **вставить из буфера обмена**, либо убрать его;
 - переименовывать, удалять записи и просматривать превью + сводку;
 - **экспортировать / импортировать** всю библиотеку (промты + превью) одним `.zip`.
 Библиотека хранится в `prompt_library.json` рядом с приложением, превью — в `prompt_previews/` (создаются при первом сохранении).
 ## Интеграция с ComfyUI
 Меню **ComfyUI** связывает приложение с запущенным сервером ComfyUI:
 - **Настройки ComfyUI** — хост, порт и HTTPS, с кнопкой *Проверить соединение*. Хранятся в `comfy_settings.json`.
 - **Проверить ComfyUI** — проверяет, что все модели, семплеры и кастомные ноды, нужные встроенному workflow `ideogram4NSFWComfyui_v11.json`, установлены на сервере, и перечисляет отсутствующие.
 - **Сгенерировать в ComfyUI** — конвертирует встроенный workflow в API-формат, подставляет текущий compact JSON, отправляет запрос и получает изображение. Результат показывается во вкладке **Результат** и может быть сохранён в файл или в библиотеку.
 ## Внешний вид и локализация
 - **Тема** (меню Вид) переключает светлую / тёмную тему.
 - Язык интерфейса переключается на лету через селектор **Язык**; по умолчанию — английский.
 Строки интерфейса берутся из `translations.json`, который создаётся при первом запуске из встроенных переводов `en` / `ru`. Чтобы добавить язык, добавьте ключ верхнего уровня с тем же набором строк (и при желании отображаемое имя в `LANGUAGE_NAMES`). Отсутствующие ключи откатываются к английскому, затем к самому ключу. Тема и язык сохраняются в `comfy_settings.json`.
 ## Compact JSON для ComfyUI
 Вывод можно скопировать в pretty- или compact-виде. Compact JSON соответствует рекомендованной сериализации для инференса и вставляется в поле промта Ideogram 4 в ComfyUI.
@@ -0,0 +1,336 @@
 <p align="center"><a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><picture>
  <source media="(prefers-color-scheme: dark)" srcset="assets/ideogram_logo_darkmode.svg">
  <source media="(prefers-color-scheme: light)" srcset="assets/ideogram_logo.svg">
  <img src="assets/ideogram_logo.svg" alt="Ideogram" width="500">
 </picture></a></p>
 <p align="center"><em>Ideogram 4: Open image model at the forefront of design</em></p>
 <p align="center">
  <a href="https://ideogram.ai/blog/ideogram-4.0/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Blog-Post-orange" alt="Blog Post"></a>
  <a href="https://github.com/ideogram-oss/ideogram4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github" alt="Code"></a>
  <a href="https://huggingface.co/collections/ideogram-ai/ideogram-4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Model-HuggingFace-blue?logo=huggingface" alt="Model"></a>
  <a href="https://developer.ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/API-developer.ideogram.ai-purple" alt="API"></a>
  <a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Official%20Site-ideogram.ai-ff69b4" alt="Official Site"></a>
 </p>
 <p align="center">
  <img src="assets/samples/collage_landscape.jpg" alt="A collage of Ideogram 4 samples spanning photorealism, illustration, typography, and poster design">
 </p>
 Ideogram 4 is **[Ideogram](https://ideogram.ai)'s first open-weight text-to-image model**. It is a **state-of-the-art foundation model trained from scratch** — not a fine-tune of any existing model. It introduces a new structured JSON prompting interface, with best-in-class multilingual text rendering, deep language understanding, explicit bounding-box layout and color-palette controls, and native 2k resolution images. The easiest way to try the model is online at **[ideogram.ai](https://ideogram.ai/)**.
 We believe openness drives innovation, and we invite the research community to innovate with us on the forefront of visual intelligence.
 ## Table of Contents
 1. [News](#news)
 2. [Model Zoo](#model-zoo)
 3. [Performance](#performance)
 4. [Quick Start](#quick-start)
 5. [Model Summary](#model-summary)
 6. [Prompting Guide](#prompting-guide)
 7. [Documentation](#documentation)
 8. [Citation](#citation)
 ## News
 * **[2026-06-03]** **Ideogram 4 released!** Inference code and weights
  are now public, and our [technical blog post](https://ideogram.ai/blog/ideogram-4.0/) is live. See the
  [Quick Start](#quick-start) section to generate your first image, or try the
  model online at [ideogram.ai](https://ideogram.ai/).
 ## Model Zoo
 | Model | Params | Weight Quantization | Supported Hardware | Diffusers Support | License |
 | :---  | :---:  | :---:        | :---:   | :---:   | :---:   |
 | **[Ideogram 4 (nf4)](https://huggingface.co/ideogram-ai/ideogram-4-nf4)** | 9.3B | nf4 | CUDA | Yes | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
 | **[Ideogram 4 (fp8)](https://huggingface.co/ideogram-ai/ideogram-4-fp8)** | 9.3B | fp8 | All | No | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
 We plan to support more quantizations in the future.
 ## Performance
 We evaluate Ideogram 4 across third-party arenas and benchmarks, standard
 open-source benchmarks, and our own internal human-preference benchmark. Across
 all of them, **Ideogram 4 is the best open-weight image model by far, and sits
 at the frontier of design.**
 ### Design Arena
 [Design Arena](https://www.designarena.ai/) is a third-party image Elo
 leaderboard focused specifically on design-oriented generation. On the overall
 board, Ideogram 4 is the top-ranked open-weight model, trailing only proprietary
 GPT and Gemini models:
 <p align="center">
  <img src="assets/benchmarks/design_arena.png" alt="Design Arena overall image Elo leaderboard with Ideogram 4.0 as the top open-weight model">
 </p>
 Filtered to open-weight models only, Ideogram 4 leads by a commanding margin,
 well ahead of the next-best open model:
 <p align="center">
  <img src="assets/benchmarks/design_arena2.png" alt="Design Arena open-weight image Elo leaderboard, with Ideogram 4.0 well ahead of all other open models">
 </p>
 ### ContraLabs
 [ContraLabs](https://contralabs.com/research) ran a blind typography evaluation judged by
 ten professional designers from Contra's top-earning talent. Ideogram 4 leads on
 first-place win rate, picked as the best of four models 47.9% of the time
 overall — well ahead of Gemini 3.1 Flash Image Preview (Nano Banana 2) at 30.0%,
 FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%):
 <p align="center">
  <img src="assets/benchmarks/contralabs_typography.png" alt="ContraLabs typography first-place win rate, with Ideogram v4 leading">
 </p>
 It also wins on practical usability: asked "Would you use this in real client
 work?", the same designers rated Ideogram 4 highest at 3.55 / 5 — significantly
 above Nano Banana 2 (2.84), Grok Imagine 1.0 (2.61), and FLUX.2 [max] (2.49):
 <p align="center">
  <img src="assets/benchmarks/contralabs_typography2.png" alt="ContraLabs 'would you use this in real client work?' rating, with Ideogram v4 leading">
 </p>
 ### LMArena
 On [LMArena](https://lmarena.ai/), a third-party text-to-image leaderboard that
 measures general-purpose text-to-image use cases, Ideogram is the top-ranked
 open-weight lab and a top-5 image generation lab overall — beaten only by giant
 companies with vastly larger budgets and resources:
 <p align="center">
  <img src="assets/benchmarks/lmarena_benchmark.png" alt="LMArena text-to-image lab leaderboard with Ideogram">
 </p>
 ### Ideogram internal eval
 For our internal human-preference benchmark, focused on graphic design and
 photography, we had graphic designers deeply familiar with professional design
 work do the rating blind. Bradley-Terry scores rank Ideogram 4 #2 overall —
 behind only GPT Image 2 medium — and the top open-weight model:
 <p align="center">
  <img src="assets/benchmarks/ideogram_benchmark.png" alt="Ideogram internal design leaderboard with Ideogram 4.0">
 </p>
 ### Open-source benchmarks
 On standard open-source benchmarks measuring core capabilities — layout control
 (7Bench), spatial reasoning and object fidelity (SpatialGenEval), text rendering
 (X-Omni OCR), and prompt alignment (Prism) — Ideogram 4 closes the gap to the
 leading closed-source models across every axis. On layout control (7Bench), it
 is significantly better than all closed-source models:
 <p align="center">
  <img src="assets/benchmarks/opensource.png" alt="Five-axis capability radar comparing Ideogram 4.0 to leading closed-source models on layout control, spatial reasoning, object fidelity, prompt alignment, and text rendering">
 </p>
 At 9.3B parameters, Ideogram 4 delivers the best text rendering of any open-weight
 release we benchmarked — ahead of much larger models like Qwen-Image (20B),
 FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE):
 <p align="center">
  <img src="assets/benchmarks/opensource2.png" alt="Parameter-efficiency scatter plot showing Ideogram 4.0 at 9.3B parameters leading all other open-weight models on text rendering">
 </p>
 ## Quick Start
 ### Install
 ```bash
 pip install .
 ```
 If you plan to modify the code, install in editable mode instead so changes
 under `src/ideogram4/` take effect without reinstalling:
 ```bash
 pip install -e .
 ```
 ### Model access
 The model weights are **gated** on Hugging Face, so you must accept the gate and
 authenticate before the code can download them — otherwise the download fails
 with a `404` / `GatedRepoError`.
 1. Open the model page — [ideogram-ai/ideogram-4-nf4](https://huggingface.co/ideogram-ai/ideogram-4-nf4)
   (or [ideogram-ai/ideogram-4-fp8](https://huggingface.co/ideogram-ai/ideogram-4-fp8)) — and click
   **Agree and access repository** to accept the license gate.
 2. Create a Hugging Face access token at
   [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and log in so the
   download is authenticated:
   ```bash
   hf auth login
   ```
   Alternatively, export the token directly: `export HF_TOKEN="hf_..."`.
 ### CLI
 The plain `--prompt` is rewritten into the structured JSON caption the model
 expects by a "magic prompt" LLM. By default this uses Ideogram's hosted
 magic-prompt API, which is **free** and does the expansion server-side (no local
 model or system prompt needed). It reads `IDEOGRAM_API_KEY` — get a key at
 https://developer.ideogram.ai/:
 ```bash
 python run_inference.py \
  --prompt "a ginger cat wearing a tiny wizard hat reading a spellbook" \
  --output out.png \
  --quantization "nf4" \
  --magic-prompt-key "$IDEOGRAM_API_KEY"
 ```
 You can also run the expansion through your own LLM provider — one of our magic-prompt
 system prompt is **open source**. See the
 [Prompting Guide](docs/prompting.md#magic-prompt) for details.
 For the highest-quality images, set `--height 2048 --width 2048` and
 `--sampler-preset V4_QUALITY_48`.
 #### Safety screening with Hive
 Prompt and output safety screening is performed via [Hive](https://thehive.ai/).
 Sign up and create a Text Moderation key and a Visual Content Moderation key,
 then export them as `HIVE_TEXT_MODERATION_KEY` and `HIVE_VISUAL_MODERATION_KEY`
 (or pass them via `--hive-text-key` / `--hive-visual-key`).
 ```bash
 python run_inference.py \
  --prompt "an isometric illustration of a tiny city floating in the clouds" \
  --output out.png \
  --quantization "nf4" \
  --magic-prompt-key "$MAGIC_PROMPT_API_KEY" \
  --hive-text-key "$HIVE_TEXT_MODERATION_KEY" \
  --hive-visual-key "$HIVE_VISUAL_MODERATION_KEY"
 ```
 For sampler presets, parameter reference, and optimization tips, see
 [docs/inference.md](docs/inference.md).
 ## Model Summary
 Ideogram 4 is a **foundation model trained entirely from scratch**, not a
 fine-tune or distillation of any existing checkpoint. It is a flow-matching
 text-to-image model built on a **fully single-stream** Diffusion Transformer
 (DiT) architecture.
 **Architecture:**
 - **Fully single-stream DiT.** Text and image tokens are concatenated into one
  unified sequence and processed through the same 34-layer transformer, with no
  separate text or image branches. This enables deep cross-modal interaction at
  every layer.
 - **Vision-language model as text encoder.** Instead of a text-only encoder
  like CLIP or T5, Ideogram 4 uses
  [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct),
  a full vision-language model that provides far richer understanding of visual
  concepts. Hidden states are extracted from **13 intermediate layers** and
  concatenated, giving the model multi-scale semantic features ranging from
  surface-level token information to deep compositional understanding.
 - **Dual-branch classifier-free guidance.** The conditional (positive) and
  unconditional (negative) branches can be independently refined, enabling
  separate control over prompt adherence and image quality.
 - **Flexible resolution.** Native support for any resolution from 256 to 2048
  (multiples of 16), with aspect ratios up to 6:1. A single model handles
  everything from square thumbnails to ultrawide banners, with the noise
  schedule auto-adjusting per resolution.
 **Key Capabilities:**
 - **Extreme controllability.** Ideogram 4 is trained on structured JSON
  captions, giving users unprecedented control over composition, style,
  lighting, color palette, typography, and spatial layout, all from a single
  prompt.
 - **State-of-the-art text rendering.** Ideogram 4 delivers best-in-class
  in-image text generation (signage, logos, captions, watermarks, multi-line
  text) with high fidelity directly from the prompt.
 - **Spatial layout control.** Bounding-box coordinates in the prompt allow
  explicit placement of subjects, text elements, and background regions.
 - **Color palette conditioning.** Specify hex colors in the prompt to steer the
  image's dominant color scheme.
 For full architecture details, see
 [docs/model_architecture.md](docs/model_architecture.md). For a walkthrough of
 how the pipeline components fit together, see
 [docs/pipeline.md](docs/pipeline.md).
 ## Prompting Guide
 Ideogram 4 is trained exclusively on **structured JSON captions**. While
 plain-text prompts work, you will get the best results by providing a JSON
 object that follows our caption schema.
 Key points:
 - **Use JSON prompts** for maximum controllability — the model was trained on
  them and understands the structure natively.
 - **Color palette conditioning** — specify a `colour_palette` array of hex
  colors in the style description to steer the image's color scheme.
 - **Aspect ratio flexibility** — Ideogram 4 supports a wide range of aspect
  ratios (any multiple-of-16 resolution from 256 to 2048 on each side). This
  is a key advantage for practical use: portraits, landscapes, banners,
  phone wallpapers, social media formats, etc.
 - **Bounding-box layout** — specify `bbox` coordinates in the prompt to
  explicitly place subjects, text elements, and background regions.
 - **Compositional control** — use `compositional_deconstruction` with bounding
  boxes and per-element descriptions for precise spatial layout.
 **Why JSON-only training?** We train exclusively on JSON so that training
 and inference share a single, common prompt format. The training captions themselves are deliberately
 **extremely descriptive**: each JSON exhaustively describes everything in
 the image to maximize training efficiency. The more
 text-to-image relationships each caption pins down, the more grounded
 supervision the model extracts from a single training pair, rather than
 having to infer those relationships across many sparsely-captioned samples.
 **Why JSON at inference time?** Because the model was trained on captions
 that name every object explicitly, the most reliable way to get every
 requested object rendered is to mirror that pattern. Plain-text prompts still work, but
 won't perform as well since the model was only trained on structured JSON captions.
 **Don't want to write JSON by hand?** That's what *magic prompt* is for: it uses
 an LLM to expand a plain-text prompt into a full structured caption before
 generation, so you get JSON-quality results from a casual prompt. It runs by
 default in `run_inference.py` (see the [CLI](#cli) section).
 See [docs/prompting.md](docs/prompting.md) for a full guide.
 ## Documentation
 | Document | Description |
 | :------- | :---------- |
 | [docs/prompting.md](docs/prompting.md) | How to write JSON prompts, color palette conditioning, aspect ratios |
 | [docs/inference.md](docs/inference.md) | Sampler presets, parameter reference, resolutions, optimization tips |
 | [docs/model_architecture.md](docs/model_architecture.md) | Architecture diagram, DiT spec, component details |
 | [docs/pipeline.md](docs/pipeline.md) | Conceptual pipeline walkthrough — how all components fit together |
 | [docs/development.md](docs/development.md) | Dev setup, pre-commit hooks, contributing |
 | [docs/safety.md](docs/safety.md) | Pre-training, post-training, and inference-time safety mitigations; how to report violations |
 ## Citation
 If you find the provided code or models useful for your research, consider citing them as:
 ```bibtex
@misc{ideogram-4-2026,
    author={Ideogram AI},
    title={{Ideogram 4}},
    year={2026},
    howpublished={\url{https://ideogram.ai/blog/ideogram-4.0/}},
 }
 ```
 ## We're Hiring!
 We're looking for **Research Scientists** and **Research Engineers** to
 work on next-generation generative models and the products built on top of
 them. Interested candidates please apply https://jobs.ashbyhq.com/ideogram
@@ -0,0 +1,58 @@
 # Development
 ## Editable install
 We recommend installing into an isolated environment — the dependencies include several GB of CUDA-built wheels.
 ```bash
 python -m venv .venv && source .venv/bin/activate
 ```
 For development, install the package in editable mode so changes to the source
 tree are picked up without reinstalling:
 ```bash
 pip install -e .
 ```
 or with [`uv`](https://docs.astral.sh/uv/):
 ```bash
 uv venv && source .venv/bin/activate
 ```
 ```bash
 uv pip install -e .
 ```
 ## Pre-commit hooks
 This repo uses [pre-commit](https://pre-commit.com/) to run lint, format, and
 type checks (`ruff`, `mypy`, etc.) before each commit.
 Install once per clone:
 ```bash
 pip install pre-commit
 pre-commit install
 ```
 `pre-commit install` registers a git hook in `.git/hooks/pre-commit`, so it
 requires the directory to be a git repo. The hooks now run automatically on
 `git commit` against staged files.
 To run the hooks manually against every file in the repo (useful right after
 the first install, or in CI):
 ```bash
 pre-commit run --all-files
 ```
 The first run downloads each hook's environment (ruff, mypy, etc.) into
 `~/.cache/pre-commit/` and may take a minute. Subsequent runs are fast.
 To bump pinned hook versions in `.pre-commit-config.yaml`:
 ```bash
 pre-commit autoupdate
 ```
@@ -0,0 +1,63 @@
 # Inference Reference
 Detailed parameters, sampler presets, supported resolutions, and optimization
 tips for Ideogram 4 inference.
 ## Sampler Presets
 Named presets bundle a step count, per-step CFG schedule, schedule mean (`mu`),
 and schedule standard deviation (`std`) into a single flag:
 ```bash
 python run_inference.py \
  --prompt "a cat wearing a tiny top hat" \
  --sampler-preset V4_QUALITY_48 \
  --output out.png
 ```
 | Preset | Steps | CFG schedule | `mu` | `std` |
 | :----- | :---: | :----------- | :--: | :---: |
 | `V4_QUALITY_48` | 48 | 45 steps @ gw=7, then 3 polish steps @ gw=3 | 0.0 | 1.5 |
 | `V4_DEFAULT_20` | 20 | 18 steps @ gw=7, then 2 polish steps @ gw=3 | 0.0 | 1.75 |
 | `V4_TURBO_12` | 12 | 11 steps @ gw=7, then 1 polish step @ gw=3 | 0.5 | 1.75 |
 `V4_QUALITY_48` is the default. Fewer steps trade quality for speed. The full
 registry lives in
 [`ideogram4.sampler_configs.PRESETS`](../src/ideogram4/sampler_configs.py); add a
 new entry there to define your own.
 ## Key Parameters
 These are the keyword arguments accepted by `Ideogram4Pipeline.__call__`. The
 defaults below apply when you call `pipe(...)` directly; `run_inference.py`
 overrides `num_steps`, `guidance_schedule`, `mu`, and `std` from the chosen
 sampler preset (see above).
 | Parameter | Default | Notes |
 | :-------- | :-----: | :---- |
 | `height` / `width` | 1024 | Must be multiples of 16. Supported range: 256–2048. Aspect ratios up to 6:1 or 1:6. |
 | `num_steps` | 48 | More steps = higher quality. The `V4_QUALITY_48` preset (48 steps) is a good speed/quality trade-off. |
 | `guidance_scale` | 7.0 | Constant guidance weight used when no `guidance_schedule` is given. Higher = more prompt adherence, lower = more diversity. |
 | `guidance_schedule` | `None` | Optional per-step guidance weights (loop-index order: index 0 is the final step). Overrides `guidance_scale`. |
 | `mu` | 0.5 | Logit-normal schedule mean. Auto-adjusted for resolution. |
 | `std` | 1.0 | Logit-normal schedule standard deviation. |
 | `seed` | `None` | Set for reproducible results. |
 ## Supported Resolutions
 Ideogram 4 natively supports any resolution where both height and width are
 multiples of 16, within the range 256–2048 (aspect ratios up to 6:1 or 1:6).
 | Use case | Resolution | Aspect ratio |
 | :------- | :--------: | :----------: |
 | Square | 1024 × 1024 | 1:1 |
 | Landscape | 1536 × 1024 | 3:2 |
 | Portrait | 1024 × 1536 | 2:3 |
 | Widescreen | 1920 × 1088 | ~16:9 |
 | Ultrawide | 2048 × 768 | ~21:9 |
 | Phone wallpaper | 1024 × 1792 | ~9:16 |
 | Social banner | 1600 × 400 | 4:1 |
 Resolution buckets use 16-pixel increments, giving fine-grained control over
 output dimensions.
@@ -0,0 +1,45 @@
 # Model Architecture
 ```
 prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
            │   
            ▼
    ┌──────────────────────────────────────────────────┐
    │    Ideogram4Transformer                         │  
    │  • 34 × Ideogram4TransformerBlock               │
    │      – Ideogram4Attention (QK-RMSNorm, MRoPE)   │
    │      – Ideogram4MLP (SwiGLU)                    │
    │      – adaln scale/gate from t-embedding         │
    │  • Ideogram4FinalLayer                          │
    └──────────────────────────────────────────────────┘
            │  velocity prediction
            ▼
    Euler flow-matching sampler with asymmetric CFG
            │  denoised image latents
            ▼
    VAE decode
            │
            ▼
            PIL.Image
 ```
 The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
 the activation layers) and image latent tokens are concatenated into one
 sequence, modulated per-block by an AdaLN computed from the flow-matching
 timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
 image tokens share a unified positional space.
 Model spec:
 | field             | value         |
 |-------------------|---------------|
 | `emb_dim`         | 4608          |
 | `num_layers`      | 34            |
 | `num_heads`       | 18            |
 | `intermediate`    | 12288         |
 | `adanln_dim`      | 512           |
 | `rope_theta`      | 5_000_000     |
 | `mrope_section`   | (24, 20, 20)  |
 | latent channels   | 32 × 2² = 128 |
 | max text tokens   | 2048          |
 | sampler           | Euler flow-matching, logit-normal schedule, asymmetric CFG |
@@ -0,0 +1,183 @@
 # Pipeline: How All the Components Work Together
 This document explains the end-to-end Ideogram 4 inference pipeline
 conceptually. For the architecture spec and code pointers, see
 [model_architecture.md](model_architecture.md).
 ## Overview
 Ideogram 4 is a **flow-matching text-to-image model** built on a
 **single-stream DiT** (Diffusion Transformer). The pipeline has four main
 components:
 ```
 ┌─────────────┐   ┌──────────────────────┐   ┌──────────────┐   ┌───────────┐
 │  Qwen3-VL   │   │  Ideogram4          │   │  KL VAE      │   │           │
 │  Text       ├──►│  Transformer (DiT)   ├──►│  VAE         ├──►│  Image    │
 │  Encoder    │   │  + Euler Sampler     │   │  Decoder     │   │           │
 └─────────────┘   └──────────────────────┘   └──────────────┘   └───────────┘
     frozen              trainable                 frozen
 ```
 ## 1. Text Encoder — Qwen3-VL-8B-Instruct
 The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
 vision-language model, used in text-only mode (no vision inputs).
 **What it does:**
 - Tokenizes the prompt using the Qwen3 chat template.
 - Runs a forward pass through the 36-layer transformer.
 - **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
  24, 27, 30, 33, 35.
 - Concatenates these hidden states along the feature dimension, producing a
  multi-scale text representation.
 **Why multi-layer extraction?** Different layers capture different levels of
 abstraction — early layers encode surface-level token information, while later
 layers encode deeper semantic meaning. Concatenating them gives the DiT access
 to the full spectrum.
 **Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
 ## 2. DiT Backbone — Ideogram4Transformer
 The core generative model is a 34-layer single-stream Diffusion Transformer.
 ### Sequence layout
 Text tokens and image latent tokens are concatenated into one sequence and
 processed through the same self-attention layers.
 ```
 Sequence layout (per sample):
  ┌───────────────────┬────────────────────────┐
  │  text tokens      │  image latent tokens   │
  │  (up to 2048)     │  (grid_h × grid_w)     │
  └───────────────────┴────────────────────────┘
           ▲                    ▲
     Qwen3-VL features    noisy latents z_t
 ```
 ### Key components per block
 - **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
  positional encoding is 3-dimensional: for text tokens it uses a 1D position
  broadcast to 3 axes; for image tokens it uses (temporal, height, width)
  coordinates. This lets text and image tokens coexist in a unified positional
  space.
 - **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
  activation.
 - **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
  and generates per-block scale and gate parameters. This conditions every layer
  on the current noise level.
 ### Flow matching
 The model is trained with a **flow-matching** objective. Instead of predicting
 noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
 defines the ODE:
 ```
 dz/dt = v(z_t, t)
 ```
 At inference time, we start from pure Gaussian noise `z_1` and integrate
 backward to `z_0` (the clean image) using the Euler method:
 ```
 z_{t-dt} = z_t + v(z_t, t) * dt
 ```
 ### Noise schedule
 The timestep distribution follows a **logit-normal schedule** parameterized by
 `(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
 different noise levels — higher `mu` shifts more steps toward higher noise
 (important for high-resolution images). The schedule auto-adjusts for
 resolution:
 ```
 mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
 ```
 where `base_pixels = 512 * 512`.
 ## 3. Classifier-Free Guidance (CFG)
 At each sampling step, two forward passes are run through the DiT:
 1. **Conditional (positive):** full text features + noisy image latents.
 2. **Unconditional (negative):** zeroed text features + noisy image latents
   (image-only tokens, asymmetric CFG).
 The guided velocity is a weighted combination:
 ```
 v_guided = gw * v_conditional + (1 - gw) * v_unconditional
 ```
 where `gw` is the per-step guidance weight. With
 `gw > 1`, the model amplifies the text-conditional signal and suppresses the
 unconditional prediction, producing images that follow the prompt more
 faithfully.
 **Asymmetric CFG:** The unconditional branch only processes image tokens (no
 text padding), making it computationally cheaper than a full-sequence negative
 pass.
 **Per-step schedules:** The guidance weight can vary across steps. The
 `V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
 `gw=3` for the final 3 "polish" steps near `t=0`.
 ## 4. VAE Decoder — KL Autoencoder
 The denoised latent `z_0` is decoded to pixel space using a frozen KL
 autoencoder.
 **What it does:**
 - **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
  input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
  `(batch, channels, grid_h * 2, grid_w * 2)`.
 - **Denormalization:** Per-channel shift and scale are applied to undo the
  latent normalization used during training.
 - **Decoding:** The VAE decoder maps latents to RGB pixels.
 - **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
 **Compression factor:** The autoencoder provides 8× spatial compression on each
 axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
 is represented as a 64×64 grid of latent tokens, each with 128 channels
 (32 base channels × 2² patch).
 ## Putting it all together
 ```python
 # Pseudocode for one generation call:
 # 1. Encode text
 text_features = qwen3_vl.encode(prompt)  # (B, L_text, D)
 # 2. Initialize noise
 z = torch.randn(B, grid_h * grid_w, 128)  # pure noise at t=1
 # 3. Euler integration from t=1 to t=0
 for step in reversed(range(num_steps)):
    t = schedule(step)
    s = schedule(step - 1)
    # Conditional pass (text + image)
    v_cond = dit(text_features, z, t)
    # Unconditional pass (image only, zeroed text)
    v_uncond = dit(zeros, z, t)
    # CFG combination
    v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
    # Euler step
    z = z + v * (s - t)
 # 4. Decode to pixels
 image = vae.decode(z)
 ```
@@ -0,0 +1,362 @@
 # Prompting Guide
 Ideogram 4 is trained exclusively on **structured JSON captions** (represented as string type). While the
 model can accept plain-text prompts, providing a JSON object that follows the
 caption schema gives significantly better results, especially for
 controllability, spatial layout, and style fidelity.
 ## Plain-text vs. JSON prompts
 You can pass in plain-text prompts directly to the model and it will work. The
 sampling parameters come from a named preset in `ideogram4.PRESETS` (the same
 ones `run_inference.py` exposes via `--sampler-preset`), unpacked into the
 `pipe()` call:
 ```python
 from ideogram4 import PRESETS
 preset = PRESETS["V4_QUALITY_48"]
 images = pipe(
  "a golden retriever on a skateboard",
  height=1024,
  width=1024,
  num_steps=preset.num_steps,
  guidance_schedule=preset.guidance_schedule,
  mu=preset.mu,
  std=preset.std,
 )
 ```
 But for higher quality image generations and more control, pass a JSON string as the prompt:
 ```python
 import json
 from ideogram4 import PRESETS
 caption = {
  "high_level_description": "A golden retriever riding a skateboard down a sunny sidewalk.",
  "style_description": {
    "aesthetics": "warm, playful, vibrant",
    "lighting": "bright afternoon sunlight, long soft shadows",
    "photo": "shallow depth of field, eye-level, 85mm lens",
    "medium": "photograph",
    "color_palette": ["#F5C542", "#87CEEB", "#4A4A4A", "#FFFFFF", "#2E8B57"]
  },
  "compositional_deconstruction": {
    "background": "A sun-drenched suburban sidewalk lined with green hedges and a white picket fence. Dappled light filters through overhead trees.",
    "elements": [
      {"type": "obj", "bbox": [200, 300, 800, 900], "desc": "A golden retriever with a fluffy coat, standing on a red skateboard with all four paws. Its tongue is out and ears are flapping in the wind."},
      {"type": "obj", "bbox": [250, 750, 750, 950], "desc": "A worn red skateboard with black wheels rolling along the concrete sidewalk."}
    ]
  }
 }
 preset = PRESETS["V4_QUALITY_48"]
 images = pipe(
  json.dumps(caption, separators=(",", ":"), ensure_ascii=False),
  height=1024,
  width=1024,
  num_steps=preset.num_steps,
  guidance_schedule=preset.guidance_schedule,
  mu=preset.mu,
  std=preset.std,
 )
 ```
 ## Magic prompt
 Writing these captions by hand is optional. *Magic prompt* uses an LLM to expand
 a plain-text prompt into a full structured caption for you, so you get the
 quality of a JSON prompt from a casual one. It is enabled by default in
 `run_inference.py`; you can also call it directly:
 ```python
 import os
 from ideogram4 import ClaudeOpusMagicPromptV1, PRESETS
 magic = ClaudeOpusMagicPromptV1(api_key=os.environ["MAGIC_PROMPT_API_KEY"])
 caption = magic.expand("a golden retriever on a skateboard", aspect_ratio="1:1")
 preset = PRESETS["V4_QUALITY_48"]
 images = pipe(
  caption,
  height=1024,
  width=1024,
  num_steps=preset.num_steps,
  guidance_schedule=preset.guidance_schedule,
  mu=preset.mu,
  std=preset.std,
 )
 ```
 The package ships three configurations, registered by name in
 `ideogram4.MAGIC_PROMPTS` (the keys `run_inference.py` accepts via
 `--magic-prompt-model`):
 | Config class | Registry key | Backend |
 | :--- | :--- | :--- |
 | `Ideogram4MagicPromptV1` | `ideogram-4-v1` | Ideogram's hosted magic-prompt API (free; reads `IDEOGRAM_API_KEY`) |
 | `ClaudeOpusMagicPromptV1` | `claude-opus-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
 | `ClaudeSonnetMagicPromptV1` | `claude-sonnet-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
 `ideogram-4-v1` is the default and is **free**. It runs the expansion
 server-side, so there is no local model or system prompt involved — it just needs
 an Ideogram API key (get one at
 [developer.ideogram.ai](https://developer.ideogram.ai)). The `claude-*`
 configurations instead send one of our open-source system prompt to an OpenRouter model;
 select one with `--magic-prompt-model` and export `MAGIC_PROMPT_API_KEY`:
 ```bash
 python run_inference.py \
  --prompt "an isometric illustration of a tiny city floating in the clouds" \
  --output out.png \
  --quantization "nf4" \
  --magic-prompt-model claude-opus-v1 \
  --magic-prompt-key "$MAGIC_PROMPT_API_KEY"
 ```
 See the README's [CLI](../README.md#cli) section for the rest of the flags.
 Our magic-prompt system prompts are **open source** (they ship in
 `src/ideogram4/magic_prompt_system_prompts/`), so you're also welcome to
 construct the caption with any system prompt and LLM of your choosing.
 **A few caveats:**
 - At Ideogram we've tested this magic prompt with **Claude Opus**. You're welcome
  to implement your own `MagicPrompt` configurations and/or drive a different LLM
  with our system prompt, but those paths aren't tested by us and quality may
  vary.
 - The magic prompt shipped here is **not** the same magic prompt used in
  production at [Ideogram.ai](https://ideogram.ai) — results will differ from the
  hosted product (including the `ideogram-4-v1` API).
 ## JSON caption schema
 > **Note:** Following this schema is **not required** — the model accepts any
 > string as a prompt. The schema below describes the exact structure the model
 > was trained on, and matching it minimizes train/eval mismatch so the model
 > generates closer to its full quality. Treat the "required" / "must" language
 > in the rest of this section as the format the [`CaptionVerifier`](../src/ideogram4/caption_verifier.py)
 > checks against, not as a hard pipeline constraint. Deviating from the schema
 > is allowed; it just means you're sampling outside the training distribution.
 The full caption schema has three top-level fields:
 1. `high_level_description` — optional string, but strongly recommended.
 2. `style_description` — optional object.
 3. `compositional_deconstruction` — **required** object.
 `compositional_deconstruction` must always be present. Within it, both
 `background` and `elements` are required.
 ### `high_level_description`
 A one- or two-sentence summary of the entire image. Strongly recommended in every prompt.
 ```json
 "high_level_description": "A medium-shot photograph of a barista pouring latte art in a cozy cafe."
 ```
 ### `style_description`
 Controls the visual style, lighting, medium, and color palette.
 `style_description` must contain **exactly one** of:
 - `photo` — for photographic captions (paired with `medium: "photograph"`).
 - `art_style` — for non-photographic captions (illustration, painting, 3D render, etc.).
 `aesthetics`, `lighting`, and `medium` are also required when `style_description` is present. `color_palette` is optional.
 **Key order is strict** and depends on which of `photo` / `art_style` is used:
 | Caption type | Required key order |
 | :----------- | :----------------- |
 | Photo (uses `photo`) | `aesthetics`, `lighting`, `photo`, `medium`, `color_palette` |
 | Non-photo (uses `art_style`) | `aesthetics`, `lighting`, `medium`, `art_style`, `color_palette` |
 `color_palette` is the only field in this list that may be omitted; if it is included it must remain in the final position.
 Field descriptions:
 | Field | Type | Description |
 | :---- | :--- | :---------- |
 | `aesthetics` | string | Aesthetic keywords (e.g. "moody, cinematic, desaturated") |
 | `lighting` | string | Lighting description (e.g. "golden hour, rim light, dramatic shadows") |
 | `photo` | string | Camera/lens details for photographic outputs (e.g. "35mm, f/1.4, bokeh"). Use this OR `art_style`, not both. |
 | `medium` | string | Medium type: `"photograph"`, `"illustration"`, `"3d_render"`, `"painting"`, `"graphic_design"`, etc. |
 | `art_style` | string | Art style description for non-photo captions (e.g. "flat vector illustration, bold outlines"). Use this OR `photo`, not both. |
 | `color_palette` | list[str] | Hex color codes that steer the image's dominant colors. Up to 16 entries. |
 ### `compositional_deconstruction`
 Provides fine-grained spatial control over the image layout using bounding
 boxes and per-element descriptions. Both fields below are required.
 | Field | Type | Description |
 | :---- | :--- | :---------- |
 | `background` | string | Description of the background/environment (required) |
 | `elements` | list[dict] | List of elements with optional bounding boxes (required) |
 `background` must come before `elements`.
 Each element in `elements` must follow a fixed **key order** depending on its
 type. `bbox` and `color_palette` are optional within an element; if present they
 must appear in the positions shown below.
 | Type | Required key order |
 | :--- | :----------------- |
 | `"obj"` | `type`, `bbox`, `desc`, `color_palette` |
 | `"text"` | `type`, `bbox`, `text`, `desc`, `color_palette` |
 Field descriptions:
 | Field | Type | Description |
 | :---- | :--- | :---------- |
 | `type` | string | `"obj"` for objects/subjects, `"text"` for in-image text |
 | `bbox` | list[int] | `[y_min, x_min, y_max, x_max]` in normalized `0–1000` coordinates (origin at top-left). Optional. |
 | `desc` | string | Detailed description of the element |
 | `text` | string | (only for `type: "text"`) The literal text to render |
 | `color_palette` | list[str] | Optional per-element palette. Up to 5 hex entries. |
 **Key ordering matters.** The model was trained on JSON with a consistent key
 order, so maintaining it improves generation quality. The pipeline runs
 [`CaptionVerifier`](../src/ideogram4/caption_verifier.py) on every prompt and emits
 warnings for unknown keys, missing required keys, or out-of-order keys.
 **Hex color format.** Colors in `color_palette` must be uppercase
 `#RRGGBB` strings (e.g. `#1B1B2F`, not `#1b1b2f` or `#fff`).
 **Encoding.** When serializing with Python's `json` module, pass
 `separators=(",", ":")` and `ensure_ascii=False`.
 `CaptionVerifier` warns when it detects `\uXXXX` escapes with no literal
 non-ASCII characters in the raw text.
 ## Color palette conditioning
 One of Ideogram 4's distinctive features is **color palette control**. By
 providing a `color_palette` array of hex colors in `style_description`, you
 can steer the dominant colors of the generated image.
 ```json
 "style_description": {
  "aesthetics": "moody, cinematic",
  "lighting": "low-key, deep shadows",
  "photo": "35mm, f/1.4",
  "medium": "photograph",
  "color_palette": ["#1B1B2F", "#162447", "#1F4068", "#E43F5A", "#F5F5F5"]
 }
 ```
 Tips for effective color palette use:
 - **Up to 16 colors** in `style_description.color_palette` for the overall
  image palette, and **up to 5 colors** per element in
  `compositional_deconstruction.elements[*].color_palette`.
 - **Include background colors** — if you want a dark background, include the
  dark hex in the palette.
 - **Contrast pairs** — include both your highlight and shadow colors for more
  controlled lighting.
 - **Uppercase hex only** — `#RRGGBB` form, no shorthand.
 ### Example: warm sunset palette
 ```json
 {
  "high_level_description": "A lone sailboat on calm water at sunset.",
  "style_description": {
    "aesthetics": "serene, warm, golden hour",
    "lighting": "golden hour backlighting, warm atmospheric haze",
    "photo": "wide angle, f/8, long exposure",
    "medium": "photograph",
    "color_palette": ["#FF6B35", "#F7C59F", "#004E89", "#1A659E", "#2B2D42"]
  },
  "compositional_deconstruction": {
    "background": "A calm ocean stretching to a low horizon, sky washed in orange and pink with thin wisps of cloud.",
    "elements": [
      {"type": "obj", "desc": "A single sailboat with a white triangular sail, silhouetted against the setting sun."}
    ]
  }
 }
 ```
 ### Example: corporate design palette
 ```json
 {
  "high_level_description": "A clean, modern business card layout for a tech company.",
  "style_description": {
    "aesthetics": "minimal, professional, geometric",
    "lighting": "even, diffuse studio lighting",
    "medium": "graphic_design",
    "art_style": "flat vector design, generous whitespace, sans-serif typography",
    "color_palette": ["#FFFFFF", "#F0F0F0", "#333333", "#0066FF", "#00CC88"]
  },
  "compositional_deconstruction": {
    "background": "A solid off-white card surface with subtle paper texture.",
    "elements": [
      {"type": "text", "text": "ACME TECH", "desc": "Bold dark grey sans-serif company name across the upper third of the card."},
      {"type": "text", "text": "hello@acme.tech", "desc": "Small blue sans-serif contact email near the bottom of the card."}
    ]
  }
 }
 ```
 ## Full example
 ```json
 {
  "high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.",
  "style_description": {
    "aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant",
    "lighting": "overcast daylight, diffused, soft subtle shadows",
    "photo": "shallow depth of field, sharp focus, eye-level, telephoto",
    "medium": "photograph"
  },
  "compositional_deconstruction": {
    "background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.",
    "elements": [
      {"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."},
      {"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."},
      {"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."},
      {"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."},
      {"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."},
      {"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."},
      {"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."},
      {"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."},
      {"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."},
      {"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."},
      {"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."},
      {"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."},
      {"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."},
      {"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."},
      {"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."},
      {"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."},
      {"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."}
    ]
  }
 }
 ```
 ## Safety filter
 NSFW prompts are blocked. Instead of an image, the model returns a gray screen
 with the text "Image blocked by safety filter". False positive rates for safety
 is higher for non-json like prompts. We are aware that this is an issue an we may
 make a future checkpoint update to improve it.
 # Congratulations!
 You are now a certified Ideogram 4 prompter!
 With structured JSON captions, you have fine-grained control over composition,
 color palettes, typography, and spatial layout — capabilities that go far
 beyond what plain-text prompts can express!
 We'd love to see what you create :-)
 Share your results, experiments, and creative discoveries with the community,
 especially the unexpected ones. Tag us on social media or open a discussion on
 the repo. Happy generating!