Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
+15
View File
@@ -0,0 +1,15 @@
# Python
__pycache__/
*.py[cod]
# Generated at runtime / regenerated from code on first launch
translations.json
# User-specific and runtime state (not part of the application source)
comfy_settings.json
draft.json
prompt_library.json
prompt_previews/
# Editor / agent
.claude/
+159
View File
@@ -0,0 +1,159 @@
# Ideogram 4 Prompt Builder
**English** · [Русский](#ideogram-4-prompt-builder-ru)
A desktop GUI (PyQt6) for building structured JSON captions for **Ideogram 4** and ComfyUI workflows, with a prompt library, reference-image canvas, localisation, light/dark themes, and direct generation through a ComfyUI server.
![English interface](eng-vlack.png)
## Run
```powershell
python ideogram_prompt_builder.py
```
Requires `PyQt6` (no other third-party dependencies):
```powershell
pip install PyQt6
```
## What it builds
Prompts follow the schema from `docs/prompting.md`:
- `high_level_description`
- `style_description` with either `photo` or `art_style`
- `compositional_deconstruction.background`
- `compositional_deconstruction.elements`
- optional uppercase HEX color palettes
- optional bounding boxes in normalized `0-1000` coordinates
Actions live in a menu bar (**File / Edit / Library / ComfyUI / View**) plus a slim toolbar (Generate, Undo/Redo, Save to library, Library, Copy) and the language/theme controls on the right. The right-hand panel is tabbed: **JSON** (output + validation) and **Result** (the generated image).
## Editing
- Move and resize layout boxes directly with the mouse on the bbox canvas.
- Palette fields accept comma-separated HEX, clickable swatches and a popup color picker, with a live `n/limit` counter and invalid-color highlighting.
- **Undo / Redo** (`Ctrl+Z` / `Ctrl+Y`).
- **Duplicate**, **reorder** (up/down) and add elements from **templates** (Character / Title text / Background object).
- The validation list is clickable — clicking an element-specific message selects that element.
- Text fields have a right-click translation menu (`Translate to RU` / `Translate to EN`, results cached).
- Work is autosaved to `draft.json`; on the next launch you are offered to restore it.
## Reference image & zoom
In the composition panel you can load a **reference image** (file or paste from clipboard) drawn under the bbox grid; the **grid scale** slider zooms the grid and the reference scales with it.
## Prompt library
The **Library** menu saves the current caption (optionally with a preview image), updates the entry you loaded from, and opens the library browser, where you can:
- search by name / tag / description and edit per-entry **tags**;
- load any saved prompt back into the editor for reuse and editing;
- attach a preview from a file or **paste it from the clipboard**, or remove it;
- rename, delete entries, and view the preview + summary;
- **export / import** the whole library (prompts + previews) as a single `.zip`.
The library is stored in `prompt_library.json` next to the app, with preview images in `prompt_previews/` (created on first save).
## ComfyUI integration
The **ComfyUI** menu connects the builder to a running ComfyUI server:
- **ComfyUI settings** — host, port and HTTPS, with a *Test connection* button. Stored in `comfy_settings.json`.
- **Check ComfyUI** — verifies that every model, sampler and custom node the bundled `ideogram4NSFWComfyui_v11.json` workflow needs is installed on the server, and lists anything missing.
- **Generate in ComfyUI** — converts the bundled workflow to API format, injects the current compact JSON caption, submits it and retrieves the generated image. The result appears in the **Result** tab and can be saved to a file or into the library.
## Appearance & localisation
- **Theme** (View menu) toggles a light / dark theme.
- The interface language is switched at runtime from the **Language** selector; the default is **English**.
UI strings are loaded from `translations.json`, created on first run from bundled `en` / `ru` translations. To add a language, add a top-level key with the same string keys (and optionally a display name in `LANGUAGE_NAMES`). Missing keys fall back to English then to the key name. Theme and language are saved in `comfy_settings.json`.
## Compact JSON for ComfyUI
The output can be copied in pretty or compact form. Compact JSON matches the recommended serialization style for inference and can be pasted into the Ideogram 4 prompt field in ComfyUI.
---
<a name="ideogram-4-prompt-builder-ru"></a>
# Ideogram 4 Prompt Builder (RU)
[English](#ideogram-4-prompt-builder) · **Русский**
Десктопное GUI-приложение (PyQt6) для сборки структурированных JSON-промтов для **Ideogram 4** и ComfyUI: с библиотекой промтов, холстом с референс-изображением, локализацией, светлой/тёмной темой и прямой генерацией через сервер ComfyUI.
![Русский интерфейс](ru-white.png)
## Запуск
```powershell
python ideogram_prompt_builder.py
```
Нужен только `PyQt6` (других сторонних зависимостей нет):
```powershell
pip install PyQt6
```
## Что собирается
Промты соответствуют схеме из `docs/prompting.md`:
- `high_level_description`
- `style_description` с одним из `photo` или `art_style`
- `compositional_deconstruction.background`
- `compositional_deconstruction.elements`
- опциональные палитры HEX в верхнем регистре
- опциональные bbox в нормализованных координатах `0-1000`
Действия вынесены в меню (**Файл / Правка / Библиотека / ComfyUI / Вид**) плюс компактная панель инструментов (Сгенерировать, Отменить/Повторить, Сохранить в библиотеку, Библиотека, Копировать) и переключатели языка/темы справа. Правая панель — вкладки: **JSON** (вывод + валидация) и **Результат** (сгенерированное изображение).
## Редактирование
- Перемещайте и масштабируйте рамки прямо мышью на холсте bbox.
- Поля палитры принимают HEX через запятую, кликабельные образцы и всплывающий выбор цвета, со счётчиком `n/лимит` и подсветкой некорректных цветов.
- **Отмена / Повтор** (`Ctrl+Z` / `Ctrl+Y`).
- **Дублирование**, **изменение порядка** (вверх/вниз) и добавление элементов из **шаблонов** (Персонаж / Заголовок / Фоновый объект).
- Список валидации кликабельный — клик по сообщению об элементе выделяет этот элемент.
- У текстовых полей есть контекстное меню перевода (`Перевести на RU` / `Перевести на EN`, с кэшированием).
- Работа автосохраняется в `draft.json`; при следующем запуске предлагается восстановить черновик.
## Референс-изображение и масштаб
В панели композиции можно загрузить **референс-изображение** (из файла или вставить из буфера), которое рисуется под сеткой bbox; ползунок **масштаба сетки** увеличивает сетку, и референс масштабируется вместе с ней.
## Библиотека промтов
Меню **Библиотека** сохраняет текущий промт (по желанию с превью), обновляет загруженную запись и открывает браузер библиотеки, где можно:
- искать по имени / тегам / описанию и редактировать **теги** записи;
- загрузить любой сохранённый промт обратно в редактор для повторного использования и правки;
- прикрепить превью из файла или **вставить из буфера обмена**, либо убрать его;
- переименовывать, удалять записи и просматривать превью + сводку;
- **экспортировать / импортировать** всю библиотеку (промты + превью) одним `.zip`.
Библиотека хранится в `prompt_library.json` рядом с приложением, превью — в `prompt_previews/` (создаются при первом сохранении).
## Интеграция с ComfyUI
Меню **ComfyUI** связывает приложение с запущенным сервером ComfyUI:
- **Настройки ComfyUI** — хост, порт и HTTPS, с кнопкой *Проверить соединение*. Хранятся в `comfy_settings.json`.
- **Проверить ComfyUI** — проверяет, что все модели, семплеры и кастомные ноды, нужные встроенному workflow `ideogram4NSFWComfyui_v11.json`, установлены на сервере, и перечисляет отсутствующие.
- **Сгенерировать в ComfyUI** — конвертирует встроенный workflow в API-формат, подставляет текущий compact JSON, отправляет запрос и получает изображение. Результат показывается во вкладке **Результат** и может быть сохранён в файл или в библиотеку.
## Внешний вид и локализация
- **Тема** (меню Вид) переключает светлую / тёмную тему.
- Язык интерфейса переключается на лету через селектор **Язык**; по умолчанию — английский.
Строки интерфейса берутся из `translations.json`, который создаётся при первом запуске из встроенных переводов `en` / `ru`. Чтобы добавить язык, добавьте ключ верхнего уровня с тем же набором строк (и при желании отображаемое имя в `LANGUAGE_NAMES`). Отсутствующие ключи откатываются к английскому, затем к самому ключу. Тема и язык сохраняются в `comfy_settings.json`.
## Compact JSON для ComfyUI
Вывод можно скопировать в pretty- или compact-виде. Compact JSON соответствует рекомендованной сериализации для инференса и вставляется в поле промта Ideogram 4 в ComfyUI.
+336
View File
@@ -0,0 +1,336 @@
<p align="center"><a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><picture>
<source media="(prefers-color-scheme: dark)" srcset="assets/ideogram_logo_darkmode.svg">
<source media="(prefers-color-scheme: light)" srcset="assets/ideogram_logo.svg">
<img src="assets/ideogram_logo.svg" alt="Ideogram" width="500">
</picture></a></p>
<p align="center"><em>Ideogram 4: Open image model at the forefront of design</em></p>
<p align="center">
<a href="https://ideogram.ai/blog/ideogram-4.0/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Blog-Post-orange" alt="Blog Post"></a>
<a href="https://github.com/ideogram-oss/ideogram4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github" alt="Code"></a>
<a href="https://huggingface.co/collections/ideogram-ai/ideogram-4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Model-HuggingFace-blue?logo=huggingface" alt="Model"></a>
<a href="https://developer.ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/API-developer.ideogram.ai-purple" alt="API"></a>
<a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Official%20Site-ideogram.ai-ff69b4" alt="Official Site"></a>
</p>
<p align="center">
<img src="assets/samples/collage_landscape.jpg" alt="A collage of Ideogram 4 samples spanning photorealism, illustration, typography, and poster design">
</p>
Ideogram 4 is **[Ideogram](https://ideogram.ai)'s first open-weight text-to-image model**. It is a **state-of-the-art foundation model trained from scratch** — not a fine-tune of any existing model. It introduces a new structured JSON prompting interface, with best-in-class multilingual text rendering, deep language understanding, explicit bounding-box layout and color-palette controls, and native 2k resolution images. The easiest way to try the model is online at **[ideogram.ai](https://ideogram.ai/)**.
We believe openness drives innovation, and we invite the research community to innovate with us on the forefront of visual intelligence.
## Table of Contents
1. [News](#news)
2. [Model Zoo](#model-zoo)
3. [Performance](#performance)
4. [Quick Start](#quick-start)
5. [Model Summary](#model-summary)
6. [Prompting Guide](#prompting-guide)
7. [Documentation](#documentation)
8. [Citation](#citation)
## News
* **[2026-06-03]** **Ideogram 4 released!** Inference code and weights
are now public, and our [technical blog post](https://ideogram.ai/blog/ideogram-4.0/) is live. See the
[Quick Start](#quick-start) section to generate your first image, or try the
model online at [ideogram.ai](https://ideogram.ai/).
## Model Zoo
| Model | Params | Weight Quantization | Supported Hardware | Diffusers Support | License |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **[Ideogram 4 (nf4)](https://huggingface.co/ideogram-ai/ideogram-4-nf4)** | 9.3B | nf4 | CUDA | Yes | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
| **[Ideogram 4 (fp8)](https://huggingface.co/ideogram-ai/ideogram-4-fp8)** | 9.3B | fp8 | All | No | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
We plan to support more quantizations in the future.
## Performance
We evaluate Ideogram 4 across third-party arenas and benchmarks, standard
open-source benchmarks, and our own internal human-preference benchmark. Across
all of them, **Ideogram 4 is the best open-weight image model by far, and sits
at the frontier of design.**
### Design Arena
[Design Arena](https://www.designarena.ai/) is a third-party image Elo
leaderboard focused specifically on design-oriented generation. On the overall
board, Ideogram 4 is the top-ranked open-weight model, trailing only proprietary
GPT and Gemini models:
<p align="center">
<img src="assets/benchmarks/design_arena.png" alt="Design Arena overall image Elo leaderboard with Ideogram 4.0 as the top open-weight model">
</p>
Filtered to open-weight models only, Ideogram 4 leads by a commanding margin,
well ahead of the next-best open model:
<p align="center">
<img src="assets/benchmarks/design_arena2.png" alt="Design Arena open-weight image Elo leaderboard, with Ideogram 4.0 well ahead of all other open models">
</p>
### ContraLabs
[ContraLabs](https://contralabs.com/research) ran a blind typography evaluation judged by
ten professional designers from Contra's top-earning talent. Ideogram 4 leads on
first-place win rate, picked as the best of four models 47.9% of the time
overall — well ahead of Gemini 3.1 Flash Image Preview (Nano Banana 2) at 30.0%,
FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%):
<p align="center">
<img src="assets/benchmarks/contralabs_typography.png" alt="ContraLabs typography first-place win rate, with Ideogram v4 leading">
</p>
It also wins on practical usability: asked "Would you use this in real client
work?", the same designers rated Ideogram 4 highest at 3.55 / 5 — significantly
above Nano Banana 2 (2.84), Grok Imagine 1.0 (2.61), and FLUX.2 [max] (2.49):
<p align="center">
<img src="assets/benchmarks/contralabs_typography2.png" alt="ContraLabs 'would you use this in real client work?' rating, with Ideogram v4 leading">
</p>
### LMArena
On [LMArena](https://lmarena.ai/), a third-party text-to-image leaderboard that
measures general-purpose text-to-image use cases, Ideogram is the top-ranked
open-weight lab and a top-5 image generation lab overall — beaten only by giant
companies with vastly larger budgets and resources:
<p align="center">
<img src="assets/benchmarks/lmarena_benchmark.png" alt="LMArena text-to-image lab leaderboard with Ideogram">
</p>
### Ideogram internal eval
For our internal human-preference benchmark, focused on graphic design and
photography, we had graphic designers deeply familiar with professional design
work do the rating blind. Bradley-Terry scores rank Ideogram 4 #2 overall —
behind only GPT Image 2 medium — and the top open-weight model:
<p align="center">
<img src="assets/benchmarks/ideogram_benchmark.png" alt="Ideogram internal design leaderboard with Ideogram 4.0">
</p>
### Open-source benchmarks
On standard open-source benchmarks measuring core capabilities — layout control
(7Bench), spatial reasoning and object fidelity (SpatialGenEval), text rendering
(X-Omni OCR), and prompt alignment (Prism) — Ideogram 4 closes the gap to the
leading closed-source models across every axis. On layout control (7Bench), it
is significantly better than all closed-source models:
<p align="center">
<img src="assets/benchmarks/opensource.png" alt="Five-axis capability radar comparing Ideogram 4.0 to leading closed-source models on layout control, spatial reasoning, object fidelity, prompt alignment, and text rendering">
</p>
At 9.3B parameters, Ideogram 4 delivers the best text rendering of any open-weight
release we benchmarked — ahead of much larger models like Qwen-Image (20B),
FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE):
<p align="center">
<img src="assets/benchmarks/opensource2.png" alt="Parameter-efficiency scatter plot showing Ideogram 4.0 at 9.3B parameters leading all other open-weight models on text rendering">
</p>
## Quick Start
### Install
```bash
pip install .
```
If you plan to modify the code, install in editable mode instead so changes
under `src/ideogram4/` take effect without reinstalling:
```bash
pip install -e .
```
### Model access
The model weights are **gated** on Hugging Face, so you must accept the gate and
authenticate before the code can download them — otherwise the download fails
with a `404` / `GatedRepoError`.
1. Open the model page — [ideogram-ai/ideogram-4-nf4](https://huggingface.co/ideogram-ai/ideogram-4-nf4)
(or [ideogram-ai/ideogram-4-fp8](https://huggingface.co/ideogram-ai/ideogram-4-fp8)) — and click
**Agree and access repository** to accept the license gate.
2. Create a Hugging Face access token at
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and log in so the
download is authenticated:
```bash
hf auth login
```
Alternatively, export the token directly: `export HF_TOKEN="hf_..."`.
### CLI
The plain `--prompt` is rewritten into the structured JSON caption the model
expects by a "magic prompt" LLM. By default this uses Ideogram's hosted
magic-prompt API, which is **free** and does the expansion server-side (no local
model or system prompt needed). It reads `IDEOGRAM_API_KEY` — get a key at
https://developer.ideogram.ai/:
```bash
python run_inference.py \
--prompt "a ginger cat wearing a tiny wizard hat reading a spellbook" \
--output out.png \
--quantization "nf4" \
--magic-prompt-key "$IDEOGRAM_API_KEY"
```
You can also run the expansion through your own LLM provider — one of our magic-prompt
system prompt is **open source**. See the
[Prompting Guide](docs/prompting.md#magic-prompt) for details.
For the highest-quality images, set `--height 2048 --width 2048` and
`--sampler-preset V4_QUALITY_48`.
#### Safety screening with Hive
Prompt and output safety screening is performed via [Hive](https://thehive.ai/).
Sign up and create a Text Moderation key and a Visual Content Moderation key,
then export them as `HIVE_TEXT_MODERATION_KEY` and `HIVE_VISUAL_MODERATION_KEY`
(or pass them via `--hive-text-key` / `--hive-visual-key`).
```bash
python run_inference.py \
--prompt "an isometric illustration of a tiny city floating in the clouds" \
--output out.png \
--quantization "nf4" \
--magic-prompt-key "$MAGIC_PROMPT_API_KEY" \
--hive-text-key "$HIVE_TEXT_MODERATION_KEY" \
--hive-visual-key "$HIVE_VISUAL_MODERATION_KEY"
```
For sampler presets, parameter reference, and optimization tips, see
[docs/inference.md](docs/inference.md).
## Model Summary
Ideogram 4 is a **foundation model trained entirely from scratch**, not a
fine-tune or distillation of any existing checkpoint. It is a flow-matching
text-to-image model built on a **fully single-stream** Diffusion Transformer
(DiT) architecture.
**Architecture:**
- **Fully single-stream DiT.** Text and image tokens are concatenated into one
unified sequence and processed through the same 34-layer transformer, with no
separate text or image branches. This enables deep cross-modal interaction at
every layer.
- **Vision-language model as text encoder.** Instead of a text-only encoder
like CLIP or T5, Ideogram 4 uses
[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct),
a full vision-language model that provides far richer understanding of visual
concepts. Hidden states are extracted from **13 intermediate layers** and
concatenated, giving the model multi-scale semantic features ranging from
surface-level token information to deep compositional understanding.
- **Dual-branch classifier-free guidance.** The conditional (positive) and
unconditional (negative) branches can be independently refined, enabling
separate control over prompt adherence and image quality.
- **Flexible resolution.** Native support for any resolution from 256 to 2048
(multiples of 16), with aspect ratios up to 6:1. A single model handles
everything from square thumbnails to ultrawide banners, with the noise
schedule auto-adjusting per resolution.
**Key Capabilities:**
- **Extreme controllability.** Ideogram 4 is trained on structured JSON
captions, giving users unprecedented control over composition, style,
lighting, color palette, typography, and spatial layout, all from a single
prompt.
- **State-of-the-art text rendering.** Ideogram 4 delivers best-in-class
in-image text generation (signage, logos, captions, watermarks, multi-line
text) with high fidelity directly from the prompt.
- **Spatial layout control.** Bounding-box coordinates in the prompt allow
explicit placement of subjects, text elements, and background regions.
- **Color palette conditioning.** Specify hex colors in the prompt to steer the
image's dominant color scheme.
For full architecture details, see
[docs/model_architecture.md](docs/model_architecture.md). For a walkthrough of
how the pipeline components fit together, see
[docs/pipeline.md](docs/pipeline.md).
## Prompting Guide
Ideogram 4 is trained exclusively on **structured JSON captions**. While
plain-text prompts work, you will get the best results by providing a JSON
object that follows our caption schema.
Key points:
- **Use JSON prompts** for maximum controllability — the model was trained on
them and understands the structure natively.
- **Color palette conditioning** — specify a `colour_palette` array of hex
colors in the style description to steer the image's color scheme.
- **Aspect ratio flexibility** — Ideogram 4 supports a wide range of aspect
ratios (any multiple-of-16 resolution from 256 to 2048 on each side). This
is a key advantage for practical use: portraits, landscapes, banners,
phone wallpapers, social media formats, etc.
- **Bounding-box layout** — specify `bbox` coordinates in the prompt to
explicitly place subjects, text elements, and background regions.
- **Compositional control** — use `compositional_deconstruction` with bounding
boxes and per-element descriptions for precise spatial layout.
**Why JSON-only training?** We train exclusively on JSON so that training
and inference share a single, common prompt format. The training captions themselves are deliberately
**extremely descriptive**: each JSON exhaustively describes everything in
the image to maximize training efficiency. The more
text-to-image relationships each caption pins down, the more grounded
supervision the model extracts from a single training pair, rather than
having to infer those relationships across many sparsely-captioned samples.
**Why JSON at inference time?** Because the model was trained on captions
that name every object explicitly, the most reliable way to get every
requested object rendered is to mirror that pattern. Plain-text prompts still work, but
won't perform as well since the model was only trained on structured JSON captions.
**Don't want to write JSON by hand?** That's what *magic prompt* is for: it uses
an LLM to expand a plain-text prompt into a full structured caption before
generation, so you get JSON-quality results from a casual prompt. It runs by
default in `run_inference.py` (see the [CLI](#cli) section).
See [docs/prompting.md](docs/prompting.md) for a full guide.
## Documentation
| Document | Description |
| :------- | :---------- |
| [docs/prompting.md](docs/prompting.md) | How to write JSON prompts, color palette conditioning, aspect ratios |
| [docs/inference.md](docs/inference.md) | Sampler presets, parameter reference, resolutions, optimization tips |
| [docs/model_architecture.md](docs/model_architecture.md) | Architecture diagram, DiT spec, component details |
| [docs/pipeline.md](docs/pipeline.md) | Conceptual pipeline walkthrough — how all components fit together |
| [docs/development.md](docs/development.md) | Dev setup, pre-commit hooks, contributing |
| [docs/safety.md](docs/safety.md) | Pre-training, post-training, and inference-time safety mitigations; how to report violations |
## Citation
If you find the provided code or models useful for your research, consider citing them as:
```bibtex
@misc{ideogram-4-2026,
author={Ideogram AI},
title={{Ideogram 4}},
year={2026},
howpublished={\url{https://ideogram.ai/blog/ideogram-4.0/}},
}
```
## We're Hiring!
We're looking for **Research Scientists** and **Research Engineers** to
work on next-generation generative models and the products built on top of
them. Interested candidates please apply https://jobs.ashbyhq.com/ideogram
+58
View File
@@ -0,0 +1,58 @@
# Development
## Editable install
We recommend installing into an isolated environment — the dependencies include several GB of CUDA-built wheels.
```bash
python -m venv .venv && source .venv/bin/activate
```
For development, install the package in editable mode so changes to the source
tree are picked up without reinstalling:
```bash
pip install -e .
```
or with [`uv`](https://docs.astral.sh/uv/):
```bash
uv venv && source .venv/bin/activate
```
```bash
uv pip install -e .
```
## Pre-commit hooks
This repo uses [pre-commit](https://pre-commit.com/) to run lint, format, and
type checks (`ruff`, `mypy`, etc.) before each commit.
Install once per clone:
```bash
pip install pre-commit
pre-commit install
```
`pre-commit install` registers a git hook in `.git/hooks/pre-commit`, so it
requires the directory to be a git repo. The hooks now run automatically on
`git commit` against staged files.
To run the hooks manually against every file in the repo (useful right after
the first install, or in CI):
```bash
pre-commit run --all-files
```
The first run downloads each hook's environment (ruff, mypy, etc.) into
`~/.cache/pre-commit/` and may take a minute. Subsequent runs are fast.
To bump pinned hook versions in `.pre-commit-config.yaml`:
```bash
pre-commit autoupdate
```
+63
View File
@@ -0,0 +1,63 @@
# Inference Reference
Detailed parameters, sampler presets, supported resolutions, and optimization
tips for Ideogram 4 inference.
## Sampler Presets
Named presets bundle a step count, per-step CFG schedule, schedule mean (`mu`),
and schedule standard deviation (`std`) into a single flag:
```bash
python run_inference.py \
--prompt "a cat wearing a tiny top hat" \
--sampler-preset V4_QUALITY_48 \
--output out.png
```
| Preset | Steps | CFG schedule | `mu` | `std` |
| :----- | :---: | :----------- | :--: | :---: |
| `V4_QUALITY_48` | 48 | 45 steps @ gw=7, then 3 polish steps @ gw=3 | 0.0 | 1.5 |
| `V4_DEFAULT_20` | 20 | 18 steps @ gw=7, then 2 polish steps @ gw=3 | 0.0 | 1.75 |
| `V4_TURBO_12` | 12 | 11 steps @ gw=7, then 1 polish step @ gw=3 | 0.5 | 1.75 |
`V4_QUALITY_48` is the default. Fewer steps trade quality for speed. The full
registry lives in
[`ideogram4.sampler_configs.PRESETS`](../src/ideogram4/sampler_configs.py); add a
new entry there to define your own.
## Key Parameters
These are the keyword arguments accepted by `Ideogram4Pipeline.__call__`. The
defaults below apply when you call `pipe(...)` directly; `run_inference.py`
overrides `num_steps`, `guidance_schedule`, `mu`, and `std` from the chosen
sampler preset (see above).
| Parameter | Default | Notes |
| :-------- | :-----: | :---- |
| `height` / `width` | 1024 | Must be multiples of 16. Supported range: 2562048. Aspect ratios up to 6:1 or 1:6. |
| `num_steps` | 48 | More steps = higher quality. The `V4_QUALITY_48` preset (48 steps) is a good speed/quality trade-off. |
| `guidance_scale` | 7.0 | Constant guidance weight used when no `guidance_schedule` is given. Higher = more prompt adherence, lower = more diversity. |
| `guidance_schedule` | `None` | Optional per-step guidance weights (loop-index order: index 0 is the final step). Overrides `guidance_scale`. |
| `mu` | 0.5 | Logit-normal schedule mean. Auto-adjusted for resolution. |
| `std` | 1.0 | Logit-normal schedule standard deviation. |
| `seed` | `None` | Set for reproducible results. |
## Supported Resolutions
Ideogram 4 natively supports any resolution where both height and width are
multiples of 16, within the range 2562048 (aspect ratios up to 6:1 or 1:6).
| Use case | Resolution | Aspect ratio |
| :------- | :--------: | :----------: |
| Square | 1024 × 1024 | 1:1 |
| Landscape | 1536 × 1024 | 3:2 |
| Portrait | 1024 × 1536 | 2:3 |
| Widescreen | 1920 × 1088 | ~16:9 |
| Ultrawide | 2048 × 768 | ~21:9 |
| Phone wallpaper | 1024 × 1792 | ~9:16 |
| Social banner | 1600 × 400 | 4:1 |
Resolution buckets use 16-pixel increments, giving fine-grained control over
output dimensions.
+45
View File
@@ -0,0 +1,45 @@
# Model Architecture
```
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
┌──────────────────────────────────────────────────┐
│ Ideogram4Transformer │
│ • 34 × Ideogram4TransformerBlock │
Ideogram4Attention (QK-RMSNorm, MRoPE) │
Ideogram4MLP (SwiGLU) │
adaln scale/gate from t-embedding │
│ • Ideogram4FinalLayer │
└──────────────────────────────────────────────────┘
│ velocity prediction
Euler flow-matching sampler with asymmetric CFG
│ denoised image latents
VAE decode
PIL.Image
```
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
the activation layers) and image latent tokens are concatenated into one
sequence, modulated per-block by an AdaLN computed from the flow-matching
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
image tokens share a unified positional space.
Model spec:
| field | value |
|-------------------|---------------|
| `emb_dim` | 4608 |
| `num_layers` | 34 |
| `num_heads` | 18 |
| `intermediate` | 12288 |
| `adanln_dim` | 512 |
| `rope_theta` | 5_000_000 |
| `mrope_section` | (24, 20, 20) |
| latent channels | 32 × 2² = 128 |
| max text tokens | 2048 |
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |
+183
View File
@@ -0,0 +1,183 @@
# Pipeline: How All the Components Work Together
This document explains the end-to-end Ideogram 4 inference pipeline
conceptually. For the architecture spec and code pointers, see
[model_architecture.md](model_architecture.md).
## Overview
Ideogram 4 is a **flow-matching text-to-image model** built on a
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
components:
```
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
frozen trainable frozen
```
## 1. Text Encoder — Qwen3-VL-8B-Instruct
The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
vision-language model, used in text-only mode (no vision inputs).
**What it does:**
- Tokenizes the prompt using the Qwen3 chat template.
- Runs a forward pass through the 36-layer transformer.
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
24, 27, 30, 33, 35.
- Concatenates these hidden states along the feature dimension, producing a
multi-scale text representation.
**Why multi-layer extraction?** Different layers capture different levels of
abstraction — early layers encode surface-level token information, while later
layers encode deeper semantic meaning. Concatenating them gives the DiT access
to the full spectrum.
**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
## 2. DiT Backbone — Ideogram4Transformer
The core generative model is a 34-layer single-stream Diffusion Transformer.
### Sequence layout
Text tokens and image latent tokens are concatenated into one sequence and
processed through the same self-attention layers.
```
Sequence layout (per sample):
┌───────────────────┬────────────────────────┐
│ text tokens │ image latent tokens │
│ (up to 2048) │ (grid_h × grid_w) │
└───────────────────┴────────────────────────┘
▲ ▲
Qwen3-VL features noisy latents z_t
```
### Key components per block
- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
positional encoding is 3-dimensional: for text tokens it uses a 1D position
broadcast to 3 axes; for image tokens it uses (temporal, height, width)
coordinates. This lets text and image tokens coexist in a unified positional
space.
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
activation.
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
and generates per-block scale and gate parameters. This conditions every layer
on the current noise level.
### Flow matching
The model is trained with a **flow-matching** objective. Instead of predicting
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
defines the ODE:
```
dz/dt = v(z_t, t)
```
At inference time, we start from pure Gaussian noise `z_1` and integrate
backward to `z_0` (the clean image) using the Euler method:
```
z_{t-dt} = z_t + v(z_t, t) * dt
```
### Noise schedule
The timestep distribution follows a **logit-normal schedule** parameterized by
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
different noise levels — higher `mu` shifts more steps toward higher noise
(important for high-resolution images). The schedule auto-adjusts for
resolution:
```
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
```
where `base_pixels = 512 * 512`.
## 3. Classifier-Free Guidance (CFG)
At each sampling step, two forward passes are run through the DiT:
1. **Conditional (positive):** full text features + noisy image latents.
2. **Unconditional (negative):** zeroed text features + noisy image latents
(image-only tokens, asymmetric CFG).
The guided velocity is a weighted combination:
```
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
```
where `gw` is the per-step guidance weight. With
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
unconditional prediction, producing images that follow the prompt more
faithfully.
**Asymmetric CFG:** The unconditional branch only processes image tokens (no
text padding), making it computationally cheaper than a full-sequence negative
pass.
**Per-step schedules:** The guidance weight can vary across steps. The
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
`gw=3` for the final 3 "polish" steps near `t=0`.
## 4. VAE Decoder — KL Autoencoder
The denoised latent `z_0` is decoded to pixel space using a frozen KL
autoencoder.
**What it does:**
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
`(batch, channels, grid_h * 2, grid_w * 2)`.
- **Denormalization:** Per-channel shift and scale are applied to undo the
latent normalization used during training.
- **Decoding:** The VAE decoder maps latents to RGB pixels.
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
**Compression factor:** The autoencoder provides 8× spatial compression on each
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
is represented as a 64×64 grid of latent tokens, each with 128 channels
(32 base channels × 2² patch).
## Putting it all together
```python
# Pseudocode for one generation call:
# 1. Encode text
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
t = schedule(step)
s = schedule(step - 1)
# Conditional pass (text + image)
v_cond = dit(text_features, z, t)
# Unconditional pass (image only, zeroed text)
v_uncond = dit(zeros, z, t)
# CFG combination
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
# Euler step
z = z + v * (s - t)
# 4. Decode to pixels
image = vae.decode(z)
```
+362
View File
@@ -0,0 +1,362 @@
# Prompting Guide
Ideogram 4 is trained exclusively on **structured JSON captions** (represented as string type). While the
model can accept plain-text prompts, providing a JSON object that follows the
caption schema gives significantly better results, especially for
controllability, spatial layout, and style fidelity.
## Plain-text vs. JSON prompts
You can pass in plain-text prompts directly to the model and it will work. The
sampling parameters come from a named preset in `ideogram4.PRESETS` (the same
ones `run_inference.py` exposes via `--sampler-preset`), unpacked into the
`pipe()` call:
```python
from ideogram4 import PRESETS
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
"a golden retriever on a skateboard",
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
But for higher quality image generations and more control, pass a JSON string as the prompt:
```python
import json
from ideogram4 import PRESETS
caption = {
"high_level_description": "A golden retriever riding a skateboard down a sunny sidewalk.",
"style_description": {
"aesthetics": "warm, playful, vibrant",
"lighting": "bright afternoon sunlight, long soft shadows",
"photo": "shallow depth of field, eye-level, 85mm lens",
"medium": "photograph",
"color_palette": ["#F5C542", "#87CEEB", "#4A4A4A", "#FFFFFF", "#2E8B57"]
},
"compositional_deconstruction": {
"background": "A sun-drenched suburban sidewalk lined with green hedges and a white picket fence. Dappled light filters through overhead trees.",
"elements": [
{"type": "obj", "bbox": [200, 300, 800, 900], "desc": "A golden retriever with a fluffy coat, standing on a red skateboard with all four paws. Its tongue is out and ears are flapping in the wind."},
{"type": "obj", "bbox": [250, 750, 750, 950], "desc": "A worn red skateboard with black wheels rolling along the concrete sidewalk."}
]
}
}
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
json.dumps(caption, separators=(",", ":"), ensure_ascii=False),
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
## Magic prompt
Writing these captions by hand is optional. *Magic prompt* uses an LLM to expand
a plain-text prompt into a full structured caption for you, so you get the
quality of a JSON prompt from a casual one. It is enabled by default in
`run_inference.py`; you can also call it directly:
```python
import os
from ideogram4 import ClaudeOpusMagicPromptV1, PRESETS
magic = ClaudeOpusMagicPromptV1(api_key=os.environ["MAGIC_PROMPT_API_KEY"])
caption = magic.expand("a golden retriever on a skateboard", aspect_ratio="1:1")
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
caption,
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
The package ships three configurations, registered by name in
`ideogram4.MAGIC_PROMPTS` (the keys `run_inference.py` accepts via
`--magic-prompt-model`):
| Config class | Registry key | Backend |
| :--- | :--- | :--- |
| `Ideogram4MagicPromptV1` | `ideogram-4-v1` | Ideogram's hosted magic-prompt API (free; reads `IDEOGRAM_API_KEY`) |
| `ClaudeOpusMagicPromptV1` | `claude-opus-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
| `ClaudeSonnetMagicPromptV1` | `claude-sonnet-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
`ideogram-4-v1` is the default and is **free**. It runs the expansion
server-side, so there is no local model or system prompt involved — it just needs
an Ideogram API key (get one at
[developer.ideogram.ai](https://developer.ideogram.ai)). The `claude-*`
configurations instead send one of our open-source system prompt to an OpenRouter model;
select one with `--magic-prompt-model` and export `MAGIC_PROMPT_API_KEY`:
```bash
python run_inference.py \
--prompt "an isometric illustration of a tiny city floating in the clouds" \
--output out.png \
--quantization "nf4" \
--magic-prompt-model claude-opus-v1 \
--magic-prompt-key "$MAGIC_PROMPT_API_KEY"
```
See the README's [CLI](../README.md#cli) section for the rest of the flags.
Our magic-prompt system prompts are **open source** (they ship in
`src/ideogram4/magic_prompt_system_prompts/`), so you're also welcome to
construct the caption with any system prompt and LLM of your choosing.
**A few caveats:**
- At Ideogram we've tested this magic prompt with **Claude Opus**. You're welcome
to implement your own `MagicPrompt` configurations and/or drive a different LLM
with our system prompt, but those paths aren't tested by us and quality may
vary.
- The magic prompt shipped here is **not** the same magic prompt used in
production at [Ideogram.ai](https://ideogram.ai) — results will differ from the
hosted product (including the `ideogram-4-v1` API).
## JSON caption schema
> **Note:** Following this schema is **not required** — the model accepts any
> string as a prompt. The schema below describes the exact structure the model
> was trained on, and matching it minimizes train/eval mismatch so the model
> generates closer to its full quality. Treat the "required" / "must" language
> in the rest of this section as the format the [`CaptionVerifier`](../src/ideogram4/caption_verifier.py)
> checks against, not as a hard pipeline constraint. Deviating from the schema
> is allowed; it just means you're sampling outside the training distribution.
The full caption schema has three top-level fields:
1. `high_level_description` — optional string, but strongly recommended.
2. `style_description` — optional object.
3. `compositional_deconstruction`**required** object.
`compositional_deconstruction` must always be present. Within it, both
`background` and `elements` are required.
### `high_level_description`
A one- or two-sentence summary of the entire image. Strongly recommended in every prompt.
```json
"high_level_description": "A medium-shot photograph of a barista pouring latte art in a cozy cafe."
```
### `style_description`
Controls the visual style, lighting, medium, and color palette.
`style_description` must contain **exactly one** of:
- `photo` — for photographic captions (paired with `medium: "photograph"`).
- `art_style` — for non-photographic captions (illustration, painting, 3D render, etc.).
`aesthetics`, `lighting`, and `medium` are also required when `style_description` is present. `color_palette` is optional.
**Key order is strict** and depends on which of `photo` / `art_style` is used:
| Caption type | Required key order |
| :----------- | :----------------- |
| Photo (uses `photo`) | `aesthetics`, `lighting`, `photo`, `medium`, `color_palette` |
| Non-photo (uses `art_style`) | `aesthetics`, `lighting`, `medium`, `art_style`, `color_palette` |
`color_palette` is the only field in this list that may be omitted; if it is included it must remain in the final position.
Field descriptions:
| Field | Type | Description |
| :---- | :--- | :---------- |
| `aesthetics` | string | Aesthetic keywords (e.g. "moody, cinematic, desaturated") |
| `lighting` | string | Lighting description (e.g. "golden hour, rim light, dramatic shadows") |
| `photo` | string | Camera/lens details for photographic outputs (e.g. "35mm, f/1.4, bokeh"). Use this OR `art_style`, not both. |
| `medium` | string | Medium type: `"photograph"`, `"illustration"`, `"3d_render"`, `"painting"`, `"graphic_design"`, etc. |
| `art_style` | string | Art style description for non-photo captions (e.g. "flat vector illustration, bold outlines"). Use this OR `photo`, not both. |
| `color_palette` | list[str] | Hex color codes that steer the image's dominant colors. Up to 16 entries. |
### `compositional_deconstruction`
Provides fine-grained spatial control over the image layout using bounding
boxes and per-element descriptions. Both fields below are required.
| Field | Type | Description |
| :---- | :--- | :---------- |
| `background` | string | Description of the background/environment (required) |
| `elements` | list[dict] | List of elements with optional bounding boxes (required) |
`background` must come before `elements`.
Each element in `elements` must follow a fixed **key order** depending on its
type. `bbox` and `color_palette` are optional within an element; if present they
must appear in the positions shown below.
| Type | Required key order |
| :--- | :----------------- |
| `"obj"` | `type`, `bbox`, `desc`, `color_palette` |
| `"text"` | `type`, `bbox`, `text`, `desc`, `color_palette` |
Field descriptions:
| Field | Type | Description |
| :---- | :--- | :---------- |
| `type` | string | `"obj"` for objects/subjects, `"text"` for in-image text |
| `bbox` | list[int] | `[y_min, x_min, y_max, x_max]` in normalized `01000` coordinates (origin at top-left). Optional. |
| `desc` | string | Detailed description of the element |
| `text` | string | (only for `type: "text"`) The literal text to render |
| `color_palette` | list[str] | Optional per-element palette. Up to 5 hex entries. |
**Key ordering matters.** The model was trained on JSON with a consistent key
order, so maintaining it improves generation quality. The pipeline runs
[`CaptionVerifier`](../src/ideogram4/caption_verifier.py) on every prompt and emits
warnings for unknown keys, missing required keys, or out-of-order keys.
**Hex color format.** Colors in `color_palette` must be uppercase
`#RRGGBB` strings (e.g. `#1B1B2F`, not `#1b1b2f` or `#fff`).
**Encoding.** When serializing with Python's `json` module, pass
`separators=(",", ":")` and `ensure_ascii=False`.
`CaptionVerifier` warns when it detects `\uXXXX` escapes with no literal
non-ASCII characters in the raw text.
## Color palette conditioning
One of Ideogram 4's distinctive features is **color palette control**. By
providing a `color_palette` array of hex colors in `style_description`, you
can steer the dominant colors of the generated image.
```json
"style_description": {
"aesthetics": "moody, cinematic",
"lighting": "low-key, deep shadows",
"photo": "35mm, f/1.4",
"medium": "photograph",
"color_palette": ["#1B1B2F", "#162447", "#1F4068", "#E43F5A", "#F5F5F5"]
}
```
Tips for effective color palette use:
- **Up to 16 colors** in `style_description.color_palette` for the overall
image palette, and **up to 5 colors** per element in
`compositional_deconstruction.elements[*].color_palette`.
- **Include background colors** — if you want a dark background, include the
dark hex in the palette.
- **Contrast pairs** — include both your highlight and shadow colors for more
controlled lighting.
- **Uppercase hex only** — `#RRGGBB` form, no shorthand.
### Example: warm sunset palette
```json
{
"high_level_description": "A lone sailboat on calm water at sunset.",
"style_description": {
"aesthetics": "serene, warm, golden hour",
"lighting": "golden hour backlighting, warm atmospheric haze",
"photo": "wide angle, f/8, long exposure",
"medium": "photograph",
"color_palette": ["#FF6B35", "#F7C59F", "#004E89", "#1A659E", "#2B2D42"]
},
"compositional_deconstruction": {
"background": "A calm ocean stretching to a low horizon, sky washed in orange and pink with thin wisps of cloud.",
"elements": [
{"type": "obj", "desc": "A single sailboat with a white triangular sail, silhouetted against the setting sun."}
]
}
}
```
### Example: corporate design palette
```json
{
"high_level_description": "A clean, modern business card layout for a tech company.",
"style_description": {
"aesthetics": "minimal, professional, geometric",
"lighting": "even, diffuse studio lighting",
"medium": "graphic_design",
"art_style": "flat vector design, generous whitespace, sans-serif typography",
"color_palette": ["#FFFFFF", "#F0F0F0", "#333333", "#0066FF", "#00CC88"]
},
"compositional_deconstruction": {
"background": "A solid off-white card surface with subtle paper texture.",
"elements": [
{"type": "text", "text": "ACME TECH", "desc": "Bold dark grey sans-serif company name across the upper third of the card."},
{"type": "text", "text": "hello@acme.tech", "desc": "Small blue sans-serif contact email near the bottom of the card."}
]
}
}
```
## Full example
```json
{
"high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.",
"style_description": {
"aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant",
"lighting": "overcast daylight, diffused, soft subtle shadows",
"photo": "shallow depth of field, sharp focus, eye-level, telephoto",
"medium": "photograph"
},
"compositional_deconstruction": {
"background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.",
"elements": [
{"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."},
{"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."},
{"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."},
{"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."},
{"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."},
{"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."},
{"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."},
{"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."},
{"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."},
{"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."},
{"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."},
{"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."},
{"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."},
{"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."},
{"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."},
{"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."},
{"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."}
]
}
}
```
## Safety filter
NSFW prompts are blocked. Instead of an image, the model returns a gray screen
with the text "Image blocked by safety filter". False positive rates for safety
is higher for non-json like prompts. We are aware that this is an issue an we may
make a future checkpoint update to improve it.
# Congratulations!
You are now a certified Ideogram 4 prompter!
With structured JSON captions, you have fine-grained control over composition,
color palettes, typography, and spatial layout — capabilities that go far
beyond what plain-text prompts can express!
We'd love to see what you create :-)
Share your results, experiments, and creative discoveries with the community,
especially the unexpected ones. Tag us on social media or open a discussion on
the repo. Happy generating!
BIN
View File
Binary file not shown.

After

Width:  |  Height:  |  Size: 237 KiB

File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
BIN
View File
Binary file not shown.

After

Width:  |  Height:  |  Size: 242 KiB