Initial commit: Ideogram 4 Prompt Builder
PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas, palette editor, presets, prompt library with previews, localisation (en/ru), light/dark themes, and ComfyUI dependency check + generation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
+15
@@ -0,0 +1,15 @@
|
|||||||
|
# Python
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
|
||||||
|
# Generated at runtime / regenerated from code on first launch
|
||||||
|
translations.json
|
||||||
|
|
||||||
|
# User-specific and runtime state (not part of the application source)
|
||||||
|
comfy_settings.json
|
||||||
|
draft.json
|
||||||
|
prompt_library.json
|
||||||
|
prompt_previews/
|
||||||
|
|
||||||
|
# Editor / agent
|
||||||
|
.claude/
|
||||||
@@ -0,0 +1,159 @@
|
|||||||
|
# Ideogram 4 Prompt Builder
|
||||||
|
|
||||||
|
**English** · [Русский](#ideogram-4-prompt-builder-ru)
|
||||||
|
|
||||||
|
A desktop GUI (PyQt6) for building structured JSON captions for **Ideogram 4** and ComfyUI workflows, with a prompt library, reference-image canvas, localisation, light/dark themes, and direct generation through a ComfyUI server.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Run
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
python ideogram_prompt_builder.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Requires `PyQt6` (no other third-party dependencies):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
pip install PyQt6
|
||||||
|
```
|
||||||
|
|
||||||
|
## What it builds
|
||||||
|
|
||||||
|
Prompts follow the schema from `docs/prompting.md`:
|
||||||
|
|
||||||
|
- `high_level_description`
|
||||||
|
- `style_description` with either `photo` or `art_style`
|
||||||
|
- `compositional_deconstruction.background`
|
||||||
|
- `compositional_deconstruction.elements`
|
||||||
|
- optional uppercase HEX color palettes
|
||||||
|
- optional bounding boxes in normalized `0-1000` coordinates
|
||||||
|
|
||||||
|
Actions live in a menu bar (**File / Edit / Library / ComfyUI / View**) plus a slim toolbar (Generate, Undo/Redo, Save to library, Library, Copy) and the language/theme controls on the right. The right-hand panel is tabbed: **JSON** (output + validation) and **Result** (the generated image).
|
||||||
|
|
||||||
|
## Editing
|
||||||
|
|
||||||
|
- Move and resize layout boxes directly with the mouse on the bbox canvas.
|
||||||
|
- Palette fields accept comma-separated HEX, clickable swatches and a popup color picker, with a live `n/limit` counter and invalid-color highlighting.
|
||||||
|
- **Undo / Redo** (`Ctrl+Z` / `Ctrl+Y`).
|
||||||
|
- **Duplicate**, **reorder** (up/down) and add elements from **templates** (Character / Title text / Background object).
|
||||||
|
- The validation list is clickable — clicking an element-specific message selects that element.
|
||||||
|
- Text fields have a right-click translation menu (`Translate to RU` / `Translate to EN`, results cached).
|
||||||
|
- Work is autosaved to `draft.json`; on the next launch you are offered to restore it.
|
||||||
|
|
||||||
|
## Reference image & zoom
|
||||||
|
|
||||||
|
In the composition panel you can load a **reference image** (file or paste from clipboard) drawn under the bbox grid; the **grid scale** slider zooms the grid and the reference scales with it.
|
||||||
|
|
||||||
|
## Prompt library
|
||||||
|
|
||||||
|
The **Library** menu saves the current caption (optionally with a preview image), updates the entry you loaded from, and opens the library browser, where you can:
|
||||||
|
|
||||||
|
- search by name / tag / description and edit per-entry **tags**;
|
||||||
|
- load any saved prompt back into the editor for reuse and editing;
|
||||||
|
- attach a preview from a file or **paste it from the clipboard**, or remove it;
|
||||||
|
- rename, delete entries, and view the preview + summary;
|
||||||
|
- **export / import** the whole library (prompts + previews) as a single `.zip`.
|
||||||
|
|
||||||
|
The library is stored in `prompt_library.json` next to the app, with preview images in `prompt_previews/` (created on first save).
|
||||||
|
|
||||||
|
## ComfyUI integration
|
||||||
|
|
||||||
|
The **ComfyUI** menu connects the builder to a running ComfyUI server:
|
||||||
|
|
||||||
|
- **ComfyUI settings** — host, port and HTTPS, with a *Test connection* button. Stored in `comfy_settings.json`.
|
||||||
|
- **Check ComfyUI** — verifies that every model, sampler and custom node the bundled `ideogram4NSFWComfyui_v11.json` workflow needs is installed on the server, and lists anything missing.
|
||||||
|
- **Generate in ComfyUI** — converts the bundled workflow to API format, injects the current compact JSON caption, submits it and retrieves the generated image. The result appears in the **Result** tab and can be saved to a file or into the library.
|
||||||
|
|
||||||
|
## Appearance & localisation
|
||||||
|
|
||||||
|
- **Theme** (View menu) toggles a light / dark theme.
|
||||||
|
- The interface language is switched at runtime from the **Language** selector; the default is **English**.
|
||||||
|
|
||||||
|
UI strings are loaded from `translations.json`, created on first run from bundled `en` / `ru` translations. To add a language, add a top-level key with the same string keys (and optionally a display name in `LANGUAGE_NAMES`). Missing keys fall back to English then to the key name. Theme and language are saved in `comfy_settings.json`.
|
||||||
|
|
||||||
|
## Compact JSON for ComfyUI
|
||||||
|
|
||||||
|
The output can be copied in pretty or compact form. Compact JSON matches the recommended serialization style for inference and can be pasted into the Ideogram 4 prompt field in ComfyUI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
<a name="ideogram-4-prompt-builder-ru"></a>
|
||||||
|
|
||||||
|
# Ideogram 4 Prompt Builder (RU)
|
||||||
|
|
||||||
|
[English](#ideogram-4-prompt-builder) · **Русский**
|
||||||
|
|
||||||
|
Десктопное GUI-приложение (PyQt6) для сборки структурированных JSON-промтов для **Ideogram 4** и ComfyUI: с библиотекой промтов, холстом с референс-изображением, локализацией, светлой/тёмной темой и прямой генерацией через сервер ComfyUI.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Запуск
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
python ideogram_prompt_builder.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Нужен только `PyQt6` (других сторонних зависимостей нет):
|
||||||
|
|
||||||
|
```powershell
|
||||||
|
pip install PyQt6
|
||||||
|
```
|
||||||
|
|
||||||
|
## Что собирается
|
||||||
|
|
||||||
|
Промты соответствуют схеме из `docs/prompting.md`:
|
||||||
|
|
||||||
|
- `high_level_description`
|
||||||
|
- `style_description` с одним из `photo` или `art_style`
|
||||||
|
- `compositional_deconstruction.background`
|
||||||
|
- `compositional_deconstruction.elements`
|
||||||
|
- опциональные палитры HEX в верхнем регистре
|
||||||
|
- опциональные bbox в нормализованных координатах `0-1000`
|
||||||
|
|
||||||
|
Действия вынесены в меню (**Файл / Правка / Библиотека / ComfyUI / Вид**) плюс компактная панель инструментов (Сгенерировать, Отменить/Повторить, Сохранить в библиотеку, Библиотека, Копировать) и переключатели языка/темы справа. Правая панель — вкладки: **JSON** (вывод + валидация) и **Результат** (сгенерированное изображение).
|
||||||
|
|
||||||
|
## Редактирование
|
||||||
|
|
||||||
|
- Перемещайте и масштабируйте рамки прямо мышью на холсте bbox.
|
||||||
|
- Поля палитры принимают HEX через запятую, кликабельные образцы и всплывающий выбор цвета, со счётчиком `n/лимит` и подсветкой некорректных цветов.
|
||||||
|
- **Отмена / Повтор** (`Ctrl+Z` / `Ctrl+Y`).
|
||||||
|
- **Дублирование**, **изменение порядка** (вверх/вниз) и добавление элементов из **шаблонов** (Персонаж / Заголовок / Фоновый объект).
|
||||||
|
- Список валидации кликабельный — клик по сообщению об элементе выделяет этот элемент.
|
||||||
|
- У текстовых полей есть контекстное меню перевода (`Перевести на RU` / `Перевести на EN`, с кэшированием).
|
||||||
|
- Работа автосохраняется в `draft.json`; при следующем запуске предлагается восстановить черновик.
|
||||||
|
|
||||||
|
## Референс-изображение и масштаб
|
||||||
|
|
||||||
|
В панели композиции можно загрузить **референс-изображение** (из файла или вставить из буфера), которое рисуется под сеткой bbox; ползунок **масштаба сетки** увеличивает сетку, и референс масштабируется вместе с ней.
|
||||||
|
|
||||||
|
## Библиотека промтов
|
||||||
|
|
||||||
|
Меню **Библиотека** сохраняет текущий промт (по желанию с превью), обновляет загруженную запись и открывает браузер библиотеки, где можно:
|
||||||
|
|
||||||
|
- искать по имени / тегам / описанию и редактировать **теги** записи;
|
||||||
|
- загрузить любой сохранённый промт обратно в редактор для повторного использования и правки;
|
||||||
|
- прикрепить превью из файла или **вставить из буфера обмена**, либо убрать его;
|
||||||
|
- переименовывать, удалять записи и просматривать превью + сводку;
|
||||||
|
- **экспортировать / импортировать** всю библиотеку (промты + превью) одним `.zip`.
|
||||||
|
|
||||||
|
Библиотека хранится в `prompt_library.json` рядом с приложением, превью — в `prompt_previews/` (создаются при первом сохранении).
|
||||||
|
|
||||||
|
## Интеграция с ComfyUI
|
||||||
|
|
||||||
|
Меню **ComfyUI** связывает приложение с запущенным сервером ComfyUI:
|
||||||
|
|
||||||
|
- **Настройки ComfyUI** — хост, порт и HTTPS, с кнопкой *Проверить соединение*. Хранятся в `comfy_settings.json`.
|
||||||
|
- **Проверить ComfyUI** — проверяет, что все модели, семплеры и кастомные ноды, нужные встроенному workflow `ideogram4NSFWComfyui_v11.json`, установлены на сервере, и перечисляет отсутствующие.
|
||||||
|
- **Сгенерировать в ComfyUI** — конвертирует встроенный workflow в API-формат, подставляет текущий compact JSON, отправляет запрос и получает изображение. Результат показывается во вкладке **Результат** и может быть сохранён в файл или в библиотеку.
|
||||||
|
|
||||||
|
## Внешний вид и локализация
|
||||||
|
|
||||||
|
- **Тема** (меню Вид) переключает светлую / тёмную тему.
|
||||||
|
- Язык интерфейса переключается на лету через селектор **Язык**; по умолчанию — английский.
|
||||||
|
|
||||||
|
Строки интерфейса берутся из `translations.json`, который создаётся при первом запуске из встроенных переводов `en` / `ru`. Чтобы добавить язык, добавьте ключ верхнего уровня с тем же набором строк (и при желании отображаемое имя в `LANGUAGE_NAMES`). Отсутствующие ключи откатываются к английскому, затем к самому ключу. Тема и язык сохраняются в `comfy_settings.json`.
|
||||||
|
|
||||||
|
## Compact JSON для ComfyUI
|
||||||
|
|
||||||
|
Вывод можно скопировать в pretty- или compact-виде. Compact JSON соответствует рекомендованной сериализации для инференса и вставляется в поле промта Ideogram 4 в ComfyUI.
|
||||||
+336
@@ -0,0 +1,336 @@
|
|||||||
|
<p align="center"><a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><picture>
|
||||||
|
<source media="(prefers-color-scheme: dark)" srcset="assets/ideogram_logo_darkmode.svg">
|
||||||
|
<source media="(prefers-color-scheme: light)" srcset="assets/ideogram_logo.svg">
|
||||||
|
<img src="assets/ideogram_logo.svg" alt="Ideogram" width="500">
|
||||||
|
</picture></a></p>
|
||||||
|
|
||||||
|
<p align="center"><em>Ideogram 4: Open image model at the forefront of design</em></p>
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<a href="https://ideogram.ai/blog/ideogram-4.0/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Blog-Post-orange" alt="Blog Post"></a>
|
||||||
|
<a href="https://github.com/ideogram-oss/ideogram4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github" alt="Code"></a>
|
||||||
|
<a href="https://huggingface.co/collections/ideogram-ai/ideogram-4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Model-HuggingFace-blue?logo=huggingface" alt="Model"></a>
|
||||||
|
<a href="https://developer.ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/API-developer.ideogram.ai-purple" alt="API"></a>
|
||||||
|
<a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Official%20Site-ideogram.ai-ff69b4" alt="Official Site"></a>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/samples/collage_landscape.jpg" alt="A collage of Ideogram 4 samples spanning photorealism, illustration, typography, and poster design">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
Ideogram 4 is **[Ideogram](https://ideogram.ai)'s first open-weight text-to-image model**. It is a **state-of-the-art foundation model trained from scratch** — not a fine-tune of any existing model. It introduces a new structured JSON prompting interface, with best-in-class multilingual text rendering, deep language understanding, explicit bounding-box layout and color-palette controls, and native 2k resolution images. The easiest way to try the model is online at **[ideogram.ai](https://ideogram.ai/)**.
|
||||||
|
|
||||||
|
We believe openness drives innovation, and we invite the research community to innovate with us on the forefront of visual intelligence.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [News](#news)
|
||||||
|
2. [Model Zoo](#model-zoo)
|
||||||
|
3. [Performance](#performance)
|
||||||
|
4. [Quick Start](#quick-start)
|
||||||
|
5. [Model Summary](#model-summary)
|
||||||
|
6. [Prompting Guide](#prompting-guide)
|
||||||
|
7. [Documentation](#documentation)
|
||||||
|
8. [Citation](#citation)
|
||||||
|
|
||||||
|
## News
|
||||||
|
|
||||||
|
* **[2026-06-03]** **Ideogram 4 released!** Inference code and weights
|
||||||
|
are now public, and our [technical blog post](https://ideogram.ai/blog/ideogram-4.0/) is live. See the
|
||||||
|
[Quick Start](#quick-start) section to generate your first image, or try the
|
||||||
|
model online at [ideogram.ai](https://ideogram.ai/).
|
||||||
|
|
||||||
|
## Model Zoo
|
||||||
|
|
||||||
|
| Model | Params | Weight Quantization | Supported Hardware | Diffusers Support | License |
|
||||||
|
| :--- | :---: | :---: | :---: | :---: | :---: |
|
||||||
|
| **[Ideogram 4 (nf4)](https://huggingface.co/ideogram-ai/ideogram-4-nf4)** | 9.3B | nf4 | CUDA | Yes | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
|
||||||
|
| **[Ideogram 4 (fp8)](https://huggingface.co/ideogram-ai/ideogram-4-fp8)** | 9.3B | fp8 | All | No | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
|
||||||
|
|
||||||
|
We plan to support more quantizations in the future.
|
||||||
|
|
||||||
|
|
||||||
|
## Performance
|
||||||
|
|
||||||
|
We evaluate Ideogram 4 across third-party arenas and benchmarks, standard
|
||||||
|
open-source benchmarks, and our own internal human-preference benchmark. Across
|
||||||
|
all of them, **Ideogram 4 is the best open-weight image model by far, and sits
|
||||||
|
at the frontier of design.**
|
||||||
|
|
||||||
|
### Design Arena
|
||||||
|
|
||||||
|
[Design Arena](https://www.designarena.ai/) is a third-party image Elo
|
||||||
|
leaderboard focused specifically on design-oriented generation. On the overall
|
||||||
|
board, Ideogram 4 is the top-ranked open-weight model, trailing only proprietary
|
||||||
|
GPT and Gemini models:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/design_arena.png" alt="Design Arena overall image Elo leaderboard with Ideogram 4.0 as the top open-weight model">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
Filtered to open-weight models only, Ideogram 4 leads by a commanding margin,
|
||||||
|
well ahead of the next-best open model:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/design_arena2.png" alt="Design Arena open-weight image Elo leaderboard, with Ideogram 4.0 well ahead of all other open models">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### ContraLabs
|
||||||
|
|
||||||
|
[ContraLabs](https://contralabs.com/research) ran a blind typography evaluation judged by
|
||||||
|
ten professional designers from Contra's top-earning talent. Ideogram 4 leads on
|
||||||
|
first-place win rate, picked as the best of four models 47.9% of the time
|
||||||
|
overall — well ahead of Gemini 3.1 Flash Image Preview (Nano Banana 2) at 30.0%,
|
||||||
|
FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%):
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/contralabs_typography.png" alt="ContraLabs typography first-place win rate, with Ideogram v4 leading">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
It also wins on practical usability: asked "Would you use this in real client
|
||||||
|
work?", the same designers rated Ideogram 4 highest at 3.55 / 5 — significantly
|
||||||
|
above Nano Banana 2 (2.84), Grok Imagine 1.0 (2.61), and FLUX.2 [max] (2.49):
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/contralabs_typography2.png" alt="ContraLabs 'would you use this in real client work?' rating, with Ideogram v4 leading">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### LMArena
|
||||||
|
|
||||||
|
On [LMArena](https://lmarena.ai/), a third-party text-to-image leaderboard that
|
||||||
|
measures general-purpose text-to-image use cases, Ideogram is the top-ranked
|
||||||
|
open-weight lab and a top-5 image generation lab overall — beaten only by giant
|
||||||
|
companies with vastly larger budgets and resources:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/lmarena_benchmark.png" alt="LMArena text-to-image lab leaderboard with Ideogram">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### Ideogram internal eval
|
||||||
|
|
||||||
|
For our internal human-preference benchmark, focused on graphic design and
|
||||||
|
photography, we had graphic designers deeply familiar with professional design
|
||||||
|
work do the rating blind. Bradley-Terry scores rank Ideogram 4 #2 overall —
|
||||||
|
behind only GPT Image 2 medium — and the top open-weight model:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/ideogram_benchmark.png" alt="Ideogram internal design leaderboard with Ideogram 4.0">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
### Open-source benchmarks
|
||||||
|
|
||||||
|
On standard open-source benchmarks measuring core capabilities — layout control
|
||||||
|
(7Bench), spatial reasoning and object fidelity (SpatialGenEval), text rendering
|
||||||
|
(X-Omni OCR), and prompt alignment (Prism) — Ideogram 4 closes the gap to the
|
||||||
|
leading closed-source models across every axis. On layout control (7Bench), it
|
||||||
|
is significantly better than all closed-source models:
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/opensource.png" alt="Five-axis capability radar comparing Ideogram 4.0 to leading closed-source models on layout control, spatial reasoning, object fidelity, prompt alignment, and text rendering">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
At 9.3B parameters, Ideogram 4 delivers the best text rendering of any open-weight
|
||||||
|
release we benchmarked — ahead of much larger models like Qwen-Image (20B),
|
||||||
|
FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE):
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="assets/benchmarks/opensource2.png" alt="Parameter-efficiency scatter plot showing Ideogram 4.0 at 9.3B parameters leading all other open-weight models on text rendering">
|
||||||
|
</p>
|
||||||
|
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
### Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install .
|
||||||
|
```
|
||||||
|
|
||||||
|
If you plan to modify the code, install in editable mode instead so changes
|
||||||
|
under `src/ideogram4/` take effect without reinstalling:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
### Model access
|
||||||
|
|
||||||
|
The model weights are **gated** on Hugging Face, so you must accept the gate and
|
||||||
|
authenticate before the code can download them — otherwise the download fails
|
||||||
|
with a `404` / `GatedRepoError`.
|
||||||
|
|
||||||
|
1. Open the model page — [ideogram-ai/ideogram-4-nf4](https://huggingface.co/ideogram-ai/ideogram-4-nf4)
|
||||||
|
(or [ideogram-ai/ideogram-4-fp8](https://huggingface.co/ideogram-ai/ideogram-4-fp8)) — and click
|
||||||
|
**Agree and access repository** to accept the license gate.
|
||||||
|
2. Create a Hugging Face access token at
|
||||||
|
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and log in so the
|
||||||
|
download is authenticated:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
hf auth login
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, export the token directly: `export HF_TOKEN="hf_..."`.
|
||||||
|
|
||||||
|
### CLI
|
||||||
|
|
||||||
|
The plain `--prompt` is rewritten into the structured JSON caption the model
|
||||||
|
expects by a "magic prompt" LLM. By default this uses Ideogram's hosted
|
||||||
|
magic-prompt API, which is **free** and does the expansion server-side (no local
|
||||||
|
model or system prompt needed). It reads `IDEOGRAM_API_KEY` — get a key at
|
||||||
|
https://developer.ideogram.ai/:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_inference.py \
|
||||||
|
--prompt "a ginger cat wearing a tiny wizard hat reading a spellbook" \
|
||||||
|
--output out.png \
|
||||||
|
--quantization "nf4" \
|
||||||
|
--magic-prompt-key "$IDEOGRAM_API_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also run the expansion through your own LLM provider — one of our magic-prompt
|
||||||
|
system prompt is **open source**. See the
|
||||||
|
[Prompting Guide](docs/prompting.md#magic-prompt) for details.
|
||||||
|
|
||||||
|
For the highest-quality images, set `--height 2048 --width 2048` and
|
||||||
|
`--sampler-preset V4_QUALITY_48`.
|
||||||
|
|
||||||
|
#### Safety screening with Hive
|
||||||
|
|
||||||
|
Prompt and output safety screening is performed via [Hive](https://thehive.ai/).
|
||||||
|
Sign up and create a Text Moderation key and a Visual Content Moderation key,
|
||||||
|
then export them as `HIVE_TEXT_MODERATION_KEY` and `HIVE_VISUAL_MODERATION_KEY`
|
||||||
|
(or pass them via `--hive-text-key` / `--hive-visual-key`).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_inference.py \
|
||||||
|
--prompt "an isometric illustration of a tiny city floating in the clouds" \
|
||||||
|
--output out.png \
|
||||||
|
--quantization "nf4" \
|
||||||
|
--magic-prompt-key "$MAGIC_PROMPT_API_KEY" \
|
||||||
|
--hive-text-key "$HIVE_TEXT_MODERATION_KEY" \
|
||||||
|
--hive-visual-key "$HIVE_VISUAL_MODERATION_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
For sampler presets, parameter reference, and optimization tips, see
|
||||||
|
[docs/inference.md](docs/inference.md).
|
||||||
|
|
||||||
|
## Model Summary
|
||||||
|
|
||||||
|
Ideogram 4 is a **foundation model trained entirely from scratch**, not a
|
||||||
|
fine-tune or distillation of any existing checkpoint. It is a flow-matching
|
||||||
|
text-to-image model built on a **fully single-stream** Diffusion Transformer
|
||||||
|
(DiT) architecture.
|
||||||
|
|
||||||
|
**Architecture:**
|
||||||
|
- **Fully single-stream DiT.** Text and image tokens are concatenated into one
|
||||||
|
unified sequence and processed through the same 34-layer transformer, with no
|
||||||
|
separate text or image branches. This enables deep cross-modal interaction at
|
||||||
|
every layer.
|
||||||
|
- **Vision-language model as text encoder.** Instead of a text-only encoder
|
||||||
|
like CLIP or T5, Ideogram 4 uses
|
||||||
|
[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct),
|
||||||
|
a full vision-language model that provides far richer understanding of visual
|
||||||
|
concepts. Hidden states are extracted from **13 intermediate layers** and
|
||||||
|
concatenated, giving the model multi-scale semantic features ranging from
|
||||||
|
surface-level token information to deep compositional understanding.
|
||||||
|
- **Dual-branch classifier-free guidance.** The conditional (positive) and
|
||||||
|
unconditional (negative) branches can be independently refined, enabling
|
||||||
|
separate control over prompt adherence and image quality.
|
||||||
|
- **Flexible resolution.** Native support for any resolution from 256 to 2048
|
||||||
|
(multiples of 16), with aspect ratios up to 6:1. A single model handles
|
||||||
|
everything from square thumbnails to ultrawide banners, with the noise
|
||||||
|
schedule auto-adjusting per resolution.
|
||||||
|
|
||||||
|
**Key Capabilities:**
|
||||||
|
- **Extreme controllability.** Ideogram 4 is trained on structured JSON
|
||||||
|
captions, giving users unprecedented control over composition, style,
|
||||||
|
lighting, color palette, typography, and spatial layout, all from a single
|
||||||
|
prompt.
|
||||||
|
- **State-of-the-art text rendering.** Ideogram 4 delivers best-in-class
|
||||||
|
in-image text generation (signage, logos, captions, watermarks, multi-line
|
||||||
|
text) with high fidelity directly from the prompt.
|
||||||
|
- **Spatial layout control.** Bounding-box coordinates in the prompt allow
|
||||||
|
explicit placement of subjects, text elements, and background regions.
|
||||||
|
- **Color palette conditioning.** Specify hex colors in the prompt to steer the
|
||||||
|
image's dominant color scheme.
|
||||||
|
|
||||||
|
For full architecture details, see
|
||||||
|
[docs/model_architecture.md](docs/model_architecture.md). For a walkthrough of
|
||||||
|
how the pipeline components fit together, see
|
||||||
|
[docs/pipeline.md](docs/pipeline.md).
|
||||||
|
|
||||||
|
## Prompting Guide
|
||||||
|
|
||||||
|
Ideogram 4 is trained exclusively on **structured JSON captions**. While
|
||||||
|
plain-text prompts work, you will get the best results by providing a JSON
|
||||||
|
object that follows our caption schema.
|
||||||
|
|
||||||
|
|
||||||
|
Key points:
|
||||||
|
|
||||||
|
- **Use JSON prompts** for maximum controllability — the model was trained on
|
||||||
|
them and understands the structure natively.
|
||||||
|
- **Color palette conditioning** — specify a `colour_palette` array of hex
|
||||||
|
colors in the style description to steer the image's color scheme.
|
||||||
|
- **Aspect ratio flexibility** — Ideogram 4 supports a wide range of aspect
|
||||||
|
ratios (any multiple-of-16 resolution from 256 to 2048 on each side). This
|
||||||
|
is a key advantage for practical use: portraits, landscapes, banners,
|
||||||
|
phone wallpapers, social media formats, etc.
|
||||||
|
- **Bounding-box layout** — specify `bbox` coordinates in the prompt to
|
||||||
|
explicitly place subjects, text elements, and background regions.
|
||||||
|
- **Compositional control** — use `compositional_deconstruction` with bounding
|
||||||
|
boxes and per-element descriptions for precise spatial layout.
|
||||||
|
|
||||||
|
|
||||||
|
**Why JSON-only training?** We train exclusively on JSON so that training
|
||||||
|
and inference share a single, common prompt format. The training captions themselves are deliberately
|
||||||
|
**extremely descriptive**: each JSON exhaustively describes everything in
|
||||||
|
the image to maximize training efficiency. The more
|
||||||
|
text-to-image relationships each caption pins down, the more grounded
|
||||||
|
supervision the model extracts from a single training pair, rather than
|
||||||
|
having to infer those relationships across many sparsely-captioned samples.
|
||||||
|
|
||||||
|
**Why JSON at inference time?** Because the model was trained on captions
|
||||||
|
that name every object explicitly, the most reliable way to get every
|
||||||
|
requested object rendered is to mirror that pattern. Plain-text prompts still work, but
|
||||||
|
won't perform as well since the model was only trained on structured JSON captions.
|
||||||
|
|
||||||
|
**Don't want to write JSON by hand?** That's what *magic prompt* is for: it uses
|
||||||
|
an LLM to expand a plain-text prompt into a full structured caption before
|
||||||
|
generation, so you get JSON-quality results from a casual prompt. It runs by
|
||||||
|
default in `run_inference.py` (see the [CLI](#cli) section).
|
||||||
|
|
||||||
|
See [docs/prompting.md](docs/prompting.md) for a full guide.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
| Document | Description |
|
||||||
|
| :------- | :---------- |
|
||||||
|
| [docs/prompting.md](docs/prompting.md) | How to write JSON prompts, color palette conditioning, aspect ratios |
|
||||||
|
| [docs/inference.md](docs/inference.md) | Sampler presets, parameter reference, resolutions, optimization tips |
|
||||||
|
| [docs/model_architecture.md](docs/model_architecture.md) | Architecture diagram, DiT spec, component details |
|
||||||
|
| [docs/pipeline.md](docs/pipeline.md) | Conceptual pipeline walkthrough — how all components fit together |
|
||||||
|
| [docs/development.md](docs/development.md) | Dev setup, pre-commit hooks, contributing |
|
||||||
|
| [docs/safety.md](docs/safety.md) | Pre-training, post-training, and inference-time safety mitigations; how to report violations |
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
If you find the provided code or models useful for your research, consider citing them as:
|
||||||
|
|
||||||
|
|
||||||
|
```bibtex
|
||||||
|
@misc{ideogram-4-2026,
|
||||||
|
author={Ideogram AI},
|
||||||
|
title={{Ideogram 4}},
|
||||||
|
year={2026},
|
||||||
|
howpublished={\url{https://ideogram.ai/blog/ideogram-4.0/}},
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## We're Hiring!
|
||||||
|
|
||||||
|
We're looking for **Research Scientists** and **Research Engineers** to
|
||||||
|
work on next-generation generative models and the products built on top of
|
||||||
|
them. Interested candidates please apply https://jobs.ashbyhq.com/ideogram
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
# Development
|
||||||
|
|
||||||
|
## Editable install
|
||||||
|
|
||||||
|
We recommend installing into an isolated environment — the dependencies include several GB of CUDA-built wheels.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m venv .venv && source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
For development, install the package in editable mode so changes to the source
|
||||||
|
tree are picked up without reinstalling:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
or with [`uv`](https://docs.astral.sh/uv/):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv venv && source .venv/bin/activate
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install -e .
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pre-commit hooks
|
||||||
|
|
||||||
|
This repo uses [pre-commit](https://pre-commit.com/) to run lint, format, and
|
||||||
|
type checks (`ruff`, `mypy`, etc.) before each commit.
|
||||||
|
|
||||||
|
Install once per clone:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install pre-commit
|
||||||
|
pre-commit install
|
||||||
|
```
|
||||||
|
|
||||||
|
`pre-commit install` registers a git hook in `.git/hooks/pre-commit`, so it
|
||||||
|
requires the directory to be a git repo. The hooks now run automatically on
|
||||||
|
`git commit` against staged files.
|
||||||
|
|
||||||
|
To run the hooks manually against every file in the repo (useful right after
|
||||||
|
the first install, or in CI):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pre-commit run --all-files
|
||||||
|
```
|
||||||
|
|
||||||
|
The first run downloads each hook's environment (ruff, mypy, etc.) into
|
||||||
|
`~/.cache/pre-commit/` and may take a minute. Subsequent runs are fast.
|
||||||
|
|
||||||
|
To bump pinned hook versions in `.pre-commit-config.yaml`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pre-commit autoupdate
|
||||||
|
```
|
||||||
@@ -0,0 +1,63 @@
|
|||||||
|
# Inference Reference
|
||||||
|
|
||||||
|
Detailed parameters, sampler presets, supported resolutions, and optimization
|
||||||
|
tips for Ideogram 4 inference.
|
||||||
|
|
||||||
|
## Sampler Presets
|
||||||
|
|
||||||
|
Named presets bundle a step count, per-step CFG schedule, schedule mean (`mu`),
|
||||||
|
and schedule standard deviation (`std`) into a single flag:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_inference.py \
|
||||||
|
--prompt "a cat wearing a tiny top hat" \
|
||||||
|
--sampler-preset V4_QUALITY_48 \
|
||||||
|
--output out.png
|
||||||
|
```
|
||||||
|
|
||||||
|
| Preset | Steps | CFG schedule | `mu` | `std` |
|
||||||
|
| :----- | :---: | :----------- | :--: | :---: |
|
||||||
|
| `V4_QUALITY_48` | 48 | 45 steps @ gw=7, then 3 polish steps @ gw=3 | 0.0 | 1.5 |
|
||||||
|
| `V4_DEFAULT_20` | 20 | 18 steps @ gw=7, then 2 polish steps @ gw=3 | 0.0 | 1.75 |
|
||||||
|
| `V4_TURBO_12` | 12 | 11 steps @ gw=7, then 1 polish step @ gw=3 | 0.5 | 1.75 |
|
||||||
|
|
||||||
|
`V4_QUALITY_48` is the default. Fewer steps trade quality for speed. The full
|
||||||
|
registry lives in
|
||||||
|
[`ideogram4.sampler_configs.PRESETS`](../src/ideogram4/sampler_configs.py); add a
|
||||||
|
new entry there to define your own.
|
||||||
|
|
||||||
|
## Key Parameters
|
||||||
|
|
||||||
|
These are the keyword arguments accepted by `Ideogram4Pipeline.__call__`. The
|
||||||
|
defaults below apply when you call `pipe(...)` directly; `run_inference.py`
|
||||||
|
overrides `num_steps`, `guidance_schedule`, `mu`, and `std` from the chosen
|
||||||
|
sampler preset (see above).
|
||||||
|
|
||||||
|
| Parameter | Default | Notes |
|
||||||
|
| :-------- | :-----: | :---- |
|
||||||
|
| `height` / `width` | 1024 | Must be multiples of 16. Supported range: 256–2048. Aspect ratios up to 6:1 or 1:6. |
|
||||||
|
| `num_steps` | 48 | More steps = higher quality. The `V4_QUALITY_48` preset (48 steps) is a good speed/quality trade-off. |
|
||||||
|
| `guidance_scale` | 7.0 | Constant guidance weight used when no `guidance_schedule` is given. Higher = more prompt adherence, lower = more diversity. |
|
||||||
|
| `guidance_schedule` | `None` | Optional per-step guidance weights (loop-index order: index 0 is the final step). Overrides `guidance_scale`. |
|
||||||
|
| `mu` | 0.5 | Logit-normal schedule mean. Auto-adjusted for resolution. |
|
||||||
|
| `std` | 1.0 | Logit-normal schedule standard deviation. |
|
||||||
|
| `seed` | `None` | Set for reproducible results. |
|
||||||
|
|
||||||
|
## Supported Resolutions
|
||||||
|
|
||||||
|
Ideogram 4 natively supports any resolution where both height and width are
|
||||||
|
multiples of 16, within the range 256–2048 (aspect ratios up to 6:1 or 1:6).
|
||||||
|
|
||||||
|
| Use case | Resolution | Aspect ratio |
|
||||||
|
| :------- | :--------: | :----------: |
|
||||||
|
| Square | 1024 × 1024 | 1:1 |
|
||||||
|
| Landscape | 1536 × 1024 | 3:2 |
|
||||||
|
| Portrait | 1024 × 1536 | 2:3 |
|
||||||
|
| Widescreen | 1920 × 1088 | ~16:9 |
|
||||||
|
| Ultrawide | 2048 × 768 | ~21:9 |
|
||||||
|
| Phone wallpaper | 1024 × 1792 | ~9:16 |
|
||||||
|
| Social banner | 1600 × 400 | 4:1 |
|
||||||
|
|
||||||
|
Resolution buckets use 16-pixel increments, giving fine-grained control over
|
||||||
|
output dimensions.
|
||||||
|
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
# Model Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌──────────────────────────────────────────────────┐
|
||||||
|
│ Ideogram4Transformer │
|
||||||
|
│ • 34 × Ideogram4TransformerBlock │
|
||||||
|
│ – Ideogram4Attention (QK-RMSNorm, MRoPE) │
|
||||||
|
│ – Ideogram4MLP (SwiGLU) │
|
||||||
|
│ – adaln scale/gate from t-embedding │
|
||||||
|
│ • Ideogram4FinalLayer │
|
||||||
|
└──────────────────────────────────────────────────┘
|
||||||
|
│ velocity prediction
|
||||||
|
▼
|
||||||
|
Euler flow-matching sampler with asymmetric CFG
|
||||||
|
│ denoised image latents
|
||||||
|
▼
|
||||||
|
VAE decode
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
PIL.Image
|
||||||
|
```
|
||||||
|
|
||||||
|
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
|
||||||
|
the activation layers) and image latent tokens are concatenated into one
|
||||||
|
sequence, modulated per-block by an AdaLN computed from the flow-matching
|
||||||
|
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
|
||||||
|
image tokens share a unified positional space.
|
||||||
|
|
||||||
|
Model spec:
|
||||||
|
|
||||||
|
| field | value |
|
||||||
|
|-------------------|---------------|
|
||||||
|
| `emb_dim` | 4608 |
|
||||||
|
| `num_layers` | 34 |
|
||||||
|
| `num_heads` | 18 |
|
||||||
|
| `intermediate` | 12288 |
|
||||||
|
| `adanln_dim` | 512 |
|
||||||
|
| `rope_theta` | 5_000_000 |
|
||||||
|
| `mrope_section` | (24, 20, 20) |
|
||||||
|
| latent channels | 32 × 2² = 128 |
|
||||||
|
| max text tokens | 2048 |
|
||||||
|
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |
|
||||||
@@ -0,0 +1,183 @@
|
|||||||
|
# Pipeline: How All the Components Work Together
|
||||||
|
|
||||||
|
This document explains the end-to-end Ideogram 4 inference pipeline
|
||||||
|
conceptually. For the architecture spec and code pointers, see
|
||||||
|
[model_architecture.md](model_architecture.md).
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Ideogram 4 is a **flow-matching text-to-image model** built on a
|
||||||
|
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
|
||||||
|
components:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
|
||||||
|
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
|
||||||
|
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
|
||||||
|
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
|
||||||
|
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
|
||||||
|
frozen trainable frozen
|
||||||
|
```
|
||||||
|
|
||||||
|
## 1. Text Encoder — Qwen3-VL-8B-Instruct
|
||||||
|
|
||||||
|
The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
|
||||||
|
vision-language model, used in text-only mode (no vision inputs).
|
||||||
|
|
||||||
|
**What it does:**
|
||||||
|
- Tokenizes the prompt using the Qwen3 chat template.
|
||||||
|
- Runs a forward pass through the 36-layer transformer.
|
||||||
|
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
|
||||||
|
24, 27, 30, 33, 35.
|
||||||
|
- Concatenates these hidden states along the feature dimension, producing a
|
||||||
|
multi-scale text representation.
|
||||||
|
|
||||||
|
**Why multi-layer extraction?** Different layers capture different levels of
|
||||||
|
abstraction — early layers encode surface-level token information, while later
|
||||||
|
layers encode deeper semantic meaning. Concatenating them gives the DiT access
|
||||||
|
to the full spectrum.
|
||||||
|
|
||||||
|
**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
|
||||||
|
|
||||||
|
## 2. DiT Backbone — Ideogram4Transformer
|
||||||
|
|
||||||
|
The core generative model is a 34-layer single-stream Diffusion Transformer.
|
||||||
|
|
||||||
|
### Sequence layout
|
||||||
|
|
||||||
|
Text tokens and image latent tokens are concatenated into one sequence and
|
||||||
|
processed through the same self-attention layers.
|
||||||
|
|
||||||
|
```
|
||||||
|
Sequence layout (per sample):
|
||||||
|
|
||||||
|
┌───────────────────┬────────────────────────┐
|
||||||
|
│ text tokens │ image latent tokens │
|
||||||
|
│ (up to 2048) │ (grid_h × grid_w) │
|
||||||
|
└───────────────────┴────────────────────────┘
|
||||||
|
▲ ▲
|
||||||
|
Qwen3-VL features noisy latents z_t
|
||||||
|
```
|
||||||
|
|
||||||
|
### Key components per block
|
||||||
|
|
||||||
|
- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
|
||||||
|
positional encoding is 3-dimensional: for text tokens it uses a 1D position
|
||||||
|
broadcast to 3 axes; for image tokens it uses (temporal, height, width)
|
||||||
|
coordinates. This lets text and image tokens coexist in a unified positional
|
||||||
|
space.
|
||||||
|
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
|
||||||
|
activation.
|
||||||
|
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
|
||||||
|
and generates per-block scale and gate parameters. This conditions every layer
|
||||||
|
on the current noise level.
|
||||||
|
|
||||||
|
### Flow matching
|
||||||
|
|
||||||
|
The model is trained with a **flow-matching** objective. Instead of predicting
|
||||||
|
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
|
||||||
|
defines the ODE:
|
||||||
|
|
||||||
|
```
|
||||||
|
dz/dt = v(z_t, t)
|
||||||
|
```
|
||||||
|
|
||||||
|
At inference time, we start from pure Gaussian noise `z_1` and integrate
|
||||||
|
backward to `z_0` (the clean image) using the Euler method:
|
||||||
|
|
||||||
|
```
|
||||||
|
z_{t-dt} = z_t + v(z_t, t) * dt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Noise schedule
|
||||||
|
|
||||||
|
The timestep distribution follows a **logit-normal schedule** parameterized by
|
||||||
|
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
|
||||||
|
different noise levels — higher `mu` shifts more steps toward higher noise
|
||||||
|
(important for high-resolution images). The schedule auto-adjusts for
|
||||||
|
resolution:
|
||||||
|
|
||||||
|
```
|
||||||
|
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
|
||||||
|
```
|
||||||
|
|
||||||
|
where `base_pixels = 512 * 512`.
|
||||||
|
|
||||||
|
## 3. Classifier-Free Guidance (CFG)
|
||||||
|
|
||||||
|
At each sampling step, two forward passes are run through the DiT:
|
||||||
|
|
||||||
|
1. **Conditional (positive):** full text features + noisy image latents.
|
||||||
|
2. **Unconditional (negative):** zeroed text features + noisy image latents
|
||||||
|
(image-only tokens, asymmetric CFG).
|
||||||
|
|
||||||
|
The guided velocity is a weighted combination:
|
||||||
|
|
||||||
|
```
|
||||||
|
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
|
||||||
|
```
|
||||||
|
|
||||||
|
where `gw` is the per-step guidance weight. With
|
||||||
|
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
|
||||||
|
unconditional prediction, producing images that follow the prompt more
|
||||||
|
faithfully.
|
||||||
|
|
||||||
|
**Asymmetric CFG:** The unconditional branch only processes image tokens (no
|
||||||
|
text padding), making it computationally cheaper than a full-sequence negative
|
||||||
|
pass.
|
||||||
|
|
||||||
|
**Per-step schedules:** The guidance weight can vary across steps. The
|
||||||
|
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
|
||||||
|
`gw=3` for the final 3 "polish" steps near `t=0`.
|
||||||
|
|
||||||
|
|
||||||
|
## 4. VAE Decoder — KL Autoencoder
|
||||||
|
|
||||||
|
The denoised latent `z_0` is decoded to pixel space using a frozen KL
|
||||||
|
autoencoder.
|
||||||
|
|
||||||
|
**What it does:**
|
||||||
|
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
|
||||||
|
input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
|
||||||
|
`(batch, channels, grid_h * 2, grid_w * 2)`.
|
||||||
|
- **Denormalization:** Per-channel shift and scale are applied to undo the
|
||||||
|
latent normalization used during training.
|
||||||
|
- **Decoding:** The VAE decoder maps latents to RGB pixels.
|
||||||
|
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
|
||||||
|
|
||||||
|
**Compression factor:** The autoencoder provides 8× spatial compression on each
|
||||||
|
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
|
||||||
|
is represented as a 64×64 grid of latent tokens, each with 128 channels
|
||||||
|
(32 base channels × 2² patch).
|
||||||
|
|
||||||
|
## Putting it all together
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Pseudocode for one generation call:
|
||||||
|
|
||||||
|
# 1. Encode text
|
||||||
|
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
|
||||||
|
|
||||||
|
# 2. Initialize noise
|
||||||
|
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
|
||||||
|
|
||||||
|
# 3. Euler integration from t=1 to t=0
|
||||||
|
for step in reversed(range(num_steps)):
|
||||||
|
t = schedule(step)
|
||||||
|
s = schedule(step - 1)
|
||||||
|
|
||||||
|
# Conditional pass (text + image)
|
||||||
|
v_cond = dit(text_features, z, t)
|
||||||
|
|
||||||
|
# Unconditional pass (image only, zeroed text)
|
||||||
|
v_uncond = dit(zeros, z, t)
|
||||||
|
|
||||||
|
# CFG combination
|
||||||
|
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
|
||||||
|
|
||||||
|
# Euler step
|
||||||
|
z = z + v * (s - t)
|
||||||
|
|
||||||
|
# 4. Decode to pixels
|
||||||
|
image = vae.decode(z)
|
||||||
|
```
|
||||||
@@ -0,0 +1,362 @@
|
|||||||
|
# Prompting Guide
|
||||||
|
|
||||||
|
Ideogram 4 is trained exclusively on **structured JSON captions** (represented as string type). While the
|
||||||
|
model can accept plain-text prompts, providing a JSON object that follows the
|
||||||
|
caption schema gives significantly better results, especially for
|
||||||
|
controllability, spatial layout, and style fidelity.
|
||||||
|
|
||||||
|
## Plain-text vs. JSON prompts
|
||||||
|
|
||||||
|
You can pass in plain-text prompts directly to the model and it will work. The
|
||||||
|
sampling parameters come from a named preset in `ideogram4.PRESETS` (the same
|
||||||
|
ones `run_inference.py` exposes via `--sampler-preset`), unpacked into the
|
||||||
|
`pipe()` call:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from ideogram4 import PRESETS
|
||||||
|
|
||||||
|
preset = PRESETS["V4_QUALITY_48"]
|
||||||
|
images = pipe(
|
||||||
|
"a golden retriever on a skateboard",
|
||||||
|
height=1024,
|
||||||
|
width=1024,
|
||||||
|
num_steps=preset.num_steps,
|
||||||
|
guidance_schedule=preset.guidance_schedule,
|
||||||
|
mu=preset.mu,
|
||||||
|
std=preset.std,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
But for higher quality image generations and more control, pass a JSON string as the prompt:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import json
|
||||||
|
from ideogram4 import PRESETS
|
||||||
|
|
||||||
|
caption = {
|
||||||
|
"high_level_description": "A golden retriever riding a skateboard down a sunny sidewalk.",
|
||||||
|
"style_description": {
|
||||||
|
"aesthetics": "warm, playful, vibrant",
|
||||||
|
"lighting": "bright afternoon sunlight, long soft shadows",
|
||||||
|
"photo": "shallow depth of field, eye-level, 85mm lens",
|
||||||
|
"medium": "photograph",
|
||||||
|
"color_palette": ["#F5C542", "#87CEEB", "#4A4A4A", "#FFFFFF", "#2E8B57"]
|
||||||
|
},
|
||||||
|
"compositional_deconstruction": {
|
||||||
|
"background": "A sun-drenched suburban sidewalk lined with green hedges and a white picket fence. Dappled light filters through overhead trees.",
|
||||||
|
"elements": [
|
||||||
|
{"type": "obj", "bbox": [200, 300, 800, 900], "desc": "A golden retriever with a fluffy coat, standing on a red skateboard with all four paws. Its tongue is out and ears are flapping in the wind."},
|
||||||
|
{"type": "obj", "bbox": [250, 750, 750, 950], "desc": "A worn red skateboard with black wheels rolling along the concrete sidewalk."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
preset = PRESETS["V4_QUALITY_48"]
|
||||||
|
images = pipe(
|
||||||
|
json.dumps(caption, separators=(",", ":"), ensure_ascii=False),
|
||||||
|
height=1024,
|
||||||
|
width=1024,
|
||||||
|
num_steps=preset.num_steps,
|
||||||
|
guidance_schedule=preset.guidance_schedule,
|
||||||
|
mu=preset.mu,
|
||||||
|
std=preset.std,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Magic prompt
|
||||||
|
|
||||||
|
Writing these captions by hand is optional. *Magic prompt* uses an LLM to expand
|
||||||
|
a plain-text prompt into a full structured caption for you, so you get the
|
||||||
|
quality of a JSON prompt from a casual one. It is enabled by default in
|
||||||
|
`run_inference.py`; you can also call it directly:
|
||||||
|
|
||||||
|
```python
|
||||||
|
import os
|
||||||
|
from ideogram4 import ClaudeOpusMagicPromptV1, PRESETS
|
||||||
|
|
||||||
|
magic = ClaudeOpusMagicPromptV1(api_key=os.environ["MAGIC_PROMPT_API_KEY"])
|
||||||
|
caption = magic.expand("a golden retriever on a skateboard", aspect_ratio="1:1")
|
||||||
|
preset = PRESETS["V4_QUALITY_48"]
|
||||||
|
images = pipe(
|
||||||
|
caption,
|
||||||
|
height=1024,
|
||||||
|
width=1024,
|
||||||
|
num_steps=preset.num_steps,
|
||||||
|
guidance_schedule=preset.guidance_schedule,
|
||||||
|
mu=preset.mu,
|
||||||
|
std=preset.std,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
The package ships three configurations, registered by name in
|
||||||
|
`ideogram4.MAGIC_PROMPTS` (the keys `run_inference.py` accepts via
|
||||||
|
`--magic-prompt-model`):
|
||||||
|
|
||||||
|
| Config class | Registry key | Backend |
|
||||||
|
| :--- | :--- | :--- |
|
||||||
|
| `Ideogram4MagicPromptV1` | `ideogram-4-v1` | Ideogram's hosted magic-prompt API (free; reads `IDEOGRAM_API_KEY`) |
|
||||||
|
| `ClaudeOpusMagicPromptV1` | `claude-opus-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
|
||||||
|
| `ClaudeSonnetMagicPromptV1` | `claude-sonnet-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
|
||||||
|
|
||||||
|
`ideogram-4-v1` is the default and is **free**. It runs the expansion
|
||||||
|
server-side, so there is no local model or system prompt involved — it just needs
|
||||||
|
an Ideogram API key (get one at
|
||||||
|
[developer.ideogram.ai](https://developer.ideogram.ai)). The `claude-*`
|
||||||
|
configurations instead send one of our open-source system prompt to an OpenRouter model;
|
||||||
|
select one with `--magic-prompt-model` and export `MAGIC_PROMPT_API_KEY`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python run_inference.py \
|
||||||
|
--prompt "an isometric illustration of a tiny city floating in the clouds" \
|
||||||
|
--output out.png \
|
||||||
|
--quantization "nf4" \
|
||||||
|
--magic-prompt-model claude-opus-v1 \
|
||||||
|
--magic-prompt-key "$MAGIC_PROMPT_API_KEY"
|
||||||
|
```
|
||||||
|
|
||||||
|
See the README's [CLI](../README.md#cli) section for the rest of the flags.
|
||||||
|
|
||||||
|
Our magic-prompt system prompts are **open source** (they ship in
|
||||||
|
`src/ideogram4/magic_prompt_system_prompts/`), so you're also welcome to
|
||||||
|
construct the caption with any system prompt and LLM of your choosing.
|
||||||
|
|
||||||
|
**A few caveats:**
|
||||||
|
|
||||||
|
- At Ideogram we've tested this magic prompt with **Claude Opus**. You're welcome
|
||||||
|
to implement your own `MagicPrompt` configurations and/or drive a different LLM
|
||||||
|
with our system prompt, but those paths aren't tested by us and quality may
|
||||||
|
vary.
|
||||||
|
- The magic prompt shipped here is **not** the same magic prompt used in
|
||||||
|
production at [Ideogram.ai](https://ideogram.ai) — results will differ from the
|
||||||
|
hosted product (including the `ideogram-4-v1` API).
|
||||||
|
|
||||||
|
## JSON caption schema
|
||||||
|
|
||||||
|
> **Note:** Following this schema is **not required** — the model accepts any
|
||||||
|
> string as a prompt. The schema below describes the exact structure the model
|
||||||
|
> was trained on, and matching it minimizes train/eval mismatch so the model
|
||||||
|
> generates closer to its full quality. Treat the "required" / "must" language
|
||||||
|
> in the rest of this section as the format the [`CaptionVerifier`](../src/ideogram4/caption_verifier.py)
|
||||||
|
> checks against, not as a hard pipeline constraint. Deviating from the schema
|
||||||
|
> is allowed; it just means you're sampling outside the training distribution.
|
||||||
|
|
||||||
|
The full caption schema has three top-level fields:
|
||||||
|
|
||||||
|
1. `high_level_description` — optional string, but strongly recommended.
|
||||||
|
2. `style_description` — optional object.
|
||||||
|
3. `compositional_deconstruction` — **required** object.
|
||||||
|
|
||||||
|
`compositional_deconstruction` must always be present. Within it, both
|
||||||
|
`background` and `elements` are required.
|
||||||
|
|
||||||
|
### `high_level_description`
|
||||||
|
|
||||||
|
A one- or two-sentence summary of the entire image. Strongly recommended in every prompt.
|
||||||
|
|
||||||
|
```json
|
||||||
|
"high_level_description": "A medium-shot photograph of a barista pouring latte art in a cozy cafe."
|
||||||
|
```
|
||||||
|
|
||||||
|
### `style_description`
|
||||||
|
|
||||||
|
Controls the visual style, lighting, medium, and color palette.
|
||||||
|
|
||||||
|
`style_description` must contain **exactly one** of:
|
||||||
|
|
||||||
|
- `photo` — for photographic captions (paired with `medium: "photograph"`).
|
||||||
|
- `art_style` — for non-photographic captions (illustration, painting, 3D render, etc.).
|
||||||
|
|
||||||
|
`aesthetics`, `lighting`, and `medium` are also required when `style_description` is present. `color_palette` is optional.
|
||||||
|
|
||||||
|
**Key order is strict** and depends on which of `photo` / `art_style` is used:
|
||||||
|
|
||||||
|
| Caption type | Required key order |
|
||||||
|
| :----------- | :----------------- |
|
||||||
|
| Photo (uses `photo`) | `aesthetics`, `lighting`, `photo`, `medium`, `color_palette` |
|
||||||
|
| Non-photo (uses `art_style`) | `aesthetics`, `lighting`, `medium`, `art_style`, `color_palette` |
|
||||||
|
|
||||||
|
`color_palette` is the only field in this list that may be omitted; if it is included it must remain in the final position.
|
||||||
|
|
||||||
|
Field descriptions:
|
||||||
|
|
||||||
|
| Field | Type | Description |
|
||||||
|
| :---- | :--- | :---------- |
|
||||||
|
| `aesthetics` | string | Aesthetic keywords (e.g. "moody, cinematic, desaturated") |
|
||||||
|
| `lighting` | string | Lighting description (e.g. "golden hour, rim light, dramatic shadows") |
|
||||||
|
| `photo` | string | Camera/lens details for photographic outputs (e.g. "35mm, f/1.4, bokeh"). Use this OR `art_style`, not both. |
|
||||||
|
| `medium` | string | Medium type: `"photograph"`, `"illustration"`, `"3d_render"`, `"painting"`, `"graphic_design"`, etc. |
|
||||||
|
| `art_style` | string | Art style description for non-photo captions (e.g. "flat vector illustration, bold outlines"). Use this OR `photo`, not both. |
|
||||||
|
| `color_palette` | list[str] | Hex color codes that steer the image's dominant colors. Up to 16 entries. |
|
||||||
|
|
||||||
|
### `compositional_deconstruction`
|
||||||
|
|
||||||
|
Provides fine-grained spatial control over the image layout using bounding
|
||||||
|
boxes and per-element descriptions. Both fields below are required.
|
||||||
|
|
||||||
|
| Field | Type | Description |
|
||||||
|
| :---- | :--- | :---------- |
|
||||||
|
| `background` | string | Description of the background/environment (required) |
|
||||||
|
| `elements` | list[dict] | List of elements with optional bounding boxes (required) |
|
||||||
|
|
||||||
|
`background` must come before `elements`.
|
||||||
|
|
||||||
|
Each element in `elements` must follow a fixed **key order** depending on its
|
||||||
|
type. `bbox` and `color_palette` are optional within an element; if present they
|
||||||
|
must appear in the positions shown below.
|
||||||
|
|
||||||
|
| Type | Required key order |
|
||||||
|
| :--- | :----------------- |
|
||||||
|
| `"obj"` | `type`, `bbox`, `desc`, `color_palette` |
|
||||||
|
| `"text"` | `type`, `bbox`, `text`, `desc`, `color_palette` |
|
||||||
|
|
||||||
|
Field descriptions:
|
||||||
|
|
||||||
|
| Field | Type | Description |
|
||||||
|
| :---- | :--- | :---------- |
|
||||||
|
| `type` | string | `"obj"` for objects/subjects, `"text"` for in-image text |
|
||||||
|
| `bbox` | list[int] | `[y_min, x_min, y_max, x_max]` in normalized `0–1000` coordinates (origin at top-left). Optional. |
|
||||||
|
| `desc` | string | Detailed description of the element |
|
||||||
|
| `text` | string | (only for `type: "text"`) The literal text to render |
|
||||||
|
| `color_palette` | list[str] | Optional per-element palette. Up to 5 hex entries. |
|
||||||
|
|
||||||
|
**Key ordering matters.** The model was trained on JSON with a consistent key
|
||||||
|
order, so maintaining it improves generation quality. The pipeline runs
|
||||||
|
[`CaptionVerifier`](../src/ideogram4/caption_verifier.py) on every prompt and emits
|
||||||
|
warnings for unknown keys, missing required keys, or out-of-order keys.
|
||||||
|
|
||||||
|
**Hex color format.** Colors in `color_palette` must be uppercase
|
||||||
|
`#RRGGBB` strings (e.g. `#1B1B2F`, not `#1b1b2f` or `#fff`).
|
||||||
|
|
||||||
|
**Encoding.** When serializing with Python's `json` module, pass
|
||||||
|
`separators=(",", ":")` and `ensure_ascii=False`.
|
||||||
|
`CaptionVerifier` warns when it detects `\uXXXX` escapes with no literal
|
||||||
|
non-ASCII characters in the raw text.
|
||||||
|
|
||||||
|
## Color palette conditioning
|
||||||
|
|
||||||
|
One of Ideogram 4's distinctive features is **color palette control**. By
|
||||||
|
providing a `color_palette` array of hex colors in `style_description`, you
|
||||||
|
can steer the dominant colors of the generated image.
|
||||||
|
|
||||||
|
```json
|
||||||
|
"style_description": {
|
||||||
|
"aesthetics": "moody, cinematic",
|
||||||
|
"lighting": "low-key, deep shadows",
|
||||||
|
"photo": "35mm, f/1.4",
|
||||||
|
"medium": "photograph",
|
||||||
|
"color_palette": ["#1B1B2F", "#162447", "#1F4068", "#E43F5A", "#F5F5F5"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Tips for effective color palette use:
|
||||||
|
|
||||||
|
- **Up to 16 colors** in `style_description.color_palette` for the overall
|
||||||
|
image palette, and **up to 5 colors** per element in
|
||||||
|
`compositional_deconstruction.elements[*].color_palette`.
|
||||||
|
- **Include background colors** — if you want a dark background, include the
|
||||||
|
dark hex in the palette.
|
||||||
|
- **Contrast pairs** — include both your highlight and shadow colors for more
|
||||||
|
controlled lighting.
|
||||||
|
- **Uppercase hex only** — `#RRGGBB` form, no shorthand.
|
||||||
|
|
||||||
|
### Example: warm sunset palette
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"high_level_description": "A lone sailboat on calm water at sunset.",
|
||||||
|
"style_description": {
|
||||||
|
"aesthetics": "serene, warm, golden hour",
|
||||||
|
"lighting": "golden hour backlighting, warm atmospheric haze",
|
||||||
|
"photo": "wide angle, f/8, long exposure",
|
||||||
|
"medium": "photograph",
|
||||||
|
"color_palette": ["#FF6B35", "#F7C59F", "#004E89", "#1A659E", "#2B2D42"]
|
||||||
|
},
|
||||||
|
"compositional_deconstruction": {
|
||||||
|
"background": "A calm ocean stretching to a low horizon, sky washed in orange and pink with thin wisps of cloud.",
|
||||||
|
"elements": [
|
||||||
|
{"type": "obj", "desc": "A single sailboat with a white triangular sail, silhouetted against the setting sun."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Example: corporate design palette
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"high_level_description": "A clean, modern business card layout for a tech company.",
|
||||||
|
"style_description": {
|
||||||
|
"aesthetics": "minimal, professional, geometric",
|
||||||
|
"lighting": "even, diffuse studio lighting",
|
||||||
|
"medium": "graphic_design",
|
||||||
|
"art_style": "flat vector design, generous whitespace, sans-serif typography",
|
||||||
|
"color_palette": ["#FFFFFF", "#F0F0F0", "#333333", "#0066FF", "#00CC88"]
|
||||||
|
},
|
||||||
|
"compositional_deconstruction": {
|
||||||
|
"background": "A solid off-white card surface with subtle paper texture.",
|
||||||
|
"elements": [
|
||||||
|
{"type": "text", "text": "ACME TECH", "desc": "Bold dark grey sans-serif company name across the upper third of the card."},
|
||||||
|
{"type": "text", "text": "hello@acme.tech", "desc": "Small blue sans-serif contact email near the bottom of the card."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Full example
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.",
|
||||||
|
"style_description": {
|
||||||
|
"aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant",
|
||||||
|
"lighting": "overcast daylight, diffused, soft subtle shadows",
|
||||||
|
"photo": "shallow depth of field, sharp focus, eye-level, telephoto",
|
||||||
|
"medium": "photograph"
|
||||||
|
},
|
||||||
|
"compositional_deconstruction": {
|
||||||
|
"background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.",
|
||||||
|
"elements": [
|
||||||
|
{"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."},
|
||||||
|
{"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."},
|
||||||
|
{"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."},
|
||||||
|
{"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."},
|
||||||
|
{"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."},
|
||||||
|
{"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."},
|
||||||
|
{"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."},
|
||||||
|
{"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."},
|
||||||
|
{"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."},
|
||||||
|
{"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."},
|
||||||
|
{"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."},
|
||||||
|
{"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."},
|
||||||
|
{"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."},
|
||||||
|
{"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."},
|
||||||
|
{"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."},
|
||||||
|
{"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."},
|
||||||
|
{"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
## Safety filter
|
||||||
|
|
||||||
|
NSFW prompts are blocked. Instead of an image, the model returns a gray screen
|
||||||
|
with the text "Image blocked by safety filter". False positive rates for safety
|
||||||
|
is higher for non-json like prompts. We are aware that this is an issue an we may
|
||||||
|
make a future checkpoint update to improve it.
|
||||||
|
|
||||||
|
# Congratulations!
|
||||||
|
|
||||||
|
You are now a certified Ideogram 4 prompter!
|
||||||
|
|
||||||
|
With structured JSON captions, you have fine-grained control over composition,
|
||||||
|
color palettes, typography, and spatial layout — capabilities that go far
|
||||||
|
beyond what plain-text prompts can express!
|
||||||
|
We'd love to see what you create :-)
|
||||||
|
Share your results, experiments, and creative discoveries with the community,
|
||||||
|
especially the unexpected ones. Tag us on social media or open a discussion on
|
||||||
|
the repo. Happy generating!
|
||||||
Binary file not shown.
|
After Width: | Height: | Size: 237 KiB |
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
Binary file not shown.
|
After Width: | Height: | Size: 242 KiB |
Reference in New Issue
Block a user