Initial commit: Ideogram 4 Prompt Builder

PyQt6 desktop app for building Ideogram 4 JSON captions: bbox canvas,
palette editor, presets, prompt library with previews, localisation (en/ru),
light/dark themes, and ComfyUI dependency check + generation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-13 16:36:27 +08:00
commit a5c319a1fc
12 changed files with 7084 additions and 0 deletions
+336
View File
@@ -0,0 +1,336 @@
<p align="center"><a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><picture>
<source media="(prefers-color-scheme: dark)" srcset="assets/ideogram_logo_darkmode.svg">
<source media="(prefers-color-scheme: light)" srcset="assets/ideogram_logo.svg">
<img src="assets/ideogram_logo.svg" alt="Ideogram" width="500">
</picture></a></p>
<p align="center"><em>Ideogram 4: Open image model at the forefront of design</em></p>
<p align="center">
<a href="https://ideogram.ai/blog/ideogram-4.0/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Blog-Post-orange" alt="Blog Post"></a>
<a href="https://github.com/ideogram-oss/ideogram4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Code-GitHub-181717?logo=github" alt="Code"></a>
<a href="https://huggingface.co/collections/ideogram-ai/ideogram-4" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Model-HuggingFace-blue?logo=huggingface" alt="Model"></a>
<a href="https://developer.ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/API-developer.ideogram.ai-purple" alt="API"></a>
<a href="https://ideogram.ai/" target="_blank" rel="noopener noreferrer"><img src="https://img.shields.io/badge/Official%20Site-ideogram.ai-ff69b4" alt="Official Site"></a>
</p>
<p align="center">
<img src="assets/samples/collage_landscape.jpg" alt="A collage of Ideogram 4 samples spanning photorealism, illustration, typography, and poster design">
</p>
Ideogram 4 is **[Ideogram](https://ideogram.ai)'s first open-weight text-to-image model**. It is a **state-of-the-art foundation model trained from scratch** — not a fine-tune of any existing model. It introduces a new structured JSON prompting interface, with best-in-class multilingual text rendering, deep language understanding, explicit bounding-box layout and color-palette controls, and native 2k resolution images. The easiest way to try the model is online at **[ideogram.ai](https://ideogram.ai/)**.
We believe openness drives innovation, and we invite the research community to innovate with us on the forefront of visual intelligence.
## Table of Contents
1. [News](#news)
2. [Model Zoo](#model-zoo)
3. [Performance](#performance)
4. [Quick Start](#quick-start)
5. [Model Summary](#model-summary)
6. [Prompting Guide](#prompting-guide)
7. [Documentation](#documentation)
8. [Citation](#citation)
## News
* **[2026-06-03]** **Ideogram 4 released!** Inference code and weights
are now public, and our [technical blog post](https://ideogram.ai/blog/ideogram-4.0/) is live. See the
[Quick Start](#quick-start) section to generate your first image, or try the
model online at [ideogram.ai](https://ideogram.ai/).
## Model Zoo
| Model | Params | Weight Quantization | Supported Hardware | Diffusers Support | License |
| :--- | :---: | :---: | :---: | :---: | :---: |
| **[Ideogram 4 (nf4)](https://huggingface.co/ideogram-ai/ideogram-4-nf4)** | 9.3B | nf4 | CUDA | Yes | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
| **[Ideogram 4 (fp8)](https://huggingface.co/ideogram-ai/ideogram-4-fp8)** | 9.3B | fp8 | All | No | [Ideogram 4 Non-Commercial](model_licenses/LICENSE-IDEOGRAM-4-NON-COMMERCIAL) |
We plan to support more quantizations in the future.
## Performance
We evaluate Ideogram 4 across third-party arenas and benchmarks, standard
open-source benchmarks, and our own internal human-preference benchmark. Across
all of them, **Ideogram 4 is the best open-weight image model by far, and sits
at the frontier of design.**
### Design Arena
[Design Arena](https://www.designarena.ai/) is a third-party image Elo
leaderboard focused specifically on design-oriented generation. On the overall
board, Ideogram 4 is the top-ranked open-weight model, trailing only proprietary
GPT and Gemini models:
<p align="center">
<img src="assets/benchmarks/design_arena.png" alt="Design Arena overall image Elo leaderboard with Ideogram 4.0 as the top open-weight model">
</p>
Filtered to open-weight models only, Ideogram 4 leads by a commanding margin,
well ahead of the next-best open model:
<p align="center">
<img src="assets/benchmarks/design_arena2.png" alt="Design Arena open-weight image Elo leaderboard, with Ideogram 4.0 well ahead of all other open models">
</p>
### ContraLabs
[ContraLabs](https://contralabs.com/research) ran a blind typography evaluation judged by
ten professional designers from Contra's top-earning talent. Ideogram 4 leads on
first-place win rate, picked as the best of four models 47.9% of the time
overall — well ahead of Gemini 3.1 Flash Image Preview (Nano Banana 2) at 30.0%,
FLUX.2 [max] (15.5%), and Grok Imagine 1.0 (15.0%):
<p align="center">
<img src="assets/benchmarks/contralabs_typography.png" alt="ContraLabs typography first-place win rate, with Ideogram v4 leading">
</p>
It also wins on practical usability: asked "Would you use this in real client
work?", the same designers rated Ideogram 4 highest at 3.55 / 5 — significantly
above Nano Banana 2 (2.84), Grok Imagine 1.0 (2.61), and FLUX.2 [max] (2.49):
<p align="center">
<img src="assets/benchmarks/contralabs_typography2.png" alt="ContraLabs 'would you use this in real client work?' rating, with Ideogram v4 leading">
</p>
### LMArena
On [LMArena](https://lmarena.ai/), a third-party text-to-image leaderboard that
measures general-purpose text-to-image use cases, Ideogram is the top-ranked
open-weight lab and a top-5 image generation lab overall — beaten only by giant
companies with vastly larger budgets and resources:
<p align="center">
<img src="assets/benchmarks/lmarena_benchmark.png" alt="LMArena text-to-image lab leaderboard with Ideogram">
</p>
### Ideogram internal eval
For our internal human-preference benchmark, focused on graphic design and
photography, we had graphic designers deeply familiar with professional design
work do the rating blind. Bradley-Terry scores rank Ideogram 4 #2 overall —
behind only GPT Image 2 medium — and the top open-weight model:
<p align="center">
<img src="assets/benchmarks/ideogram_benchmark.png" alt="Ideogram internal design leaderboard with Ideogram 4.0">
</p>
### Open-source benchmarks
On standard open-source benchmarks measuring core capabilities — layout control
(7Bench), spatial reasoning and object fidelity (SpatialGenEval), text rendering
(X-Omni OCR), and prompt alignment (Prism) — Ideogram 4 closes the gap to the
leading closed-source models across every axis. On layout control (7Bench), it
is significantly better than all closed-source models:
<p align="center">
<img src="assets/benchmarks/opensource.png" alt="Five-axis capability radar comparing Ideogram 4.0 to leading closed-source models on layout control, spatial reasoning, object fidelity, prompt alignment, and text rendering">
</p>
At 9.3B parameters, Ideogram 4 delivers the best text rendering of any open-weight
release we benchmarked — ahead of much larger models like Qwen-Image (20B),
FLUX.2 [dev] (32B), and HunyuanImage 3.0 (80B MoE):
<p align="center">
<img src="assets/benchmarks/opensource2.png" alt="Parameter-efficiency scatter plot showing Ideogram 4.0 at 9.3B parameters leading all other open-weight models on text rendering">
</p>
## Quick Start
### Install
```bash
pip install .
```
If you plan to modify the code, install in editable mode instead so changes
under `src/ideogram4/` take effect without reinstalling:
```bash
pip install -e .
```
### Model access
The model weights are **gated** on Hugging Face, so you must accept the gate and
authenticate before the code can download them — otherwise the download fails
with a `404` / `GatedRepoError`.
1. Open the model page — [ideogram-ai/ideogram-4-nf4](https://huggingface.co/ideogram-ai/ideogram-4-nf4)
(or [ideogram-ai/ideogram-4-fp8](https://huggingface.co/ideogram-ai/ideogram-4-fp8)) — and click
**Agree and access repository** to accept the license gate.
2. Create a Hugging Face access token at
[huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) and log in so the
download is authenticated:
```bash
hf auth login
```
Alternatively, export the token directly: `export HF_TOKEN="hf_..."`.
### CLI
The plain `--prompt` is rewritten into the structured JSON caption the model
expects by a "magic prompt" LLM. By default this uses Ideogram's hosted
magic-prompt API, which is **free** and does the expansion server-side (no local
model or system prompt needed). It reads `IDEOGRAM_API_KEY` — get a key at
https://developer.ideogram.ai/:
```bash
python run_inference.py \
--prompt "a ginger cat wearing a tiny wizard hat reading a spellbook" \
--output out.png \
--quantization "nf4" \
--magic-prompt-key "$IDEOGRAM_API_KEY"
```
You can also run the expansion through your own LLM provider — one of our magic-prompt
system prompt is **open source**. See the
[Prompting Guide](docs/prompting.md#magic-prompt) for details.
For the highest-quality images, set `--height 2048 --width 2048` and
`--sampler-preset V4_QUALITY_48`.
#### Safety screening with Hive
Prompt and output safety screening is performed via [Hive](https://thehive.ai/).
Sign up and create a Text Moderation key and a Visual Content Moderation key,
then export them as `HIVE_TEXT_MODERATION_KEY` and `HIVE_VISUAL_MODERATION_KEY`
(or pass them via `--hive-text-key` / `--hive-visual-key`).
```bash
python run_inference.py \
--prompt "an isometric illustration of a tiny city floating in the clouds" \
--output out.png \
--quantization "nf4" \
--magic-prompt-key "$MAGIC_PROMPT_API_KEY" \
--hive-text-key "$HIVE_TEXT_MODERATION_KEY" \
--hive-visual-key "$HIVE_VISUAL_MODERATION_KEY"
```
For sampler presets, parameter reference, and optimization tips, see
[docs/inference.md](docs/inference.md).
## Model Summary
Ideogram 4 is a **foundation model trained entirely from scratch**, not a
fine-tune or distillation of any existing checkpoint. It is a flow-matching
text-to-image model built on a **fully single-stream** Diffusion Transformer
(DiT) architecture.
**Architecture:**
- **Fully single-stream DiT.** Text and image tokens are concatenated into one
unified sequence and processed through the same 34-layer transformer, with no
separate text or image branches. This enables deep cross-modal interaction at
every layer.
- **Vision-language model as text encoder.** Instead of a text-only encoder
like CLIP or T5, Ideogram 4 uses
[Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct),
a full vision-language model that provides far richer understanding of visual
concepts. Hidden states are extracted from **13 intermediate layers** and
concatenated, giving the model multi-scale semantic features ranging from
surface-level token information to deep compositional understanding.
- **Dual-branch classifier-free guidance.** The conditional (positive) and
unconditional (negative) branches can be independently refined, enabling
separate control over prompt adherence and image quality.
- **Flexible resolution.** Native support for any resolution from 256 to 2048
(multiples of 16), with aspect ratios up to 6:1. A single model handles
everything from square thumbnails to ultrawide banners, with the noise
schedule auto-adjusting per resolution.
**Key Capabilities:**
- **Extreme controllability.** Ideogram 4 is trained on structured JSON
captions, giving users unprecedented control over composition, style,
lighting, color palette, typography, and spatial layout, all from a single
prompt.
- **State-of-the-art text rendering.** Ideogram 4 delivers best-in-class
in-image text generation (signage, logos, captions, watermarks, multi-line
text) with high fidelity directly from the prompt.
- **Spatial layout control.** Bounding-box coordinates in the prompt allow
explicit placement of subjects, text elements, and background regions.
- **Color palette conditioning.** Specify hex colors in the prompt to steer the
image's dominant color scheme.
For full architecture details, see
[docs/model_architecture.md](docs/model_architecture.md). For a walkthrough of
how the pipeline components fit together, see
[docs/pipeline.md](docs/pipeline.md).
## Prompting Guide
Ideogram 4 is trained exclusively on **structured JSON captions**. While
plain-text prompts work, you will get the best results by providing a JSON
object that follows our caption schema.
Key points:
- **Use JSON prompts** for maximum controllability — the model was trained on
them and understands the structure natively.
- **Color palette conditioning** — specify a `colour_palette` array of hex
colors in the style description to steer the image's color scheme.
- **Aspect ratio flexibility** — Ideogram 4 supports a wide range of aspect
ratios (any multiple-of-16 resolution from 256 to 2048 on each side). This
is a key advantage for practical use: portraits, landscapes, banners,
phone wallpapers, social media formats, etc.
- **Bounding-box layout** — specify `bbox` coordinates in the prompt to
explicitly place subjects, text elements, and background regions.
- **Compositional control** — use `compositional_deconstruction` with bounding
boxes and per-element descriptions for precise spatial layout.
**Why JSON-only training?** We train exclusively on JSON so that training
and inference share a single, common prompt format. The training captions themselves are deliberately
**extremely descriptive**: each JSON exhaustively describes everything in
the image to maximize training efficiency. The more
text-to-image relationships each caption pins down, the more grounded
supervision the model extracts from a single training pair, rather than
having to infer those relationships across many sparsely-captioned samples.
**Why JSON at inference time?** Because the model was trained on captions
that name every object explicitly, the most reliable way to get every
requested object rendered is to mirror that pattern. Plain-text prompts still work, but
won't perform as well since the model was only trained on structured JSON captions.
**Don't want to write JSON by hand?** That's what *magic prompt* is for: it uses
an LLM to expand a plain-text prompt into a full structured caption before
generation, so you get JSON-quality results from a casual prompt. It runs by
default in `run_inference.py` (see the [CLI](#cli) section).
See [docs/prompting.md](docs/prompting.md) for a full guide.
## Documentation
| Document | Description |
| :------- | :---------- |
| [docs/prompting.md](docs/prompting.md) | How to write JSON prompts, color palette conditioning, aspect ratios |
| [docs/inference.md](docs/inference.md) | Sampler presets, parameter reference, resolutions, optimization tips |
| [docs/model_architecture.md](docs/model_architecture.md) | Architecture diagram, DiT spec, component details |
| [docs/pipeline.md](docs/pipeline.md) | Conceptual pipeline walkthrough — how all components fit together |
| [docs/development.md](docs/development.md) | Dev setup, pre-commit hooks, contributing |
| [docs/safety.md](docs/safety.md) | Pre-training, post-training, and inference-time safety mitigations; how to report violations |
## Citation
If you find the provided code or models useful for your research, consider citing them as:
```bibtex
@misc{ideogram-4-2026,
author={Ideogram AI},
title={{Ideogram 4}},
year={2026},
howpublished={\url{https://ideogram.ai/blog/ideogram-4.0/}},
}
```
## We're Hiring!
We're looking for **Research Scientists** and **Research Engineers** to
work on next-generation generative models and the products built on top of
them. Interested candidates please apply https://jobs.ashbyhq.com/ideogram
+58
View File
@@ -0,0 +1,58 @@
# Development
## Editable install
We recommend installing into an isolated environment — the dependencies include several GB of CUDA-built wheels.
```bash
python -m venv .venv && source .venv/bin/activate
```
For development, install the package in editable mode so changes to the source
tree are picked up without reinstalling:
```bash
pip install -e .
```
or with [`uv`](https://docs.astral.sh/uv/):
```bash
uv venv && source .venv/bin/activate
```
```bash
uv pip install -e .
```
## Pre-commit hooks
This repo uses [pre-commit](https://pre-commit.com/) to run lint, format, and
type checks (`ruff`, `mypy`, etc.) before each commit.
Install once per clone:
```bash
pip install pre-commit
pre-commit install
```
`pre-commit install` registers a git hook in `.git/hooks/pre-commit`, so it
requires the directory to be a git repo. The hooks now run automatically on
`git commit` against staged files.
To run the hooks manually against every file in the repo (useful right after
the first install, or in CI):
```bash
pre-commit run --all-files
```
The first run downloads each hook's environment (ruff, mypy, etc.) into
`~/.cache/pre-commit/` and may take a minute. Subsequent runs are fast.
To bump pinned hook versions in `.pre-commit-config.yaml`:
```bash
pre-commit autoupdate
```
+63
View File
@@ -0,0 +1,63 @@
# Inference Reference
Detailed parameters, sampler presets, supported resolutions, and optimization
tips for Ideogram 4 inference.
## Sampler Presets
Named presets bundle a step count, per-step CFG schedule, schedule mean (`mu`),
and schedule standard deviation (`std`) into a single flag:
```bash
python run_inference.py \
--prompt "a cat wearing a tiny top hat" \
--sampler-preset V4_QUALITY_48 \
--output out.png
```
| Preset | Steps | CFG schedule | `mu` | `std` |
| :----- | :---: | :----------- | :--: | :---: |
| `V4_QUALITY_48` | 48 | 45 steps @ gw=7, then 3 polish steps @ gw=3 | 0.0 | 1.5 |
| `V4_DEFAULT_20` | 20 | 18 steps @ gw=7, then 2 polish steps @ gw=3 | 0.0 | 1.75 |
| `V4_TURBO_12` | 12 | 11 steps @ gw=7, then 1 polish step @ gw=3 | 0.5 | 1.75 |
`V4_QUALITY_48` is the default. Fewer steps trade quality for speed. The full
registry lives in
[`ideogram4.sampler_configs.PRESETS`](../src/ideogram4/sampler_configs.py); add a
new entry there to define your own.
## Key Parameters
These are the keyword arguments accepted by `Ideogram4Pipeline.__call__`. The
defaults below apply when you call `pipe(...)` directly; `run_inference.py`
overrides `num_steps`, `guidance_schedule`, `mu`, and `std` from the chosen
sampler preset (see above).
| Parameter | Default | Notes |
| :-------- | :-----: | :---- |
| `height` / `width` | 1024 | Must be multiples of 16. Supported range: 2562048. Aspect ratios up to 6:1 or 1:6. |
| `num_steps` | 48 | More steps = higher quality. The `V4_QUALITY_48` preset (48 steps) is a good speed/quality trade-off. |
| `guidance_scale` | 7.0 | Constant guidance weight used when no `guidance_schedule` is given. Higher = more prompt adherence, lower = more diversity. |
| `guidance_schedule` | `None` | Optional per-step guidance weights (loop-index order: index 0 is the final step). Overrides `guidance_scale`. |
| `mu` | 0.5 | Logit-normal schedule mean. Auto-adjusted for resolution. |
| `std` | 1.0 | Logit-normal schedule standard deviation. |
| `seed` | `None` | Set for reproducible results. |
## Supported Resolutions
Ideogram 4 natively supports any resolution where both height and width are
multiples of 16, within the range 2562048 (aspect ratios up to 6:1 or 1:6).
| Use case | Resolution | Aspect ratio |
| :------- | :--------: | :----------: |
| Square | 1024 × 1024 | 1:1 |
| Landscape | 1536 × 1024 | 3:2 |
| Portrait | 1024 × 1536 | 2:3 |
| Widescreen | 1920 × 1088 | ~16:9 |
| Ultrawide | 2048 × 768 | ~21:9 |
| Phone wallpaper | 1024 × 1792 | ~9:16 |
| Social banner | 1600 × 400 | 4:1 |
Resolution buckets use 16-pixel increments, giving fine-grained control over
output dimensions.
+45
View File
@@ -0,0 +1,45 @@
# Model Architecture
```
prompt ─► Qwen3-VL-8B-Instruct (extract hidden states from layers (0,3,…,33,35) → concat)
┌──────────────────────────────────────────────────┐
│ Ideogram4Transformer │
│ • 34 × Ideogram4TransformerBlock │
Ideogram4Attention (QK-RMSNorm, MRoPE) │
Ideogram4MLP (SwiGLU) │
adaln scale/gate from t-embedding │
│ • Ideogram4FinalLayer │
└──────────────────────────────────────────────────┘
│ velocity prediction
Euler flow-matching sampler with asymmetric CFG
│ denoised image latents
VAE decode
PIL.Image
```
The transformer is a single-stream DiT: text tokens (Qwen3-VL hidden states from
the activation layers) and image latent tokens are concatenated into one
sequence, modulated per-block by an AdaLN computed from the flow-matching
timestep embedding. Attention uses QK-RMSNorm and 3D MRoPE so that text and
image tokens share a unified positional space.
Model spec:
| field | value |
|-------------------|---------------|
| `emb_dim` | 4608 |
| `num_layers` | 34 |
| `num_heads` | 18 |
| `intermediate` | 12288 |
| `adanln_dim` | 512 |
| `rope_theta` | 5_000_000 |
| `mrope_section` | (24, 20, 20) |
| latent channels | 32 × 2² = 128 |
| max text tokens | 2048 |
| sampler | Euler flow-matching, logit-normal schedule, asymmetric CFG |
+183
View File
@@ -0,0 +1,183 @@
# Pipeline: How All the Components Work Together
This document explains the end-to-end Ideogram 4 inference pipeline
conceptually. For the architecture spec and code pointers, see
[model_architecture.md](model_architecture.md).
## Overview
Ideogram 4 is a **flow-matching text-to-image model** built on a
**single-stream DiT** (Diffusion Transformer). The pipeline has four main
components:
```
┌─────────────┐ ┌──────────────────────┐ ┌──────────────┐ ┌───────────┐
│ Qwen3-VL │ │ Ideogram4 │ │ KL VAE │ │ │
│ Text ├──►│ Transformer (DiT) ├──►│ VAE ├──►│ Image │
│ Encoder │ │ + Euler Sampler │ │ Decoder │ │ │
└─────────────┘ └──────────────────────┘ └──────────────┘ └───────────┘
frozen trainable frozen
```
## 1. Text Encoder — Qwen3-VL-8B-Instruct
The text encoder is a frozen [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct)
vision-language model, used in text-only mode (no vision inputs).
**What it does:**
- Tokenizes the prompt using the Qwen3 chat template.
- Runs a forward pass through the 36-layer transformer.
- **Extracts hidden states** from 13 specific layers: 0, 3, 6, 9, 12, 15, 18, 21,
24, 27, 30, 33, 35.
- Concatenates these hidden states along the feature dimension, producing a
multi-scale text representation.
**Why multi-layer extraction?** Different layers capture different levels of
abstraction — early layers encode surface-level token information, while later
layers encode deeper semantic meaning. Concatenating them gives the DiT access
to the full spectrum.
**Output:** A tensor of shape `(batch, num_text_tokens, hidden_dim * 13)`.
## 2. DiT Backbone — Ideogram4Transformer
The core generative model is a 34-layer single-stream Diffusion Transformer.
### Sequence layout
Text tokens and image latent tokens are concatenated into one sequence and
processed through the same self-attention layers.
```
Sequence layout (per sample):
┌───────────────────┬────────────────────────┐
│ text tokens │ image latent tokens │
│ (up to 2048) │ (grid_h × grid_w) │
└───────────────────┴────────────────────────┘
▲ ▲
Qwen3-VL features noisy latents z_t
```
### Key components per block
- **Self-attention** with QK-RMSNorm and 3D Multimodal RoPE (MRoPE). The
positional encoding is 3-dimensional: for text tokens it uses a 1D position
broadcast to 3 axes; for image tokens it uses (temporal, height, width)
coordinates. This lets text and image tokens coexist in a unified positional
space.
- **SwiGLU MLP** — the feed-forward layer uses a gated linear unit with SiLU
activation.
- **Adaptive Layer Norm (AdaLN)** — the timestep `t` is embedded as a scalar
and generates per-block scale and gate parameters. This conditions every layer
on the current noise level.
### Flow matching
The model is trained with a **flow-matching** objective. Instead of predicting
noise (as in DDPM), the model predicts a **velocity field** `v(z_t, t)` that
defines the ODE:
```
dz/dt = v(z_t, t)
```
At inference time, we start from pure Gaussian noise `z_1` and integrate
backward to `z_0` (the clean image) using the Euler method:
```
z_{t-dt} = z_t + v(z_t, t) * dt
```
### Noise schedule
The timestep distribution follows a **logit-normal schedule** parameterized by
`(mu, sigma)`. The mean `mu` controls how much time the sampler spends at
different noise levels — higher `mu` shifts more steps toward higher noise
(important for high-resolution images). The schedule auto-adjusts for
resolution:
```
mu_adjusted = mu_base + 0.5 * log(num_pixels / base_pixels)
```
where `base_pixels = 512 * 512`.
## 3. Classifier-Free Guidance (CFG)
At each sampling step, two forward passes are run through the DiT:
1. **Conditional (positive):** full text features + noisy image latents.
2. **Unconditional (negative):** zeroed text features + noisy image latents
(image-only tokens, asymmetric CFG).
The guided velocity is a weighted combination:
```
v_guided = gw * v_conditional + (1 - gw) * v_unconditional
```
where `gw` is the per-step guidance weight. With
`gw > 1`, the model amplifies the text-conditional signal and suppresses the
unconditional prediction, producing images that follow the prompt more
faithfully.
**Asymmetric CFG:** The unconditional branch only processes image tokens (no
text padding), making it computationally cheaper than a full-sequence negative
pass.
**Per-step schedules:** The guidance weight can vary across steps. The
`V4_QUALITY_48` preset, for example, uses `gw=7` for the first 45 steps and
`gw=3` for the final 3 "polish" steps near `t=0`.
## 4. VAE Decoder — KL Autoencoder
The denoised latent `z_0` is decoded to pixel space using a frozen KL
autoencoder.
**What it does:**
- **Unpatching:** The DiT works with 2×2 patches of latent pixels. The decoder
input is reshaped from `(batch, grid_h * grid_w, channels * 4)` to
`(batch, channels, grid_h * 2, grid_w * 2)`.
- **Denormalization:** Per-channel shift and scale are applied to undo the
latent normalization used during training.
- **Decoding:** The VAE decoder maps latents to RGB pixels.
- **Clipping:** Output is clamped to [-1, 1] and rescaled to [0, 255] uint8.
**Compression factor:** The autoencoder provides 8× spatial compression on each
axis, and the 2×2 patching in the DiT adds another 2×. So a 1024×1024 image
is represented as a 64×64 grid of latent tokens, each with 128 channels
(32 base channels × 2² patch).
## Putting it all together
```python
# Pseudocode for one generation call:
# 1. Encode text
text_features = qwen3_vl.encode(prompt) # (B, L_text, D)
# 2. Initialize noise
z = torch.randn(B, grid_h * grid_w, 128) # pure noise at t=1
# 3. Euler integration from t=1 to t=0
for step in reversed(range(num_steps)):
t = schedule(step)
s = schedule(step - 1)
# Conditional pass (text + image)
v_cond = dit(text_features, z, t)
# Unconditional pass (image only, zeroed text)
v_uncond = dit(zeros, z, t)
# CFG combination
v = gw[step] * v_cond + (1 - gw[step]) * v_uncond
# Euler step
z = z + v * (s - t)
# 4. Decode to pixels
image = vae.decode(z)
```
+362
View File
@@ -0,0 +1,362 @@
# Prompting Guide
Ideogram 4 is trained exclusively on **structured JSON captions** (represented as string type). While the
model can accept plain-text prompts, providing a JSON object that follows the
caption schema gives significantly better results, especially for
controllability, spatial layout, and style fidelity.
## Plain-text vs. JSON prompts
You can pass in plain-text prompts directly to the model and it will work. The
sampling parameters come from a named preset in `ideogram4.PRESETS` (the same
ones `run_inference.py` exposes via `--sampler-preset`), unpacked into the
`pipe()` call:
```python
from ideogram4 import PRESETS
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
"a golden retriever on a skateboard",
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
But for higher quality image generations and more control, pass a JSON string as the prompt:
```python
import json
from ideogram4 import PRESETS
caption = {
"high_level_description": "A golden retriever riding a skateboard down a sunny sidewalk.",
"style_description": {
"aesthetics": "warm, playful, vibrant",
"lighting": "bright afternoon sunlight, long soft shadows",
"photo": "shallow depth of field, eye-level, 85mm lens",
"medium": "photograph",
"color_palette": ["#F5C542", "#87CEEB", "#4A4A4A", "#FFFFFF", "#2E8B57"]
},
"compositional_deconstruction": {
"background": "A sun-drenched suburban sidewalk lined with green hedges and a white picket fence. Dappled light filters through overhead trees.",
"elements": [
{"type": "obj", "bbox": [200, 300, 800, 900], "desc": "A golden retriever with a fluffy coat, standing on a red skateboard with all four paws. Its tongue is out and ears are flapping in the wind."},
{"type": "obj", "bbox": [250, 750, 750, 950], "desc": "A worn red skateboard with black wheels rolling along the concrete sidewalk."}
]
}
}
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
json.dumps(caption, separators=(",", ":"), ensure_ascii=False),
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
## Magic prompt
Writing these captions by hand is optional. *Magic prompt* uses an LLM to expand
a plain-text prompt into a full structured caption for you, so you get the
quality of a JSON prompt from a casual one. It is enabled by default in
`run_inference.py`; you can also call it directly:
```python
import os
from ideogram4 import ClaudeOpusMagicPromptV1, PRESETS
magic = ClaudeOpusMagicPromptV1(api_key=os.environ["MAGIC_PROMPT_API_KEY"])
caption = magic.expand("a golden retriever on a skateboard", aspect_ratio="1:1")
preset = PRESETS["V4_QUALITY_48"]
images = pipe(
caption,
height=1024,
width=1024,
num_steps=preset.num_steps,
guidance_schedule=preset.guidance_schedule,
mu=preset.mu,
std=preset.std,
)
```
The package ships three configurations, registered by name in
`ideogram4.MAGIC_PROMPTS` (the keys `run_inference.py` accepts via
`--magic-prompt-model`):
| Config class | Registry key | Backend |
| :--- | :--- | :--- |
| `Ideogram4MagicPromptV1` | `ideogram-4-v1` | Ideogram's hosted magic-prompt API (free; reads `IDEOGRAM_API_KEY`) |
| `ClaudeOpusMagicPromptV1` | `claude-opus-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
| `ClaudeSonnetMagicPromptV1` | `claude-sonnet-v1` | [OpenRouter](https://openrouter.ai) (reads `MAGIC_PROMPT_API_KEY`) |
`ideogram-4-v1` is the default and is **free**. It runs the expansion
server-side, so there is no local model or system prompt involved — it just needs
an Ideogram API key (get one at
[developer.ideogram.ai](https://developer.ideogram.ai)). The `claude-*`
configurations instead send one of our open-source system prompt to an OpenRouter model;
select one with `--magic-prompt-model` and export `MAGIC_PROMPT_API_KEY`:
```bash
python run_inference.py \
--prompt "an isometric illustration of a tiny city floating in the clouds" \
--output out.png \
--quantization "nf4" \
--magic-prompt-model claude-opus-v1 \
--magic-prompt-key "$MAGIC_PROMPT_API_KEY"
```
See the README's [CLI](../README.md#cli) section for the rest of the flags.
Our magic-prompt system prompts are **open source** (they ship in
`src/ideogram4/magic_prompt_system_prompts/`), so you're also welcome to
construct the caption with any system prompt and LLM of your choosing.
**A few caveats:**
- At Ideogram we've tested this magic prompt with **Claude Opus**. You're welcome
to implement your own `MagicPrompt` configurations and/or drive a different LLM
with our system prompt, but those paths aren't tested by us and quality may
vary.
- The magic prompt shipped here is **not** the same magic prompt used in
production at [Ideogram.ai](https://ideogram.ai) — results will differ from the
hosted product (including the `ideogram-4-v1` API).
## JSON caption schema
> **Note:** Following this schema is **not required** — the model accepts any
> string as a prompt. The schema below describes the exact structure the model
> was trained on, and matching it minimizes train/eval mismatch so the model
> generates closer to its full quality. Treat the "required" / "must" language
> in the rest of this section as the format the [`CaptionVerifier`](../src/ideogram4/caption_verifier.py)
> checks against, not as a hard pipeline constraint. Deviating from the schema
> is allowed; it just means you're sampling outside the training distribution.
The full caption schema has three top-level fields:
1. `high_level_description` — optional string, but strongly recommended.
2. `style_description` — optional object.
3. `compositional_deconstruction`**required** object.
`compositional_deconstruction` must always be present. Within it, both
`background` and `elements` are required.
### `high_level_description`
A one- or two-sentence summary of the entire image. Strongly recommended in every prompt.
```json
"high_level_description": "A medium-shot photograph of a barista pouring latte art in a cozy cafe."
```
### `style_description`
Controls the visual style, lighting, medium, and color palette.
`style_description` must contain **exactly one** of:
- `photo` — for photographic captions (paired with `medium: "photograph"`).
- `art_style` — for non-photographic captions (illustration, painting, 3D render, etc.).
`aesthetics`, `lighting`, and `medium` are also required when `style_description` is present. `color_palette` is optional.
**Key order is strict** and depends on which of `photo` / `art_style` is used:
| Caption type | Required key order |
| :----------- | :----------------- |
| Photo (uses `photo`) | `aesthetics`, `lighting`, `photo`, `medium`, `color_palette` |
| Non-photo (uses `art_style`) | `aesthetics`, `lighting`, `medium`, `art_style`, `color_palette` |
`color_palette` is the only field in this list that may be omitted; if it is included it must remain in the final position.
Field descriptions:
| Field | Type | Description |
| :---- | :--- | :---------- |
| `aesthetics` | string | Aesthetic keywords (e.g. "moody, cinematic, desaturated") |
| `lighting` | string | Lighting description (e.g. "golden hour, rim light, dramatic shadows") |
| `photo` | string | Camera/lens details for photographic outputs (e.g. "35mm, f/1.4, bokeh"). Use this OR `art_style`, not both. |
| `medium` | string | Medium type: `"photograph"`, `"illustration"`, `"3d_render"`, `"painting"`, `"graphic_design"`, etc. |
| `art_style` | string | Art style description for non-photo captions (e.g. "flat vector illustration, bold outlines"). Use this OR `photo`, not both. |
| `color_palette` | list[str] | Hex color codes that steer the image's dominant colors. Up to 16 entries. |
### `compositional_deconstruction`
Provides fine-grained spatial control over the image layout using bounding
boxes and per-element descriptions. Both fields below are required.
| Field | Type | Description |
| :---- | :--- | :---------- |
| `background` | string | Description of the background/environment (required) |
| `elements` | list[dict] | List of elements with optional bounding boxes (required) |
`background` must come before `elements`.
Each element in `elements` must follow a fixed **key order** depending on its
type. `bbox` and `color_palette` are optional within an element; if present they
must appear in the positions shown below.
| Type | Required key order |
| :--- | :----------------- |
| `"obj"` | `type`, `bbox`, `desc`, `color_palette` |
| `"text"` | `type`, `bbox`, `text`, `desc`, `color_palette` |
Field descriptions:
| Field | Type | Description |
| :---- | :--- | :---------- |
| `type` | string | `"obj"` for objects/subjects, `"text"` for in-image text |
| `bbox` | list[int] | `[y_min, x_min, y_max, x_max]` in normalized `01000` coordinates (origin at top-left). Optional. |
| `desc` | string | Detailed description of the element |
| `text` | string | (only for `type: "text"`) The literal text to render |
| `color_palette` | list[str] | Optional per-element palette. Up to 5 hex entries. |
**Key ordering matters.** The model was trained on JSON with a consistent key
order, so maintaining it improves generation quality. The pipeline runs
[`CaptionVerifier`](../src/ideogram4/caption_verifier.py) on every prompt and emits
warnings for unknown keys, missing required keys, or out-of-order keys.
**Hex color format.** Colors in `color_palette` must be uppercase
`#RRGGBB` strings (e.g. `#1B1B2F`, not `#1b1b2f` or `#fff`).
**Encoding.** When serializing with Python's `json` module, pass
`separators=(",", ":")` and `ensure_ascii=False`.
`CaptionVerifier` warns when it detects `\uXXXX` escapes with no literal
non-ASCII characters in the raw text.
## Color palette conditioning
One of Ideogram 4's distinctive features is **color palette control**. By
providing a `color_palette` array of hex colors in `style_description`, you
can steer the dominant colors of the generated image.
```json
"style_description": {
"aesthetics": "moody, cinematic",
"lighting": "low-key, deep shadows",
"photo": "35mm, f/1.4",
"medium": "photograph",
"color_palette": ["#1B1B2F", "#162447", "#1F4068", "#E43F5A", "#F5F5F5"]
}
```
Tips for effective color palette use:
- **Up to 16 colors** in `style_description.color_palette` for the overall
image palette, and **up to 5 colors** per element in
`compositional_deconstruction.elements[*].color_palette`.
- **Include background colors** — if you want a dark background, include the
dark hex in the palette.
- **Contrast pairs** — include both your highlight and shadow colors for more
controlled lighting.
- **Uppercase hex only** — `#RRGGBB` form, no shorthand.
### Example: warm sunset palette
```json
{
"high_level_description": "A lone sailboat on calm water at sunset.",
"style_description": {
"aesthetics": "serene, warm, golden hour",
"lighting": "golden hour backlighting, warm atmospheric haze",
"photo": "wide angle, f/8, long exposure",
"medium": "photograph",
"color_palette": ["#FF6B35", "#F7C59F", "#004E89", "#1A659E", "#2B2D42"]
},
"compositional_deconstruction": {
"background": "A calm ocean stretching to a low horizon, sky washed in orange and pink with thin wisps of cloud.",
"elements": [
{"type": "obj", "desc": "A single sailboat with a white triangular sail, silhouetted against the setting sun."}
]
}
}
```
### Example: corporate design palette
```json
{
"high_level_description": "A clean, modern business card layout for a tech company.",
"style_description": {
"aesthetics": "minimal, professional, geometric",
"lighting": "even, diffuse studio lighting",
"medium": "graphic_design",
"art_style": "flat vector design, generous whitespace, sans-serif typography",
"color_palette": ["#FFFFFF", "#F0F0F0", "#333333", "#0066FF", "#00CC88"]
},
"compositional_deconstruction": {
"background": "A solid off-white card surface with subtle paper texture.",
"elements": [
{"type": "text", "text": "ACME TECH", "desc": "Bold dark grey sans-serif company name across the upper third of the card."},
{"type": "text", "text": "hello@acme.tech", "desc": "Small blue sans-serif contact email near the bottom of the card."}
]
}
}
```
## Full example
```json
{
"high_level_description": "A medium-shot photograph of Formula 1 driver Max Verstappen wearing his Red Bull Racing racing suit and cap, smiling as he holds his racing helmet and talks to a man in a white shirt and black vest at a race track.",
"style_description": {
"aesthetics": "saturated primary colors, rule of thirds, joyful and triumphant",
"lighting": "overcast daylight, diffused, soft subtle shadows",
"photo": "shallow depth of field, sharp focus, eye-level, telephoto",
"medium": "photograph"
},
"compositional_deconstruction": {
"background": "The background is an out-of-focus racing paddock or track environment. Several blurred figures are visible, including one in an orange shirt. A purple and white structure with a red 'F1' logo stands on the left. The scene is outdoors with daylight, though the sky is not visible.",
"elements": [
{"type": "obj", "bbox": [55, 642, 1000, 937], "desc": "An older man standing in profile, facing left toward Max Verstappen. He has grey hair and fair skin. He is wearing a white long-sleeved button-down shirt with a navy blue quilted vest over it. He has a slight smile."},
{"type": "obj", "bbox": [34, 137, 1000, 617], "desc": "Max Verstappen, a fair-skinned male Formula 1 driver, positioned in the center. He is facing forward with a joyful expression and a slight smile. He wears a navy blue Red Bull Racing team uniform with numerous sponsor logos and a matching baseball cap with the number '1'. He is holding a white and red racing helmet in his hands. He has a silver watch on his left wrist."},
{"type": "obj", "bbox": [422, 212, 792, 452], "desc": "Max Verstappen's racing helmet, held in front of his chest. It features a white, red, and yellow design with the Red Bull logo and the 'Player 0.0' branding. The visor is clear and open."},
{"type": "text", "bbox": [657, 0, 755, 142], "text": "F1", "desc": "Large, stylized red logo on a black and purple background in the lower left."},
{"type": "text", "bbox": [768, 0, 818, 147], "text": "Formula 1\nWorld Championship™", "desc": "Small white sans-serif text below the F1 logo on the left side."},
{"type": "text", "bbox": [78, 447, 117, 510], "text": "ORACLE\nRed Bull\nRacing", "desc": "Very small white and orange logo on the front of the navy blue cap."},
{"type": "text", "bbox": [78, 417, 120, 440], "text": "1", "desc": "Bold red numeral '1' on the front left side of the navy blue cap."},
{"type": "text", "bbox": [332, 442, 363, 483], "text": "Red Bull", "desc": "Small yellow and red text logo on the collar of the uniform."},
{"type": "text", "bbox": [373, 490, 423, 532], "text": "RAUCH", "desc": "Small yellow and blue logo on the right chest of the uniform."},
{"type": "text", "bbox": [422, 473, 500, 532], "text": "BYBIT\nHONDA", "desc": "Medium-sized white sans-serif text on the right chest of the uniform."},
{"type": "text", "bbox": [410, 203, 442, 257], "text": "RAUCH", "desc": "Small yellow logo on the left upper arm of the uniform."},
{"type": "text", "bbox": [530, 448, 627, 510], "text": "Red Bull", "desc": "Medium red text logo on the right side of the torso, part of the Red Bull graphic."},
{"type": "text", "bbox": [680, 417, 768, 523], "text": "Red Bull", "desc": "Large red text logo across the lower torso of the uniform."},
{"type": "text", "bbox": [797, 475, 815, 518], "text": "MAX", "desc": "Small white text next to a Dutch flag on the belt area of the uniform."},
{"type": "text", "bbox": [558, 317, 715, 355], "text": "Player 0.0", "desc": "Black sans-serif text on a white band on the racing helmet."},
{"type": "text", "bbox": [560, 800, 582, 835], "text": "IA.COM", "desc": "Small blue sans-serif text on the right sleeve of the white shirt."},
{"type": "text", "bbox": [968, 8, 997, 332], "text": "© Anadolu Agency via Getty Images", "desc": "Small white watermark text in the bottom left corner."}
]
}
}
```
## Safety filter
NSFW prompts are blocked. Instead of an image, the model returns a gray screen
with the text "Image blocked by safety filter". False positive rates for safety
is higher for non-json like prompts. We are aware that this is an issue an we may
make a future checkpoint update to improve it.
# Congratulations!
You are now a certified Ideogram 4 prompter!
With structured JSON captions, you have fine-grained control over composition,
color palettes, typography, and spatial layout — capabilities that go far
beyond what plain-text prompts can express!
We'd love to see what you create :-)
Share your results, experiments, and creative discoveries with the community,
especially the unexpected ones. Tag us on social media or open a discussion on
the repo. Happy generating!