Leaderboard Overview

200 Prompts Benchmarked in 45 Minutes per Model

See how leading image generation models compare across reproducible evaluation metrics such as CLIP Score, FID, and Composition Correctness. DreamLayer automated prompt orchestration, generation, scoring, and result aggregation across models.
‍
Methodology: This benchmark used a prompt set derived from Microsoft COCO and a reference set based on the CIFAR training split. To keep the evaluation controlled and reproducible, the same prompts, seeds, and configs were used across all models. The benchmark was published in September 2025.

CLIP Score

Measures how closely a generated image matches its text prompt.

Rank

Company

Model

CLIP Score

Luma Labs

Photon

0.265

Black Forest Labs

Flux Pro

0.263

OpenAI

Dall-E 3

0.259

Google Gemini

Nano Banana

0.258

Runway AI

Runway Gen 4

0.2505

Ideogram

Ideogram V3

0.2501

Stability AI

Stability SD Turbo

0.249

FID Score

Assess how close an AI-generated images are to real images.

Rank

Company

Model

FID Score

Ideogram

Ideogram V3

305.60

OpenAI

Dall-E 3

306.08

Runway AI

Runway Gen 4

317.52

Luma Labs

Photon

318.55

Black Forest Labs

Flux Pro

318.63

Google Gemini

Nano Banana

318.80

Stability AI

Stability SD Turbo

321.75

F1 Score

Combines precision and recall to show overall image accuracy.

Rank

Company

Model

F1 Score

Luma Labs

Photon

0.463

Stability AI

Stability SD Turbo

0.447

Runway AI

Runway Gen 4

0.445

Black Forest Labs

Flux Pro

0.421

Ideogram

Ideogram V3

0.415

OpenAI

Dall-E 3

0.380

Google Gemini

Nano Banana

0.351

Precision

Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.

Rank

Company

Model

Precision Score

Luma Labs

Photon

0.448

Stability AI

Stability SD Turbo

0.432

Runway AI

Runway Gen 4

0.423

Black Forest Labs

Flux Pro

0.406

Ideogram

Ideogram V3

0.397

OpenAI

Dall-E 3

0.358

Google Gemini

Nano Banana

0.339

Recall

Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.

Rank

Company

Model

Recall Score

Stability AI

Stability SD Turbo

0.533

Luma Labs

Photon

0.532

Runway AI

Runway Gen 4

0.522

Ideogram

Ideogram V3

0.497

Black Forest Labs

Flux Pro

0.495

OpenAI

Dall-E 3

0.477

Google Gemini

Nano Banana

0.415

CLIP Score

Measures how closely a generated image matches its text prompt.

Rank

Company

CLIP Score

Luma Photon

0.265

BFL Flux Pro

0.263

OpenAI Dall-E 3

0.259

Google Nano Banana

0.258

Runway Gen 4

0.2505

Ideogram V3

0.2501

Stability SD Turbo

0.249

FID Score

Assess how close an AI-generated images are to real images.

Rank

Company

FID Score

Ideogram V3

305.60

OpenAI Dall-E 3

306.08

Runway Gen 4

317.52

Luma Photon

318.55

BFL Flux Pro

318.63

Google Nano Banana

318.80

Stability SD Turbo

321.75

F1 Score

Combines precision and recall to show overall image accuracy.

Rank

Company

F1 Score

Luma Photon

0.463

Stability SD Turbo

0.447

Runway Gen 4

0.445

BFL Flux Pro

0.421

Ideogram V3

0.415

OpenAI Dall-E 3

0.380

Google Nano Banana

0.351

Precision

Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.

Rank

Company

Precision Score

Luma Photon

0.448

Stability SD Turbo

0.432

Runway Gen 4

0.423

BFL Flux Pro

0.406

Ideogram V3

0.397

OpenAI Dall-E 3

0.358

Google Nano Banana

0.339

Recall

Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.

Rank

Company

Recall Score

Stability SD Turbo

0.533

Luma Photon

0.532

Runway Gen 4

0.522

Ideogram V3

0.497

BFL Flux Pro

0.495

OpenAI Dall-E 3

0.477

Google Nano Banana

0.415

FAQ

What is DreamLayer AI?

DreamLayer AI is an open-source benchmarking and evaluation platform for image and video diffusion models. It automates prompts, seeds, configs, metric scoring, and reproducible run logging so researchers and teams can compare model outputs consistently.

What can DreamLayer benchmark?

DreamLayer can benchmark image generation models, video generation models, prompt-to-image alignment, image quality, composition correctness, and reference-based similarity metrics. It is designed for reproducible model evaluation across prompts, seeds, configs, and metrics.

What metrics does DreamLayer support?

DreamLayer supports image and video evaluation metrics for benchmarking diffusion model outputs, including CLIP Score, FID, precision, recall, and F1, with support for additional quality metrics and custom evaluation pipelines. It is built to help researchers compare model outputs across reproducible prompts, seeds, configs, and scoring workflows.

Does DreamLayer run locally?

Yes. DreamLayer runs locally and supports reproducible benchmarking workflows with prompts, seeds, configs, metrics, and exportable run results. It is built for teams that want controlled evaluations without relying only on manual scripts.

Who is DreamLayer for?

DreamLayer is built for AI researchers, ML engineers, labs, and model creators running reproducible image and video model evaluations. It is especially useful for comparing model outputs across controlled benchmark setups.

Can DreamLayer compare models across prompts, seeds, and configs?

Yes. DreamLayer is designed to compare model outputs across consistent prompts, seeds, configs, and evaluation metrics so benchmark results are easier to reproduce and analyze.

Can DreamLayer export benchmark results?

Yes. DreamLayer supports exportable benchmark results for reports, papers, internal review, and leaderboard workflows. Runs can be packaged with configs, outputs, and evaluation results for easier sharing.

Does DreamLayer support open-source and API-based models?

Yes. DreamLayer supports benchmarking workflows across open-source model setups and API-based model workflows. This makes it easier to compare models across the same benchmark configuration.