Reproducible AI Research Infrastructure for Benchmarking Diffusion Model
DreamLayer provides benchmarking infrastructure for reproducible AI research by automating prompts, seeds, configs, metric scoring, and run logging across image and video model evaluations.
Compatible with leading image generation APIs and open-source diffusion model workflows
Leaderboard Overview
200 Prompts Benchmarked in 45 Minutes per Model
See how leading image generation models compare across reproducible evaluation metrics such as CLIP Score, FID, and Composition Correctness. DreamLayer automated prompt orchestration, generation, scoring, and result aggregation across models.

Methodology: This benchmark used a prompt set derived from Microsoft COCO and a reference set based on the CIFAR training split. To keep the evaluation controlled and reproducible, the same prompts, seeds, and configs were used across all models. The benchmark was published in September 2025.
CLIP Score
Measures how closely a generated image matches its text prompt.
Rank
Company
Model
CLIP Score
1
Luma Labs
Photon
0.265
2
Black Forest Labs
Flux Pro
0.263
3
OpenAI
Dall-E 3
0.259
4
Google Gemini
Nano Banana
0.258
5
Runway AI
Runway Gen 4
0.2505
6
Ideogram
Ideogram V3
0.2501
7
Stability AI
Stability SD Turbo
0.249
FID Score
Assess how close an AI-generated images are to real images.
Rank
Company
Model
FID Score
1
Ideogram
Ideogram V3
305.60
2
OpenAI
Dall-E 3
306.08
3
Runway AI
Runway Gen 4
317.52
4
Luma Labs
Photon
318.55
5
Black Forest Labs
Flux Pro
318.63
6
Google Gemini
Nano Banana
318.80
7
Stability AI
Stability SD Turbo
321.75
F1 Score
Combines precision and recall to show overall image accuracy.
Rank
Company
Model
F1 Score
1
Luma Labs
Photon
0.463
2
Stability AI
Stability SD Turbo
0.447
3
Runway AI
Runway Gen 4
0.445
4
Black Forest Labs
Flux Pro
0.421
5
Ideogram
Ideogram V3
0.415
6
OpenAI
Dall-E 3
0.380
7
Google Gemini
Nano Banana
0.351
Precision
Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.
Rank
Company
Model
Precision Score
1
Luma Labs
Photon
0.448
2
Stability AI
Stability SD Turbo
0.432
3
Runway AI
Runway Gen 4
0.423
4
Black Forest Labs
Flux Pro
0.406
5
Ideogram
Ideogram V3
0.397
6
OpenAI
Dall-E 3
0.358
7
Google Gemini
Nano Banana
0.339
Recall
Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.
Rank
Company
Model
Recall Score
1
Stability AI
Stability SD Turbo
0.533
2
Luma Labs
Photon
0.532
3
Runway AI
Runway Gen 4
0.522
4
Ideogram
Ideogram V3
0.497
5
Black Forest Labs
Flux Pro
0.495
6
OpenAI
Dall-E 3
0.477
7
Google Gemini
Nano Banana
0.415
CLIP Score
Measures how closely a generated image matches its text prompt.
Rank
Company
CLIP Score
1
Luma Photon
0.265
2
BFL Flux Pro
0.263
3
OpenAI Dall-E 3
0.259
4
Google Nano Banana
0.258
5
Runway Gen 4
0.2505
6
Ideogram V3
0.2501
7
Stability SD Turbo
0.249
FID Score
Assess how close an AI-generated images are to real images.
Rank
Company
FID Score
1
Ideogram V3
305.60
2
OpenAI Dall-E 3
306.08
3
Runway Gen 4
317.52
4
Luma Photon
318.55
5
BFL Flux Pro
318.63
6
Google Nano Banana
318.80
7
Stability SD Turbo
321.75
F1 Score
Combines precision and recall to show overall image accuracy.
Rank
Company
F1 Score
1
Luma Photon
0.463
2
Stability SD Turbo
0.447
3
Runway Gen 4
0.445
4
BFL Flux Pro
0.421
5
Ideogram V3
0.415
6
OpenAI Dall-E 3
0.380
7
Google Nano Banana
0.351
Precision
Measures how many AI-images came out correct when they were compared to the total number of images the AI generated.
Rank
Company
Precision Score
1
Luma Photon
0.448
2
Stability SD Turbo
0.432
3
Runway Gen 4
0.423
4
BFL Flux Pro
0.406
5
Ideogram V3
0.397
6
OpenAI Dall-E 3
0.358
7
Google Nano Banana
0.339
Recall
Measures how many of the correct images the AI was able to produce out of all the possible correct images it could've generated.
Rank
Company
Recall Score
1
Stability SD Turbo
0.533
2
Luma Photon
0.532
3
Runway Gen 4
0.522
4
Ideogram V3
0.497
5
BFL Flux Pro
0.495
6
OpenAI Dall-E 3
0.477
7
Google Nano Banana
0.415

FAQ

What is DreamLayer AI?

DreamLayer AI is an open-source benchmarking and evaluation platform for image and video diffusion models. It automates prompts, seeds, configs, metric scoring, and reproducible run logging so researchers and teams can compare model outputs consistently.

What can DreamLayer benchmark?

DreamLayer can benchmark image generation models, video generation models, prompt-to-image alignment, image quality, composition correctness, and reference-based similarity metrics. It is designed for reproducible model evaluation across prompts, seeds, configs, and metrics.

What metrics does DreamLayer support?

DreamLayer supports image and video evaluation metrics for benchmarking diffusion model outputs, including CLIP Score, FID, precision, recall, and F1, with support for additional quality metrics and custom evaluation pipelines. It is built to help researchers compare model outputs across reproducible prompts, seeds, configs, and scoring workflows.

Does DreamLayer run locally?

Yes. DreamLayer runs locally and supports reproducible benchmarking workflows with prompts, seeds, configs, metrics, and exportable run results. It is built for teams that want controlled evaluations without relying only on manual scripts.

Who is DreamLayer for?

DreamLayer is built for AI researchers, ML engineers, labs, and model creators running reproducible image and video model evaluations. It is especially useful for comparing model outputs across controlled benchmark setups.

Can DreamLayer compare models across prompts, seeds, and configs?

Yes. DreamLayer is designed to compare model outputs across consistent prompts, seeds, configs, and evaluation metrics so benchmark results are easier to reproduce and analyze.

Can DreamLayer export benchmark results?

Yes. DreamLayer supports exportable benchmark results for reports, papers, internal review, and leaderboard workflows. Runs can be packaged with configs, outputs, and evaluation results for easier sharing.

Does DreamLayer support open-source and API-based models?

Yes. DreamLayer supports benchmarking workflows across open-source model setups and API-based model workflows. This makes it easier to compare models across the same benchmark configuration.