# GenExam **Repository Path**: py-service/GenExam ## Basic Information - **Project Name**: GenExam - **Description**: https://github.com/OpenGVLab/GenExam - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-04 - **Last Updated**: 2026-03-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

GenExam: A Multidisciplinary Text-to-Image Exam [Zhaokai Wang](https://www.wzk.plus/)\*, [Penghao Yin](https://penghaoyin.github.io/)\*, [Xiangyu Zhao](https://scholar.google.com/citations?user=eqFr7IgAAAAJ), [Changyao Tian](https://scholar.google.com/citations?user=kQ3AisQAAAAJ), [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ), [Wenhai Wang](https://whai362.github.io/), [Jifeng Dai](https://jifengdai.org/), [Gen Luo](https://scholar.google.com/citations?user=EyZqU9gAAAAJ)

## ⭐️ News * [2026/2/26] Results of Seedream 5.0 and Nano Banana 2 are updated. * [2026/1/28] Results of Qwen-Image-2512 and FLUX.2 dev are updated. * [2025/12/17] Results of GPT-Image-1.5, Seedream 4.5 and FLUX.2 max are updated. * [2025/11/23] Nano Banana Pro achieves new SOTA! (72.7 strict score and 93.7 relaxed score) * [2025/10/7] Results of HunyuanImage-3.0 are updated. * [2025/9/18] GenExam is released! ## 📖 Introduction Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for **multidisciplinary text-to-image exams**, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments on 17 text-to-image and unified models demonstrate the great challenge of GenExam and the huge gap where open-source models consistently lag behind the leading closed-source ones. By framing image generation as an exam, GenExam offers a rigorous assessment of models' ability to integrate understanding, reasoning, and generation, providing insights on the path to intelligent generative models. Our benchmark and evaluation code will be released.

## 🚀 Leaderboard ### Strict Score

Model	Math	Phy	Chem	Bio	Geo	Comp	Eng	Econ	Music	Hist	Overall
Closed-source Models
Nano Banana Pro	55.6	75.2	60.2	75.6	75.8	65.7	71.2	88.3	61.5	97.6	72.7
Nano Banana 2	56.3	74.3	52.5	66.0	69.7	56.9	67.6	63.6	50.8	82.9	64.1
Seedream 5.0	47.0	38.9	38.1	44.2	45.5	45.1	55.9	62.3	35.4	29.3	44.2
GPT-Image-1.5	26.5	46.0	39.0	56.4	60.6	36.3	44.1	42.9	29.2	51.2	43.2
GPT-Image-1	8.0	13.2	13.5	22.8	15.9	10.3	13.1	13.0	9.3	2.4	12.1
Seedream 4.5	5.3	11.5	7.6	25.0	12.1	12.7	15.3	15.6	3.1	4.9	11.3
FLUX.2 max	6.6	8.8	6.8	11.6	15.2	8.8	10.8	2.6	6.2	7.3	8.5
Seedream 4.0	2.6	3.5	5.9	18.6	10.6	6.9	11.7	5.2	0.0	7.3	7.2
Imagen-4-Ultra	2.6	9.7	9.3	14.7	7.6	2.9	12.6	9.1	0.0	0.0	6.9
Gemini-2.5-Flash-Image	0.7	7.1	4.2	5.1	4.5	4.9	10.0	1.3	1.5	0.0	3.9
Seedream 3.0	0.7	0.0	0.8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.2
FLUX.1 Kontext max	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Open-source T2I Models
FLUX.2 dev	2.6	1.8	4.6	3.8	3.0	1.0	2.7	1.3	0.0	0.0	2.1
Qwen-Image-2512	0.0	2.7	0.8	1.3	6.1	0.0	4.5	0.0	0.0	0.0	1.5
Qwen-Image	0.0	0.0	0.0	0.0	3.0	0.0	0.0	0.0	0.0	0.0	0.3
HiDream-I1-Full	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
HunyuanImage-3.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
FLUX.1 dev	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
FLUX.1 Krea	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Stable Diffusion 3.5 Large	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Open-source Unified MLLMs
BAGEL (thinking)	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
BAGEL	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Show-o2-7B	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Show-o2-1.5B-HQ	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
BLIP3o-NEXT-GRPO-Text-3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
BLIP3o-8B	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Janus-Pro	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
Emu3	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

### Relaxed Score

Model	Math	Phy	Chem	Bio	Geo	Comp	Eng	Econ	Music	Hist	Overall
Closed-source Models
Nano Banana Pro	86.3	95.1	88.7	95.9	96.5	91.7	95.1	97.2	91.0	99.9	93.7
Nano Banana 2	87.8	95.7	90.0	95.2	94.8	88.8	95.8	94.2	86.9	97.3	92.6
Seedream 5.0	82.9	85.7	81.1	89.9	89.5	85.2	91.2	94.7	76.7	87.0	86.4
GPT-Image-1.5	65.8	85.4	78.1	91.9	92.5	75.8	86.4	85.5	70.8	90.9	82.3
GPT-Image-1	52.0	66.4	53.4	74.6	73.9	55.6	65.5	65.8	52.6	67.4	62.6
FLUX.2 max	49.1	63.2	54.0	74.5	76.3	56.5	68.9	61.5	47.0	68.0	61.9
Seedream 4.5	44.7	63.4	48.9	75.8	67.6	57.9	69.7	67.3	38.0	55.0	58.8
Gemini-2.5-Flash-Image	43.1	60.9	45.3	72.6	70.2	47.4	65.8	59.8	37.0	57.1	55.9
Imagen-4-Ultra	35.9	57.4	44.5	68.1	66.9	40.1	65.6	59.7	38.4	57.8	53.4
Seedream 4.0	39.8	49.0	46.1	71.0	65.1	52.2	60.0	56.0	34.5	56.7	53.0
FLUX.1 Kontext max	23.5	25.6	19.2	38.3	47.5	20.9	28.9	22.3	25.4	33.5	28.5
Seedream 3.0	18.6	21.5	18.3	32.2	38.2	15.3	26.5	12.5	21.6	29.2	23.4
Open-source T2I Models
FLUX.2 dev	31.6	42.7	33.2	54.8	62.6	31.1	48.9	43.6	33.4	47.5	42.9
Qwen-Image-2512	27.9	41.3	23.2	44.4	56.6	24.1	42.9	32.3	28.3	37.0	35.8
Qwen-Image	18.9	26.3	15.3	32.1	49.6	18.9	32.0	20.3	23.4	38.6	27.5
HiDream-I1-Full	16.7	17.7	13.5	27.3	36.2	15.4	24.4	18.8	21.3	31.8	22.3
HunyuanImage-3.0	17.0	17.2	18.8	18.7	30.4	15.5	16.9	11.7	23.9	20.4	19.1
FLUX.1 dev	12.2	14.4	12.5	22.8	36.4	11.0	14.0	9.2	21.3	21.7	17.6
FLUX.1 Krea	7.0	14.0	8.5	26.5	38.4	8.4	15.4	11.1	16.8	17.4	16.4
Stable Diffusion 3.5 Large	12.2	13.2	10.7	21.8	38.8	6.6	16.3	8.0	24.1	18.0	17.0
Open-source Unified MLLMs
BAGEL (thinking)	11.7	13.8	11.9	15.2	28.5	6.2	10.7	6.3	14.7	16.0	13.5
BAGEL	14.7	10.6	7.9	10.8	24.5	6.8	10.2	5.3	13.7	14.4	11.9
Show-o2-7B	10.8	11.9	4.8	12.8	33.3	4.7	11.8	7.0	8.8	14.5	12.0
Show-o2-1.5B-HQ	7.3	7.5	6.2	15.0	25.3	4.3	9.3	7.3	7.6	19.8	11.0
BLIP3o-NEXT-GRPO-Text-3	15.5	10.5	9.2	15.5	23.7	8.2	10.1	8.1	15.2	10.2	12.6
BLIP3o-8B	6.4	5.5	4.7	7.0	16.7	3.6	8.4	2.5	6.0	11.2	7.2
Janus-Pro	13.7	8.8	8.2	7.2	18.8	3.9	10.5	4.2	14.5	6.6	9.6
Emu3	11.3	0.6	0.6	5.6	34.6	5.1	16.5	1.9	5.8	6.2	8.8

### Comparison Across Four Dimensions

## 🛠️ Usage Our data is stored in `data/`. You can also download them from [Huggingface](https://huggingface.co/datasets/OpenGVLab/GenExam/resolve/main/GenExam_data.zip?download=true). Additionally, images organized by taxonomy can be found [here](https://huggingface.co/datasets/OpenGVLab/GenExam/resolve/main/images_by_taxonomy.zip?download=true). ### 1. Prerequisites 1. Install requirements: `pip install requests tqdm pillow` 2. Set `openai_api_key` and `openai_base_url` (optional, if you want to use proxy) in `run_eval.py` for the gpt-5-20250807 evaluator and inference of gpt-image-1. 3. Generate the images offline with your model based on the `prompt` values in `data/annotations/All_Subjects.jsonl`. Saved image paths should be like `gen_imgs/{id}.png`. ### 2. Run Evaluation #### Offline Inference Run evaluation offline if images are already generated in `gen_imgs/`: ```bash python run_eval.py --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results ``` The eval results are saved to separate jsons under `./eval_results` for each sample. The `run_eval.py` script supports resuming from breakpoints. If your evaluation encounters an error midway, simply **re-run** the script. #### Online Inference Alternatively, you can add `--run_inference` to inference and evaluation together (generate images online): ```bash python run_eval.py --run_inference --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results ``` This script runs gpt-image-1 by default, which costs $185 on the full set ($160 for inference and $25 for evaluation). You can replace the `inference_function` in the script with customized function for your model's inference. #### Speed Up with Multiprocessing Add a `--max_worker` argument to speed up with multiprocessing: ```bash python run_eval.py --max_worker 20 --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results ``` ### 3. Calculate Scores Run the script to generate a detailed report for the eval results: ```bash python cal_score.py --eval_results_dir ./eval_results ``` This should give a report like:

Report Example

```yaml ================================================================================ Each score dimension: - semantic_correctness: 0.47 - spelling: 1.48 - readability: 1.55 - logical_consistency: 0.7 ================================================================================ Each score dimension (average) for each subject: - Computer_Science: semantic_correctness: 0.53 spelling: 1.68 readability: 1.43 logical_consistency: 0.66 - Physics: semantic_correctness: 0.4 spelling: 1.7 readability: 1.41 logical_consistency: 0.5 - Biology: semantic_correctness: 0.72 spelling: 1.28 readability: 1.59 logical_consistency: 1.02 - History: semantic_correctness: 0.53 spelling: 1.32 readability: 1.68 logical_consistency: 0.85 - Math: semantic_correctness: 0.24 spelling: 1.5 readability: 1.65 logical_consistency: 0.29 - Geography: semantic_correctness: 0.62 spelling: 1.27 readability: 1.69 logical_consistency: 0.98 - Economics: semantic_correctness: 0.56 spelling: 1.77 readability: 1.58 logical_consistency: 0.75 - Chemistry: semantic_correctness: 0.33 spelling: 1.33 readability: 1.52 logical_consistency: 0.6 - Music: semantic_correctness: 0.26 spelling: 1.42 readability: 1.5 logical_consistency: 0.46 - Engineering: semantic_correctness: 0.56 spelling: 1.49 readability: 1.43 logical_consistency: 0.94 -------------------------------------------------------------------------------- Total number of eval results: 487 -------------------------------------------------------------------------------- Strict score: - Computer_Science(47 samples): 10.2% - Physics(46 samples): 3.5% - Biology(46 samples): 12.2% - History(41 samples): 5.9% - Math(52 samples): 0.0% - Geography(52 samples): 7.7% - Economics(52 samples): 3.1% - Chemistry(52 samples): 4.6% - Music(52 samples): 0.0% - Engineering(47 samples): 6.8% Average strict score: 5.4% -------------------------------------------------------------------------------- Relaxed score: - Computer_Science(47 samples): 44.8% - Physics(46 samples): 36.9% - Biology(46 samples): 56.1% - History(41 samples): 45.4% - Math(52 samples): 27.2% - Geography(52 samples): 50.7% - Economics(52 samples): 47.6% - Chemistry(52 samples): 32.4% - Music(52 samples): 27.8% - Engineering(47 samples): 47.0% Average relaxed score: 41.6% ```

### Run on GenExam-Mini To run evaluation on the mini subset, you can add a `--mini` argument when running `run_eval.py`: ```bash python run_eval.py --mini --data_dir ./data/ --img_save_dir ./gen_imgs --eval_save_dir ./eval_results ``` If you have already run evaluation on the full set, you can alternatively add `--mini` when running `cal_score.py`: ```bash python cal_score.py --mini --eval_results_dir ./eval_results ``` ## 🖼 Examples of Generated Images For more examples, please refer to the appendix in our paper and [this repo](https://huggingface.co/datasets/OpenGVLab/GenExam_Gen_Images).

### Images Generated by Nano Banana Pro | bio_1

| |-----|-----|-----| | comp

| |

| |

| ## 📃 License This project is released under the [MIT license](LICENSE). ## 🖊️ Citation If you find our work helpful, please consider giving us a ⭐ and citing our paper: ```bibtex @article{GenExam, title={GenExam: A Multidisciplinary Text-to-Image Exam}, author = {Wang, Zhaokai and Yin, Penghao and Zhao, Xiangyu and Tian, Changyao and Qiao, Yu and Wang, Wenhai and Dai, Jifeng and Luo, Gen}, journal={arXiv preprint arXiv:2509.14232}, year={2025} } ```