Kitten KITTEN: A Knowledge-Integrated Evaluation of

Image Generation on Visual Entities


Ming-Wei Chang1

    

Xuhui Jia1

    

Kelvin C.K. Chan1

    

Hexiang Hu1

    

Yu-Chuan Su1

    

Ming-Hsuan Yang1,2


*: Equal Contribution

1 Google Deepmind        

2 UC Merced



KITTEN evaluates the ability of T2I models to generate real-world entities grounded in knowledge sources.


🎧 Notebook LM Summary: Hear our paper's highlights in just a few minutes!

Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose Kitten, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using Kitten, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts.

Can text-to-image models generate precise visual details of real-world entities?


State-of-the-art text-to-image models can effectively render well-known entities (e.g., the Great Pyramid of Giza),
but they often struggle with less familiar entities (e.g. Statue of Equality Ramanuja), leading to hallucinated depictions.

We construct the KITTEN benchmark using real-world entities across eight domains.


For each selected entity, we define five evaluation tasks of image-generation prompts incorporating the entity.
A support set and an evaluation set of entity images are collected to assess the fidelity of the generated entities.

We evaluate the latest text-to-image models, retrieval-augmented models, and unified models.


We use carefully designed human evaluations (top left), automatic metrics (top right), and MLLMs as judges (bottom)
placing an emphasis on faithfulness to the entities and the ability to follow instructions.

Our results illustrate the trade-off between faithfulness to entities and instruction-following.


The backbone text-to-image models and unified models still exhibit a significant gap in generating faithful representations of entities.
While retrieval-augmented methods enhance faithfulness, they often struggle with creative prompts.
Our study underscores the need for techniques that improve the fidelity of entities without sacrificing the ability to follow instructions.

BibTex

      
        @article{huang_2026_kitten,
          title = {{KITTEN}: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities},
          author={Huang, Hsin-Ping and Wang, Xinyi and Bitton, Yonatan and Taitelbaum, Hagai and 
            Tomar, Gaurav Singh and Chang, Ming-Wei and Jia, Xuhui and Chan, Kelvin C.K. and 
            Hu, Hexiang and Su, Yu-Chuan and Yang, Ming-Hsuan},
          journal={Transactions on Machine Learning Research},
          year={2026}
        }