Kitten KITTEN: A Knowledge-Intensive Evaluation of

Image Generation on Visual Entities


Ming-Wei Chang1

    

Xuhui Jia1

    

Kelvin C.K. Chan1

    

Hexiang Hu1

    

Yu-Chuan Su1

    

Ming-Hsuan Yang1,2


*: Core Contribution

1 Google Deepmind        

2 UC Merced



KITTEN evaluates the ability of text-to-image models to generate real-world entities grounded in knowledge sources.

Recent advancements in text-to-image generation have significantly enhanced the quality of synthesized images. Despite this progress, evaluations predominantly focus on aesthetic appeal or alignment with text prompts. Consequently, there is limited understanding of whether these models can accurately represent a wide variety of realistic visual entities — a task requiring real-world knowledge. To address this gap, we propose a benchmark focused on evaluating Knowledge-InTensive image generaTion on real-world ENtities (i.e., KITTEN). Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals. We evaluate the latest text-to-image models and retrieval-augmented customization models using both automatic metrics and carefully designed human evaluations, with an emphasis on the fidelity of entities in the generated images. Our findings reveal that even the most advanced text-to-image models often fail to generate entities with accurate visual details. Although retrieval-augmented models can enhance the fidelity of entities by incorporating reference images during testing, they often over-rely on these references and struggle to produce novel configurations of the entity as requested in creative text prompts.

Can text-to-image models generate precise visual details of real-world entities?

State-of-the-art text-to-image models can effectively render well-known entities (e.g., the Great Pyramid of Giza),
but they often struggle with less familiar entities (e.g. Statue of Equality Ramanuja), leading to hallucinated depictions.

We construct the KITTEN benchmark using real-world entities across eight domains.

For each selected entity, we create four types of image-generation prompts.
A support set and an evaluation set of entity images are collected to assess the fidelity of the generated entities.

We evaluate the latest text-to-image models and retrieval-augmented models.

using both carefully designed human evaluations (top) and automatic metrics (bottom),
placing an emphasis on faithfulness to the entities and the ability to follow instructions.

Our results illustrate the trade-off between faithfulness to entities and instruction-following in current models.

The backbone text-to-image models still exhibit a significant gap in generating faithful representations of entities.
While retrieval-augmented methods enhance faithfulness, they often struggle with creative prompts.
Our study underscores the need for techniques that improve the fidelity of entities without sacrificing the ability to follow instructions.

BibTex

      
      @article{huang2024kitten,
        title={KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities},
        author={Huang, Hsin-Ping and Wang, Xinyi and Bitton, Yonatan and Taitelbaum, Hagai and 
          Tomar, Gaurav Singh and Chang, Ming-Wei and Jia, Xuhui and Chan, Kelvin C.K. and 
          Hu, Hexiang and Su, Yu-Chuan and Yang, Ming-Hsuan},                
        journal={arXiv preprint arXiv:2410.11824},
        year={2024},
      }