Probing and Inducing Combinational Creativity in
Vision-Language Models

Yongqian Peng^1* Yuxi Ma^1* Mengmeng Wang² Yuxuan Wang² Yizhou Wang¹ Chi Zhang² Yixin Zhu^1✉️ Zilong Zheng^2✉️

¹Peking University ²State Key Laboratory of General Artificial Intelligence
^*Equal Contribution ^✉️Corresponding Author

Paper arXiv Video Code Data

Generated Combinational Creative Mashups

IEI's Resuts versus II's Results

**IEI (Identification + Explanation + Implication)**: In the IEI setup, the model receives the input objects and the theme, and is then prompted to analyze their shared characteristics from various perspectives before generating the prompts for text-to-image models.

**II (Identification + Implication)**: In the II setup, the model is given the same inputs (i.e., objects, theme) and uses the chain-of-thought method but is not explicitly guided to explore the shared attributes between the objects.

These results show that the ***IEI*** method can work across different themes and various concepts and enhance the quality of combination compared to the II setting.

Themes Driven Generation

The model is given a theme and two objects. It is then prompted to generate a mashup image that aligns with the theme, using the two objects as input.

Object Driven Generation

Novel conceptual mashups starting from "horse"

novel mashups for product designs

Starting from an initial object (e.g., a horse), we prompt a language model to generate conceptually related items by identifying common attributes across a generalized feature space (e.g., color, function). These outputs are structured as triplets—(object1, object2, shared attributes)—which are then fed into the IEI pipeline to produce creative combinations: novel conceptual mashups (shown above) and product designs (shown below).

Abstract

The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity—defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts—or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

Method Overview

Task Setup

BibTex

@article{peng2025probing,
  title={Probing and Inducing Combinational Creativity in Vision-Language Models},
  author={Peng, Yongqian and Ma, Yuxi and Wang, Mengmeng and Wang, Yuxuan and Wang, Yizhou and Zhang, Chi and Zhu, Yixin and Zheng, Zilong},
  journal={arXiv preprint arXiv:2504.13120},
  year={2025}
}