Describing Differences in Image Sets
with Natural Language

CVPR 2024 Oral
A totally tubular collab between UC Berkeley and Stanford
Equal *authorship and advising contribution (author order randomly chosen)

What is the difference between two sets of images?

We all know it's valuable to understand your data, but sifting through thousands of images to find their differences is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two sets of images, which we term Set Difference Captioning. Given two sets of images 𝓓A and 𝓓B, output natural language descriptions of concepts that are more true for 𝓓A.

We introduce VisDiff, a set difference captioning system that utilizes VLM, LLM, and CLIP to discover and rank differences between image sets ranging from a few dozen to several thousand images. To evaluate set difference captioning, we also introduce VisDiffBench, a benchmark of 187 paired image sets, each with a ground-truth difference description. VisDiffBench consists of shifts from ImageNetR and ImageNet* as well our new dataset PairedImageSets that covers 150 diverse real-world differences spanning three difficulty levels.

VisDiff: 𝓓A contains more.. Score

VisDiff: Describing Differences with Large Multi-Modal Models

VisDiff consists of a GPT-4 proposer on BLIP-2 generated captions and a CLIP ranker. The proposer takes randomly sampled image captions from 𝓓A and 𝓓B and proposes candidate differences. The ranker takes these proposed differences and evaluates them across all the images in 𝓓A and 𝓓B to assess which ones are most true.


We apply VisDiff to a ton of fun applications. Check out our WandB pages and project reports for more results and visualizations!

What is the difference between ImageNet and ImageNetV2?

While the two datasets are visually indistinguishable, we find several interesting per-class and global differences between the two datasets. Check out more per-class and global differences here!

Comparing the behavior of CLIP and ResNet

Even these models have similar accuracy on ImageNet, we find that CLIP and ResNet have very different failure modes by finding common concepts in images which are correctly classified by CLIP and not ResNet. Check out other per-class and global differences in our paper!

What types of images does ResNet perform poorly on?

We can find interpretable failure modes for a pretrained ResNet, such as images which people, by finding what concepts are in the incorrect classifications which do not appear in the correct classifications. Check out more results here!

Comparing StableDiffusionV1 and StableDiffusionV2

How can you tell if an image was created by the SDv2 rather than the original SDv1? Check out more differences we found here!

What makes an image memorable?

Some amazing work has quantified an image memorability, but humans are bad at determining whether a given image is memorable. Why not use VisDiff to find out? Check out more results here!


    title={Describing Differences in Image Sets with Natural Language},
    author={Dunlap, Lisa and Zhang, Yuhui and Wang, Xiaohan and Zhong, Ruiqi and Darrell, Trevor and Steinhardt, Jacob and Gonzalez, Joseph E. and Yeung-Levy, Serena},
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},