Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

Yang Liu, Wentao Fen, Zhuoyao Liu, Shudong Huang, Jiancheng Lv
1 College of Computer Science, Sichuan University, Chengdu, 610065, China
2 Engineering Research Center of Machine Learning and Industry Intelligence, Ministry of Education, Chengdu, 610065, China

Abstract

Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples. To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation. Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings. In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings. Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.

Introduction Figure
Figure 1: (Top) In previous methods, when addressing problems with varying information density, a set of visual embeddings is learned to match the most similar text embeddings. However, this approach results in learned embeddings with limited information capacity. (Mid) Since dense text has a higher information density than sparse text, training with aligned dense text results in image embeddings with greater information capacity. (Bottom) Sparse text embedding can be distilled through dense text embedding to enhance its information capacity, enabling the matching of text descriptions from different perspectives with a single visual embedding. (Best viewed in color)

Code/Pre-trained Models

Our code and pre-trained models are available on our Github repo.

Citation

        
            @InProceedings{liu2025aligning,
                author    = {Liu, Yang and Feng, Wentao and Liu, Zhuoyao and Huang, Shudong and Lv, Jiancheng},
                title     = {Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching},
                booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
                month     = {October},
                year      = {2025}
            }