Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation CVPR 2024
- Ji-Jia Wu NTU
- Andy Chia-Hao Chang NYCU
- Chieh-Yu Chuang NYCU
- Chun-Pei Chen NYCU
- Yu-Lun Liu NYCU
- Min-Hung Chen NVIDIA
- Hou-Ning Hu MediaTek
- Yung-Yu Chuang NTU
- Yen-Yu Lin NYCU

Abstract
This paper addresses text-supervised semantic segmentation, aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue, we propose a novel framework, Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model, we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
Training pipeline for image-text co-decomposition
Our method consists of three major modules, including (a) the image-text co-segmentation module where the image and text segmenters estimate the region and word masks according to a selected noun, respectively, (b) the region-word highlighting module where the estimated masks together with two learnable prompts produce the highlighted image and text, and (c) the region-word alignment module where contrastive learning is applied to the embedded object regions and word segments to accomplish region-word alignment.
Qualitative results
The first two rows display text and images, representing input image-text pairs. In each text, nouns are underlined with different colors. Our method uses these nouns as queries for performing image-text co-decomposition. Using our image-text co-decomposition method, the last two rows depict the method's output, where regions and word segments associated with different nouns appear in corresponding colors.
Quantitative results
The proposed method is compared with nine SOTA methods on six popular semantic segmentation datasets: PASCAL VOC (VOC), PASCAL Context (Context), COCO-Object (Object), COCO-Stuff (Stuff), Cityscapes (City) and ADE20K (ADE). For each compared method, the dataset column lists its training datasets. Several methods used datasets in addition to CC3M and CC12M, such as YFCC14M, COCO and RedCaps12M. When applicable, we also provide an average mIoU across all six datasets. For each dataset, the best method is indicated by bold fonts, whereas the second best method is underlined.
Citation
Acknowledgements
This work was supported in part by the National Science and Technology Council (NSTC) under grants
112-2221-E-A49-090-MY3, 111-2628-E-A49-025-MY3, 112-2634-F-002-005, 112-2634-F-002-006, and
110-2221-E-002-124-MY3, and NTU under grants 112L9009. This work was funded in part by MediaTek and NVIDIA.
The website template was borrowed from CEVR.