https://github.com/intcomp/camouflaged-vlm.

" /> https://github.com/intcomp/camouflaged-vlm.

" />
AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (19.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

Open-vocabulary camouflaged object segmentation with cascaded vision language models

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China
Nankai International Advanced Research Institute (Shenzhen Futian), Shenzhen 518045, China, and VCIP, and CS Department, Nankai University, Tianjin 300071, China
Show Author Information

Abstract

Open-vocabulary camouflaged object segmentation (OVCOS) seeks to segment and classify camouflaged objects in arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories. Recent approaches typically adopt a two-stage paradigm: they first segment objects, and then classify the segmented regions using vision language models (VLMs). However, such methods (ⅰ) suffer from a domain gap caused by the mismatch between VLMs’ full-image training and cropped-region inferencing, and (ⅱ) depend on generic segmentation models optimized for well-delineated objects which are less effective for camouflaged objects. Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation. In this paper, we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS. For segmentation, we leverage the segment anything model (SAM), guided by the VLM. Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy. For classification, we avoid the domain gap introduced by hard cropping. Instead, we treat the segmentation output as a soft spatial prior using the alpha channel. This retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects. The same VLM is shared between segmentation and classification to ensure efficiency and semantic consistency. Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects. Our code and models are open-sourced at https://github.com/intcomp/camouflaged-vlm.

Graphical Abstract

References

【1】
【1】
 
 
Computational Visual Media
Pages 473-492

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Zhao K, Yuan W, Wang Z, et al. Open-vocabulary camouflaged object segmentation with cascaded vision language models. Computational Visual Media, 2026, 12(2): 473-492. https://doi.org/10.26599/CVM.2025.9450512

242

Views

7

Downloads

3

Crossref

1

Web of Science

1

Scopus

0

CSCD

Received: 24 June 2025
Accepted: 09 September 2025
Published: 20 March 2026
© The Author(s) 2026.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

To submit a manuscript, please go to https://jcvm.org.