Open-vocabulary camouflaged object segmentation with cascaded vision language models

Kai Zhao; Wubang Yuan; Zheng Wang; Guanyi Li; Xiaoqiang Zhu; Deng-Ping Fan; Dan Zeng

doi:10.26599/CVM.2025.9450512

https://github.com/intcomp/camouflaged-vlm.

" /> https://github.com/intcomp/camouflaged-vlm.

" />

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

PDF (19.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access

Open-vocabulary camouflaged object segmentation with cascaded vision language models

Kai Zhao^¹, Wubang Yuan^¹, Zheng Wang^¹, Guanyi Li^¹, Xiaoqiang Zhu^¹(

), Deng-Ping Fan^², Dan Zeng^¹

School of Communication and Information Engineering, Shanghai University, Shanghai 200444, China

Nankai International Advanced Research Institute (Shenzhen Futian), Shenzhen 518045, China, and VCIP, and CS Department, Nankai University, Tianjin 300071, China

Show Author Information

Abstract

Open-vocabulary camouflaged object segmentation (OVCOS) seeks to segment and classify camouflaged objects in arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories. Recent approaches typically adopt a two-stage paradigm: they first segment objects, and then classify the segmented regions using vision language models (VLMs). However, such methods (ⅰ) suffer from a domain gap caused by the mismatch between VLMs’ full-image training and cropped-region inferencing, and (ⅱ) depend on generic segmentation models optimized for well-delineated objects which are less effective for camouflaged objects. Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation. In this paper, we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS. For segmentation, we leverage the segment anything model (SAM), guided by the VLM. Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy. For classification, we avoid the domain gap introduced by hard cropping. Instead, we treat the segmentation output as a soft spatial prior using the alpha channel. This retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects. The same VLM is shared between segmentation and classification to ensure efficiency and semantic consistency. Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects. Our code and models are open-sourced at https://github.com/intcomp/camouflaged-vlm.

Graphical Abstract

Keywords

open-vocabulary segmentation camouflaged objects vision-language models (VLMs)CLIP segment anything model (SAM)

References

【1】

Crossref Google Scholar

Computational Visual Media

Volume 12 Issue 2,
April 2026

Pages 473-492

DOI: 10.26599/CVM.2025.9450512

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Zhao K, Yuan W, Wang Z, et al. Open-vocabulary camouflaged object segmentation with cascaded vision language models. Computational Visual Media, 2026, 12(2): 473-492. https://doi.org/10.26599/CVM.2025.9450512

242

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 24 June 2025

Accepted: 09 September 2025

Published: 20 March 2026

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

To submit a manuscript, please go to https://jcvm.org.