Abstract:The camouflaged object detection (COD) task for visible-spectrum images aims to utilize visible-spectrum information to detect camouflaged objects that are visually consistent with their surrounding environment. This visual consistency poses challenges such as difficulty in distinguishing object boundaries and learning discriminative features, which limit the effectiveness of existing object detection methods for COD. A Cross-modal Dynamic Collaborative Dual-channel Network (CDCDN) is proposed to explore the potential of global-local multi-level visual perception and visual-language models in COD. First, to address the challenge of distinguishing object boundaries, a dynamic, collaborative, dual-channel module is designed. Through the dual channels, the detection process is decoupled into global information localizationand local feature refinement, enabling object detection and optimization from a multi-level visual perspective. A dynamic information collaboration and fusion mechanism is established, through which global and local information are mutually complemented and corrected by global gating constraints and local perception correction. The spatial capture capability of the model is enhanced in scenarios with blurred object boundaries. To address the difficulty in learning discriminative features, a cross-modal scene-object matching module is designed. By incorporating a pre-trained VLM, this module establishes cross-modal interactions between the visual and language modalities, thereby enhancing the distinction between objects and backgrounds in the feature space and improving the model's semantic discrimination in scenes with limited discriminative features. CDCDN is evaluated on the MHCD2022 and COD10K datasets using the mAP@0.5, mAP@0.5∶0.95, and mAP@0.75 metrics. CDCDN achieves scores of 67.6%, 42.6%, 48.4% on the MHCD2022 dataset, and 67.9%, 40.6%, 41.0% on the COD10K dataset, respectively. Compared to five mainstream object detection methods, including Faster R-CNN, DETR, Lite-DETR, YOLOv5, and YOLOv10, CDCDN achieves the best detection accuracy across all three metrics.Visualization of detection results in four common camouflaged scenes -barren land, grassland, woodland, and snowfield -demonstrates the adaptability of CDCDN to various scenes. In an ablation study, the key components of CDCDN are incrementally removed to systematically evaluate their contributions, with results showing that each component significantly enhances the model's detection performance. Comprehensive experimental results indicate that CDCDN can accurately detect camouflaged objects with high visual consistency to their surroundings, providing a novel solution for COD.
[1] Zheng Y, Zhang X, Wang F, et al. IEEE Signal Processing Letters, 2019, 26(1): 29.
[2] XU Jing-yu, BAO Ni-sha, LANG Jie-shuang,et al(徐景余, 包妮沙, 郎洁双, 等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2024, 44(12): 3534.
[3] Talas L, Baddeley R J, Cuthill I C. Philosophical Transactions of the Royal Society B: Biological Sciences, 2017, 372(1724): 20160351.
[4] Liu Y, Wang C, Zhou Y. Defence Technology, 2023, 21: 176.
[5] Fan D P, Ji G P, Sun G, et al. Proceedings of the Computer Vision and Pattern Recognition,2020. 2777.
[6] Lv Y, Zhang J, Dai Y, et al. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(7): 3462.
[7] Zhou T, Zhou Y, Gong C, et al. IEEE Transactions on Image Processing, 2022, 31: 7036.
[8] Cong R, Sun M, Zhang S, et al. Proceedings of the 31st ACM International Conference on Multimedia,2023. 1179.
[9] Khan A, Khan M, Gueaieb W, et al. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,2024. 1434.
[10] Liang B, Luo H. Expert Systems with Applications, 2024, 238: 121778.
[11] Zou Z, Chen K, Shi Z, et al. Proceedings of the IEEE, 2023, 111(3): 257.
[12] Ren S, He K, Girshick R, et al. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137.
[13] Law H, Deng J. Proceedings of the European Conference on Computer Vision,2018. 734.
[14] Carion N, Massa F, Synnaeve G, et al. Proceedings of the European Conference on Computer Vision, 2020. 213.
[15] Liu M, Di X. Neurocomputing, 2023, 549: 126466.
[16] Woo S, Park J, Lee J Y, et al. Proceedings of the European Conference on Computer Vision,2018. 3.
[17] Li J, Li D, Xiong C, et al. Proceedings of the International Conference on Machine Learning, 2022. 12888.
[18] Michel P, Levy O, Neubig G. Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019. 14037.
[19] Li F, Zeng A, Liu S, et al. Proceedings of the Computer Vision and Pattern Recognition,2023. 18558.
[20] Khanam R, Hussain M. What is YOLOv5: A Deep Look Into the Internal Features of the Popular Object Detector, 2024, 10.48550/arXiv_2407_20892.
[21] Wang A, Chen H, Liu L, et al. Yolov10: Real-Time End-to-End Object Detection, 2024, arXiv: 2405. 14458.