Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Videos contain multi-modal content, and exploring multi-branch cross-modal interactions with natural language queries can be of benefit to the text–video retrieval task (TVR). However, recent methods applying the large-scale pre-trained CLIP model for TVR only focus on visual cues in videos. Furthermore, traditional methods of simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-branch multi-modal hybrid fusion (M2HF) network to hierarchically explore interaction between text queries and other modality content in videos. Specifically, M2HF first fuses visual features extracted by CLIP with audio and motion features extracted from videos to obtain fused audio–visual features and motion–visual features respectively. The multi-modal completion problem is also considered and solved in this process. Then, visual features, audio–visual features, motion–visual features, and text extracted from the video are used to establish cross-modal relationships with caption text queries using a multi-branch approach. The retrieval outputs from all branches are then fused to obtain the final text–video retrieval results. Our framework provides two kinds of training strategies, using an ensemble approach and an end-to-end approach. Moreover, a novel multi-modal loss function is proposed to balance the contributions of each modality for efficient end-to-end training. M2HF allows us to obtain state-of-the-art results on various benchmarks: Rank@1 of 66.0%, 68.6%, 33.9%, 57.4%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, respectively.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
To submit a manuscript, please go to https://jcvm.org.
Comments on this article