面向端到端目标检测的水下声光信息多模态深度融合方法

刘禄; 陈燕斌; 张祥越; 朱红全; 石廷超; 范慧丽; 陶浩; 孔诗涵

doi:10.19693/j.issn.1673-3185.04920

面向端到端目标检测的水下声光信息多模态深度融合方法

A Multimodal deep fusion method for underwater acoustic and optical information for end-to-end object detection

摘要

摘要:
目的现有多模态目标检测对齐与融合算法难以适配水下环境，受声呐与光学相机成像机理差异影响，易出现跨模态特征空间错位、融合性能退化问题，限制复杂水域环境下的水下目标检测精度，为此开展声光多模态深度融合检测方法研究。
方法构建五层端到端声光融合水下目标检测框架，依托双分支 ResNet50 分别提取光学、声呐多尺度特征；设计空间位置对齐模块学习仿射变换参数修正像素偏移，搭配嵌入注意力机制的动态权重融合模块自适应分配模态融合权重，通过融合 DenseNet 与交叉注意力完成分层交互融合编码器实现多尺度特征深度互补；检测头沿用 DETR 的 Transformer 解码结构与匈牙利匹配完成目标定位分类。
结果基于自建真实水下声光配对数据集开展对比与消融试验，所提方法 mAP_50 达 95.6%、 mAP_50-95 达 50.7%；相较 YOLO 系列、RTDETR 等单一模态模型及 DenseFusion、U2Fusion 等主流多模态算法精度全面提升；消融实验证实了空间对齐、动态权重两大模块可显著优化检测性能，加噪仿真测试表明动态权重可有效抵御水体噪声干扰的策略。
结论该框架可有效解决声光图像空间失准与融合退化难题，显著提升复杂水下场景目标检测精度，为水下多模态感知提供标准化融合范式；后续可围绕模型轻量化、多场景数据集扩充推进工程落地。

Abstract:
Objective Conventional multimodal alignment and fusion methods struggle to adapt to complex underwater detection environments. Owing to the divergences in imaging mechanisms between acoustic sonar and optical cameras, cross-modal features are often spatially misaligned, resulting in degraded fusion performance. This issue significantly limits the accuracy of underwater target detection in turbulent and turbid marine environments. To address these challenges, this study proposes an acoustic-optical multimodal fusion detection architecture for underwater perception tasks.
Method An end-to-end five-layer detection framework is developed, comprising an input layer, a feature extraction layer, a multimodal fusion layer, a target perception layer, and output layers. Two independent ResNet50 branches are employed to extract multi-scale feature representations from sonar and optical images, respectively. A novel spatial alignment module is designed to estimate affine transformation parameters, including scaling and translation factors, enabling pixel-level spatial registration between modalities. Integrated with channel and spatial attention mechanisms, the dynamic weighted fusion module adaptively adjusts the contribution of each modality, thereby suppressing low-quality and noisy features. Furthermore, a hierarchical interactive fusion encoder incorporating DenseNet and a cross-attention mechanism is constructed to achieve deep complementary fusion of multi-scale cross-modal features. The Transformer decoder and Hungarian matching loss inherited from DETR are utilized for end-to-end target classification and bounding-box regression, eliminating the need for additional non-maximum suppression operation.
Results Comparative and ablation experiments are conducted on a self-constructed real-world paired acoustic-optical underwater dataset. The proposed method achieves an mAP₅₀ of 95.6% and an mAP_50-95 of 50.7%, consistently outperforming state-of-the-art unimodal detectors (YOLO12X, YOLO13X, RTDETR) and advanced multimodal fusion methods (DenseFusion, U2Fusion, SwinFusion). Ablation studies confirm the critical contributions of both the spatial alignment module and the dynamic weighted fusion module to overall detection performance. In addition, noise injection experiments demonstrate that the dynamic weighting strategy exhibits strong robustness against speckle noise and random pixel occlusion in challenging underwater environments.
Conclusion The proposed framework effectively mitigates cross-modal spatial misalignment and fusion degradation between sonar and optical imagery, resulting in significant improvements in underwater target detection accuracy. It delivers a feasible fusion paradigm for underwater multimodal perception. Future work will focus on network lightweighting and extended dataset construction to facilitate deployment on autonomous underwater vehicles in real-world applications.

HTML全文

参考文献(45)

施引文献

资源附件(0)