Abstract:
Objective Conventional multimodal alignment and fusion methods struggle to adapt to complex underwater detection environments. Owing to the divergences in imaging mechanisms between acoustic sonar and optical cameras, cross-modal features are often spatially misaligned, resulting in degraded fusion performance. This issue significantly limits the accuracy of underwater target detection in turbulent and turbid marine environments. To address these challenges, this study proposes an acoustic-optical multimodal fusion detection architecture for underwater perception tasks.
Method An end-to-end five-layer detection framework is developed, comprising an input layer, a feature extraction layer, a multimodal fusion layer, a target perception layer, and output layers. Two independent ResNet50 branches are employed to extract multi-scale feature representations from sonar and optical images, respectively. A novel spatial alignment module is designed to estimate affine transformation parameters, including scaling and translation factors, enabling pixel-level spatial registration between modalities. Integrated with channel and spatial attention mechanisms, the dynamic weighted fusion module adaptively adjusts the contribution of each modality, thereby suppressing low-quality and noisy features. Furthermore, a hierarchical interactive fusion encoder incorporating DenseNet and a cross-attention mechanism is constructed to achieve deep complementary fusion of multi-scale cross-modal features. The Transformer decoder and Hungarian matching loss inherited from DETR are utilized for end-to-end target classification and bounding-box regression, eliminating the need for additional non-maximum suppression operation.
Results Comparative and ablation experiments are conducted on a self-constructed real-world paired acoustic-optical underwater dataset. The proposed method achieves an mAP50 of 95.6% and an mAP50-95 of 50.7%, consistently outperforming state-of-the-art unimodal detectors (YOLO12X, YOLO13X, RTDETR) and advanced multimodal fusion methods (DenseFusion, U2Fusion, SwinFusion). Ablation studies confirm the critical contributions of both the spatial alignment module and the dynamic weighted fusion module to overall detection performance. In addition, noise injection experiments demonstrate that the dynamic weighting strategy exhibits strong robustness against speckle noise and random pixel occlusion in challenging underwater environments.
Conclusion The proposed framework effectively mitigates cross-modal spatial misalignment and fusion degradation between sonar and optical imagery, resulting in significant improvements in underwater target detection accuracy. It delivers a feasible fusion paradigm for underwater multimodal perception. Future work will focus on network lightweighting and extended dataset construction to facilitate deployment on autonomous underwater vehicles in real-world applications.