Classification of Hyperspectral Remote Sensing Images by Joint Hybrid Convolution and Cascaded Group Attention Mechanisms
WANG Xiao-yan1, LIANG Wen-hui2, BI Chu-ran1, LI Jie3*, WANG Xi-yu2
1. School of Systems Science and Statistics, Beijing Wuzi University, Beijing 101149, China
2. School of Information, Beijing Wuzi University, Beijing 101149, China
3. School of Electromechanical and Vehicle Engineering, Beijing University of Civil Engineering and Architecture, Beijing 102616, China
摘要: 高光谱遥感影像丰富的光谱信息,能够为地物分类提供可靠的数据支持。但是,光谱数据高维、冗余,空谱特征联合困难、光谱特征提取不充分等问题对基于深度学习的高光谱遥感影像分类提出了挑战。卷积神经网络(CNN)和Vision Transformer(ViT)是两种在计算机视觉领域中广泛使用的深度学习架构,各自有独特的优势和局限性。CNN擅长捕捉局部特征和空间层次结构,对图像的平移不变性有很好的处理能力。ViT通过自注意力机制能够捕捉图像中的全局依赖关系,对图像的复杂模式有较好的理解能力。为了提升高光谱遥感影像的分类精度,充分发挥CNN和ViT两种模型的优势,结合CNN的局部特征提取能力和ViT的全局上下文理解能力,创新性地将3D EfficientViT模块引入混合卷积,提出了一种联合混合卷积与级联群注意力机制的高光谱遥感影像分类算法EVIT3D_HSN。本算法在三维卷积提取高光谱遥感影像空谱联合特征及二维卷积提取空间特征的基础上引入3D Efficient ViT模块,提高了对不同数据集的泛化能力、更全面地捕捉了高光谱数据的图像特征,从而增强了分类算法的性能,同时并未增加模型复杂度。为了验证本算法的先进性,将本算法EVIT3D_HSN在高光谱遥感影像分类数据集India Pines、Pavia University和Salinas,与算法1DCNN、2DCNN、3DFCN和3DCNN进行对比实验,并于原算法HybridSN进行消融实验。EVIT3D_HSN在以上三种数据集的分类结果为:OA分别为97.66%、99.00%和99.65%,Kappa系数分别为97.3%、98.6%和99.6%。相比于1DCNN,模型分类精度分别提升了37.12%、25.09%和33.67%;相比于2DCNN,精度分别提升了59%、57.43%和46.92%;相比于3DFCN,精度分别提升了45.36%、24.5%和29.72%;相比于3DCNN,精度分别提升了28.05%、14.26%和34.29%;相比于HybridSN,分别提升了3.76%、1.85%和2.57%。此外,除IP数据集的Stone-Steel-Towers,PU数据集的Painted metal sheets和Shadows,以及SA数据集的Stubble地物之外,EVIT3D_HSN对其他共37种地物的F1值均最高。实验结果表明,EVIT3D_HSN在模型精度和泛化能力上的表现优于上述五种高光谱遥感影像分类算法,本模型具有良好的实用价值。
关键词:高光谱遥感影像分类;混合卷积;3D Efficient ViT;级联群注意力
Abstract:The rich spectral information of hyperspectral remote sensing images can provide reliable data support for their feature classification. However, the problems of high dimensionality and redundancy of spectral data, difficulty associating spatial and spectral features, and insufficient spectral feature extraction have challenged the classification of hyperspectral remote sensing images based on deep learning. Convolutional neural network (CNN) and Vision Transformer (ViT) are two deep learning architectures widely used in computer vision, and each has unique advantages and limitations.CNN is good at capturing local features and spatial hierarchies and can deal with the invariance of the image's translation. ViT can capture global dependencies and has a better understanding of complex patterns in images. To improve the classification accuracy of hyperspectral remote sensing images and give full play to the advantages of both CNN and ViT models, this paper combines the local feature extraction capability of CNN and the global context understanding capability of ViT, and innovatively introduces the 3D Efficient ViT module into the hybrid convolution, and proposes a hyperspectral remote sensing image classification algorithm combining the hybrid convolution and cascading group attention mechanism EVIT3D_HSN: This algorithm introduces 3D Efficient ViT module based on 3D convolution to extract the joint features of hyperspectral remote sensing images and 2D convolution to extract the spatial features, which improves the generalization ability to different datasets and captures the image features of hyperspectral data in a more comprehensive way, thus enhances the performance of the classification algorithm without increasing the complexity of the model. To validate the advancement of this algorithm, this paper's algorithm EVIT3D_HSN is compared with algorithms 1DCNN, 2DCNN, 3DFCN, and 3DCNN and the original algorithm HybridSN for ablation experiments on hyperspectral remote sensing imagery classification datasets India Pines, Pavia University, and Salinas. The classification results of EVIT3D_HSN on the above three datasets are 97.66%, 99.00%, and 99.65% for OA and 97.3%, 98.6%, and 99.6% for the Kappa coefficient, respectively. Compared with 1DCNN, the model classification accuracies are improved by 37.12%, 25.09%, and 33.67%, respectively; compared with 2DCNN, the accuracies are improved by 59%, 57.43%, and 46.92%, respectively; compared with 3DFCN, the accuracies are improved by 45.36%, 24.5% and 29.72%, respectively; and compared with 3DCNN, the accuracies are improved by 28.05%, 14.26% and 34.29%; and compared to HybridSN, the accuracy is improved by 3.76%, 1.85% and 2.57%, respectively. In addition, EVIT3D_HSN has the highest F1 values for a total of 37 features, except stone steel towers for the IP dataset, Painted metal sheets and Shadows for the PU dataset, and Stubble features for the SA dataset. CONCLUSION The experimental results show that EVIT3D_HSN outperforms the above five hyperspectral remote sensing image classification algorithms regarding model accuracy and generalization ability, and the model has good practical value.
Key words:Hyperspectral remote sensing image classification; Hybrid convolution; 3D efficient ViT; Cascade group attention
王晓燕,梁文辉,毕楚然,李 杰,王禧钰. 联合混合卷积与级联群注意力机制的高光谱遥感影像分类[J]. 光谱学与光谱分析, 2025, 45(05): 1485-1493.
WANG Xiao-yan, LIANG Wen-hui, BI Chu-ran, LI Jie, WANG Xi-yu. Classification of Hyperspectral Remote Sensing Images by Joint Hybrid Convolution and Cascaded Group Attention Mechanisms. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2025, 45(05): 1485-1493.
[1] LIU Yin-nian, XUE Yong-qi(刘银年, 薛永祺). Acta Geodaetica et Cartographica Sinica(测绘学报), 2023, 52(7): 1045.
[2] SU Yuan-chao, XU Ruo-qing, GAO Lian-ru, et al(苏远超,许若晴,高连如, 等). National Remote Sensing Bulletin(遥感学报), 2024, 28(1): 1.
[3] WANG Zi-xuan, YANG Liang, HUANG Ling-xia, et al(王子轩, 杨 良, 黄凌霞, 等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2024, 44(6): 1724.
[4] Yang X, Ye Y, Li X, et al. IEEE Transactions on Geoscience and Remote Sensing, 2018, 56(9): 5408.
[5] Li S, Song W, Fang L, et al. IEEE Transactions on Geoscience and Remote Sensing, 2019, 57(9): 6690.
[6] Hu W, Huang Y, Wei L, et al. Journal of Sensors, 2015, 2015(1): 258619.
[7] Lee H, Kwon H. 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2016: 3322.
[8] Sharma V, Diba A, Tuytelaars T, et al. Hyperspectral CNN for Image Classification & Band Selection, With Application to Face Recognition, Belgium, Tech. Rep. KUL/ESAT/PSI/1604, 2016.
[9] Hamida A B, Benoit A, Lambert P, et al. IEEE Transactions on Geoscience & Remote Sensing, 2018, 56(8): 4420.
[10] Roy S K, Krishna G, Dubey S R, et al. IEEE Geoscience and Remote Sensing Letters, 2020, 17(2): 277.
[11] Han K, Wang Y, Chen H, et al. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(1): 87.
[12] Liu X, Peng H, Zheng N, et al. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 14420.
[13] Liu Z, Lin Y, Cao Y, et al. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 10012.