利用预训练模型提升光谱特征提取性能的有效性研究

doi:10.3964/j.issn.1000-0593(2024)12-3480-05

摘要
参考文献
相关文章 (7)

全文: PDF (1497 KB)
输出: BibTeX | EndNote (RIS)

摘要：观测技术的发展带来了海量的光谱数据。如何对这些数据进行自动分类受到广大研究人员的关注，其关键是光谱数据的特征提取。鉴于人工处理方式的局限性，主流研究大多采用机器学习算法进行光谱数据的特征提取。然而，这些机器学习算法由于时空复杂度过高无法处理海量光谱数据。近年来涌现的预训练模型具有优良的特征提取能力，但目前鲜有文献对该模型对光谱数据有效性问题进行探讨。因此，将恒星光谱数据作为研究对象，分别引入BERT、ALBERT、GTP等预训练模型和卷积神经网络（CNN）来对恒星光谱数据进行特征提取和分类处理，通过比较实验结果来检验这几类预训练模型在恒星光谱特征提取方面的有效性。利用Python编程语言编写光谱分类程序。在预训练模型特征提取的基础上，利用TensorFlow1.14中的CNN模型进行光谱类型判定。实验用到的数据集是SDSS DR10恒星光谱数据集，包括K型、F型、G型。利用网格搜索和5倍交叉验证法获得实验最优参数。在相同训练数据集条件下，与ALBERT、GPT相比，BERT模型的分类正确率均最高。从平均正确率看，在K型、F型、G型恒星数据集上，BERT模型的平均正确率比ALBERT分别高0.025 1、0.021 5和0.022 5，比GPT分别高0.049 7、0.042 4和0.043 2。分析实验结果可以得出如下结论：（1）恒星光谱分类正确率随训练数据规模的增大而提高；（2）在训练数据规模占比相同的情况下，同一模型在K型恒星数据集上的分类正确率最高，其次是F型恒星数据集，G型恒星数据集最低；（3）与ALBERT、GPT相比，BERT模型具有具有更优的特征提取能力。

关键词：海量光谱数据；光谱特征提取；预训练模型；有效性验证

Abstract：The development of observation technology has led to massive spectral data. How to automatically classify these data has received attention from researchers, the most important of which is feature extraction. Given the limitations of manual processing, most of the research uses machine learning algorithms to extract feature-based spectral data. However, these machine learning algorithms cannot handle massive spectral data due to the high spatial and temporal complexities. The pre-trained models emerging in recent years have excellent feature extraction capabilities. Still, there is little research on the effectiveness of such a model in the feature extraction of spectral data. Therefore, this paper takes the stellar spectral data as the research object separately introduces the pre-training models such as BERT, ALBERT, GTP, and Convolutional Neural Networks (CNN) for feature extraction and classification of the stellar spectral data, and tries to verify the effectiveness of these pre-training models for feature extraction of stellar spectral data by comparing the experimental results. Python programming language is used to write the spectral classification program. Based on the feature extraction of the pre-trained models, the CNN model in TensorFlow 1.14 is utilized for spectral data classification. The dataset used for the experiment is the SDSS DR10 stellar spectral dataset, including K-type, F-type, and G-type. The grid search and 5-fold cross-validation are utilized to obtain the experimental optimal parameters. The BERT model has the highest classification accuracies compared to ALBERT and GPT with the same experimental conditions. In terms of the average classification accuracies, the average classification accuracies of the BERT model are 0.025 1, 0.021 5, and 0.022 5 higher than that of ALBERT, and 0.049 7, 0.042 4, and 0.043 2 higher than that of GPT, on the K-type, F-type, and G-type stellar datasets. It is easy to draw the following conclusions by analyzing the experimental results: Firstly, the classification accuracies improve with the scale increase of training data; Secondly, the same model has the highest classification accuracies on the same training dataset of K-type stellar, followed by the F-type and the G-type; Thirdly, the BERT model has the best ability of feature extraction compared with ALBERT and GPT.

Key words：Massive spectral data; Spectral feature extraction; Pre-training model; Validation of effectiveness

收稿日期: 2024-01-06 修订日期: 2024-04-11

中图分类号:

TP29

基金资助: 国家自然科学基金项目(11803080)资助

通讯作者: 刘忠宝 E-mail: liu_zhongbao@hotmail.com

作者简介: 任菊香, 女, 1971年生, 山西工程科技职业大学信息工程学院副教授 e-mail: 806214106@qq.com

引用本文:

任菊香，刘忠宝. 利用预训练模型提升光谱特征提取性能的有效性研究[J]. 光谱学与光谱分析, 2024, 44(12): 3480-3484.
REN Ju-xiang, LIU Zhong-bao. Research on Effectiveness of the Pre-Training Model in Improving the Performance of Spectral Feature Extraction. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2024, 44(12): 3480-3484.

链接本文:

https://www.gpxygpfx.com/CN/10.3964/j.issn.1000-0593(2024)12-3480-05 或 https://www.gpxygpfx.com/CN/Y2024/V44/I12/3480