1. 新疆大学信息科学与工程学院, 新疆 乌鲁木齐 830046
2. 上海交通大学图像处理与模式识别研究所,上海 200240
3. Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland 1020, New Zealand
A Variable Selection Method of the Selectivity Ratio Competitive Model Population Analysis for Near Infrared Spectroscopy
WANG Yu-xi1, JIA Zhen-hong1*, YANG Jie2, Nikola K Kasabov3
1. College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
2. Institute of Image Processing and Pattern Recognition, Shanghai Jiaotong University, Shanghai 200240, China
3. Knowledge Engineering and Discovery Research Institute, Auckland University of Technology, Auckland 1020, New Zealand
Abstract:Spectral analysis is an important application of chemometrics and has been widely used in various fields. Spectral variable selection is a key part of spectral analysis. Therefore, it is critical to study different variable selection methods to objectively identify useful information variables or eliminate irrelevant and interfering variables. In our study, a new variable selection method of the selectivity ratio competitive population analysis (SRCMPA) is proposed. This algorithm adopts the idea of selection ratio, adaptive weighted sampling and model population analysis, and combines the method of variable arrangement and exponential decline function. The key wavelength is defined as the wavelength with a high score value in the regression model. In this paper, the score value of the selection ratio under the PLS model is used as an index to evaluate the importance of each wavelength. Then, according to the importance of each wavelength, SRCMPA sequentially selects N wavelength subsets from Monte Carlo sampling, and runs in an iterative and competitive manner. In each sampling operation, the PLS model is built with a fixed ratio samples and the selection ratio value of each variable is calculated. Based on the score value of the ranking selection ratio and the normalized SR (selection ratio) score value as the weight, the key variables are selected by two steps: the compulsory selection of exponential decline function and the competitive selection of adaptive weighted sampling. Finally, cross validation (CV) method is applied to select the optimal subset with the lowest cross validation mean square root (RMSECV). The algorithm has been tested on wheat protein data set and beer data set, and compared with three efficient algorithms. Through the experimental results to evaluate the superiority of the algorithm, this algorithm can find the best combination of the key wavelength variables of the data set, and can be used to explain the chemical characteristics of interest, the evaluation results after modeling are also the best. Compared with the PLS model of full-spectrum beer data set, the number of variables in this algorithm has been reduced from 567 to about 42. And the RMSECV of model decreased from 0.622 to 0.115, RMSEP decreased from 0.823 to 0.363, and the prediction accuracy increased by 81.5% and 55.9%, respectively. Q2_CV and Q2_test also increased from 0.940, 0.852 to 0.994 and 0.995. For wheat protein data sets, Compared with the PLS model of full-spectrum wheat protein spectral data set, the number of variables has been reduced from 175 to about 18. And the RMSECV of the model decreased from 0.607 to 0.292, the RMSEP decreased from 0.519 to 0.234, and the prediction accuracy increased by 51.9% and 54.9%, respectively. Q2_CV and Q2_test also increased from 0.748, 0.774 to 0.931 and 0.839.
Key words:Variable selection;Selection ratio;Adaptive weighted sampling;Population model analysis;Monte Carlo sampling
王玉喜,贾振红,杨 杰,Nikola K Kasabov. 近红外光谱的选择比率竞争群体分析的变量选择算法[J]. 光谱学与光谱分析, 2020, 40(04): 1056-1062.
WANG Yu-xi, JIA Zhen-hong, YANG Jie, Nikola K Kasabov. A Variable Selection Method of the Selectivity Ratio Competitive Model Population Analysis for Near Infrared Spectroscopy. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2020, 40(04): 1056-1062.
[1] QU Fang-fang, REN Dong, HOU Jin-jian, et al(瞿芳芳,任 东,侯金健,等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2016, 36(2): 593.
[2] Zhang Ruoqiu, Zhang Feiyu, Chen Wanchao, et al. Chemometrics & Intelligent Laboratory Systems, 2018, 175: 47.
[3] Huang X, Luo Y P, Xu Q S, et al. Anal. Methods, 2017, 9(4): 672.
[4] Alfons A, Croux C, Gelper S. Computational Statistics & Data Analysis, 2016, 93(C): 421.
[5] Ge T, Wei B, Wu D, et al. Journal of Applied Spectroscopy, 2018, 85(1): 109.
[6] Ranzan C, Trierweiler L F, Hitzmann B, et al. Chemometrics and Intelligent Laboratory Systems, 2015, 142: 78.
[7] Cao H, Wang Y, Yang S, et al. Journal of Chemometrics, 2015, 29(5):289.
[8] Huang X, Luo Y P, Xu Q S, et al. Anal. Methods, 2017, 9(4): 672.
[9] Farrés Mireia, Platikanov S, Tsakovski S, et al. Journal of Chemometrics, 2015, 29(10): 528.
[10] Li C, Zhao T, Li C, et al. Food Chemistry, 2017, 221: 990.
[11] Bin J, Ai F, Fan W, et al. Chemometrics & Intelligent Laboratory Systems, 2016, 158: 1.
[12] Wang Y, Jiang F, Gupta B B, et al. IEEE Access, 2017, (99): 1.
[13] Deng B C, Yun Y H, Liang Y Z, et al. The Analyst, 2014, 139(19): 4836.
[14] Mahanty Biswanath. Chemometrics and Intelligent Laboratory Systems, 2018, 174: 45.
[15] ZHAO Huan,HUAN Ke-wei,SHI Xiao-guang,et al(赵 环, 宦克为, 石晓光, 等). Chinese J. Anal. Chem.(分析化学),2018,1(46): 136.
[16] Yun Y H, Wang W T, Deng B C, et al. Analytica Chimica Acta, 2015, 862: 14.
[17] Deng B C, Yun Y H, Cao D S, et al. Analytica Chimica Acta, 2016, 908: 63.
[18] Jiang H, Zhang H, Chen Q, et al. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2015, 149: 1.
[19] Norgaard L, Saudland A, Wagner J, et al. Applied Spectroscopy, 2000, 54: 413.
[20] Wang Weiting, Yun Yonghuan, Deng Baichuan, et al. RSC Advances, 2015, 5: 95771.
[21] Farrés Mireia, Platikanov S, Tsakovski S, et al. Journal of Chemometrics, 2015, 29(10): 528.