|
|
|
|
|
|
A Variable Selection Method Based on Ensemble-SISPLS for Near Infrared Spectroscopy |
LI Si-hai1, ZHAO Lei2 |
1. School of Information Science & Engineering, Gansu University of Traditional Chinese Medicine, Lanzhou 730000, China
2. Key Laboratory of Chemistry and Quality for Traditional Chinese Medicines of the College of Gansu Province, Lanzhou 730000, China |
|
|
Abstract Near-infrared spectroscopy has the characteristics of high-dimensional small sample,which means the number of variables is by far larger compared to that of samples. Variable selection is an effective method to improve the robustness and interpretability of quantitative analysis models of near-infrared spectroscopy. Sure Independence Screening (SIS), an effective feature selection method for ultrahigh dimensional space based on marginal correlations between each predictor and response, is widely used for variable selection of gene microarray data. SIS has the ability to reduce the dimensionality of data to the size of the sample, which is comparable to the reduction ability of LASSO. In a fairly general asymptotic framework, the use of SIS with the sure screening property means that all the significant variables remain after employing the variable screening method with probability tending to one. The variable selection method, based on sure independence screening combined with partial least squares regression (SIS-SPLS), is an iterative SIS method. Firstly, the SIS method is used to complete the initial selection of significant variables, then the stepwise forward selection is carried out on the basis of the marginal correlation of selected significant variables: the partial least squares regression model is established, and the final variable selection result is determined according to the Bayesian Information Criterion (BIC). SIS-SPLS implements an incremental screening of important variables in the stepwise forward selection manner. As the number of latent variables increases and the residual decreases gradually, the number of variables selected by SIS-SPLS will stay steady. Whereas, the evaluation of the importance of variables only by the marginal correlation, when the number of spectral variables is much larger than that of samples, will make the selected variable still large in number, or make the robustness of the variable selection results unsatisfactory. To improve the robustness of variable selection results in the case of small samples, a new variable selection method based on ensemble learning, the SIS method and partial least squares regression (Ensemble-SISPLS) was developed in this paper. First, using the bagging ensemble strategy, the bootstrap method was adopted to resample at random on the calibration set. The variable selection was performed by SIS-SPLS on each calibration subset. The variable selection results of all the calibration subsets were aggregated together by the vote rule. The variable whose frequency was greater than the given threshold was selected and the partial least squares regression model was established to calculate the root mean square error of the 5-fold cross validation. The grid search method was utilized to optimize the two key parameters of the frequency threshold and the number of latent variables. Based on the cross-validation root mean square error and number of variables of the sub-models, the sub-model performance was comprehensively evaluated, and the variables included in the optimal sub-model were treated as the final variable selection result. The variable selection experiments were respectively performed on the Corn dataset and the Angelica sinensis dataset, several variable selection methods such as Ensemble-SISPLS, SIS-SPLS and UVE-PLS were compared in selected variable number and model robustness. A total of 77 Angelica sinensis samples were collected from Minxian and Weiyuan Counties in Gansu Province. Near infrared spectra of all samples were obtained through a Nicolet-6700 near-infrared spectrometer for the prediction of ferulic acid content in Angelica sinensis. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on the Corn dataset were 22, 0.000 8 and 0.999 8 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on the Corn dataset were 97, 0.007 3 and 0.998 8 respectively. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on Angelica sinensis dataset were 24, 0.018 1 and 0.996 3 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on Angelica sinensis dataset were 38, 0.022 6 and 0.994 3. The results showed that the Ensemble-SISPLS method further improved the robustness and predictability of the variable selection result. The Ensemble-SISPLS method which combines the variable selection ability of the SIS-SPLS method and the good generalization capacity of ensemble learning can improve the robustness of variable selection. In addition, the evaluation criteria of sub-models manage to make an optimal compromise between the prediction performance and the number of selected variables, which reduces the number of selected variables to some extent and at the same time improves the interpretability of the model.
|
Received: 2018-03-20
Accepted: 2018-07-08
|
|
|
[1] SONG Xiang-zhong,TANG Guo,ZHANG Lu-da,et al(宋相中,唐 果,张录达,等). Spectroscopy and Spectral Analysis(光谱学与光谱分析),2017,37(4):1048.
[2] Wang Z X,He Q P,Wang J. Journal of Process Control,2015,26:56.
[3] Mehmood T,Liland K H,Snipen L,et al. Chemometrics and Intelligent Laboratory Systems,2012,118:62.
[4] Kong X B,Liu Z,Yao Y,et al. Test,2017,26(1):1.
[5] Huang X,Xu Q S,Cao D S,et al. Analytical Methods,2014,6(17):6621.
[6] Huang X,Cao D S,Xu Q S,et al. Chemometrics and Intelligent Laboratory Systems,2013,120:71.
[7] Hu Y,Peng S,Peng J,et al. Talanta,2012,94(94):301.
[8] Qu F,Ren D,Wang J,et al. Sensors,2016,16(1):89.
[9] Fan J,Lv J. Journal of the Royal Statistical Society,2008,70(5):849.
[10] Huang X,Xu Q S,Liang Y Z. Analytical Methods,2012,4(9):2815.
[11] Xu X,Cheng K K,Deng L,et al. Chemometrics and Intelligent Laboratory Systems,2017,170:38.
[12] LI Si-hai,CHEN Jian-guo,REN Guo-jin(李四海,陈建国,任国瑾). Transducer and Microsystem Technologies(传感器与微系统),2017,37(12):114.
[13] Zhang R,Chen Y,Wang Z,et al. Chemometrics and Intelligent Laboratory Systems,2017,163:7.
[14] Li B,Wang C,Xi L,et al. Analytical Methods,2014,6(24):9691. |
[1] |
GAO Feng1, 2, XING Ya-ge3, 4, LUO Hua-ping1, 2, ZHANG Yuan-hua3, 4, GUO Ling3, 4*. Nondestructive Identification of Apricot Varieties Based on Visible/Near Infrared Spectroscopy and Chemometrics Methods[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2024, 44(01): 44-51. |
[2] |
LI Yu1, ZHANG Ke-can1, PENG Li-juan2*, ZHU Zheng-liang1, HE Liang1*. Simultaneous Detection of Glucose and Xylose in Tobacco by Using Partial Least Squares Assisted UV-Vis Spectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2024, 44(01): 103-110. |
[3] |
BAO Hao1, 2,ZHANG Yan1, 2*. Research on Spectral Feature Band Selection Model Based on Improved Harris Hawk Optimization Algorithm[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2024, 44(01): 148-157. |
[4] |
HU Cai-ping1, HE Cheng-yu2, KONG Li-wei3, ZHU You-you3*, WU Bin4, ZHOU Hao-xiang3, SUN Jun2. Identification of Tea Based on Near-Infrared Spectra and Fuzzy Linear Discriminant QR Analysis[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(12): 3802-3805. |
[5] |
LIU Xin-peng1, SUN Xiang-hong2, QIN Yu-hua1*, ZHANG Min1, GONG Hui-li3. Research on t-SNE Similarity Measurement Method Based on Wasserstein Divergence[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(12): 3806-3812. |
[6] |
BAI Xue-bing1, 2, SONG Chang-ze1, ZHANG Qian-wei1, DAI Bin-xiu1, JIN Guo-jie1, 2, LIU Wen-zheng1, TAO Yong-sheng1, 2*. Rapid and Nndestructive Dagnosis Mthod for Posphate Dficiency in “Cabernet Sauvignon” Gape Laves by Vis/NIR Sectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(12): 3719-3725. |
[7] |
WANG Qi-biao1, HE Yu-kai1, LUO Yu-shi1, WANG Shu-jun1, XIE Bo2, DENG Chao2*, LIU Yong3, TUO Xian-guo3. Study on Analysis Method of Distiller's Grains Acidity Based on
Convolutional Neural Network and Near Infrared Spectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(12): 3726-3731. |
[8] |
LUO Li, WANG Jing-yi, XU Zhao-jun, NA Bin*. Geographic Origin Discrimination of Wood Using NIR Spectroscopy
Combined With Machine Learning Techniques[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3372-3379. |
[9] |
ZHANG Shu-fang1, LEI Lei2, LEI Shun-xin2, TAN Xue-cai1, LIU Shao-gang1, YAN Jun1*. Traceability of Geographical Origin of Jasmine Based on Near
Infrared Diffuse Reflectance Spectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3389-3395. |
[10] |
YANG Qun1, 2, LING Qi-han1, WEI Yong1, NING Qiang1, 2, KONG Fa-ming1, ZHOU Yi-fan1, 2, ZHANG Hai-lin1, WANG Jie1, 2*. Non-Destructive Monitoring Model of Functional Nitrogen Content in
Citrus Leaves Based on Visible-Near Infrared Spectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3396-3403. |
[11] |
HUANG Meng-qiang1, KUANG Wen-jian2, 3*, LIU Xiang1, HE Liang4. Quantitative Analysis of Cotton/Polyester/Wool Blended Fiber Content by Near-Infrared Spectroscopy Based on 1D-CNN[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3565-3570. |
[12] |
HUANG Zhao-di1, CHEN Zai-liang2, WANG Chen3, TIAN Peng2, ZHANG Hai-liang2, XIE Chao-yong2*, LIU Xue-mei4*. Comparing Different Multivariate Calibration Methods Analyses for Measurement of Soil Properties Using Visible and Short Wave-Near
Infrared Spectroscopy Combined With Machine Learning Algorithms[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3535-3540. |
[13] |
KANG Ming-yue1, 3, WANG Cheng1, SUN Hong-yan3, LI Zuo-lin2, LUO Bin1*. Research on Internal Quality Detection Method of Cherry Tomatoes Based on Improved WOA-LSSVM[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(11): 3541-3550. |
[14] |
LIU Bo-yang1, GAO An-ping1*, YANG Jian1, GAO Yong-liang1, BAI Peng1, Teri-gele1, MA Li-jun1, ZHAO San-jun1, LI Xue-jing1, ZHANG Hui-ping1, KANG Jun-wei1, LI Hui1, WANG Hui1, YANG Si2, LI Chen-xi2, LIU Rong2. Research on Non-Targeted Abnormal Milk Identification Method Based on Mid-Infrared Spectroscopy[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(10): 3009-3014. |
[15] |
HUANG Hua1, LIU Ya2, KUERBANGULI·Dulikun1, ZENG Fan-lin1, MAYIRAN·Maimaiti1, AWAGULI·Maimaiti1, MAIDINUERHAN·Aizezi1, GUO Jun-xian3*. Ensemble Learning Model Incorporating Fractional Differential and
PIMP-RF Algorithm to Predict Soluble Solids Content of Apples
During Maturing Period[J]. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(10): 3059-3066. |
|
|
|
|