Wavelength Selection Method of Near-Infrared Spectrum Based on
Random Forest Feature Importance and Interval Partial
Least Square Method
CHEN Rui1, WANG Xue1, 2*, WANG Zi-wen1, QU Hao1, MA Tie-min1, CHEN Zheng-guang1, GAO Rui3
1. College of Information and Electrical Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China
2. Daqing Center of Inspection and Testing for Agricultural Products and Processed Products, Ministry of Agriculture and Rural Affairs, Daqing 163319, China
3. School of Electrical and Information, Northeast Agricultural University, Harbin 150030, China
Abstract:In the rapidly establishing quantitative analysis model of near-infrared spectroscopy, feature wavelength selection is one of the more effective methods to improve prediction accuracy. Through selecting effective information, redundant data is reduced, and the effectiveness of the data set is improved. Random Forest (RF) is an integrated algorithm. The feature importance of spectroscopy wavelength can be calculated by using RF. And the mean square error average value is used as the feature importance result based on the mean decrease accuracy (MDA) method of Out-of-Bag data (OOB). The feature variables are selected to form the feature wave subset by setting the feature importance threshold. However, there is no theoretical basis for setting the threshold range. So it is necessary to explore the range of feature importance thresholds. On the other hand, due to the random characteristics of RF, invalid or even interfering variables may be included in the characteristic wavelength subset, and the selected effectiveness variables cannot be guaranteed. Therefore, the RF-iPLS feature wavelength selection algorithm is further proposed.The feature wavelength subset is divided into intervals by interval partial least squares (iPLS), which makes up for the problem of invalid variables caused by RF randomness and redundant information by iPLS. In order to illustrate the rationality of the RF-iPLS algorithm, the RF-MC-iPLS algorithm is constructed using by Monte Carlo (MC) method. The comparison feature subset is generated after 500 samples.Although the structure of RF-iPLS is similar to that of RF-MC-iPLS, its running time is shortened by 11.12%. The results show that the feature wavelength selection of the RF-iPLS algorithm is effective and has low time complexity in the prediction model. Furthermore, to verify the algorithm’s effectiveness, RF-iPLS was applied to grain protein near-infrared spectroscopy data sets and PLSR models were established. It is compared with the full spectrum PLSR and PLSR models based on different wavelength selection methods. The results show that compared with 117 wavelength points of the full spectrum, RF-iPLS selects 12 feature wavelength points. The RMSEC of the modeling set is reduced from 2.61 to 0.64. The prediction accuracy is improved by about 75.5%. The RMSEP of the prediction set is reduced from 2.63 to 0.69, and the prediction accuracy is improved by 73.8%. The prediction accuracy and optimal prediction results show that RF-iPLS is an effective feature wavelength selection method, and it can simplify the complexity of the near-infrared spectral quantitative analysis model and achieve efficient dimensionality reduction.
陈 蕊,王 雪,王子文,曲 浩,马铁民,陈争光,高 睿. 基于随机森林特征重要性和区间偏最小二乘法的近红外光谱波长筛选方法[J]. 光谱学与光谱分析, 2023, 43(04): 1043-1050.
CHEN Rui, WANG Xue, WANG Zi-wen, QU Hao, MA Tie-min, CHEN Zheng-guang, GAO Rui. Wavelength Selection Method of Near-Infrared Spectrum Based on
Random Forest Feature Importance and Interval Partial
Least Square Method. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2023, 43(04): 1043-1050.
[1] HONG Ming-jian,WEN Quan,WEN Zhi-yu(洪明坚,温 泉,温志渝). Acta Optica Sinica(光学学报),2010,(12):3637.
[2] CHU Xiao-li,CHEN Pu,LI Jing-yan,et al(褚小立,陈 瀑,李敬岩,等). Journal of Instrumental Analysis(分析测试学报),2020,39(10):1181.
[3] GUO Zhi-ming,HUANG Wen-qian,PENG Yan-kun,et al(郭志明,黄文倩,彭彦昆,等). Chinese Journal of Analytical Chemistry(分析化学),2014,42(4):513.
[4] Lee S, Choi H, Cha K, et al. Microchemical Journal, 2013, 110(7):39.
[5] Epifanio I. BMC Bioinformatics, 2017, 18(1):230.
[6] Nicodemus K K, Malley J D, Strobl C, et al. BMC Bioinformatics, 2010, 11(1):110.
[7] SONG Shu-fang,HE Ru-yang(宋述芳,何入洋). Journal of National University of Defense Technology(国防科技大学学报),2021,43(2):25.
[8] WANG Qi-bin,YANG Hui-hua,PAN Xi-peng,et al(王其滨,杨辉华,潘细朋,等). Laser and Infrared(激光与红外),2020,50(9):7.
[9] QIN Yu-hua,GONG Hui-li,SONG Nan,et al(秦玉华,宫会丽,宋 楠,等). Tobacco Science & Technology(烟草科技),2014,(6):64.
[10] FANG Kuang-nan,WU Jian-bin,ZHU Jian-ping,et al(方匡南,吴见彬,朱建平,等). Statistics & Information Forum(统计与信息论坛),2011,26(3):32.
[11] YAO Deng-ju,YANG Jing,ZHAN Xiao-juan(姚登举,杨 静,詹晓娟). Journal of Jilin University(Engineering and Technology Editon)[吉林大学学报(工学版)],2014,(1):142.
[12] HAO Yong,SUN Xu-dong,WANG Hao(郝 勇,孙旭东,王 豪). Journal of Jiangsu University(Natural Science Edition)[江苏大学学报(自然科学版)],2013,34(1):49.
[13] WANG Xue,MA Tie-min,YANG Tao,et al(王 雪,马铁民,杨 涛,等). Transactions of the Chinese Society of Agricultural Engineering(农业工程学报),2018,34(13):203.
[14] Breiman L. Machine Learning, 2001, 45(1):5.
[15] YANG Qiong-zhu,REN Peng,LONG Shuai,et al(杨琼朱,任 鹏,龙 帅,等). Journal of Analytical Science(分析科学学报),2016, 32(4):485.
[16] Wang X, Ma T M, Yang T, et al. International Journal of Agricultural and Biological Engineering, 2019, 12(2):132.
[17] MA Yue,JIANG Qi-gang,MENG Zhi-guo,et al(马 玥,姜琦刚,孟治国,等). Spectroscopy and Spectral Analysis(光谱学与光谱分析),2018,38(1):181.
[18] LI Na-na,WANG Yong,ZHOU Lin,et al(李娜娜,王 勇,周 林,等). Computer Science(计算机科学),2021,48(S1):464.
[19] LI Mao-gang,YAN Chun-hua,XUE Jia,et al (李茂刚,闫春华,薛 佳,等). Chinese Journal of Analytical Chemistry(分析化学), 2019,47(12):1995.
[20] XIE Huan,CHEN Zheng-guang(谢 欢,陈争光). Chinese Journal of Analytical Chemistry(分析化学),2019,47(12):1987.
[21] Liu J, Sun S, Tan Z, et al. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2020, 242:118718.
[22] Ridgway C, Chambers J. Journal of the Science of Food & Agriculture, 2015, 71(2):251.