Variable Selection Methods in Spectral Data Analysis
LI Yan-kun1*, DONG Ru-nan1, ZHANG Jin2, HUANG Ke-nan3, MAO Zhi-yi4
1. Department of Environmental Science and Engineering, North China Electric Power University, Hebei Key Lab of Power Plant Flue Gas Multi-Pollutants Control, Baoding 071003, China
2. School of Food Science, Guizhou Medical University, Guiyang 550025, China
3. The 82nd Army Group Hospital of the Chinese People’s Liberation Army, Baoding 071000, China
4. Tianjin Building Material Science Research Academy, Tianjin 300110, China
Abstract:How to extract useful information from massive or high-dimensional data is a huge challenge for current data analysis and a hot spot of current research. Variable selection technology can extract feature information variables from numerous and complex measurement data, and achieve the purpose of simplifying multivariate model and even improving the model’s prediction performance. In spectral analysis, the measurement data will inevitably contain interference and irrelevant information variables and the multicollin earity among variables, which will affect the robustness and prediction ability of the model. Therefore, the variable(wavelength) selection methods have progressed greatly in the research and application of spectral analysis. Based on the related pieces of literature and the author’s research experiences, this paper summarizes the proposals, characteristics, developments, categories, comparisons and applications in recent five yearsof methods for selecting variables not only in near-infrared spectra area but also in fields of mid-infrared spectra, Raman spectra and other spectra. The parameters as their criteria or thresholds for evaluating the importance of variables and the strategies or tracks of selecting variables are vital. Moreover, each method has its advantages and limitations. In practice, it is necessary to select the appropriate method according to the characteristics of boththe method and the object. Key contents: (1) Compared the wavelength selection, and wavelength interval selection methods; (2) Summarized the different variable selection methods based on PLS model parameters; (3) Classified and overviewed the variable selection methods according to the strategiesof searching and selection of variables. Finally, we discuss the problems of variable selection methods (such as overfitting and instability etc.) appearing in the actual system and the corresponding solutions. Meantime, there look forward to the research trend, development prospect and application direction of the variable selection methods. Among them, new criteria for evaluating the importance and new selection strategy of variables still require further research. It is expected that this paper will play a positive role in promoting the follow-up researches and applications of variable selection technology.
Key words:Variable selection; Spectral data; Characteristic variable; Redundant information
[1] Yun Y H, Li H D, Deng B C, et al. Trac-Trend Anal. Chem., 2019, 113: 102.
[2] CHU Xiao-li, YUAN Hong-fu, LU Wan-zhen(褚小立, 袁洪福, 陆婉珍). Progress in Chemistry(化学进展), 2004, 16(4): 528.
[3] Nie M P, Meng L W, Chen X J, et al. J. Chemometr., 2019, 33(4): e3113.
[4] Mehmood T, Ahmed B J. Chemometrics, 2016, 30(1): 4.
[5] Wold S, Albano C, Dunll M. Pattern Regression Finding and Using Regularities in Multi-variate Data. London: Analysis Appfied Science Publication, 1983.
[6] Centner V, Massart D L. Denoord O E, et al. Anal. Chem., 1996, 68: 3851.
[7] Cai W S, Li Y K, Shao X G. Chemom. Intell. Lab. Syst., 2008, 90(2): 188.
[8] Norgaard L, Saudland A, Wagner J, et al. Appl. Spectrosc., 2000, 54(3): 413.
[9] Jiang J H, Berry R J, Siesler H W, et al. Anal. Chem., 2002, 74(14): 3555.
[10] Li H D, Liang Y Z, Xu Q S, et al. Anal. Chim. Acta, 2009, 648(1): 77.
[11] Wold S, Johansson E, Cocchi M. 3D-QSAR in Drug Design, Theory, Methods, and Applications. Leiden:ESCOM Science Publishers, 1993.
[12] Fisher R A. The Design of Experiments. Edinburgh:Oliver and Boyd. 1935.
[13] Lindgren F, Geladi P, Rännar S, et al. J. Chemometr., 1994, 8(5): 349.
[14] Forina M, Casolino C, Millan C P. J. Chemometr., 1999, 13(2): 165.
[15] Chen D, Hu B, Shao X, et al. Analyst, 2004, 129(7): 664.
[16] Li Y K, Jing J. Chemom. Intell. Lab. Syst., 2014, 130(130): 45.
[17] Li C, Zhao T L, Li C, et al. Food Chem., 2017, 221(4): 990.
[18] Li Y K. Anal. Methods, 2012, 4(1): 254.
[19] NIU Xiao-ying, SHAO Li-min, ZHAO Zhi-lei, et al(牛晓颖, 邵利敏, 赵志磊, 等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2019, 39(2): 443.
[20] ZHAO Fang, PENG Yan-kun(赵 芳, 彭彦昆). Chinese Journal of lasers(中国激光), 2017, 44(11): 243.
[21] Ding Y, Xia G Y, Ji H W, et al. Anal. Methods, 2019, 11(29): 3657.
[22] Miao X X, Miao Y, Gong H R, et al. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 2021, 257: 119700.
[23] Pereira Rainha K, Tristão do Carmo Rocha J, Tavares Rodrigues R R, et al. Anal. Lett., 2019, 52(18): 2914.
[24] XU Liang, YAN Liang-liang, SAI Jilahu, et al(许 良, 闫亮亮, 塞击拉呼, 等). Computers and Applied Chemistry(计算机与应用化学), 2016, 33(4): 415.
[25] XIE Jun, MA Hui, PAN Tao(谢 军, 马 辉, 潘 涛). Chinese Journal of Analysis Laboratory(分析试验室), 2015, 34(3): 255.
[26] Wang S H,Zhao Y,Hu R, et al. Chinese J. Anal. Chem., 2019, 47(4): e19034.
[27] Cramer J A,Kramer K E,Johnson K J, et al. Chemom. Intell. Lab. Syst., 2008, 92(1): 13.
[28] Nie L X, Dai Z, Ma S C. Analytical Letters, 2016, 49(14): 2259..
[29] SHI Yan, SUN Dong-mei,XIONG Jing,et al(石 岩, 孙冬梅, 熊 婧,等). Chinese Pharmaceutical Journal(中国药学杂志),2018,53(14): 1216.
[30] Hu L Q, Yin C L, Ma S, et al. Spectrochim. A, 2018, 205: 207.
[31] Zhang X, Li W, Yin B, et al. Spectrochim. Acta A, 2013, 114: 350.
[32] Li W, Zhang X, Zheng K Y, et al. J. AOAC Int., 2015, 98(1): 183.
[33] Zheng K Y, Feng T, Zhang W, et al. Chemom. Intell. Lab. Syst., 2019, 191: 109.
[34] Ferreira D S, Poppi R J, Lima Pallone J A. J. Cereal Sci., 2015, 64: 43.
[35] Gosselin R,Rodrigue D,Duchesne C. Chemom. Intell. Lab. Syst., 2010, 100(1): 12.
[36] Xu H, Liu Z C, Cai W S, et al. Chemom. Intell. Lab. Syst., 2009, 97(2): 189.
[37] Mao Z Y, Shan R F, Wang J J, et al. Spectrochim. Acta A, 2014, 128: 711.
[38] XIE Huan, CHEN Zheng-guang(谢 欢, 陈争光). Analytical Chemistry(分析化学), 2019, 47(12): 1987.
[39] YUN Yong-huan, DENG Bai-chuan, LIANG Yi-zeng(云永欢, 邓百川, 梁逸曾). Chinese Journal of Analytical Chemistry(分析化学), 2015, 43(11): 1638.
[40] Ma X P, Pang J F, Dong R N, et al. J. Food Compos. Anal., 2020, 91: 103509.
[41] Li Y K, Ma X P, Huang K N, et al. Indian J. Biochem. Bio., 2019, 56(1): 53.
[42] Li Y K, Zeng X C. Anal. Methods, 2016, 8: 183.
[43] Holland J H. Adaptation in Natural and Artificial Systems. Ann Arbor, Mich: University of Michigan Press, 1992.
[44] Metropolis N, Rosenbluth A W, Rosenbluth M N, et al. J. Chem. Phys., 1953, 21(6): 1087.
[45] Kennedy J, Eberhart R. Particle Swarm Optimization, IEEE International Conference on Neural Networks, Perth, 1995, 4: 1942.
[46] Colorni A, Dorigo M, Maniezzo V, et al. Distributed Optimization by Ant Colonies, Proceedings of the First European Conference on Artificial Life. Paris, 1991: 134.
[47] Mirjalili S, Mirjalili S M, Lewis A. Adv. Eng. Software, 2014, 69: 46.
[48] Deng B C, Yun Y H, Cao D S, et al. Anal. Chim. Acta, 2016, 908: 63.
[49] Yun Y H, Wang W T, Deng B C, et al. Anal. Chim. Acta, 2015, 862: 14.
[50] Deng B C, Yun Y H, Liang Y Z, et al. Analyst, 2014, 139(19): 4836.
[51] Song X Z, Huang Y, Yan H, et al. Anal. Chim. Acta, 2016, 948: 19.
[52] Yun Y H, Li H D, Wood L R E, et al. Spectrochim. Acta A, 2013, 111: 31.
[53] Moreira E D T, Pontes M J C, Galvão R K H, et al. Talanta, 2009, 79(5): 1260.
[54] Gomes A D, Galvao R K H, de Araújo M C U, et al. Microchem J. 2013, 110: 202.
[55] Araujo M C U, Saldanha T C B, Galvao R K H, et al. Chemom. Intell. Lab. Syst., 2001, 57(2): 65.
[56] Yu Q, Li J, Yao L, et al. J. Appl. Remote Sens., 2018, 12(3): 036019.
[57] Fisher R A. Annals of Eugenics, 1936, 7: 179.
[58] PANG Jia-feng, TANG Chen, LI Yan-kun, et al(庞佳烽, 汤 谌, 李艳坤, 等). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2020, 40(10): 3235.
[59] Jin Z, Yang J Y, Hu Z S, et al. Pattern Recognit., 2001, 34(7): 1405.
[60] Zhang S P, Tan Z L, Liu J, et al. Spectrochim. Acta A, 2019, 227: 117551.
[61] WU Li-zhou, WANG Xiao-hui, WANG Zhi-hui, et al(吴立周, 王晓慧, 王志辉, 等). Journal of Zhejiang A&F University(浙江农林大学学报), 2020, 37(1): 136.
[62] LI Guan-wen, GAO Xiao-hong, XIAO Neng-wen, et al(李冠稳, 高小红, 肖能文, 等). Chinese Journal of Luminescence(发光学报), 2019, 40(8): 1030.
[63] Breiman L. Mach. Learn., 2001, 45(1): 5.
[64] Boser B E, Guyon I M,Vapnik V N. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the 5th Annual Workshop on Computational Learning Theory, Pittsburgh, MD: ACM Press, 1992: 144.
[65] Zhang R Q, Zhang F Y, Chen W C, et al. Chemom. Intell. Lab. Syst., 2019, 184: 132.
[66] Tibshirani R. J. R. Stat. Soc. B, 1996, 58(01): 267.
[67] Zou H, Hastie T. Regression Shrinkage and Selection via the Elastic Net, With Application to Microarrays, 2003: 1.
[68] Hoerl A E, Kennard R W. Technometrics, 1970, 12(1): 55.
[69] Shan R F, Cai W S, Shao X G. Chemom. Intell. Lab. Syst., 2014, 131: 31.
[70] Shao X G, Du G R, Jing M, et al. Chemom. Intell. Lab. Syst., 2012, 114: 44.
[71] Zhang J, Cui X Y, Cai W S, et al. J. Chemom., 2018, 32(11): e2971.
[72] Zhang J, Cui X Y, Cai W S, et al. Sci. China Chem., 2019, 62(02): 271.
[73] Xu H, Cai W S, Shao X G. Anal. Methods, 2010, 2: 289.
[74] Lin Y W, Deng B C, Wang L L, et al. Chemom. Intell. Lab. Syst., 2016, 159: 196.
[75] Lin Y W, X N, Wang L L, et al. Chemom. Intell. Lab. Syst., 2017, 168: 62.
[76] Mehmood T, Liland K H, Snipen L, et al. Chemom. Intell. Lab. Syst., 2012, 118: 62.