Drugs Identification Using Near-Infrared Spectroscopy Based on Random Forest and CatBoost
JIANG Ping1, LU Hao-xiang2, LIU Zhen-bing2*
1. School of Computer and Information Technology, Guangxi Police College, Nanning 530028, China
2. College of Computer and Information Security, Guilin University of Electronic Technology,Guilin 541004, China
Abstract:Drug quality is related to people’s health and national lifeblood. The rapid development of the economy and society plays an extremely important role in the rapid and effective identification of drug quality. Spectral analysis technology has high accuracy, fast analysis speed and no pollution to samples, and is widely used in the chemical industry, petroleum, medicine and other important areas of people’s livelihood. In order to solve the problems of low accuracy, low identification speed and poor stability of the traditional drug identification model, the spectrometer was used to collect near-infrared spectroscopy data of drugs to achieve the purpose of pollution-free drugs. Then, random forest and CatBoost were combined to classify and identify drugs quickly and accurately. The proposed method firstly uses Random Forest (RF) to screen the effective characteristic wavelength of the spectrometer’s spectral data to eliminate the irrelevant wavelength in the drug spectral data and screen out the characteristic wavelength that can best characterize the sample properties. Then Extreme Learning Machine (ELM) was used as CatBoost weak classifier to analyze the feature wavelengths of the screening for drug attribute identification. Since ELM only contains one hidden layer and no iterative optimization is required to ensure the faster running of the identification model, CatBoost can improve the model’s identification accuracy by integrating a weak classifier. In order to effectively evaluate the performance of the drug identification model proposed in this paper, the spectral data of drugs of different sizes were constructed by randomly selected training sets, and experiments were carried out independently. The mean value of 10 running results was taken as the final result. In addition, Back Propagation with CatBoost, Support Vector Machine (SVM), BP, ELM, Summation Wavelet Extreme Learning Machine (SWELM) and Boosting were compared to evaluate the performance of the proposed model further. As can be seen from the classification results of training sets of different sizes, with the increase of training sets, the highest classification accuracy is 100%, and the prediction standard deviation tends to be 0. The experimental results show that the RF-CATBoost identification model proposed in this paper has higher classification accuracy, faster speed and stronger robustness than the comparison method on drug data sets of different sizes and can be widely used in the accurate identification of drug categories, to achieve effective supervision of drug quality.
Key words:Near-infrared spectroscopy; Random Forest; Extreme learning machine; CatBoost
蒋 萍,路皓翔,刘振丙. 随机森林结合CatBoost的近红外光谱药品鉴别[J]. 光谱学与光谱分析, 2022, 42(07): 2148-2155.
JIANG Ping, LU Hao-xiang, LIU Zhen-bing. Drugs Identification Using Near-Infrared Spectroscopy Based on Random Forest and CatBoost. SPECTROSCOPY AND SPECTRAL ANALYSIS, 2022, 42(07): 2148-2155.
[1] CHU Xiao-li, CHEN Pu, LI Jing-yan, et al(褚小立, 陈 瀑, 李敬岩, 等). J. Instr. Anal.(分析测试学报), 2020, 39(10): 1181.
[2] Pavlek L R, Mueller C, Jebbia M R, et al. Front. Pediatr., 2021, 8: 624113.
[3] WANG Li-qun, LI Yu-yu, JIN Rong-jiang, et al(王丽群, 李雨谿, 金荣疆, 等). J. Tissue. Eng.(中国组织工程研究), 2021, 25(11): 1799.
[4] FU Dan-dan, WANG Qiao-hua, GAO Sheng, et al(付丹丹, 王巧华, 高 升, 等). Chin. J. Anal. Chem.(分析化学), 2020, 48(2): 289.
[5] CHU Xiao-li, YUAN Hong-fu, LU Wan-zhen(褚小立, 袁洪福, 陆婉珍). Chin. J. Anal. Chem.(分析化学), 2002, 30(1): 114.
[6] Siddiqui M R, Alothman Z A, Rahman N. Arab. J. Chen., 2017, 44(1): 1409.
[7] Huang Y, Meng S, Zhao P, et al. Appl. Optics, 2019, 58(18): 5122.
[8] Nguyen K, Duong D Q, Almeida F T, et al. J. Dent. Res., 2020, 99(1): 1054.
[9] Morellos A, Pantazi X E, Moshou D, et al. Biosyst. Eng., 2016, 152: 104.
[10] Clua P G, Jo E, Nikolic S, et al. J. Pharmaceut. Biomed., 2020, 183(8): 113163.
[11] Zheng A, Yang H, Pan X, et al. Sensors, 2021, 21(4): 1088.
[12] ZHOU Ying, LIU Jia-ming, LI Xiu-yun(周 颖, 刘佳明, 李秀芸). China Pharm.(中国药师), 2020, 23(1): 172.
[13] Rodionova Y, Titova A V, Balyklo K S. Talanta, 2019, 205: 120150.
[14] Sampaio P S, Castanho A, Almeida A S, et al. Eur. Food Res. Technol., 2020, 246(3): 527.
[15] Kim S Y, Hong S J, Kim E, et al. Appl. Eng. Agric., 2021, 37(4): 653.
[16] Nasir R, Saleem M R, Nisar A, et al. Optik, 2021, 225(11): 165714.
[17] CHEN Wen-li, WANG Qi-bin, LU Hao-xiang, et al(陈文丽, 王其滨, 路皓翔, 等). J. Instr. Anal.(分析测试学报), 2020, 39(10): 1267.
[18] SHEN Dong-xu, HONG Ming-jian, DONG Jia-lin(沈东旭, 洪明坚, 董家林). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2020, 40(11): 3457.
[19] Chen B, Wang Z B. Chemometr. Intell. Lab., 2019, 191: 103.
[20] Breiman L. Mach. Learn., 2001, 45(1): 5.
[21] Tang J, Fan B, Xiao L, et al. SPE Journal, 2020, 26(1): 482.
[22] Pinto P A, Dias A A, Fraga I, et al. Bioresour. Technol., 2012, 111: 261.