Kernel Mahalanobis-Driven Clustering for Outlier Detection in
Mid-Infrared Spectroscopy
HU Rui1, 2, LI Yu-jun1, 2*, JIAO Shang-bin1, 2, SUN Peng-cheng1, 2, WU Chen-yan1, 2
1. School of Automation and Information Engineering, Xi'an University of Technology, Xi'an 710048, China
2. Shaanxi Province Complex System Control and Intelligent Information Processing Key Laboratory, Xi'an 710048, China
Abstract:In the quantitative analysis of alkane gas mixtures by infrared spectroscopy, the manual calibration sample preparation process is complex (requiring precise control of parameters such as multi-component gas concentration, ambient temperature, and gas pressure), and operational deviations can easily lead to the deviation of spectral data from the calibration concentration, resulting in anomalous samples. The traditional single anomaly detection method is difficult to handle complex anomaly patterns in high-dimensional and nonlinear data effectively. To address this problem, this paper proposes a hybrid anomaly detection framework that synergizes kernel martens distance (KMD) and K-means clustering, which innovatively combines kernelized feature mapping with dynamic density clustering, thereby overcoming the matrix singularity problem and the limitation of insufficient sensitivity to local anomalies in high-dimensional sample scenarios. In this paper, we use the kernel Marginal Distance (KMD) to construct a nonlinear high-dimensional feature space, quantify the anomaly degree of the spectral-concentration mapping relationship through the covariance matrix, and set a 95% confidence threshold (χ2_{0.95}) to screen potential anomaly candidate samples. Combined with the K-means algorithm, the training set is divided into seven optimisation sub-clusters (determined based on the elbow rule), and a dynamic threshold is set to reject anomalous samples by the standard deviation of the distance from the test sample to the nearest centre of mass. The final dual-threshold joint decision-making is achieved through the logical and (AND) mechanism. The experiment was carried out using a German Bruker Tensor27 spectrometer to collect 938 sets of samples (wavelength 2.5~25 μm, resolution 4 cm-1), with methane and ethane component gases as the focus of analysis. The model was validated by a partial least squares (PLS) regression model and compared with the traditional Marginal Distance (MD) method. The results showed that after excluding the anomalous samples, the relative error (MRE) of methane concentration prediction decreased from 38.29% to 18.77%, which was 11.52 percentage points more than that of the MD method (30.44%). The MRE of ethane decreased from 54.51% to 26.03%, which was 13.39 percentage points more than that of the MD method (39.42%), and the accuracies of the model analyses were both increased by more than 50%. The proposed method not only theoretically breaks the bottleneck of anomaly detection in high-dimensional spaces, but also demonstrates its effectiveness in the quantitative analysis of infrared spectra of complex gas mixtures in practical applications. Compared to traditional methods, the hybrid detection framework of kernel Martens distance and K-means clustering demonstrates significant robustness in handling nonlinear and multidimensional data. The method offers a reliable and effective solution for cleaning anomaly data in the quantitative analysis of infrared spectra of alkane gas mixtures.
[1] JING Wen-feng, YAN Rong-hui, CHEN Zhong-pu, et al(荆文峰,阎荣辉,陈中普,等). Mud Logging Engineering(录井工程), 2019, 30(3): 124.
[2] Griffith D W T. Applied Spectroscopy, 1996, 50(1): 59.
[3] Platonov I A, Rodinkov O V, Gorbacheva A R, et al. Journal of Analytical Chemistry, 2018, 73(2): 109.
[4] WANG Zhi-qi, YANG Hong-jie, DONG Xu-bin, et al(汪智琦,杨洪杰,董旭斌,等). Instrumentation User(仪器仪表用户), 2019, 26(11): 6.
[5] ZHANG Xin, ZHANG Zheng-dong, DU Biao, et al(张 鑫,张正东,杜 彪,等). Chemical Reagents(化学试剂), 2024, 46(8): 59.
[6] Kwasny M, Bombalska A. Sensors, 2023, 23(5): 2834.
[7] Rothman L S, Gordon I E, Babikov Y, et al. Journal of Quantitative Spectroscopy and Radiative Transfer, 2013, 130: 4.
[8] LI Shao-min, SUN Li-qun(李绍民,孙利群). Acta Physica Sinica(物理学报), 2023, 72(1): 010701.
[9] WANG Xing, HUANG Xiao-yu, LIU Xuan-pu, et al(汪 星,黄小瑜,刘瑄璞,等). Journal of Xidian University(西安电子科技大学学报), 2018, 45(4): 106.
[10] LI Shu-yuan, ZHAO Jian, ZHAO Yi-jun(李书缘,赵 俭,赵乂鋆). Metrology & Measurement Technology(计测技术), 2024, 44(1): 80.
[11] LI Tong, ZHAI Yong-nan, HUA Ying-fan(李 彤,翟永南,华英凡). Systems Engineering-Theory & Practice(系统工程理论与实践), 2024, 44(2): 752.
[12] Gu H, Wang L. International Journal of Chemical Engineering, 2022, 2022: 8460463 (doi: 10.1155/2022/8460463).
[13] Muandet K, Fukumizu K, Sriperumbudur B, et al. Foundations and TrendsD○R in Machine Learning, 2017, 10(1-2): 1: 10.1561/2200000060.
[14] WEI Meng-sha, GONG Yun, ZHANG Xiao-yu, et al(卫梦莎,龚 云,张小宇,等). Bulletin of Surveying and Mapping(测绘通报), 2024, (9): 117.
[15] Chang C C, Lin C J. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 10.1145/1961189.1961199.
[16] Schölkopf B, Smola A J. Cambridge, MA: MIT Press, 2001. doi:10.7551/mitpress/4175.001.0001.
[17] Hoffmann H. Pattern Recognition, 2007, 40(3): 863.
[18] De Maesschalck R, Jouan-Rimbaud D, Massart D L. Chemometrics and Intelligent Laboratory Systems, 2000, 50(1): 1.
[19] Aggarwal C C. Cham: Springer, 2017.
[20] Thorndike R L. Psychometrika, 1953, 18(4): 267.
[21] LI Yu-jun, TANG Xiao-jun, LIU Jun-hua(李玉军,汤晓君,刘君华). Spectroscopy and Spectral Analysis(光谱学与光谱分析), 2010, 30(3): 774.