%A LIU Xin-peng, SUN Xiang-hong, QIN Yu-hua, ZHANG Min, GONG Hui-li
%T Research on t-SNE Similarity Measurement Method Based on Wasserstein Divergence
%0 Journal Article
%D 2023
%J SPECTROSCOPY AND SPECTRAL ANALYSIS
%R 10.3964/j.issn.1000-0593(2023)12-3806-07
%P 3806-3812
%V 43
%N 12
%U {https://www.gpxygpfx.com/CN/abstract/article_13620.shtml}
%8 2023-12-01
%X Near-infrared spectroscopy has the characteristics of high dimension, high redundancy, and nonlinearity, which seriously affects the similarity measurement results between samples. This paper proposes a t-distributed stochastic nearest neighbor embedding algorithm (Wt-SNE) based on Wasserstein divergence. Based on the idea of manifold learning algorithm, Gaussian distribution is used to convert the distance of high-dimensional data into a probability distribution, and t-distribution is used to represent the probability distribution of corresponding data points in low-dimensional space, which is more inclined to long-tailed distribution. The probability distribution embedding of high-dimensional data is mapped to the low-dimensional space. The low-dimensional manifold structure is reconstructed, the Wasserstein divergence is introduced to measure the difference between the probability distributions in the two spaces, and the similarity of the two distributions is improved by reducing the divergence value. In this way, the dimensionality reduction processing of high-dimensional data is realized. In order to verify the effectiveness of the Wt-SNE algorithm, this paper first performs dimensionality reduction projection on tobacco NIR spectral data and compares it with PCA, LPP, and t-SNE algorithms. The results show that the sample category boundaries in the low-dimensional space are more obvious after the dimensionality reduction of the Wt-SNE algorithm. Secondly, the KNN, SVM, and PLS-DA classifiers were used to predict the tobacco origin of the reduced-dimensional data, and the accuracy rates were 93.8%, 91.5%, and 92.7% respectively, indicating that the reduced-dimensional data not only reconstructed the spatial structure of the original spectrum but also retained the similarity relationship between samples. Finally, tobacco from a particular cigarette formula was selected for single material target tobacco replacement, and the replacement samples were selected based on the Marginal distance between the candidate samples and the target samples. The experiments showed that the replacement tobacco selected by Wt-SNE had the highest similarity to the target tobacco, the chemical composition contents such as nicotine and total sugar were less different from those of the target tobacco, and the aroma, smoke, and taste scores showed high consistency. The method can effectively measure the similarity between the NIR spectra of the tobacco and provide a strong basis for the maintenance of the cigarette formula.