基于SIMCA-SVDD方法的分子光谱分析及其在食用油分类中的应用

引用本文

赵众, 李彬, 吴妍娴, 袁洪福. 基于SIMCA-SVDD方法的分子光谱分析及其在食用油分类中的应用[J]. 光谱学与光谱分析, 2020,40(8): 2651-2656.
ZHAO Zhong, LI Bin, WU Yan-xian, YUAN Hong-fu. Edible Oil Classification Based on Molecular Spectra Analysis With SIMCA-SVDD Method[J]. Spectroscopy and Spectral Analysis, 2020,40(8): 2651-2656.
Doi:10.3964/j.issn.1000-0593(2020)08-2651-06 复制到剪切板

Permissions

《光谱学与光谱分析》期刊社所有

基于SIMCA-SVDD方法的分子光谱分析及其在食用油分类中的应用

赵众^1,^*, 李彬¹, 吴妍娴¹, 袁洪福²

1. 北京化工大学信息科学与技术学院, 北京 100029

2. 北京化工大学材料科学与工程学院, 北京 100029

*通讯联系人

收稿日期: 2019-07-26 接受日期: 2020-01-18

摘要

食用油是日常生活中的必需品。市场上食用油在成分、营养价值及价格上有很大的不同。为避免欺诈行为, 亟需建立一套有效的市场销售的食用油品质分类方法。常规的食用油检测方法速度慢而且需要复杂的实验室预处理过程。分子光谱从分子水平上反映了物质的组成与结构信息, 分子光谱分析速度快而且是无损监测, 因此分子光谱分析结合化学计量学的方法正成为食用油分类方法的趋势。 SIMCA(Soft Independent Modeling of Class Analogy)是应用广泛的分子光谱分析方法, 然而在SIMCA中使用欧氏距离于对基于PCA和F检验提取的特征进行分类, 难以区分不规则的特征空间。由于食用油样本分子光谱差别细微, 通常难以用SIMCA方法进行分类。 SVDD(Support Vector Domain Description)算法是一类基于支持域的非线性单类分类方法, SVDD利用求解凸二次规划得出一个尽可能包含所有目标样本的最小超球体进行分类。本文提出了一种基于SIMCA-SVDD方法的分子光谱分析方法并用于食用油的快速分类。为鉴别不同种类的食用油, 在ATR-FTIR光谱仪上扫描四种食用油的红外光谱。应用SIMCA方法提取分类特 T²和 Q, 由于提取的特征 T²和 Q分布的不规则性, 不同于SIMCA中的欧氏距离, 本文采用SVDD用于对提取的不规则特征进行分类。由于SVDD能通过映射函数将分类特征映射到高维空间, 因此可以通过求解凸二次规划来训练最优的分类超球面对分类特征进行分类。采用本文所提的SIMCA-SVDD方法及传统的SIMCA方法, 对同样的样本进行了对比实验。对比实验证实了本文所提的SIMCA-SVDD方法具有比传统的SIMCA方法更好的分类结果, 所提的方法为实现基于分子光谱进行食用油快速分类提供了一条新的途径。

关键词: 食用油; 分子光谱; SIMCA; 欧氏距离; SVDD

中图分类号:O561.3 文献标志码:A

Edible Oil Classification Based on Molecular Spectra Analysis With SIMCA-SVDD Method

ZHAO Zhong^1,^*, LI Bin¹, WU Yan-xian¹, YUAN Hong-fu²

1. College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China

2. College of Materials Science and Engineering, Beijing University of Chemical Technology, Beijing 100029, China

*Corresponding author e-mail: zhaozhong@mail.buct.edu.cn

Biography:　ZHAO Zhong, (1970—), Professor, College of Information Science and Technology, Beijing University of Chemical Technology

Fund:National Key Research and Development Plan(2016YFF0102500), Beijing Natural Science Foundation(4172044)

Abstract

Edible oil is a necessity in daily life. The nutritional value and price of different types of edible oils on the market vary a lot. Because of the spurious activities in the market, it is necessary to establish effective detection methods to classify the quality of the edible oils in the market. Traditional edible oil classification methods are usually time-consuming and requiring complex pre-treatment in the lab. Molecular spectroscopy can elucidate the sample information of both compositions and properties at the molecular level, and molecular spectra analysis has the advantages of fast speed detection and non-destructive testing for edible oil classification. Molecular spectra analysis combined with the chemometrics is becoming a popular method for rapid classification of edible oil. SIMCA (Soft Independent Modeling of Class Analogy) is widely applied to molecular spectra analysis. However, the Euclidean distance is used in SIMCA to classify the extracted features with PCA and F test. Therefore it is difficult to classify the irregular feature spaces. When the molecular spectral differences among the different types of samples are tiny such as edible oils, it is usually difficult to identify them with the traditional SIMCA method. SVDD(Support Vector Domain Description)algorithm is a support domain method for solving the one-class classification problem. SVDD can get a hypersphere to include as many objective samples as possible by solving the convex quadratic programming problem. In this work, a method of molecular spectra analysis based on SIMCA-SVDD method for rapid classification of edible oils is proposed. In order to accomplish recognition of the different types of edible oils, the attenuated total reflectance infrared spectra of four types of edible oil are scanned on ATR-FTIR. SIMCA is applied to extract the classification features T² and Q. Since the extracted edible oil classification features T² and Q distribute irregularly, instead of classification with Euclidean distance in SIMCA, Support Vector Domain Description (SVDD) is applied in this work to classify the extracted features. Since SVDD can map the extracted classification features to high dimensional space by mapping functions, then an optimal classification hypersphere can be trained to classify the irregular distributing feature spaces by solving the convex quadratic programming problem. Comparative experiments to identify the same molecular spectra samples with the proposed SIMCA-SVDD method and the SIMCA method have also been done. Comparative experiment results have verified that the classification results with the proposed SIMCA-SVDD method are obviously better than that with SIMCA. The proposed SIMCA-SVDD method has provided a new way to classify the edible oil rapidly based on molecular spectra analysis.

Keyword: Edible oil; Molecular spectrum; SIMCA; Euclidean distance; SVDD

文章图片

Introduction

Edible oil is a necessity of daily life. There are many kinds of edible oil such as the peanut oil, rapeseed oil, soybean oil, corn oil, tea seed oil, sesame oil and olive oil in the market. The nutritional value and prices of different types of edible oils vary a lot according to their composition change. In order to avoid market fraud, it is necessary to establish an effective detection method to classify edible oils. There are some methods that have been reported to detect the quality of edible oil^{[1, 2]}. However, these detection methods are usually time-consuming and requiring complex pre-treatment. Spectral analysisbased detection methods^{[3, 4]} have been developed to analyze the edible oils with the advantages of fast speed and non-destructive testing.

SIMCA (Soft Independent Modeling of Class Analogy)^[5] is the widely applied method to molecular spectra analysis and chemometrics. In SIMCA, PCA and F test are used to extract T² and Q as the classification features. Then, Euclidean distance is used to classify the extracted features. The range defined by Euclidean distance, which is a circle in the plane of T²vs Q, can not accurately classify the extracted features distributing in irregular feature spaces. Support Vector Domain Description (SVDD)^[6] is a supervised machine learning method based on SVM theory. SVDD can map the nonlinear feature data to the high-dimensional space with different kernel functions. A closed and compact sphere can be optimized to classify the nonlinear feature data. Since SVDD can be optimized with the distribution of the classification data, it can be used to classify the irregular feature spaces^{[7, 8]}. In this work, a method of molecular spectra analysis based on SIMCA-SVDD method for rapid classification of edible oil is proposed. Comparative experiments to identify the same samples with the proposed SIMCA-SVDD method, and SIMCA have also been done. Comparative experiment results have verified that the classification results with the proposed SIMCA-SVDD method are obviously better than that with SIMCA.

1 Methods

1.1 SIMCA

SIMCA is a supervised pattern recognition method, PCA is applied to decompose sample matrix of each class as

$X = \overset{̅}{X} + TP$ (1)

where $\overset{̅}{X}$ , T, P representmean matrix, score matrix and load matrix. Cross validation is used to determine the number of principal components A as

$X = \overset{̅}{X} + T_{A} P_{A} + E$ (2)

where E is residual matrix. The fraction of the total variation can be estimated as

$Q = 1 - PRESS / SS$ (3)

where PRESS is the sum of squares of the prediction errors and SS is the sum of squares of the residuals of the previous component. According to selected A components, the Hotelling T² for observation i is calculated as

$T_{i}^{2} = \overset{A}{\sum_{a = 1}} \frac{t_{ia}^{2}}{s_{ia}^{2}}$ (4)

where $t_{ia}^{2}$ is the calculated with the score vector, $s_{ia}^{2}$ is the variance of t_ia. Calculate T² with F test as

$T_{\max}^{2} = A (N^{2} - 1) / N (N - A) \times F_{crit}$ (5)

$S_{i} = \sqrt[]{\frac{\sum e_{ik}^{2}}{(K - A)}} \times v$ (6)

where v is correction factor, e_ik denotes the residuals value of the i^th score value and the k^th loading value in the correction set, K is the number of load vector and F_crit is the critical value of F test. $\frac{T_{i}}{T_{\max}}$ and $\frac{S_{i}}{Q}$ are used to describe the features of samples. In SIMCA, Euclidean distance is applied to classify the features of samples as

$D_{i} = \sqrt[]{{(\frac{T_{i}}{T_{\max}})}^{2} + {(\frac{S_{i}}{Q})}^{2}}$ (7)

1.2 SIMCA-SVDD

In this work, SVDD instead of Euclidean distance is applied to classify the extracted features $\frac{T_{i}}{T_{\max}}$ and $\frac{S_{i}}{Q}$ with SIMCA. For a feature set {x₁, x₂, …, x_n} with n samples, use the center of the sphere aand the radius R to representone class as

$\begin{array}{l} \min F (R, a, ξ_{i}) = \min (R^{2} + C \overset{n}{\sum_{i = 1}} ξ_{i}) \\ s.t. ‖ x_{i} - a ‖^{2} \leq R^{2} + ξ_{i}, ξ_{i} \geq 0 \end{array}$ (8)

where C is the penalty coefficient and ξ _i is a relaxation factor. According to Eq. (8), the Lagrangian function is defined as

$\begin{array}{l} L (R, a, ξ_{i}, α_{i}, γ_{i}) = R^{2} + C \overset{n}{\sum_{i = 1}} ξ_{i} - \\ \overset{n}{\sum_{i = 1}} α_{i} [R^{2} + ξ_{i} - (‖ x_{i} ‖^{2} - 2 a x_{i} {+ ‖ a ‖)}^{2}] - \overset{n}{\sum_{i = 1}} γ_{i} ξ_{i} \end{array}$ (9)

where α _i(α _i≥ 0) and γ _i(γ _i≥ 0) are Lagrangian multipliers. The class center of the sphere a and the radius R can be obtained by solving MaxMinL(R, a, ξ _i, α _i, γ _i). According to Eq.(9), there are

$\frac{\partial L}{\partial R} = 0, \sum_{i} α_{i} = 1$ (10)

$\frac{\partial L}{\partial a} = 0, a = \sum_{i} α_{i} x_{i}$ (11)

$\frac{\partial L}{\partial ξ_{i}} = 0, C - α_{i} ⁃ γ_{i} = 0, 0 \leq α_{i} \leq C$ (12)

Substituting Eq.(10), Eq.(11) and Eq.(12) into Eq.(9), there is

$L = \sum_{i} α_{i} (x_{i} x_{j}) - \sum_{i, j} α_{i} α_{j} (x_{i} x_{j})$ (13)

Use kernel function to replace the inner product in Eq.(13) and maximize L, then

$\max L = \sum_{i} α_{i} K (x_{i}, x_{i}) - \sum_{i, j} α_{i} α_{j} (x_{i}, x_{j})$ (14)

According to Eq.(14) and defined C, α can be solved for every feature sample. The radius R can be calculated as

$\begin{array}{l} R = ‖ p - a ‖^{2} = K (p, p) - 2 \sum_{i} α_{i} K (p, x_{i}) + \\ \sum_{i, j} α_{i} K (x_{i}, x_{j}) \end{array}$ (15)

where p is the support vector. For multi-classification, the relative distance is defined as

$D_{i} = \frac{‖ z - a_{i} ‖}{R_{i}}, i = 1, \dots, n$ (16)

According to the minimum D_i in Eq.(16), the feature samples are classified.

2 Experimental

2.1 Materials

54 edible oil samples are provided by the National Institute of Metrology (NIM) of China, which belong to four types of edible oil. 43 samples are chosen as the calibration set and the remaining 11 samples are chosen as the validation set with the Rank-KS method^[9]. The number of calibration set and validation set for each types of samplesis shown in Table 1.

Table 1 Statistics of samples

2.2 Spectra measurement

The infrared spectra of the samples are scanned by Attenuated Totalinternal Reflectance Fourier Transform Infrared (Agilent 5500) spectrometer. The spectra are collected from 650 to 4 000 cm^-1 with a resolution 4 cm^-1and with 32 scans. Each sample is scanned three times and the average is used for analysis. The spectra of all samples measured on ATR instrument are shown in Fig.1.

	Figure Option View Download New Window
	Fig.1 Original spectrum of four types of oil samples measured on ATR spectrometer

2.3 Software

All data have been analyzed with MATLAB 2017a (The Mathworks Inc.).

2.4 Preprocessing

Eliminate the side effects of surface scattering and the change of optical path on infrared diffuse reflection spectra, and spectral mean centeringis applied to the spectral data.

2.5 Evaluation of classification results

Correct classification rate (CCR) is applied to evaluate the qualitative recognition results^[9].

3 Results and discussion

3.1 Feature extractionwith SIMCA

PCA is applied todecompose the preprocessed spectra samples. According to PRESS, cross validation is used to determine the number of principal components. Then, the classification features of $\frac{T_{i}}{T_{\max}}$ and $\frac{S_{i}}{Q}$ are calculated. PRESS and Q-T² distributions of spectra samples are shown in Fig.2.

	Figure Option View Download New Window
	Fig.2 PRESS and Q-T² distributions for the spectra samples

3.2 Classification results

The Euclidean distance discrimination for blended oil samples is shown in Fig.3. Euclidean distance discrimination in SIMCA treats the classification features $\frac{T_{i}}{T_{\max}}$ and $\frac{S_{i}}{Q}$ equally, then the discrimination area is anarc in the plane of T²vs Q. For clarity, mirroring the data into the other three quadrants, the arc forms a circle. Since the classification features of the edible oil distribute unevenly, the circle discrimination could be too large which will include wrong classification features.

	Figure Option View Download New Window
	Fig.3 Euclidean distance discrimination for blended oil samples

In this work, the radial basis function (RBF)is applied in SVDD to classify the features $\frac{T_{i}}{T_{\max}}$ and $\frac{S_{i}}{Q}$ that are extracted with SIMCA. The SVDD discrimination for blended oil samples is shown in Fig.4. In SVDD, the kernel function is applied to map the linear inseparable feature data to a high dimensional linear separable feature space. The bandwidth of the kernel function can be optimized to adapt the classification of the irregular feature spaces.

	Figure Option View Download New Window
	Fig.4 SVDD discrimination for blended oil samples

The comparative experiments for edible oil classification based on molecular analysis with SIMCA and proposed SIMCA-SVDD have been done. The classification results are shown in Table 2. According to Fig.2, it is noticed that the extracted edible oil features based on molecular spectra analysis with SIMCA are different. Then, the extracted features can be used for classification. But, the extracted features based on molecular spectra analysis with SIMCA are not always linear separable. The discrimination area for feature spaces with Euclidean distance in SIMCA is a circle, and then it is difficult to classify the irregular feature spaces such as the linear inseparable feature spaces. SVDD can map the linear inseparable feature data to a high-dimensional space with kernel tricks. Then, the minimum hypersphereis trained with SQP to include as many class samples as possible. According to the comparative experiments, the blended oil samples in the validation set can be recognized with SIMCA-SVDD accurately.

Table 2 Classification results of SIMCA and SIMCA-SVDD

For SIMCA, the decision plane is a circle, and its indicator is the radius in which Judging indicator is too single. After the features are extracted, the characteristic distribution rules of single oil can be distinguished within the regular area. However, for mixed oil, changes in its composition lead to irregularities in the decision plane. The SIMCA-SVDD method can change the irregular decision area by the parameters of the kernel function, so better classification results are achieved.

4 Conclusions

In this work, a method of edible oil classification based on molecular spectra analysis with SIMCA-SVDD is proposed. The IR spectra of four types of edible oil are scanned on ATR-FTIR. For a single oil sample, SIMCA and the proposed SIMCA-SVDD method can better classify the sample. However, due to changes in the composition of the mixed oil and changes in the content of the components, SIMCA does not distinguish well between the mixed oil and the single oil. SIMCA-SVDD!can correctly distinguish mixed oils in many samples. SIMCA is applied to extract the classification features T² and Q. Instead of classification with Euclidean distance in SIMCA, SVDD is applied in this work to classify the extracted linear inseparable features. The comparative experiment results have verified that the proposed method had a better classification of edible oils than the traditional SIMCA method. The proposed method has provided a new way to classify the edible oil rapidly based onmolecular spectra analysis.

参考文献

文献列表

[1]	Li Lin, Sun Qiuju, Xin Shigang, et al. Applied Mechanics & Materials, 2013, 395-396: 355. [本文引用:1]
[2]	Yoshimi Kitada, Yasuyuki Ueda, Masatoshi Yamamoto, et al. Journal of Liquid Chromatography & Related Technologies, 1985, 8(1): 47. [本文引用:1]
[3]	Xu L, Yu X, Lei L, et al. Food Chemistry, 2016, 202: 229. [本文引用:1]
[4]	CAO Yu-ting, ZHAO Zhong, YUAN Hong-fu, et al. Spectroscopy and Spectral Analysis, 2019, 39(2): 659. [本文引用:1]
[5]	Wold S. Pattern Recognition, 1976, 8(3): 127. [本文引用:1]
[6]	Tax D M J, Duin R P W. Pattern Recognition Letters, 1999, 20(11-13): 1191. [本文引用:1]
[7]	Chen G, Zhang X, Wang Z J, et al. Knowledge-Based Systems, 2015, 90: 129. [本文引用:1]
[8]	Lazzaretti A E, Tax D M J. An Adaptive Radial Basis Function Kernel for Support Vector Data Description. Berlin: Springer International Publishing, 2015. 103. [本文引用:1]
[9]	LIU Wei, ZHAO Zhong, YUAN Hong-fu, et al. Spectroscopy and Spectral Analysis, 2014, 34(4): 947. [本文引用:2]

2013

0.0

... There are some methods that have been reported to detect the quality of edible oil^[1,2] ...

1985

0.0

... There are some methods that have been reported to detect the quality of edible oil^[1,2] ...

2016

0.0

... Spectral analysisbased detection methods^[3,4] have been developed to analyze the edible oils with the advantages of fast speed and non-destructive testing ...

2019

0.0

... Spectral analysisbased detection methods^[3,4] have been developed to analyze the edible oils with the advantages of fast speed and non-destructive testing ...

1976

0.0

... SIMCA (Soft Independent Modeling of Class Analogy)^[5] is the widely applied method to molecular spectra analysis and chemometrics ...

1999

0.0

... Support Vector Domain Description (SVDD)^[6] is a supervised machine learning method based on SVM theory ...

2015

0.0

... Since SVDD can be optimized with the distribution of the classification data, it can be used to classify the irregular feature spaces^[7,8] ...

2015

0.0

... Since SVDD can be optimized with the distribution of the classification data, it can be used to classify the irregular feature spaces^[7,8] ...

2014

0.0

... 43 samples are chosen as the calibration set and the remaining 11 samples are chosen as the validation set with the Rank-KS method^[9] ...

... 5 Evaluation of classification resultsCorrect classification rate (CCR) is applied to evaluate the qualitative recognition results^[9] ...