Research on the enhancement of machine fault evaluation model based on data-driven

. Recently fault data diagnosis-based deep learning methods have achieved promising results. However, most of these methods ’ performances are dif ﬁ cult to improve once they have achieved accuracy. This paper mainly uses fusion theory based on data-driven to solve this problem. Firstly, the diagnostic models are divided into feature extraction and neural network. Then, four feature extraction methods are fused by pre-allocation. The neural network part consists of three single models, and the weight of the three output results is determined by regression analysis. Experiments show that the accuracy of diagnostic models is improved. Finally, we combine the two studies and propose a Fusion-Ensemble superposition (FES) model. The AUC value of the model is higher than 98% in most tasks of the DCASE2020 machine failure dataset.


Introduction
Abnormal sound can be used as an essential standard to identify whether the machine is abnormal.Normal sounds of a working machine are often smooth and regular but accompanied by obviously anomalous sounds when the machine is out of order.Anomalous sounds [1,2] indicate that a machine may have malfunctioned, including the rupture of mechanical components, stuck, or the failure of completing a specific function [3].Timely discovery of faults can avoid heavy losses and reduce production costs.Most machine failures occur slowly, and uncertainty makes it difficult to predict, so data collection is extremely difficult.Out-ofdistribution (OOD) [4] detection has methods suitable for supervised data and semi-supervised data.Therefore, the OOD detection method based on deep learning is often used for anomalous sound recognition.Now many researchers pay more attention to model innovation, but we find that feature extraction also impacts the overall recognition effect.This paper will show the impact on machine fault recognition from two aspects: feature extraction and network structure.
Aiming at handling abnormal sound detection problems in the early stages [5], Koizumi et al. [6] proposed using the Gaussian mixture model to calculate anomaly scores [7], and Foggia et al. [5] used audio streams to perform sound detection to determine dangerous situations.However, traditional algorithms cannot handle high-dimensional data, and feature extraction capabilities are weak.Deep anomaly detection (DAD) advocates for solving this problem, and auto-encoder (AE) is one of the commonly used DAD algorithms.Long-and short-term memory network adversarial networks (GANs) [8] and OC-NN [9] have also been widely used in various sound detection scenarios.Suedusa et al. [10] used an interpolation-based deep learning network for abnormal sound detection.The spectrogram of the removed center frame is used as the input of the model, and the interpolation prediction result of the removed frame is used as the output of the model.Komatsu et al. [11] proposed to use WaveNet and I-Vector to detect abnormal acoustic events based on time, location, and changes in the surrounding environment.
The main contributions of the paper are: -In the task of audio recognition, the results of different feature extraction methods are also different.The four feature extraction methods are fused to improve the accuracy of machine fault diagnosis.-The maximum limit of single model accuracy is broken through the method of a model ensemble.-A method for machine fault diagnosis based on multifeature fusion and model ensembles is proposed.
The rest of this paper is organized as follows: Section 2 introduces the data set and evaluation method.Section 3 presents the model structure and method.Section 4 shows the model accuracy and comparative test results.
2 Dataset and evaluation metrics DCASE2020 TASK2 data set was used to verify the performance of the proposed model.The main challenge of this task is to detect unknown anomalous sounds under the condition that only normal sound samples have been provided as training data.In real-world factories, actual anomalous sounds rarely occur and are highly diverse.Therefore, exhaustive patterns of anomalous sounds are impossible to deliberately make and/or collect.This means we have to detect unknown anomalous sounds that were not observed in the given training data.The data set was composed of ToyADMOS and MIMII, which were single-channel recordings.The down-sampling rate of all audio clips was 16 kHz, and the length was about 10 s.The normal sound sample data used in the TASK2 are divided into six categories: toy-car, toy, valve, pump, fan, and slider.The first two are from toy machines, whereas the rest are from real machines [12].Figure 1 shows the spectrograms of the six groups of samples.
There are several fixed detection indicators for OOD detection: true positive rate (TPR) is calculated in equation (1), where TP and FN represent true positive and false negative, respectively False positive rate (FPR) is calculated in equation ( 2), where FP and TN indicate false positive and true negative, respectively OOD detection usually uses the area under the curve (AUC) and partial-AUC (pAUC) to evaluate the quality of the model.AUC is the area under the receiver operating characteristic (ROC) curve.The abscissa of the ROC curve is FPR, and the ordinate is TPR.AUC is the indicator for judging the pros and cons of a two-class prediction model.AUC is more commonly used than accuracy and recall rate [13].AUC can demonstrate the overall performance of the model.A high AUC value indicates that the model's performance is excellent, and the error probability of the positive prediction example is low.The pAUC is the AUC within a specific false positive rate range.
In the feature extraction, the sampling rate is 16 kHz, the window length is 1024 samples, the skip length is 512 samples (64 ms).1024 FFT (fast Fourier transform) points are used, and 128 mel filters are used.The sequence length of a training sample is n frames = 640 audio frames, and every five frames are connected.

Multi-feature fusion
Four feature extraction methods are applied in the feature extraction part: log-linear [14], Log-Mel [15], HPSS_h [16], HPSS_p [17].The principle of Log-Mel is to extract sound features by simulating the human ear structure.However, when the sound signal has high and low tones, the high tones will be covered by the low tones.HPSS (harmonic/percussive source separation) technology was first applied in the music field.Music signals are distributed in two forms, continuously and smoothly, along with time and frequency.These two distributed music sources are called harmonic sources (HPSS_h) and shock sources (HPSS_p) [16].

Log-Mel
Log-Mel adopts the signal energy as the basic feature, and its signal processing can be employed as the output feature.This feature is not affected by the nature of the signal and has no restriction on the input signal, which has a better recognition effect when the signal-to-noise ratio is low.

HPSS_h
The harmonic source contains a fixed tone, which can form a series of smooth instantaneous envelopes on the frequency.It is smooth and continuous on the time axis and discontinuous on the frequency axis.

HPSS_p
The shock source is concentrated quickly, forming a series of vertical broadband spectral envelopes on the frequency spectrum, so it is discontinuous on the time axis and smoothly continuous on the frequency axis.Figure 2 is the structure diagram in which a machine fault recognition model is established by an AE [17] (the structure is shown in Fig. 3).Different feature extraction methods have different application ranges.Log-Mel is the most widely used, so it is the basic feature extraction method.Log-linear is suitable for data with strong correlation, HPSS_p has a good extraction effect for sound with a complete period, HPSS_h is more inclined to extract the features of discontinuous sound and burst.We present a fusion strategy to allocate the best feature extraction method according to the audio characteristics of the tasks (the scheme is shown in Tab. 1).

Model ensemble
The purpose of a multi-model ensemble is to integrate the advantages of each model through scientific methods and obtain a stronger ability to solve unknown problems [18].So, multi-model ensemble has attracted more attention in practical application.
AE could compress input data into a lower dimension manner and decode the data into the original input data unsupervised.The encoded data is reconstructed by decoding, and the difference between the reconstructed data and the original input data is the reconstruction error.If the reconstruction error is large, it is guaranteed to be a poor auto-encoder.The unsupervised training algorithm layer by layer is used to complete the pre-training of the hidden layer between the encoders and decoders.Then the backpropagation algorithm is used to optimize and adjust the system parameters of the whole neural network, which improves the learning ability and is beneficial to the pretraining.VAE and CAE are both variants of AE.
The structure of VAE is shown in Figure 5.It contains two encoders used to calculate the mean and variance, respectively.Gaussian noise is added to the encoder network for calculating the mean value so that the decoder can be robust to noise.KL loss is applied to make the mean value 0 and the variance 1 and append a regularizer to the encoder so that the encoder data has zero mean value.The function of the network for calculating variance is to adjust the intensity of noise dynamically.
CAE replaces the Hessian matrix of AE with the Jacobian matrix [22], and other parts are almost the same.
WaveNet model is a sequence generation model that can directly learn the mapping of sampling value sequence, so it has a good synthesis effect.At present, WaveNet is applied in speech synthesis, acoustic model modeling, and vocoder and has great potential in speech synthesis.The structure is shown in Figure 6.
X 1 , X 2 , X 3 represents the output of the three network models, and the final result of the model ensemble is marked Y.The relationship between them is calculated according to equation (3):  The linear regression analysis obtains the weights a 1 , a 2 , a 3 of the three network models.Finally, by the weight of the training results, the network model parameters are optimized, which could get the optimal network architecture of the machine fault diagnosis.

Fusion-ensemble superposition model
We propose a fusion-ensemble superposition (FES) model based on multi-feature fusion and model ensemble (the model structure is shown in Fig. 7).It is divided into a feature extraction module and a neural network module.The feature extraction module adopts the multi-feature fusion method in Section 3.1, and the neural network module uses the model ensemble method in Section 3.2.
ResNet solves the degradation problem through the residual learning depth network, which can train a deeper network (the structure is shown in Tab. 2).The convergence speed of ResNet is faster, so it is much easier to directly learn the residual than to learn the mapping between input and output directly, and the classification accuracy can be improved by adding layers.
MobileFaceNet has made five improvements based on MobileNetV2: separable convolution instead of average pool layer, Insightface loss function for training, reduces channel expansion multiple, PReLU instead of ReLU, and employs batch normalization.The structure is shown in Table 3.Both PReLU and ReLU are activation functions.PReLU can retain some information less than zero, and achieve the purpose of activating functions at the same time.See their expressions for specific differences: where a i is automatically calculated by the network feedback, and i represents different channels.
PCA is a statistical method that converts a set of potentially correlated variables into a set of linearly unrelated variables through an orthogonal transformation.It is widely used in many fields such as satisfaction measurement, pattern recognition, image compression, etc. LOF mainly determines whether the point is an outlier by comparing the densities of each point and its neighbors.Points with low densities are identified as outliers.GMM uses the Gaussian probability density function to quantify the data accurately.It is a model that decomposes the data into multiple normal distribution curves.The ensemble strategy of PLG is to convert the outlier scores into a standardized scale and then calculate the average standardized values for the three models.

Effect comparison of multi-feature fusion
Table 4 lists the recognition effects of four different feature extraction methods using the same neural network, and the results of the multi-feature fusion model are also listed.The experimental results agree with the pre-allocation in Table 1, proving that each task's applicable characteristic hypothesis is valid.
In the feature extraction based on HPSS_p, the identification accuracy of the Pump is 80.9%/64.1%,which is better than other feature extraction schemes.In slider and value, the accuracy of HPSS_h is 7.7%/12.7%and 18.3%/15.5%higher than Log-Mel, respectively.In the Pump task, the accuracy of HPSS_p is 6.7%/2.9%higher than Log-Mel.The experimental results show that Log-Mel performs well in signal processing: larger background sound, less obvious sound characteristics, or signal processing with lower SNR.Other feature extraction methods perform better than Log-Mel in some tasks.HPSS-h can extract more features for periodic smooth sound.Correspondingly, HPSS-p has a good performance in extracting the features of irregular sounds.None of loglinear is the best, but it is the most comprehensive and can achieve satisfactory results in different situations.In Figure 8, it is obvious that our model performance is significantly improved over the single feature model.Therefore, it is proved that the multi-feature fusion method is better than the single feature method in machine fault diagnosis.WaveNet can generate the deep neural network of the original audio waveform, which is specially designed for audio.The experimental results show that among the three single model networks used in the model ensemble network, WaveNet has achieved the best results in all projects.As described in Section 3.2, although CAE simply replaces the Hessian matrix in AE, it performs slightly better in periodic stable sound.In Fan and Pump projects, CAE is 0.88% and 2.91% higher than the AUC of VAE, and pAUC is 10.43% and 6.01% higher than VAE.A large number of improvements made by VAE relative to AE.The experimental results also confirm that the VAE improvement is successful.The AUC and pAUC of VAE are 11.23% and 16.09% higher than AUC at most other projects.

Effect comparison of model ensemble
It can be seen from Figure 9 that the ensemble network composed of three single models is better than any of them.WaveNet is the best fault diagnosis among the three single models.However, the average AUC of the ensemble network is 1.05% higher than WaveNet, and the average pAUC is 2.03% higher than WaveNet.Therefore, it is proved that the model ensemble method can improve the effectiveness of the single model method in mechanical fault diagnosis.

Effect comparison of Fusion-Ensemble superposition model
Table 6 lists the recognition effects of FES network and three comparison networks.ResNet, MobileFaceNet and PLG have different effects on different projects.From the overall effect, the fusion model is better than the single model.Although PLG only achieved the best results in the valve project, the average AUC was 1.46% and 2.62% higher than ResNet and MobileFaceNet, respectively.The average effect of the two single models has little difference,   The average AUC of FES was 6.62%, 5.16%, and 7.78% higher than ResNet, MobileFaceNet, and PLG, respectively.Moreover, the increase of the minimum value is more than 10% for most projects.It can be proved that our FES model can complete the task of machine fault diagnosis.The specific effect is shown in Figure 10.

Method generality experiment
In order to test the effect of our proposed method on other data sets, we use FSDnoisy18k dataset [28], the comparison model DenseNet-201 is a Densenet [29] with 201 layers, and use ImageNet [30] for pre-training.Although ImageNet is an image data set, we found that it is very effective for pretraining.The results are shown in Table 7.
FSDnoisy18k dataset is a multi-classification task and contains 18532 audio clips across 20 classes, totaling 42.5 h of audio.The clip durations range from 300 ms to 30 s. DenseNet-201 has used ImageNet for pre training, and then used the target data set for training.It is not only a deep learning method, but also represents a data fusion method.The results show that the proposed method is also suitable for multi classification problems, and is 1.2% and 0.8% higher than the data fusion method in average precision and accuracy, respectively.

Conclusion
In the field of mechanical fault diagnosis, vibration signals are often used as data sources, but there are three problems: (1) in order to ensure the diagnosis effect, vibration sensors need to be deployed in each monitoring position, and the equipment cost will be very high; (2) some compact devices do not have space to install vibration sensors; (3) the stability of heavy equipment is good, and the vibration effect is not obvious, resulting in inaccurate diagnosis.We propose a fault diagnosis method based on sound characteristics, which can solve these problems.Compared with the traditional method, our method has lower cost and wider applicability.
In this paper, we discuss the influence of multi-feature fusion and model ensemble on the effect of machine fault diagnosis.The experiment shows that each feature extraction method is suitable for different machine fault types.Selecting an appropriate feature extraction method plays an important role in improving the accuracy of machine fault diagnosis.For tasks with multiple data types, we propose a feature fusion method to configure appropriate feature extraction methods for each type.It is helpful to improve the overall performance of the task.The multi-model ensemble can fuse excellent models through some scientific methods to break through the bottleneck of the generalization ability about a single model to unknown problems and integrate the advantages of each model to obtain the optimal solution to the same problem.An ensemble model is proposed with the average AUC value reaching 95.83%, and it is higher than a single model network.The superiority of the ensemble network is further proved.A new mechanical fault diagnosis model FES is proposed by combining the two experiments.The results show that the AUC value of the model in most projects is more than 98% and has good fault identification accuracy.

Fig. 1 .
Fig. 1.Various data Log-Mel spectrogram.The horizontal axis represents time, and the vertical axis represents frequency.

Fig. 3 .
Fig.3.AE is a special neural network architecture, and the input and output are the same architecture.It is trained in an unsupervised method to obtain the lower dimensional expression of the input data.These low-latitude information expressions are reconstructed back to high-dimensional data expressions.

Fig. 2 .
Fig.2.The input data selects different feature extraction methods according to the corresponding relationship in Table1, and the features are trained through AE network.During the test, the root mean square error of the reconstruction error is calculated, and the current state is obtained by comparing the threshold value.

Fig. 4 .
Fig. 4. Log-Mel is used for data feature extraction to train VAE, CAE and WaveNet network respectively.The training set is used to collect the output results, and the voting weights of the three networks are determined by regression analysis.

Fig. 5 .
Fig. 5. Variational autoencoder structure diagram.Where x is the input data, x 0 is the reconstructed data, and m x and d x are the mean and standard deviation of normal distribution respectively.

Fig. 6 .
Fig.6.Overview of the WaveNet entire architecture.A residual module in the model is shown in the dotted line.Multiple such modules will be stacked together in the network.K is the layer index.The nodes of each layer in the hidden layer will add the original value and the value of the activation function and pass it to the next layer.The 1 Â 1 convolution kernel is used to reduce the number of channels.Then the results of the over activated function of each hidden layer are added to do a series of operations and transmitted to the output layer.The output layer uses softmax to calculate the probability of each sampling point.

Fig. 8 .
Fig. 8.Comparison line chart of each model: (a) line chart of AUC value; (b) line chart of pAUC value.

Fig. 9 .
Fig. 9. Comparison line chart of each model: (a) line chart of AUC value; (b) line chart of pAUC value.

Table 1 .
Tasks and feature extraction methods preallocation table.

Table 4 .
Using the same convolutional neural network and different feature extraction methods, AUC/pAUC value summary.

Table 5
lists the recognition effects of a single network and a Multi-model ensemble network.

Table 5 .
Using the same feature extraction methods and different convolutional neural networks, AUC/pAUC value summary.All values are in %.

Table 6 .
The recognition effect is significantly different using the same convolutional neural network and different feature extraction methods.All values are in %. .Cui et al.: Int.J. Metrol.Qual.Eng. 13, 13 (2022) and each has its own advantages.ResNet works best in ToyCar and Pump projects, MobileFaceNet achieved the best results in the ToyConveyor, Fan and Slider projects, especially in the ToyConveyor project.
Fig. 10.Comparison line chart of each model: (a) line chart of AUC value; (b) line chart of pAUC value.P