Gearbox fault diagnosis convolutional neural networks with multi-head attention mechanism

Hang Xu; Huawei Li; Shufeng Yang; Jianghong Cui; Youhua Li; Yuanchun He; Guiping Xie; Yaoting Wu

doi:10.1051/ijmqe/2025001

All issues

Volume 16 (2025)

Int. J. Metrol. Qual. Eng., 16 (2025) 3

Full HTML

Open Access

Issue		Int. J. Metrol. Qual. Eng. Volume 16, 2025


Article Number		3
Number of page(s)		9
DOI		https://doi.org/10.1051/ijmqe/2025001
Published online		07 March 2025

Int. J. Metrol. Qual. Eng. 16, 3 (2025)

Research Article

Gearbox fault diagnosis convolutional neural networks with multi-head attention mechanism

Hang Xu¹^,2^,3^*, Huawei Li¹^,2, Shufeng Yang¹^,2, Jianghong Cui¹^,2, Youhua Li¹^,2, Yuanchun He³, Guiping Xie³ and Yaoting Wu³

¹ Zhongyuan University of Technology, Zhengzhou, Henan 451191, China
² Key Laboratory of Optical Sensing and Testing Technology for Mechanical Industry, Zhongyuan University of Technology, 41 Zhongyuan Middle Road, Zhengzhou, Henan Province, China
³ Zhejiang Xiasha Precision Manufacturing Co., Ltd, 389 Rongji Road, Ningbo City, Zhejiang Province, China

^* Corresponding author: xuhangzzti@126.com

Received: 8 December 2023
Accepted: 6 February 2025

Abstract

With the development of intelligent manufacturing systems, data-driven fault diagnosis has become a hot research topic. Traditional data-driven fault diagnosis methods often rely on expert-extracted features, wherein feature extraction process requires considerable effort and affects the final results to a great extent. However, end-to-end fault diagnosis methods based on deep learning can automatically learn feature representations from raw data. In this study, first, the raw vibration signals of various fault states of a planetary gearbox were segmented and preprocessed into input table data types. Second, a convolutional neural network with wide kernels in the first layer was used to extract gear fault features from the raw vibration data of the gearbox. Then, the multi-head attention mechanism was incorporated to focus on different feature spaces and obtain diverse feature information. Finally, using the Softmax layer, the fault features were classified and fault diagnosis of the gearbox was achieved. Validation experiments and comparative analysis indicated that the proposed fault diagnosis model exhibits stronger learning ability as well as a simpler and convenient diagnostic process compared with the traditional methods. The proposed model has broad application prospects in data-driven fault diagnosis.

Key words: Fault diagnosis / vibration signal / multi-head attention mechanism / convolutional neural network

© H. Xu et al., Published by EDP Sciences, 2025

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Fault diagnosis plays a crucial role in the manufacturing systems by facilitating early detection of the emerging issues, thereby saving valuable time and costs. With the development of intelligent manufacturing, data-driven fault diagnosis has become a hotspot of research. In the intelligent manufacturing era, establishing fault models for systems based on a large amount of data has become feasible due to the availability of massive machinery data. This has provided opportunities for utilizing data-driven fault diagnosis methods, which have garnered increasing attention from researchers and engineers.

In the field of fault diagnosis, both time- and frequency-domain signals are used to construct multiple features, which are analyzed using mathematical algorithms. Compared with the traditional shallow machine learning models, deep learning models can handle complex high-dimensional functions that are difficult to express using the shallow networks, thus offering stronger representation learning capabilities. Deep learning was first applied by Tamil Selvan et al. [1] to aircraft engine fault diagnosis in 2013; since then, it has been extensively applied in the field of fault diagnosis. Sparse auto-encoders (SAE), deep belief networks (DBN), convolutional neural networks (CNN), and deep residual networks (DRN) are among the common deep learning models. Kang et al. [2] applied one-dimensional time-domain vibration signals as inputs for the CNN. Shen et al. [3] applied SVM to bearing life prediction. Based on the self-attention mechanism for machine translation, Vaswani et al. [4] introduced the transformer model; they integrated information from different subspaces by utilizing multi-head attention, thus improving the encoding of dependency relationships. Guo et al. [5] transformed one-dimensional signals into two-dimensional grayscale images by using continuous wavelet transform and trained a CNN. Chen et al. [6] applied a CNN to gear fault recognition, achieving an average recognition accuracy of 96.8% on their dataset. Jia et al. [7] improved text classification by using a combination of attention mechanism and capsule network. Kong et al. [8] achieved planetary gearbox fault diagnosis by using a method involving combination of time-frequency fusion and attention mechanism.

Gearboxes have multiple gear engagements, presenting complex scenarios, and are affected by the presence of significant noise during vibration measurements, which, in turn, affects the extraction of fault features to a great extent. In addition, large fluctuations in load lead to strong time-varying characteristics of the vibration signals, resulting in variations in signal features across different time intervals [9,10]. The time-frequency representation of mechanical vibration signals offers detailed information about the operating condition of the equipment. Applying time-frequency plots constructed using different methods as inputs to deep neural networks can lead to difference in the final output [11,12]. This study proposes a fault diagnosis method for gearboxes based on a CNN with a multi-head attention mechanism. In the proposed method, the signals were processed and one-dimensional signals were directly inputted to the network. The CNN and multi-head attention mechanism were employed for automatic fault feature extraction and recognition from the feature maps. The multi-head attention mechanism is designed with a larger input scale to expand the receptive field, thereby more comprehensively capturing temporal context information and learning the importance of local sequence features. Finally, the proposed method was validated and analyzed using a validation dataset.

2 The composition of (CNN)

A CNN [13] is a type of neural network specifically designed to process data with grid-like structures. It has mature applications in various fields such as Natural Language Processing (NLP) [14] and Computer Vision (CV) [15] A typical CNN comprises an input layer, convolutional layers, pooling layers, fully connected layers, and an output layer (Fig. 1).

Fig. 1

Neural network structure diagram.

2.1 Convolutional layer

In the convolutional layer, one-dimensional data are convolved using a convolutional kernel, which involves performing cross-correlation between the input tensor and the kernel tensor. The mathematical model for one-dimensional convolution is as follows: $y (n) = f (n) * g (n) = \sum_{m = 0}^{N - 1} f (m) g (n - m) .$ (1)

In the equation, N represents the length of the input signal, and y(n) represents the sequence of convolution results. As shown in Figure 2, one-dimensional convolution can be understood as the output of the collective effect of multiple inputs of a system at a certain moment.

Fig. 2

Convolution principle.

2.2 Activation function

After the convolution operation, an activation function is applied to process the data, which computes the weighted sum, introducing a bias to determine whether a neuron should be activated. Consequently, the input signal is transformed into an output using differentiable operations. Activation functions introduce nonlinearity to the neural network, enhancing the network’s feature representation ability. The most popular activation function is Rectified Linear Unit (ReLU), which has been widely ap-plied to accelerate the convergence of CNN. ReLU facilitates the training of shallow layers by addressing the problem of vanishing gradient that has plagued previous neural networks. The selection of ReLU as the activation function yields the following output: $f (x) = m a x (0, x) .$ (2)

2.3 Pooling layer

In CNNs, the pooling layer functions primarily in down sampling the input features, reducing the number of network parameters, decreasing computational complexity, and thereby preventing overfitting.

Max pooling and average pooling are among the common pooling operations. In this study, Max pooling was employed, as shown in Figure 3. The CNN network adopts local connectivity and weight sharing, while the pooling operation significantly reduces parameter training as well as model complexity [16,17]. Max pooling helps select the maximum value within the local region of the input features, which serves as the down sampled representation. This approach helps retain the most salient information from the input features and reduce the dimensionality to some extent, which is crucial for reducing the number of parameters and computational complexity, especially when dealing with large-scale data.

Fully Connected Output Layer: The Softmax function is used for the output of multi-class classification. Its mathematical representation is as follows: $y_{i} = \frac{e^{x_{i}}}{\sum_{j = 1}^{n} e^{x_{j}}}$ (3)

In the equation, y_i represents the probability of the input belonging to the i-th class, where 0 < y_i < 1 for all i. y_i represents the output value for the i-th class sample.

Fig. 3

Max pooling principle.

2.4 Regularization and loss functions

Cross entropy is a commonly used loss function that is used to evaluate the difference between the probability distribution obtained by current training and the real distribution. The formula is: $L (Y | f (x)) = \sum_{i = 1}^{n} Y_{i} \times log (\frac{Y_{i}}{f (x_{i})}) .$ (4)

where Y represents the true value and f(x) represents the predicted value.

When L2 regularization of the unit is required, a regularization term can be added to change the formula to: $L (Y | f (x)) = \sum_{i = 1}^{n} Y_{i} \times log (\frac{Y_{i}}{f (x_{i})}) + γ \sum_{j = 1}^{M} ll ω_{j} ll .$ (5)

In the formula, M is the number of parameters in the regularization part, γ is the regularization coefficient, which determines the degree of regularization, and ω is the parameter participating in regularization.

2.5 Multi-head attention mechanism

Multi-head attention mechanism is an extension of the attention mechanism that enhances the model’s ability for representation learning and feature extraction by simultaneously learning multiple parallel attention heads [18]. It can capture different levels and aspects of feature information; the different aspects captured are then fused together to obtain a comprehensive representation.

The self-attention mechanism is depicted in Figure 4. As shown, first, it takes input sequences (x₁, x₂,…, x_t) and obtains initial embeddings (a₁, a₂…. a_t). Then, three matrices, (W_Q), (W_K), and (W_V), are multiplied with the embeddings, resulting in (q_i), (k_i), and (v_i) i∊(1,2,3….t). Subsequently, the vector dot product between (q_t) and (k₁, k₂…,k_t) is calculated, followed by Softmax operation. The resulting values are then multiplied with (v₁, v₂ … , v_t) and summed to obtain the corresponding output. $a_{t} = W x_{t}$ (6) $q_{t} = W_{Q} a_{t}$ (7) $k_{t} = W_{k} a_{t}$ (8) $v_{t} = W_{v} a_{t}$ (9) $A t t e n t i o n_{i} = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}})$ (10) $H e a d_{i} = A t t e n t i o n_{i} V_{i} .$ (11)

The multi-head attention mechanism involves h different linear projections, obtained through independent learning. Rather than a single attention pooling it uses the different linear projections to transform the queries, keys, and values, and the corresponding h sets of transformed queries, keys, and values are then inputted in parallel to the attention pooling. Specifically, multiple sets of matrices, denoted as W_Q, W_K, W_V, W_Q, W_K, W_V, are used to perform respective multiplications, which yield multiple sets of q_i, k_i, v_i, as illustrated in Figure 5. The outputs of these h attention pooling operations are concatenated together and transformed through another learnable linear projection, generating the final output.

Fig. 4

Attention connection linear transformation.

Fig. 5

(a) Multiheaded attention connection linear transformation. (b) Complete input and output of multi-head attention mechanism.

2.6 Gearbox diagnosis process

After the model is successfully constructed, the gearbox diagnosis process is initiated, which involves the following steps:

Model initialization and parameter update. The model is initialized, and the parameters are updated using a loss function. After pre-training the model, the parameters are saved, and the model is outputted.
Model learning with fault features. The saved parameters are reloaded, and the model learns the fault features using a classifier.
Data partitioning. The data are divided into a training set, a test set, and a validation set, and those from the training and test sets are used for model training.
Model prediction on the validation set. Data from the validation set are used to predict the outputs of the model.

2.7 Fault diagnosis model

This study proposes a fault diagnosis model by incorporating a multi-head attention mechanism into the first-layer width CNN (WDCNN) [19]. The model leverages the WDCNN with wide kernels in the first convolutional layer and parallel attention mechanisms. The WDCNN, designed with a wider first layer, was employed to extract fault features from vibration signals. The use of wide convolutional kernels in the initial layer allows the model to capture a broader range of features, enabling it to obtain more global information in the early stages. Utilizing wider convolutional kernels can reduce the number of layers required to achieve the same receptive field, and fewer layers result in shorter propagation paths, thereby improving gradient issues. By integrating more contextual information, the model becomes more stable in response to minor variations in the input data. Incorporation of the multi-head attention mechanism enhanced the model’s ability to capture informative features while suppressing irrelevant ones.

The architecture of the fault diagnosis model is depicted in Figure 6. As shown, the raw data were preprocessed and fed into the convolutional layer. The convolutional layer transformed the signals into a set of feature maps, which were then down sampled by the pooling layer. This process was iterated twice, with the multi-head attention mechanism selecting more discriminative features to enter the fully connected layer. The output of the fully connected layer was then activated using the ReLU function and passed to the Softmax layer, providing probability values for each class. The class with the highest probability was considered as the final result.

Fig. 6

Architecture of the fault diagnosis model.

3 Experimental setup and results

3.1 Dataset

The vibration acceleration signal data from the Southern Methodist University test platform were used for the experiment in this study. Acceleration sensors were employed to capture vibration signals from both healthy and faulty gearboxes on the test rig. The following three gearbox health states were simulated: gear tooth breakage, gear wear, and normal condition, as shown in Figure 7. The data were sampled at 10 kHz, with a sampling duration of 10 s. The rotational speed was set to 1420 revolutions per minute (rpm), with the small gear having 15 teeth and the large gear having 110 teeth. The meshing frequency was calculated as follows: (1420/60) × 15 Hz = 355 Hz.

The time-domain of gear faults are shown in Figure 8. From the figures, it can be observed that the fault signals exhibit characteristics of transient impacts.

This study focuses primarily on three categories of gear faults, namely gear surface wear, gear tooth breakage, and normal gears. To classify and identify gear fault signals using CNNs, the original vibration signal data must be preprocessed through their la-beling. The data were divided into the training, testing, and validation sets in a ratio of 3:1:1. The dataset used in this research consisted of 30,000 samples (as shown in Tab. 2), divided into three groups of data, with each group containing 10,000 samples. The labels for the gear dataset are presented in Table 1.

Fig. 7

Gear transmission and fault diagram, (a) platform, (b) chipped tooth, (c) worn tooth.

Fig. 8

Time-domain signals of gear faults.

Table 1

Gear dataset labels.

Table 2

Main parameters of the model.

3.2 Data processing

For processing, the data were labeled and normalized, and then, the signals were transformed into tensor format, which is suitable for inputting into the model. Tensor processing is necessary to convert the data into a format compatible with the model’s input requirements.

Normalization of the data: Batch normalization (BN) reduces the internal covariance shift and accelerates the training process of deep neural networks [20]. Typically, the BN layer is added after the convolutional layer and before the activation units. Dropout and BN reduce overfitting in the network, allowing for stronger generalization and enabling neurons to learn more robust features [21,22]. The minimum–maximum normalization method is used to scale the elements of a vector to the range [0,1]. The calculation formula is as follows: $X_{n o r m a l} = (X - X_{m i n}) / (X_{m a x} - X_{m i n})$ (12)

where X represents the original value; max and min represent the maximum and minimum values, respectively; and normal represents the normalized value.

One-hot encoding: It is the process of converting categorical variables into a form compatible with the machine learning algorithms. In this process, a “binary-like” transformation is performed on the categories that are used as features for model training.

The proposed model in this study is a streamlined multi-head attention CNN, composed of two convolutional layers, two pooling layers, one dropout layer, one multi-head attention layer, one fully connected layer, and one Softmax classification layer. The first convolutional layer utilized wide kernels to effectively suppress the influence of noise on feature extraction. The model’s performance was determined by the hyper-parameters, with a learning rate of 0.01, Dropout layer coefficient of 0.45, and L2 regularization coefficient of 0.0001 to prevent overfitting. ADAM optimizer was employed for parameter optimization, with a batch size of 128 and the number of iterations being 60. Other model parameters were dynamically adjusted to achieve higher accuracy and stronger generalization capabilities.

3.3 Result analysis

In this paper, accuracy and loss function are used as the performance metrics for the model. For fault diagnosis tasks, accuracy represents the model’s ability to correctly identify normal states and various fault conditions. High accuracy means that the model can correctly classify the majority of cases. Loss is another key performance metric that measures the degree of difference between the model’s predicted values and the true values. During training, the loss function is used to optimize the model parameters to minimize the loss value. The training loss is the loss value of the model on the training dataset, representing the model’s fit to the current dataset. The “Vol loss” (validation loss) is the loss value of the model on the validation dataset, used to evaluate the model’s performance on unseen data.

After establishing the basic model of this article and giving the fuzzy values of the hyperparameters, the model was trained for the first time. As can be seen from Figure 9, the loss function of the training set has not completely flattened, and the curve of the test set indicates that when encountering unfamiliar data model performance. The larger the loss value, the worse the model generalization performance, so the model parameters should be optimized.

To obtain the optimal model, the parameters of the network model will be adjusted, including the number of batches, the number of heads in the multi-head attention mechanism, and the value of Dropout. Subsequently, the classification accuracy will be compared to select the optimal value.

Larger batch sizes can improve training speed because each parameter update is calculated based on more samples in a single iteration, allowing for more efficient use of computational resources. Larger batch sizes can also reduce randomness during training, leading to more stable parameter updates, as shown in Figure 10.However, overly large batch sizes may result in excessive memory consumption, preventing the model from being trained within GPU memory. Smaller batch sizes may increase training time, as each parameter update requires more iterations. Therefore, selecting an appropriate batch size is crucial.
The impact of Dropout, if the value is set too low, may lead to underfitting of the model, meaning the model is unable to fully learn the characteristics of the data, thereby affecting the model’s performance. If the value is set too high, it may lead to over-regularization of the model, meaning the model loses too much information, preventing it from learning sufficient features, which in turn affects the model’s performance and generalization ability. The value of Dropout is typically taken between 0.3 and 0.9. Therefore, the model is analyzed under the values of 0.3, 0.5, 0.7, and 0.9, and through the analysis of Figure 11, it is found that a Dropout value of 0.5 is better to achieve the optimal effect of model training and generalization.

The impact of the number of heads in multi-head attention: Increasing the number of attention heads enhances the model’s ability to represent the diversity of the input, as each attention head can focus on different parts of the input, thereby capturing more feature information. However, increasing the number of heads also raises the computational cost, as each head requires additional resources to compute attention weights and perform weighted summations. Consequently, adding more heads may slow down the model’s training and inference speed. As shown in Figure 12, the impact on training time is minimal, but the effect on accuracy is significant. Overfitting occurred when the number of bulls reached 7. After comprehensive consideration, using 6 heads strikes a balance between accuracy and training time.

The number of network layers and multi-head attention mechanisms used by the model is small, so the hardware cost of calculation is low, and the impact of computational complexity is not considered.

In this study, a multi-head attention mechanism-based CNN with a wide kernel in the first layer was constructed for the original gear fault data. The accuracy and loss function curves were analyzed and compared with a CNN.

Through model training, the accuracy trend of the CNN was plotted. Within the first 20 iterations, the accuracy of the model was found to increase significantly. With further increase in the number of iterations, the growth rate of accuracy gradually de-creased and eventually stabilized. The model accuracy on both the training and valida-tion sets reached a stable level, which proved the model’s ability to effectively capture the gear fault features and perform accurate classification and recognition.

Figure 13 illustrates the significant advantage of the proposed method on the vali-dation set. The accuracy on the validation set exceeded 90%, while a CNN could achieve only around 80% accuracy on the validation set, indicating that the proposed method exhibits robustness against noise signals and possesses stronger generalization capabilities for gear fault classification tasks. Of note, the training and validation accuracy curves of the proposed method were smooth, showing no signs of overfitting. This indicates good generalization ability and stability of the model during the training process. The performance of the conventional CNN model is depicted in Figure 14, indicating the need for improvement in generalization capabilities and a slower improvement rate of accuracy during the training process compared with the proposed model.

The confusion matrix provides a visual representation of the ability of a model to recognize different types of gear faults. The multi-class confusion matrix introduced in the proposed model illustrates the errors and classification outcomes between the predicted and actual values, as shown in Figure 15.

As shown in Figure 15, the recognition rate is high for normal samples in the test set, whereas those for broken teeth and wear cases were relatively low and need to be improved, though they exceeded 93%. 0 is normal and 1is chipped tooth 2 is worn tooth. Therefore, it can be concluded that the model constructed in this study exhibits a high fault recognition rate.

To further demonstrate the model’s ability to learn different vibration data features of the gearbox, t-Distributed Stochastic Neighbor Embedding (t-SNE) was used as the data dimension-ality reduction technique. t-SNE was employed to represent high-dimensional datasets in a low-dimensional (two- or three-dimensional) space, enabling visualization and a comparative analysis. Specifically, the extracted features from various deep learning models were mapped onto a two-dimensional space, as shown in Figure 16.

In order to verify the generalization of the model and its superiority with other models, some of the uninput data were taken from the original data set. In the experiment, different models were compared to demonstrate that the model proposed in this paper achieves higher accuracy. To reduce the accidental impact of initial weights, the comparative experiment was conducted four times, and the results are shown in Figure 17. While a conventional convolutional neural network model also yields good results, the method proposed in this paper, which incorporates attention mechanisms and residual connections, significantly improves accuracy, with minimal impact from initialization.

Fig. 9

Loss curve of model with undetermined hyperparameters.

Fig. 10

Accuracy for different batch sizes.

Fig. 11

Accuracy for different Dropout.

Fig. 12

The impact of the number of long positions.

Fig. 13

Trend chart of accuracy function of the proposed model.

Fig. 14

Trend chart of the ordinary CNN model’s accuracy function.

Fig. 15

Multi-classification confusion matrix.

Fig. 16

(a) Input feature distribution. (b) Output feature distribution.

Fig. 17

Comparison of accuracy among different models.

4 Conclusions

In this study, a multi-head attention mechanism-based CNN model with wide kernels in first layer was constructed for fault diagnosis. Using the proposed model, the data were normalized. Furthermore, it eliminated the need for feature engineering, thus simplifying the diagnostic process. Deep feature extraction was performed on the vibration signals of the gearbox using operations such as convolution, pooling, and dropout. The inclusion of the multi-head attention mechanism allowed for more diverse and comprehensive feature extraction, resulting in the accuracy of the training set exceeding 98% and of the validation set being close to 80%. The proposed model exhibited better generalization capabilities and robustness against interference from complex signals than the traditional CNNs. Multi-head attention mechanisms can dynamically learn the importance of different parts of the input sequence. This results in a more nuanced and context-aware feature representation, enhancing the model’s ability to distinguish between relevant and irrelevant features. In industrial settings, the combined use of multi-head attention and CNNs can significantly improve the monitoring and diagnosis of machinery. The ability to detect subtle fault signatures amidst noisy signals can prevent unexpected downtimes and costly repairs.

Acknowledgment

The work was supported by the National Natural Science Foundation of China (No. 52075561).

Funding

Funding for this paper is supported by the National Natural Science Foundation of China (No. 52075561).

Conflicts of interest

No conflict of interest exists in the submission of this manuscript, and manuscript is approved by all authors for publication.

Data availability statement

The data in this paper are obtained by simulation analysis.

Author contribution statement

Hang Xu: Conceptualization, Formal Analysis, Project Administration, Funding Acquisition, Writing-Review & Editing; Huawei Li: Investigation, Writing-Original Draft, Software; Shufeng Yang: Visualization, Investigation, Data Curation; Jianghong Cui: Formal Analysis, Validation; Youhua Li: Supervision; Yuanchun He: Resources; Guiping Xie and Yaoting Wu: Methodology.

References

T. Tamilselvan, P. Wang, Failure diagnosis using deep belief learning based health state classification, Reliab. Eng. Syst. Saf. 115, 124–135 (2013) [CrossRef] [Google Scholar]
J. Kang, Y.J. Park, Novel leakage detection by ensemble CNN-SVM and graph-based location in water distribution systems, IEEE Trans. Ind. [Google Scholar]
Z.J. Shen, X.F. Chen, Z.J. He, Remaining life predictions of rolling bearing based on relative features and multivariable support vector aamachine, J. Mech. Eng. 49, 183–189 (2013) [CrossRef] [Google Scholar]
A. Vaswani, N. Shazeer, N. Parmar, Attention is all you need, in Proceedings of 31st Conference on Neural Information Processing Systems, Curran Associates Inc., Long Beach, USA, 2017, pp. 5998–6008 [Google Scholar]
M.F. Guo, X.D. Zeng, D.Y. Chen, Deep-learning-based earth fault detection using continuous wavelet transform and convolutional neural network in resonant grounding distribution systems, IEEE Sens. J. 18, 1291–1300 (2018) [CrossRef] [Google Scholar]
Z.Q. Chen, C. Li, R.V. Sanchez, Gearbox fault identification and classification with convolutional neural networks, Shock Vibrat. 2, 1–10 (2015) [Google Scholar]
X.D. Jia, L. Wang, Text classification model based on multi-head attention capsule networks, J. Tsinghua Univ. (Science and Technology) 60, 415–421 (2020) [Google Scholar]
Z.Q. Kong, L. Deng, B.P. Tang, Deep learning planetary gearbox fault diagnosis method based on time-frequency fusion and attention mechanism, J. Instrum. Measur. 40, 221–227 (2019) [Google Scholar]
L.Y. Jing, M. Zhang, P. Li et al., A convolutional neural network based on feature learning and fault diagnosis method for the condition monitoring of gearbox, Measurement 111, 1–10 (2017) [CrossRef] [Google Scholar]
D.D. Li, H.Y. Wang, Feature extraction and detection of planetary gearbox faults based on unsupervised feature learning, Power System Technol. 42, 3805–3811 (2018) [Google Scholar]
D. Verstrate, A. Ferrad, E.L. Droguett, Deep learning enabled fault diagnosis using time-frequency image analysis of rolling element bearings, Shock Vibrat. 1–17 (2017) [CrossRef] [Google Scholar]
C. Lu, Z.Y. Wang, B. Zhou, Intelligent fault diagnosis of rolling bearing using hierarchical convolutional network based on health state classification, Adv. Eng. Inform. 32, 139–151 (2017) [CrossRef] [Google Scholar]
S.B. Li, G.K. Liu, X.H. Tang et al., An ensemble deep convolutional neural network model with improved DS evidence fusion for bearing fault diagnosis, Sensors 17, 1–19 (2017) [Google Scholar]
Y. Lecun, L. Bottou, Gradient-based learning applied to document recognition, Proc. IEEE 86, 2278–2324 (1998) [CrossRef] [Google Scholar]
Y.X. Tan, H.G. Yao, Deep capsule network handwritten digit recognition, Int. J. Adv. Network Monitor. Controls 5, 1–8 (2021) [Google Scholar]
J.W. Chen, Z.G. Liu, H.R. Wang, Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network, IEEE Trans. Instrument. Measur. 67, 257–269 (2018) [CrossRef] [Google Scholar]
H.D. Shao, H.K. Jiang, H.Z. Zhang, Electric locomotive bearing fault diagnosis using a novel convolutional deep belief network, IEEE Trans. Ind. Electr. 65, 2727–2736 (2018) [CrossRef] [Google Scholar]
C. Sun, M. Zhang, R.J. Wu et al., A convolutional recurrent neural network with attention framework for speech separation in monaural recordings, Sci. Rep. 11, 1434 (2021) [CrossRef] [Google Scholar]
W. Zhang, Research on Bearing Fault Diagnosis Algorithm Based on Convolutional Neural Network [D]. Harbin Institute of Technology, Harbin (2017) [Google Scholar]
A. Gibson, J. Patterson, Deep learning: a practitioner’s approach, O’Reilly Media, Boston (2017), pp. 324–325 [Google Scholar]
A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, in Proceeding of the Advances in neural Information processing systems. Red Hook: Curran Associates Inc. (2012), pp. 1097–1105 [Google Scholar]
J.T. Huang, J.Y. Li, Y.T. Gong, An analysis of convolutional neural networks for speech recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, Piscataway (2015), pp. 4898–4993 [Google Scholar]

Cite this article as: Hang Xu, Huawei Li, Shufeng Yang, Jianghong Cui, Youhua Li, Yuanchun He, Guiping Xie, Yaoting Wu, Gearbox fault diagnosis convolutional neural networks with multi-head attention mechanism, Int. J. Metrol. Qual. Eng. 16, 3 (2025), https://doi.org/10.1051/ijmqe/2025001

All Tables

Table 1

Gear dataset labels.

In the text

Table 2

Main parameters of the model.

In the text

All Figures

	Fig. 1 Neural network structure diagram.
In the text

	Fig. 2 Convolution principle.
In the text

	Fig. 3 Max pooling principle.
In the text

	Fig. 4 Attention connection linear transformation.
In the text

	Fig. 5 (a) Multiheaded attention connection linear transformation. (b) Complete input and output of multi-head attention mechanism.
In the text

	Fig. 6 Architecture of the fault diagnosis model.
In the text

	Fig. 7 Gear transmission and fault diagram, (a) platform, (b) chipped tooth, (c) worn tooth.
In the text

	Fig. 8 Time-domain signals of gear faults.
In the text

	Fig. 9 Loss curve of model with undetermined hyperparameters.
In the text

	Fig. 10 Accuracy for different batch sizes.
In the text

	Fig. 11 Accuracy for different Dropout.
In the text

	Fig. 12 The impact of the number of long positions.
In the text

	Fig. 13 Trend chart of accuracy function of the proposed model.
In the text

	Fig. 14 Trend chart of the ordinary CNN model’s accuracy function.
In the text

	Fig. 15 Multi-classification confusion matrix.
In the text

	Fig. 16 (a) Input feature distribution. (b) Output feature distribution.
In the text

	Fig. 17 Comparison of accuracy among different models.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] T. Tamilselvan, P. Wang, Failure diagnosis using deep belief learning based health state classification, Reliab. Eng. Syst. Saf. 115, 124–135 (2013) [CrossRef] [Google Scholar]

[2] J. Kang, Y.J. Park, Novel leakage detection by ensemble CNN-SVM and graph-based location in water distribution systems, IEEE Trans. Ind. [Google Scholar]

[3] Z.J. Shen, X.F. Chen, Z.J. He, Remaining life predictions of rolling bearing based on relative features and multivariable support vector aamachine, J. Mech. Eng. 49, 183–189 (2013) [CrossRef] [Google Scholar]

[4] A. Vaswani, N. Shazeer, N. Parmar, Attention is all you need, in Proceedings of 31st Conference on Neural Information Processing Systems, Curran Associates Inc., Long Beach, USA, 2017, pp. 5998–6008 [Google Scholar]

[5] M.F. Guo, X.D. Zeng, D.Y. Chen, Deep-learning-based earth fault detection using continuous wavelet transform and convolutional neural network in resonant grounding distribution systems, IEEE Sens. J. 18, 1291–1300 (2018) [CrossRef] [Google Scholar]

[6] Z.Q. Chen, C. Li, R.V. Sanchez, Gearbox fault identification and classification with convolutional neural networks, Shock Vibrat. 2, 1–10 (2015) [Google Scholar]

[7] X.D. Jia, L. Wang, Text classification model based on multi-head attention capsule networks, J. Tsinghua Univ. (Science and Technology) 60, 415–421 (2020) [Google Scholar]

[8] Z.Q. Kong, L. Deng, B.P. Tang, Deep learning planetary gearbox fault diagnosis method based on time-frequency fusion and attention mechanism, J. Instrum. Measur. 40, 221–227 (2019) [Google Scholar]

[9] L.Y. Jing, M. Zhang, P. Li et al., A convolutional neural network based on feature learning and fault diagnosis method for the condition monitoring of gearbox, Measurement 111, 1–10 (2017) [CrossRef] [Google Scholar]

[10] D.D. Li, H.Y. Wang, Feature extraction and detection of planetary gearbox faults based on unsupervised feature learning, Power System Technol. 42, 3805–3811 (2018) [Google Scholar]

[11] D. Verstrate, A. Ferrad, E.L. Droguett, Deep learning enabled fault diagnosis using time-frequency image analysis of rolling element bearings, Shock Vibrat. 1–17 (2017) [CrossRef] [Google Scholar]

[12] C. Lu, Z.Y. Wang, B. Zhou, Intelligent fault diagnosis of rolling bearing using hierarchical convolutional network based on health state classification, Adv. Eng. Inform. 32, 139–151 (2017) [CrossRef] [Google Scholar]

[13] S.B. Li, G.K. Liu, X.H. Tang et al., An ensemble deep convolutional neural network model with improved DS evidence fusion for bearing fault diagnosis, Sensors 17, 1–19 (2017) [Google Scholar]

[14] Y. Lecun, L. Bottou, Gradient-based learning applied to document recognition, Proc. IEEE 86, 2278–2324 (1998) [CrossRef] [Google Scholar]

[15] Y.X. Tan, H.G. Yao, Deep capsule network handwritten digit recognition, Int. J. Adv. Network Monitor. Controls 5, 1–8 (2021) [Google Scholar]

[16] J.W. Chen, Z.G. Liu, H.R. Wang, Automatic defect detection of fasteners on the catenary support device using deep convolutional neural network, IEEE Trans. Instrument. Measur. 67, 257–269 (2018) [CrossRef] [Google Scholar]

[17] H.D. Shao, H.K. Jiang, H.Z. Zhang, Electric locomotive bearing fault diagnosis using a novel convolutional deep belief network, IEEE Trans. Ind. Electr. 65, 2727–2736 (2018) [CrossRef] [Google Scholar]

[18] C. Sun, M. Zhang, R.J. Wu et al., A convolutional recurrent neural network with attention framework for speech separation in monaural recordings, Sci. Rep. 11, 1434 (2021) [CrossRef] [Google Scholar]

[19] W. Zhang, Research on Bearing Fault Diagnosis Algorithm Based on Convolutional Neural Network [D]. Harbin Institute of Technology, Harbin (2017) [Google Scholar]

[20] A. Gibson, J. Patterson, Deep learning: a practitioner’s approach, O’Reilly Media, Boston (2017), pp. 324–325 [Google Scholar]

[21] A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, in Proceeding of the Advances in neural Information processing systems. Red Hook: Curran Associates Inc. (2012), pp. 1097–1105 [Google Scholar]

[22] J.T. Huang, J.Y. Li, Y.T. Gong, An analysis of convolutional neural networks for speech recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE Press, Piscataway (2015), pp. 4898–4993 [Google Scholar]