Early warning signals of failures in building management systems

In the context of sensor data generated by Building Management Systems (BMS), early warning signals are still an unexplored topic. The early detection of anomalies can help preventing malfunctions of key parts of a heating, cooling and air conditioning (HVAC) system that may lead to a range of BMS problems, from important energy waste to fatal errors in the worst case.We analyse early warning signals in BMS sensor data for early failure detection. In this paper, the studied failure is a malfunction of one specific Air Handling Unit (AHU) control system that causes temperature spikes of up to 30 degrees Celsius due to overreaction of the heating and cooling valves in response to an anomalous temperature change caused by the pre-heat coil in winter period in a specific area of a manufacturing facility. For such purpose, variance, lag-1 autocorrelation function (ACF1), power spectrum (PS) and variational autoencoder (VAE) techniques are applied to both univariate and multivariate scenarios. The univariate scenario considers the application of these techniques to the control variable only (the one that displays the failure), whereas the multivariate analysis considers the variables affecting the control variable for the same purpose. Results show that anomalies can be detected up to 32 hours prior to failure, which gives sufficient time to BMS engineers to prevent a failure and therefore, an proactive approach to BMS failures is adopted instead of a reactive one.


Introduction
Digital transformation involves changes of key business operations which affect products and processes [1], and data analytics and machine learning play a key role in these processes [2]. In recent years, the development of connectivity and flows of information between devices and sensors provided abundant data. In order to extract value from this data, Early Warning Signals (EWS) can be used to reduce systems downtime, optimise capacity and reduce operational costs.
Building Management Systems (BMS) present advantages for energy control and comfort policy management such as identifying locations of potential energy waste, decreasing equipment operating cost, providing indoor environmental safety and comfort through HVAC system control, as well as controls of water consumption, elevators, etc. This paper focuses on the early detection of failures in the temperature control of zones, where a policy comfort has been maintained. This temperature variable often displays a failure in heating or cooling system. An example could be a failure in a certain point of the system (fan coil unit, ventilation, heating, etc.) that causes the temperature of a zone to decrease well below its policy comfort.

Fault detection and diagnosis
As a part of this study, we consider that the EWS methodologies applied could be relevant to diagnosis of system errors. For AHU fault diagnosis some previous work has been reported in the literature. Reference [3] describes the application of Artificial Neural Networks (ANNs) to fault diagnosis in an AHU by using residuals of system variables to quantify the dominant symptoms of fault modes of operation. Following the same approach, reference [4] proposed AHU subsystem level fault detection using a General Regression Neural-Network (GRNN), residual generation and fault detection and diagnosis. A novel feature extraction technique to extract temperature and power associated features from high-dimensional and unstructured terminal unit data is presented in [5], to diagnose faulty HVAC in an automatic and remote manner. Reference [6] exercises the use of Air handling unit Performance Assessment Rules (APAR). They use control signals to determine the mode of operation of the AHU. A subset of expert rules which correspond to that mode is then evaluated to determine whether a fault exists. In the review of fault detection and diagnosis methodologies carried by [7], various Fault Detection and Diagnosis (FDD) are described to illustrate the use of evaluation standard parameters for improving the performance of AHUs. This work divides FDDs in three main categories, namely analytical-based methods, knowledge-based methods, and data-driven methods. In [8] an approach for clustering air conditioning zones of influence together, shows how room areas, weather conditions, and air conditioning settings affect the air conditioning power consumption of rooms in real life. In the context of building's energy performance, reference [9] proposes a hybrid and multilevel FDD tool for the identification and prioritization of corrective maintenance actions helping to ensure the energy performance of buildings using dynamic Bayesian networks to monitor energy consumption. Following a similar line of work in the context of building energy performance, reference [10] proposed to evaluate the uncertainty associated to the use of a simplified model for the estimation of the energy consumption of a given building. In a more recent study, reference [11] proposes a method that employs sequential two-state clustering to identify abnormal behaviour of the fan coil unit. Some other recent studies on HVAC systems fault detection and diagnosis can be seen in [5,12,13].

Early warning signals
The main problem in the state-of-the-art BMSs is that they tend to use a reactive approach instead of a proactive one. The system failure is detected once it has happened, thus causing a disruption in services and forcing engineers to temporarily shut down some of the equipment in order to fix the failure. This paper proposes a proactive approach to early detection of BMS failures, aiming at both EWS and forecast of time series temperature sensor data.
Prior publications prove that EWS have a wide range of applications such as climate pattern change analysis [14][15][16][17]. Also, a common application for EWS in manufacturing is degradation assessment [18,19], and early detection of failures for critical system components in general as explained in [20]. Another popular area of application is economics, such as the analysis of banking system collapse in [21][22][23], credit risk diagnosis [24,25] and studies for economic cycles for certain asset prices in [26][27][28]. Also in biology, with a wide variety of applications within the field such as [29][30][31][32].
There are several studies of EWS in dynamical systems such as climatic variables that can be applied to our case study due to its nature as dynamical system. An example is [14], which uses lag-1 autocorrelation (ACF1), the Detrended Fluctuation Analysis (DFA) exponent and Power Spectrum (PS) in tropical cyclon data. For the multiuvariate tropical cyclone data case, reference [15] uses Empirical Orthogonal Functions (EOF) for dimensionality reduction prior to applying ACF1, DFA and PS. They also study the possibility of using the Jacobian matrix eigenvalues of the system as tipping point indicator. In the same area of application, references [16,17] use ACF and DFA to detect climate tipping points. In [33] a novel statistical method is applied, which is the method of potentials [34], for EWS to analyse the changing number of climate states during the last 60 kyr. The method detects the changes between states by estimating the probability density of the recorded time series. In the predictive maintenance field, reference [35] uses DFA and ACF to detect anomalies in electronic components commonly used in applications of the automotive and aviation industries. A different approach is shown in [36], which uses invariant-based identification for Lithium-ion battery performance degradation. In [37], they present data driven based models for air source heat pumps performance evaluation and anomaly detection based on real data collected over a water heating system.
Variational Autoencoders (VAE) are becoming increasingly popular for failure detection. In [38] they use an autoencoder residual vector error magnitude. This method is tested on several images datasets, concluding that it is a valid methodology for failure detection, as the hidden layer representation is capable of characterising the fundamental attributes of of the system within normal conditions, therefore to measure the deviation from "normal functioning". Following a similar method in a manufacturing field, in [18] uses VAE for degradation assessment of the ball screw. The assessment is done using the Variational Autoencoder Reconstruction Error (VAERE) and it demonstrates the progressive degradation of this component.
The paper is organised as follows. Section 2 describes the data used in this study, both for the univariate and multivariate cases. In Section 3, the methodologies used for the purpose of this paper are explained and summarised. Results are then outlined in Section 4. Finally, Section 5 outlines the conclusions.

Problem description & data
The goal of this paper is to apply EWS techniques for early failure detection in Building Management Systems (BMS). The building used in this case study is a large manufacturing facility that consists of office spaces and production spaces, separated in different plants. The failure consists of a malfunction of the heating and cooling valve control system that causes unnecessary temperature spikes in the affected zone of the building. When the Ouside Air Temperature (OAT) drops below 5 degrees Celsius, the control system activates the frost coil valve, which is designed to protect the HVAC equipment from freezing but also may have an effect in the room temperature (in this case it does). This causes the cooling and heating valve to overreact in order to compensate de influence in temperature and thus, causing the spikes. This malfunction has been observed after a long period of time with no failures.
Our variable of interest is the supply air temperature (SAT) of a particular zone of the building in the manufacturing plant. SAT is controlled by the average room temperature measured by four sensors installed in the zone using the control signal of pre-established fixed setpoint. Average room temperature is then used by the system to regulate the percentage of openness of heating or cooling valves, according to the setpoint. Another factor that influences the temperature of the room is the OAT. These four variables are represented in Figure 1, where the origin, 0, is taken to represent the start of the failure.
Another variable that controls temperature of the room is the frost coil valve (or pre-heat coil), which is worth mentioning as it may have an influence in the room temperature by increasing it, but we are not using it for the purpose of early failure detection.
-Outside Air Temperature controls frost coil valve. Frost coil valve protects the Air Handling Unit (AHU) from very low temperatures (the pre-heat coil protects the parts containing liquid elements, otherwise they would contract due to the low temperatures and damage the AHU as a consequence). -Supply Air Temperature controls heating and cooling valves. As data is registered in winter, the cooling valve will be closed most of the time, but in periods of anomaly it activates and provides with cooling to the environment.
We first conduct univariate analysis on the SAT by using different indicators explained in Section 3, and then we take into account other variables, reducing their dimensionality and applying the same techniques to compare both approaches and identify which one provides an EWS the earliest.

Early warning signals
There are a number of techniques to detect Early Warning Signals (EWS) as mentioned in Section 1. For the purpose of early failure detection applied to BMS we use four: variance, power spectrum, ACF1 and VAE. These early warning indicators are used with a chosen sliding window applied to the time series preceding the onset of a transition. In this case, the choice of the length of the sliding window is a trade-off between time-resolution (data availability) and the clarity of the change of the signal prior to transition. In order to identify trends before the transition, we use Kendall t correlation coefficient [39] in one of the indicators, PS, as this indicator presents a more chaotic behavior prior to transition. A positive Kendall t coefficient indicates increasing trends in the indicators prior to transition, as applied in [16].

Variance
The first indicator we use is the variance of time series, which is defined by: where s 2 is the sample variance, x i is the value of one observation, x is the mean value of all observations and n is the number of observations. The reason to use variance is to compare a more simple approach with other EWS indicators.

Power spectrum (PS)
Another EWS indicator is the power spectrum scaling exponent b, which is calculated by estimating the slope of the power spectrum S(f) of the data, plotted on logarithmic axes [40] in short term range. Exponent b can be estimated as: where the power spectrum is approximated by the periodogram, obtained from the absolute value of the fast Fourier transform. Then we obtain b by measuring the slope inside of the frequency range 10 À2 f 10 À1 .

Lag-1 auto-correlation function (ACF1)
Autocorrelation function is also used for the purpose of this paper. This function measures correlation of the time series within itself at different time lags. According to [41], the definition of the lag-k autocorrelation function is: We use lag-1 (k = 1) autocorrelation function (ACF1), the same as in previous studies. It is important to mention the influence of the right choice of k. If k is too small, the indicator would respond very quickly to changes in the time series. If k is too large, the indicator would barely change prior to a significant change in the series.

Variational auto-encoder (VAE)
This technique is a novel methodology for EWS. According to [42], the VAE is a network which attempts to represent the input with a PDF instead of several hidden nodes. This is illustrated in [43], with a training set X = [x 1 , x 2 , ..., x A ] T being x vectors from time t = 1 to time t = A. VAE uses a neural network for the probabilistic encoder q u (z|x) to approximate the posterior of the generative model p u (x, z). Let the prior over the latent variables be the centered isotropic multivariate Gaussian p u (z) = N (z ; 0, I). The parameters of the distribution are computed with a fully-connected neural network with a single hidden layer that attempts to reduce the reconstruction error. According to [43], the posterior is approximated with a multivariate Gaussian with a diagonal covariance structure: log q u ðzjx ðiÞ Þ ¼ log N ðz; m ðiÞ ; s 2ðiÞ IÞ ð 4Þ here, the mean and standard deviation, m (i) and s (i) respectively, are the outputs of the encoding half of the network. Concerning the decoder half of the network, samples are computed from the posterior z (i,l) ∼ q u (z|x (i) ) using z (i,l) = g f (x (i) , e (l) ) = m (i) + s (i) ⊙e (l) where e (l) ∼N (0, I), with ⊙ being an element-wise multiplication. As both prior (p u (z)) and posterior (q u (z|x)) are assumed to be Gaussian, the resulting estimator for the model and datapoint x (i) is, according to [43]: We first assume a normal functioning state in the system and training the model parameters (m (i) and s (i) ) with such data. As EWS indicator, we track the reconstruction error, which we define with the root mean square error (RMSE), described by: where n denotes the number of sampled points in a sequence, and b y and y are the reconstructed output and the actual, respectively. The VAE indicator is constructed, therefore, by measuring the reconstruction error according to the RMSE, defined as the variational autoencoder reconstruction error (VAERE).

Principal component analysis
Principal Component Analysis (PCA) was first introduced in [44]. It is a methodology that obtains an r-dimensional basis that best captures the variance in the data. Given input data D ∈ R nÂd and the desired threshold a, it selects the smallest set of dimension r that captures at least an a fraction of the total variance. The steps are shown in Algorithm 1.
The trajectory with the largest projected variance is called the first principal component, the orthogonal trajectory to the first one which captures the second largest projected variance is the second principal component, and so on.
We select a number of dimensions which is fewer than the one in the original dataset such as the subspace extent of these r dimensions captures at least a fraction of the variance. In the practice, a is set to a number around 0.95 (as it is in this case), so that the reduced dataset captures at lease 95% of the total variance. Also, r = 1 so the original three time series are reduced to one that captures the features of the three with the minimum possible error.
We now apply the methods in Section 3 to the observed sensor data. For the purpose of this study we perform univariate analysis only with SAT, then we perform multivariate analysis with all variables controlling SAT by reducing the dimensionality before applying these techniques.
Although these techniques have not been implemented in similar BMS-related failures so far, the techniques proved successful for early failure detection/tipping point analysis in similar dynamical systems such as manufacturing, automotive industry, climatological events, etc.
The different values for the windows can be chosen for each methodology according to the earliest they start to peak prior to failure. Once the values are chosen, the techniques can be extrapolated to similar problems (for instance early detection of failures in different parts of the HVAC system affecting room/area temperature).

Univariate analysis
Results for univariate analysis are presented in Figure 2, where SAT is presented together with the EWS indicators. The plot shows data from 64 hours prior to the failure at moment 0, which is when the temperature increases anomalously for the first time. We use different windows to obtain every EWS: 22 hours for variance and ACF1, 8 hours for PS and 4 hours for VAERE. The window size has been selected according to the clarity of the signal they provided on each indicator. Y-axis of the variance has been represented in logaritmic scale, for convenience.
As shown in Figure 2, ACF1 and variance are the indicators presenting the most clear signals. It can be observed that PS also increases before the failure occurs, although it gives a signal with a shorter period. VAE has been first trained with the system functioning under normal conditions. The output displayed shows the VAERE. This seems to show a drop right before the tipping point, but less than 4 hours prior to event.
As can be seen in Figure 3, Kendalls show a more robust positive trend than negative, as the majority of values are greater than zero. However, some negative values can be seen as oscillations occur.

Multivariate analysis
Multivariate analysis for EWS is shown in Figure 4. The plot on the top corresponds to OAT, cooling valve and heating valve reduced with PCA to their first principal component or direction to the minimum projection variance. The windows used for each indicator for multivariate analysis are: 22 hours for variance and ACF1, 16 hours for PS and 20 hours for VAERE.
Comparing this plot with the univariate one, we observe that ACF1 and variance present clear EWS signals, whereas the variance represents a more steady slope in the multivariate case. PS shows a more prompt signal in the multivariate case, 24 hours prior to failure in comparison to 8 hours prior to the failure in the univariate analysis. However, values of the Kendalls incline a bit more towards negative values than in the univariate case, although still positive values are predominant. This means that PS indicator is less robust in the multivariate case. An earlier signal is also given by the VAE, whose indicator for the multivariate signal gives a clear EWS after 32 hours before the event, in comparison to the 4 hours prior to failure of the univariate case.

Discussion, conclusions and limitations
In this paper, we applied several early warning signals techniques to analyse BMS temperature sensor data and percentage of openness of heating and cooling valves to activate heating and/or cooling systems, respectively, in an area of a large industrial facility. The studied failure is a malfunction of one specific AHU control system that causes temperature spikes of up to 30 degrees celsius due to overreaction of the heating and cooling valves in response to an anomalous temperature change caused by the preheat coil in winter period in a specific manufacturing facility. Therefore the aim of the paper is to detect this anomaly before it happens. The analysis has been divided into univariate and multivariate EWS analysis, using for each the same indicators: variance, ACF, PS and VAE. PCA has been used for dimensionality reduction in the multivariate case.
The analysis shows that, in general, the indicators provide an earlier and more reliable signal in the multivariate case. This improvement can be seen specially in PS and VAE. In the case of PS, the difference is that the indicator completely changes its range of movement 24 hours prior to failure. With the VAE, the reconstruction error does not reproduce the gradual increase before the failure in the univariate case despite of the abrupt temperature change. In the multivariate case, the VAE does show the expected behaviour, producing an early signal almost 32 hours prior to failure. The reason that the VAE does not provide a good signal in the univariate case may lay in what is considered a normal functioning system when training the model using only SAT. Such distribution makes small changes due to external factors, which actually control its non-stationary behaviour. In the multivariate case, the behaviour of the system is defined mainly by the heating valve, which steadily opens and closes to provided heating to the environment within an established range or set point. In both univariate and multivariate analysis, ACF demonstrated to perform very well, starting to clearly go above its 0.2 value almost 32 hours prior to failure, with a very similar performance in both cases.
The SAT is controlled by average of the four temperature sensors in the room. The choice of the SAT instead of each the four temperature sensors is to avoid false positives due to noise. Although there is a standard on the locations on which the sensors should be installed, they can be temporarily exposed to heat sources (if someone places a desk with a computer next to it by mistake or  sunlight directly beaming at the sensor, for instance). Therefore there could have been external factors that could not have been controlled for the purpose of this study.
The indicators generally respond earlier in the multivariate analysis, as the variables controlling SAT are the ones considered for the dimensionality reduction prior to applying the EWS techniques. This has been possible due to a good knowledge of the system so the right variables controlling SAT, out of hundreds of sensors, could be selected. In other circumstances the lack of knowledge about the system or sensors installed would not create the conditions for multivariate analysis. In such case, a technique for single variable EWS analysis can also deliver satisfactory results.
In this work, lack of data and difficulty of automation are the main limitations encountered. Here the problem lies not only in the automation of the data acquisition production, but also in the location of specific failures in order to test the algorithms. Further work to improve these results would imply an improvement of software solutions for data extraction, as well as the location of new failures and case studies to further improve and generalise these results.
The early detection of such failures gives time to on-site engineers to make adjustments when necessary before these failures actually happen. This not only reduces maintenance and operational costs, but also produces energy savings by advising when parts of the system should not be activated in some given period, thus compensating the "blind spots" from the BMS control system by extracting real value from the data generated.

Funding information
This work was funded by EPSRC DTP PhD Studentship, NPL and Mitie. For the purpose of Open Access the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission.