Open Access
Issue
Int. J. Metrol. Qual. Eng.
Volume 11, 2020
Article Number 14
Number of page(s) 16
DOI https://doi.org/10.1051/ijmqe/2020011
Published online 19 November 2020

© V. Ramnath, published by EDP Sciences, 2020

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

1.1 Research motivation

The use of least squares based statistical regression analysis is a widely used tool to analyse experimental data when estimating and fitting model parameter values and uncertainties. In the particular case of straight line equations of the form y = ax + b known input data xi and output data yi for data-points i = 1, … , n inclusive of possible available uncertainty information are used in the regression analysis where standard uncertainty analysis theory from the GUM [1] specifies the model uncertainty as(1) (2)

In the above formula since x is a statistically independent random variable it follows that cov(x, a)= 0, cov(x, b) = 0, and in general cov(a, b) ≠ 0. As a result, the probability density function (PDF) of the output gy(ηy) may be determined where ηy is a univariate random variable of the model output y, gx(ξx) is a univariate random variable of the model input x, and ga,b(ξa, ξb) is a bivariate random variable of the joint PDF of the model parameters a and b that incorporates the coupling effect between random variables ξa and ξb for the model parameters a and b that are modelled in terms of cov(a, b).

The research problem that occurs when OLS, WLS and GLS regressions are used to estimate the parameters a and b to varying degrees of accuracies, is that additional Monte Carlo simulations must then be performed in order to determine the variances u2(a) and u2(b) along with the covariance cov(a, b), and that this information must then be further processed in order to infer u2(A0), u2(λ) and cov(A0, λ). Consequently, in the absence of covariance information the model uncertainty will be systematically under-estimated if the covariance is unknown or unavailable and only the variances of the model parameters are utilized in uncertainty calculations.

1.2 Research focus

This paper focuses on demonstrating the conceptual ease and functionality of the WTLSC approach and how it is able to incorporate measurement variance information u2(xi) and u2(yi) along with covariance information cov(xi, yi) of the data when available, in order to determine both the variances u2(a) and u2(b) along with the covariance information cov(a, b) of the model parameters for statistical completeness. A new simple analytical calculation approach to directly infer both the variances u2(A0) and u2(λ) as well as covariance term cov(A0, λ) from knowledge of u2(a), u2(b) and cov(a, b) as obtained from the WTLSC algorithm implemented with a Gnu Octave/Matlab program is presented that does not rely on additional Monte Carlo simulations or statistical post-processing.

2 Literature review

2.1 Physical theory of pressure balances

Pressure balance theory from Dadson [2] for a cross-floated pressure balance in equilibrium may be used to define the applied pressure P in terms of a generalized force F and an effective area S as where F is a generalized force term to account for buoyancy effects, and S is a cross-sectional effective area that accounts for pressure induced deformation and temperature expansion/contraction effects such that where S = A0(1 + λP)φ, A0 is a effective area when extrapolated to a zero pressure, λ is a distortion coefficient based on linear elasticity theory, and φ is an appropriate temperature compensation function that adjusts the area to an appropriate reference temperature.

In general since there is a pressure dependence within the effective area term S the equation is usually rearranged as a homogeneous equation and then solved for specified inputs to obtain a value of P such that h = 0. In the event that the pressure P is known and the area S must be determined the equation is rearranged as and again rearranged as another homogeneous equation in order to determine the value of S that solves the homogeneous equation for specified inputs.

In the above equation m is the mass on top of the piston, ρa is the ambient fluid density in contact with the piston to account for buoyancy forces acting on the piston, ρm is the mass density, g is the gravitational acceleration constant, A0 is the zero-pressure area, λ is the distortion coefficient, P is the applied pressure, and φ = 1 + α(t − t0) is a temperature compensation function to adjust the cross-sectional area to a reference temperature t0 where α = αp + αc is the linear thermal expansion coefficient of the combination of the piston and cylinder where αp and αc are the corresponding linear thermal expansion coefficients of the piston and cylinder respectively.

When the above system of equations is implemented the equation for the cross-floated area S may then be simply modelled as(3)

In the above equation the inputs are the expected values and standard uncertainties of the applied pressures (μ(Pi) ± u(Pi)), the cross-floated areas (μ(Si) ± u(Si)), and estimates of the covariances cov(Pi, Si) for a sequence of measurements i = 1, … , n where n is the number of measurements, whilst the unknowns are considered to be the expected values of the zero-pressure area A0, the distortion coefficient λ, and the covariance cov(A0, λ). This modelling approach is equivalent to fitting a straight line of the form y = ax + b from experimental data, and utilizing a least squares regression to quantify the values and uncertainties of the parameters a and b.

2.2 Historical and mathematical context

The fitting of straight lines is a common numerical task to fields such as science, engineering, and economics that require experimental or computational data to be modelled and fitted. Usually a least-squares algorithm is implemented to minimize a merit function as discussed by Saunders [3] where in general a non-linear calibration equation y = y(x ; a1, … , aN) is desired to be fitted from measurement pairs of controlled values xi and observed values yi. In the case where the number of measurements is M and the number of parameters to determine is N then a least squares solution is mathematically essential in the general case where M > N. For this situation a merit function is constructed as(4)and optimized in terms of the parameters a1, … , aN by simultaneously solving ∂χ2/∂ ai = 0 for i = 1, … , N where wi is a suitable and appropriate weighting function to account for the underlying measurement accuracy and available information.

The simplest case involves an optimization where all of the data points have an assumed constant uncertainty/accuracy such that u(xi) = u(yi) = σ for i = 1, … , M measurements, which corresponds to the situation where no detailed or specific measurement uncertainty information is available, and as a result the optimization of the merit function does not require any specific knowledge of σ as this can be conveniently factored out when optimizing χ2. This particular problem thus corresponds to the well known historical total least squares problem as investigated by Van Huffel and Vandervalle [4].

At a next conceptual level the problem of fitting a straight line with errors in both coordinates was originally formulated by York [5] and subsequently involved either attempts to directly solve the original problem posed by York or by various approximation methods in order to infer the variances of the fitting parameters by researchers working in the area of regression analysis of experimental measurements. For this problem the approach of Press et al. [6] has achieved the most notable recognition amongst many non-mathematicians for the fitting of a straight line where both x and y coordinates possess some underlying known statistical uncertainty.

In this particular case the model y(x) = a + bx involves a χ2 merit function optimization of the form(5)where u(xi) and u(yi) are the standard deviations of the xi and yi data points that are allowed to vary. Although conceptually straightforward the main challenge with directly solving this problem is that the presence of the parameter b in the denominator renders the partial derivative equation ∂χ2/∂ b = 0 a non-linear equation, and as a result this problem is considerably more difficult to solve than the earlier total least squares problem which is amenable to an analytical solution.

Consequently, the optimization problem for straight lines with errors in both coordinates is typically solved either with classical optimization techniques such as the Levenberg-Marquardt method or with the Monte Carlo bootstrap technique as outlined by Press et al. [6], and as a result this solution approach is not generally directly accessible for many metrologists working in industrial laboratories due to the mathematical complexity of the method.

It should be noted that a conventional numerical solution of this non-linear optimization problem using either classical techniques such as Newton's, steepest descent and homotopy methods as discussed by Burden and Faires [7] or newer methods such as the particle swarm optimization (PSO) approach would still suffer from two limitations, namely that it would firstly not be able to compute the parameter variances, and secondly that it would not be able to provide any insights into the parameter variances and covariances.

The statistical effects from the choice of OLS, WLS and GLS regressions was subsequently investigated by Duer et al. [8] who determined that whilst the specific choice of OLS, WLS or GLS regressions did not generally produce inconsistent estimates for the intercept b and slope a parameter values, that the different regression schemes nevertheless did in fact produce considerable variations in the corresponding estimates of the uncertainties u(a) and u(b) of the parameters a and b. As a result, Duer et al. concluded that a simplified OLS regression could only be safely utilized in the very specific and special case of where the uncertainties u(xk) in the xk data points was negligible.

In the particular case of scientific metrology work an inconsistent or incorrect estimate of a parameter value and variance/covariance could then have a catastrophic effect on high level laboratory inter-comparison work or in safety critical experimental measurements as parameter uncertainties of equipment and instruments are typically utilized in many quality engineering systems to specify tolerances and safety limits for equipment and personnel.

Due to these specific limitations particularly for scientific measurement work at national laboratories the total least squares problem was subsequently revisited by Krystek and Anton [9] who simplified the original two-dimensional problem into an equivalent one-dimensional optimization which is amenable with a set of simplified mathematical tools. This initial approach was then later generalized to solve the full problem of errors in both coordinates with correlation effects in a later paper by Krystek and Anton [10]. This approach is now termed as a weighted total least squares with correlation (WTLSC) analysis approach and it makes an assumption that the covariance matrix Uk for individual experimental data points for k = 1, … , n are available prior to optimizing the merit function which is explicitly specified as (6)

Part of the reasoning as to why the WTLSC approach does not appear to have been more widely adopted is that in the specific area of scientific metrology that the calculation of the covariance matrix is extremely mathematically complicated when using standard calculus based approaches, and is therefore only tractable from a full Monte Carlo pure numerical simulation approach as discussed by Harris [11]. As a result, if the covariance matrix is unavailable then the WTLSC approach becomes problematic in which case the covariance must either be calculated or estimated, or alternately modified by setting the correlations to zero such that the algorithm implementation then simplifies to the earlier weighted total least squares WLS algorithm.

Whilst the absence of a covariance matrix may be viewed as technical limitation when implementing a GLS regression where the standard statistical formulation comprises of determining the value of the regression parameter β which optimizes the residuals ε in a linear regression model y = X β + ε, it has subsequently been determined by researchers that this is not an insurmountable challenge as mechanisms exist to infer a likely covariance matrix. For this problem the covariance matrix Cov[ε |X] = Ω, is known to take the general form(7)

In order to account for the unavailability of the actual covariance matrix Ω which may be unknown, or impractical to estimate, one may nevertheless get a consistent estimate of Ω usually written as that is termed the feasible generalized least squares (FGLS) estimator using either iteration based commercial routines with software such as Matlab [12], or with newer support vector machines (SVMs) approaches that are currently under development such as recently reported results from Miller and Startz [13].

2.3 Recent statistical regression developments

Recently work reported in the literature by Wuethrich and Souiyam [14] building on earlier work by Otal and Yardin [15] has investigated the use of Monte Carlo based simulations for Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS) regression schemes in order to determine the parameter values of a pressure balance model of the form S = bP + a, where P is an input of applied pressure, S is an output of cross-floated area, and a and b are model parameters whose values and uncertainties must be determined. Conventionally the pressure balance model has the form S = A0(1 + λP) where A0 is a zero-pressure area and λ is a distortion coefficient and the regression analysis fits the values for a = A0 and the product b = A0 λ for i = 1, … , n data-points such that(8) (9)

For the above regression scheme the distortion coefficient is recovered as λ = b/a after the residuals εi are minimized in the matrix equation X C = Y + ε. This minimization is implemented through the use of a coefficient matrix V that is used to incorporate the known information such that(10) (11)

In an OLS scheme no weighting is applied and a coefficient matrix V consists of just a diagonal matrix with unity on each of the diagonal elements such that Vi,i = 1, i = 1,…,n in a WLS scheme the coefficient matrix consists of a diagonal matrix where the elements on each of the diagonal elements are proportional to the uncertainties of the areas i.e. Vi,i ∝ u2(Si), and finally in a GLS scheme the diagonal elements of V equals the variances of the areas such that Vi,i = u2(Si) whilst the off-diagonal elements equal the covariances such that Vi,j = cov(Si, Sj),  i, j = 1, … , n,  i ≠ j.

The general approach of Wuethrich and Souiyam [14] is to solve equation (9) with nominal values of the input quantities at specified pressures Pi for the equivalent homogeneous equation in order to determine corresponding values of Si for a sequence of pressures P1, … , Pn with i = 1, … , n, and to then use this information to determine the regression parameter Cj = [(A0)j, bj]T from the matrix equation with an appropriate choice of weighting matrix V. This process is repeated M = 10000 times with slightly perturbed values of the inputs consistent with the underlying uncertainties to sequentially generate samples Cj for j = 1, … , M that then results in a sequence C1, … , CM of regression parameters, which are then post-processed to calculate expected values, variances, and the covariance of the regression parameters A0 and b = A0 λ where the distortion coefficient is extracted as λj = bj/(A0)j. The pair of values ((A0)j, λj) is then further post-processed according to the GUM [1] for two sampled random variables qj and rj to calculate the covariance that is estimated as(12) (13) (14)

Investigations performed by Wuethrich and Souiyam [14] has concluded that a GLS regression produces high accuracy results, followed by the WLS regression which produces medium accuracy results, and finally by the OLS regression which produces low accuracy results. As a result the GLS approach may be considered the best overall approach however the key difficulty with implementing this approach is in the estimates of the covariances cov(Si, Sj) that are required, which are either mathematically complex to determine or which must be numerically estimated with a large number of Monte Carlo simulations.

The key differences in the newer approach of Wuethrich and Souiyam [14] and an earlier approach of Ramnath [16] is that in the latter work a multivariate GUM Supplement 2 approach [17] was utilized that modelled the various inputs x by treating the two parameters A0 and λ as a single vector y = [A0, λ]T with an equation h(y, x) = 0 where a multivariate PDF gy(ηy) of the output was to be determined from a multivariate input gx(ξx) by numerically solving the equation h(ηy, ξx) = 0 for a large number M of Monte Carlo simulation events.

In this approach sampled multivariate values of the inputs consistent with the joint PDFs of the inputs x were used to solve the homogeneous vector equation corresponding to and a sequence of pressures P1, … , Pn and areas S1, … , Sn were generated that was then used with an OLS approach to calculate sampled values of ((A0)j, λj) from the joint PDF gy(ηy). The sampled values were then used to construct a bivariate copula that modelled u2(A0), u2(λ) and cov(A0, λ), however a limitation of this approach was that it required a large number of Monte Carlo simulations, was mathematically complex, and thus more suited to scientific metrology applications in national metrology institutes.

Differences in the two approaches is that firstly the method of Wuethrich and Souiyam [14] requires estimates for the uncertainties of the areas u2(S1), … , u2(Sn) in the case of a WLS approach and additionally that of the covariances cov(Si, Sj),  1i, jn for each Monte Carlo event j, and secondly makes an assumption of a Gaussian correlation between A0 and λ. On the other hand the method of Ramnath [16] firstly avoids the explicit need for calculating u2(Si) and cov(Si, Sj) at each Monte Carlo event j as the data is considered as statistical samples, and secondly does not assume a Gaussian correlation between A0 and λ since bivariate copulas can model non-Gaussian correlation effects.

It should be noted that the calculation of the covariances cov(Si, Sj) that are required in a GLS regression whilst theoretically possible requires an additional matrix formulation solution as outlined in the GUM Supplement 2 [17], pp. 15–17. For an explicit matrix equation y = f(x) the covariance matrix of the output is and for an implicit matrix equation h(y, x) = 0 the covariance matrix of the output is solved from the matrix equation , where Ux is a covariance matrix of the input, and and are corresponding sensitivity or Jacobian matrices.

Due to the combination of algebraic and numerical ill-conditioning complexity that is present for determining a multivariate covariance Uy, this calculation is usually only tractable from a pure numerical based Monte Carlo simulation using an appropriate homogeneous equation for i = 1, … , n as previously discussed.

3 Mathematical modelling

We consider the standard case of a straight line equation y = ax + b with inputs xi, outputs yi, variances u2(xi) and u2(yi), and covariances cov(xi, yi) for measurements i = 1, … , n. This system was originally analysed by Krystek and Anton [10] who termed this system as a weighted total least squares with correlation system and developed a WTLSC algorithm that is able to calculate the variances u2(a) and u2(b) and the covariance cov(a, b). An earlier paper by Ramnath [18] utilized this scheme to quantify the variances u2(A0) and u2(λ) but did not explicitly report on the covariance cov(A0, λ) term.

Considering the straight line equation S = aP + b = A0(1 + λ) it immediately follows that(15) (16)

From the above set of equations it follows that the variances are(17) (18)

Using a = A0 λ it follows that

Expanding out the partial derivatives in the above equation and solving for cov(A0, λ) then yields (19)

4 Numerical simulations

Numerical simulations are performed using the summary of statistical data reported in Table 1 which is based on an earlier full M = 1000 Monte Carlo simulation reported in detail in Ramnath [16]. In this table the columns are the values of Pk, u(Pk), Sk, u(Sk) and ρ(Pk, Sk) for k = 1, … , 10, where ρ(Pk, Sk) is the dimensionless correlation that is defined in terms of the covariance cov(Pk, Sk) by the formula(20)

Results are obtained by running the Octave/Matlab implementation of the WTLSC algorithm reported in Figure 1 with the script in Figure 2 as summarized in Table 2. All of the results reported in this table were directly obtained from a Octave/Matlab program with the exception of the dimensionless correlation coefficient ρ(A0, λ) as the available number of Monte Carlo simulation events at M = 1000 was too small to obtain meaningful statistical results with the use of equation (11).

Instead the correlation was estimated by saving the processed data from the Octave/Matlab program as a txt file and fitting a Gaussian copula which is defined in terms of the covariance using the following RStudio computer code [19]:

library(copula)
library(VineCopula)
xydata <- read.table(''C:/mydata.txt'')
x1 <- xydata[, 1]
x2 <- xydata[, 2]
fncF1 <- ecdf(x1)
fncF2 <- ecdf(x2)
u1 <- fncF1(x1)
u2 <- fncF2(x2)
C21 <- BiCopSelect(u1, u2, familyset = c(1))
rho <- C21$par

This trick was utilized since a calculation of a covariance/correlation is mathematically equivalent to the assumption of a Gaussian bivariate copula to model the coupling effect between two random variables, and the VineCopula statistical package [20] uses more advanced optimization algorithms to estimate the parameters for a covariance/correlation. In general if a large number of Monte Carlo simulation events for example M = 10000 is available then equation (11) can be used directly, and this observation illustrates the essential need for a large number of simulations to adequately apply a GLS regression in order to obtain a statistically meaningful result particularly when a relatively small number of scattered data-points from a small number of Monte Carlo simulations is available for data analysis as shown in Figure 3.

It may be observed that in the case of a bivariate PDF that the joint PDF is then a three-dimensional function and as a result it becomes difficult to visualize the three PDFs gOLS(x), gWLS(x) and gGLS(x) corresponding to the OLS, WLS and GLS regressions as these function overlap and intersect with one another. As a result in order to simplify the interpretation of the numerical results level curves of the PDFs are extracted as shown in Figure 4 so that the relative shape and sharpness of the PDFs may be meaningfully compared later in the paper.

Due to the complexity of the underlying pressure balance equations a large number of Monte Carlo simulations such as say M = 10000 is only feasible if the equations are implemented in an appropriate computer programming environment such as a Visual Basic for Applications (VBA) script within a Excel spreadsheet or a Matlab script, and as a result may present onerous technical computing demands on a pressure metrologist in an industrial calibration laboratory.

The results from Table 2 are visualized in Figure 5 from which it is confirmed that a OLS regression gives the lowest accuracy and a WLS gives a medium accuracy. The relative errors for the zero-pressure area , distortion coefficient , correlation , distortion coefficient uncertainty , and zero-pressure uncertainty , may be defined as (21)where Q represents a selection of a quantity and T denotes a selection of a technique.

Considering the relative errors for the WTLSC regression when compared to predictions from the GLS which is considered as the most accurate technique, it is observed that the relative errors are , , , , and respectively for the representative pressure balance data that was analysed. As a result it may be concluded that a WTLSC algorithm gives essentially the same quality of a results as full GLS Monte Carlo simulation if there are available estimates for the correlation terms cov(Pi, Si) for the data points i = 1, … , n.

The key benefits of utilizing a WTLSC algorithm is that it is a relatively straightforward implementation of using a Octave/Matlab function file FitWTLSC with about 20 lines of computer code as outlined in Figure 2, and does not require extensive or advanced programming skills for a pressure metrologist to implement. Limitations of the method are nevertheless present in that estimates for the correlation terms ρ(Pi, Si),  i = 1, … , n are technically necessary to utilize the WTLSC algorithm.

This issue may be addressed in one of two practical ways, namely (i) setting zero correlations for all of the cross-float data-points such that ρ(Pi, Si) = 0 for i = 1, … , n for which the WTLSC algorithm will reproduce results for a WLS algorithm as a conservative compromise, or (ii) performing a small number of say 50 or 100 repeat measurements in an Excel spreadsheet which is feasible without the need for any specialist VBA scripting programming within a spreadsheet with appropriate expert physical judgement to generate (Pi, Si) data to fit a Gaussian copula with the 11 lines of RStudio code previously discussed to obtain an approximate estimate for the correlation.

Referring to the results in Figure 3 it is observed that the PDFs of the OLS, WLS and GLS regressions may be compared by examining the behaviour of the PDFs along the semi-major and semi-minor axes of the hyper-ellipsoid for the confidence region. When this analysis was performed as visualized in Figure 4 it was observed that the WLS regression performed relatively well when compared to the GLS regression. As a result it may be concluded that the limitations of the potential absence of a covariance matrix is not excessively detrimental.

Table 1

Statistical data for pressure balance cross-floating measurements for pressures Pk and areas Sk in approximate equal steps with k = 1, … , 10 from P1 = 50  MPa through to P10 = 500  MPa. The above data is sufficient to use as an input for the WTLSC algorithm and produces essentially the same quality of results as a full GLS Monte Carlo simulation. If the dimensionless correlation term ρ(Pk, Sk) is unknown or cannot be otherwise estimated it may be set to zero and the WTLSC algorithm will equivalently reproduce the results from a WLS algorithm which provides conservative estimates for the corresponding parameter expected values and associated variances/covariance. It should be noted that a zero correlation of the input/output data will in general still produce a non-zero correlation in the parameters to account for and model the coupling effect between the parameter random variables.

thumbnail Fig. 1

Gnu Octave/Matlab function file to implement weighted total least squares with correlation algorithm. The x input and its standard uncertainty u(x) have the same physical units, and similarly the y output and its standard uncertainty u(y) also have the same physical units, whilst the correlation ρ(x, y) = cov(x, y)/(u(x) × u(y)) is dimensionless.

thumbnail Fig. 2

Gnu Octave/Matlab script file using the function file FitWTLSC.m to calculate the pressure balance parameter values and associated variance and covariance information in a typical pressure laboratory calibration certificate for a client pressure balance instrument. The data in the loaded file is from Table 1 and may conveniently be extracted from an Excel spreadsheet of measurement results and as saved a txt file.

Table 2

Final results and comparison of numerical experiments using the WTLSC, OLS, WLS, and GLS methods.

thumbnail Fig. 3

Comparison of ordinary least squares (OLS), weighted least squares (WLS) and generalized least squares regression analysis from Monte Carlo data. Hyper-ellipsoids are generated to indicate the corresponding 95% confidence regions (CRs) and reveals that the highest accuracy scheme is that of a GLS which has the smallest CR which corresponds to a high degree of certainty of the range of possible values, followed by a WLS which has a slightly lower accuracy compared to a GLS, and finally a OLS which has the lowest accuracy as the CR is unduly large.

thumbnail Fig. 4

Comparison of joint PDFs from OLS, WLS and GLS regressions. The confidence region may be visualized as a three dimensional volume underneath the envelope covered by the joint PDF and bounded by the hyper-ellipsoids, and indicates that the joint PDFs may fit inside one another. Due to the difficulties with visualizing three different joint PDFs intersecting within each other simultaneously, cuts along the major axis (orange line in previous figure) and minor axis (magenta line in previous figure) of the ellipsoids confidence regions are taken which produce level curves to more easily visualize the shape, orientation and size of the different OLS, WLS and GLS regression schemes. From these level curves it is observed that for level curves along the ellipsoid major axis that the GLS regression (green curve) has the narrowest/sharpest curve which corresponds to the highest confidence or accuracy in the estimates of the parameters, then the WLS regression (blue curve) which has a medium spread/sharpness, and finally the OLS regression (red curve) which is relative wide and flat which corresponds to less certainty/accuracy of the parameters in the regression analysis. A broadly similar behaviour is observed for level curves cut along the minor ellipsoid axis, and as a result it is observed and concluded a OLS regression has the least accuracy, a WLS regression has a medium accuracy, and a GLS regression has the highest accuracy. From the above graphs it may be concluded that a straight line regression for pressure metrology work should ideally utilize a GLS regression (green line) however if no covariance information is available the a WLS regression (blue line) can give a reasonable estimate particularly for industrial pressure metrology laboratories.

thumbnail Fig. 5

Comparison of relative errors of the quantities predicted by OLS, WLS and WTLSC regression approaches relative to predictions from a GLS regression approach. These graphs indicate that the relative error between a WTLSC regression and a GLS regression is relatively insignificant, and as a result a pressure metrologist instead of applying a more complicated GLS regression can simply apply a WTLSC regression with the aid of the supplied methodology developed in this paper.

5 Conclusions

Based on the research reported in this paper the following conclusions were determined:

  • Numerical simulations have been performed that demonstrate that a weighted total least squares with correlation (WTLSC) algorithm gives essentially equivalent results for straight line parameters as a more advanced generalized least squares (GLS) algorithm with additional Monte Carlo simulations for parameter uncertainty estimation

  • A method has been developed for the WTLSC algorithm that is able to analytically extract the variance/covariance information of the straight line parameters without the need for additional and more advanced Monte Carlo simulations

6 Influences and implications

Based on the research reported in this paper the following influences and implications are:

  • Pressure metrologists working in industrial calibration laboratories may now utilize the WTLSC algorithm to conveniently and easily quantify the expected values and variances/covariance of the zero-pressure area A0 and distortion coefficient λ

  • A computer program FitWTLSC.m is now freely and publicly available to conveniently implement the WTLSC algorithm in Gnu Octave or Matlab for calculating the straight line parameters and their associated variance/covariance information and may now also be used by other metrologists from different fields

This work was performed with funds provided by the Department of Higher Education, Science and Technology (DHEST) on behalf of the South African government for research by public universities.

Appendix A Hyper-ellipsoidal bivariate confidence region

Let random variables x1, … xM ∈ ℝN be grouped as Ω = [x1| … |xM] where xj,  j = 1, … , M are N × 1 vectors, Ω is a N × M matrix, and M is a suitably large positive integer number of Monte Carlo simulation events. Then the expected value μ a N × 1 vector and the covariance matrix U a N × N matrix may be estimated with the GUM Supplement 2 [17] as μ and U. Under these circumstances the joint PDF, unless higher order statistics information over and above the expected and covariance values are provided, is then assumed to follow a multivariate Gaussian distribution g(x) ~ (μ, U).

The corresponding joint PDF is then g(x) and consists of hyper-ellipsoids within a ℝ N dimensional space that are centred at μ and have an orientation determined from characteristics of U such that(A1) (A2) (A3)

In the special case of bivariate data where , and , the hyper-ellipsoid will lie in the x1 x2-plane within a 2 dimensional space, and the joint PDF will be (A4)

For the above equation the terms are , , and since the covariance matrix may equivalently be written as . According to the GUM Supplement 2 [17] the 100p% coverage region is a hyper-ellipsoid of the form(A5)

An earlier analysis by Ramnath [21] determined that the ellipsoid semi-axis values a1 and a2 could be approximately determined by forming the matrix(A6)

Once this matrix was formed the two eigenvectors v1 and v2 along with the corresponding eigenvalues λ1 and λ2 were calculated from the equation(A7)

The ellipse semi-axis values were then calculated as(A8) (A9)and the eigenvectors v1 and v2 then define the orientations of the ellipsoid principal axes following the analysis by Lee [22]. The disadvantage of this approach is that an appropriate software routine for determining eigenvectors and eigenvalues is necessary, however an advantage of this approach is that it is generalisable to higher dimensions.

An alternative direct algebraic approach is to directly substitute the statistical information μ and U into the formula equation (A5) and expand out so that

(A10)

The above equation is in the form of the generalized ellipse equation as discussed by Weisstein [23] that is centred at xo and yo, with semi-major and semi-minor axes a and b and rotated by an angle θ of the form(A11)

The algebraic coefficients may now be expressed in terms of the statistical terms as(A12) (A13) (A14) (A15) (A16) (A17)

The corresponding ellipse geometry parameters may now be algebraically calculated as (A18) (A19) (A20)

and the ellipse rotation is given as(A21)

The hyper-ellipsoid may then be conveniently plotted using the parametrization 0 ≤ t ≤ 2π so that the initial boundary points xi and yi are(A22) (A23)

Next the boundary points are rotated by the angle θ by using a rotation matrix to give xr and yr such that(A24)

Finally the points are translated as(A25) (A26)

This algorithm is used to produce the hyper-ellipsoids in Figure 3 and may be used to determine the points along the ellipse major and minor axes. As a result if these lines are used a “cut” through the joint PDF may be used to study how the level curves of the different OLS, WLS, and GLS regressions behave as it is too difficult to easily visualize the three dimensional joint PDFs in Figure 4 even with transparency effects. Once the lines are determine the x1 and x2 points are known from a simple straight line fit so the x1 and x2 points may then be substituted in the bivariate joint PDF in equation (A4) which will produce a density value. The set of points (x1, x2, g(x1, x2)) may then be plotted as a three-dimensional curve in order to visualize the shape and sharpness of the corresponding level curve of the PDF. This algebraic approach is only suitable for bivariate simulations and becomes impractical in higher dimensions, for which the eigenvector/eigenvalue approach is then necessary.

Appendix B Simplified worked example

In order to illustrate the method suppose that the experimental data has n = 5 points of generated pressures from 100  MPa through to 500  MPa in 100  MPa steps and cross-floated areas, which will give an experimental control variable of x = [x1, x2, x3, x4, x5] and an experimental observed variable of y = [y1, y2, y3, y4, y5]. Writing down the basis equation with the experimental data (for one particular Monte Carlo event) as follows.

Let

Considering an ordinary least squares (OLS) regression the estimate for the matrix V is justwhere W = V−1.

Implementing the formula in Matlab or GNU Octave with the above information gives

Next considering a WLS regression it is known that the Vi,i ∝ u2(Si) so using the data from Table 1 let u(S1) =0.0000990361, u(S2) = 0.0001110578, u(S3) = 0.0001766929, u(S4) = 0.0002740451 and u(S5) = 0.0004087028, and then simply let

.

Using Matlab/Octave to implement the matrix operations then produces

The above process is repeated for all of the experimental data points and then the results can be post-processed in order to obtain the statistical characteristics for the OLS, WLS and GLS regressions as shown in Figure 3.

Actual physical data is first obtained from the experimental measurements in the laboratory and this data is used as inputs into the instrument models to produce the data X and Y in this example where the computer program outlines the entire logical calculation process. As an example the piston temperature may have been measured as say 20.4C with a standard uncertainty of ±0.1C with an assumed Gaussian distribution, then random temperatures consistent with the physical experimental uncertainty are sampled and used as inputs into the physical model. This is due to the fact that modern measurement science has generally adopted a Bayesian statistics framework for measurement uncertainty and as a result no physical measurement is deterministic since all physical measurements, with the partial exception of certain physical constants, have an underlying probability distribution i.e. any physical measurement is considered as a statistical sample from that random variables probability density function. Once sufficient samples have been generated and used as inputs/outputs into the measurement model then the OLS, WLS or GLS regressions may be implemented as outlined in Figure B1.

thumbnail Fig. B1

Gnu Octave/Matlab function file to implement OLS, WLS and GLS regression analysis.

The general steps to perform the regression analysis are therefore as follows:

  • Obtain laboratory nominal experimental measurement data x that has an associated joint PDF gX(x)

  • Sample a random variable ξ from the joint PDF gX(x)

  • Solve the associated measurement equation h(η, ξ) = 0 to obtain the random variable η associated with ξ, and repeat this several times to produce the Monte Carlo data Ω

  • Step through the data input ξ and data output η i.e. the inputs x and outputs y and substitute this in the regression equation C = (XT W X)−1(XT W Y)

  • Solve the regression equation for C and then post-process all the results to work out the statistical characteristics of A0 and b namely u2(A0), u2(b) and cov(A0, b)

  • Simply apply the new algebraic formula to work out u2(A0) = u2(b), , and

The above computer code routines for performing the statistical regressions may be observed to be relatively lengthy and complicated, and may be compared to the WTLSC implementation which is simply the 20 lines of Matlab/Octave code in Figure 2. In this approach there is no need to perform any Monte Carlo simulations and the actual physical experimental data, in this simplified problem the n = 5 data points, may be used directly as inputs into the Octave/Matlab program FitWTLSC.m to directly extract the uncertainty information of the parameters A0 and λ and obtain u2(A0), u(λ) and cov(A0, λ).

References

  1. BIPM, IEC, IFCC, ILAC, ISO, IUPAP, and OIML, Evaluation of measurement data − Guide to the expression of uncertainty in measurement, tech. rep., JCGM/WG1 GUM, 2008. Revised 1st edition − https://www.bipm.org/en/publications/guides/ [Google Scholar]
  2. R.S. Dadson, S.L. Lewis, G.N. Peggs, The Pressure Balance: Theory and Practice (HMSO, London, 1982) ISBN 0114800480. [Google Scholar]
  3. P. Saunders, Propagation of uncertainty for non-linear calibration equations with an application in radiation thermometry, Metrologia 40 , 93–101 (2003) [Google Scholar]
  4. S. Vanhuffel, J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis (SIAM, 1987) [Google Scholar]
  5. D. York, Least squares fitting of a straight line with correlated errors, Earth Planet. Sci. Lett. 5 , 320–324 (1968) [Google Scholar]
  6. W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing (Cambridge University Press, 2007), 3rd edn [Google Scholar]
  7. R.L. Burden, J.D. Faires, Numerical Analysis (Brookes/Cole, 2001), 7th edn [Google Scholar]
  8. W.C. Duer, P.J. Ogren, A. Meetze, C.J. Kitchen, R.V. Lindern, D.C. Yaworsky, C. Boden, J.A. Gayer, Comparison of ordinary, weighted, and generalized least-squares straight-line calibrations for LC-MS=MS, GC-MS, HPLC, GC, and enzymatic assay, J. Anal. Toxicol. 32 , 329–338 (2008) [CrossRef] [PubMed] [Google Scholar]
  9. M. Krystek, M. Anton, A weighted total least-squares algorithm for fitting a straight line, Measur. Sci. Technol. 18 , 3438–3442 (2007) [CrossRef] [Google Scholar]
  10. M. Krystek, M. Anton, A least-squares algorithm for fitting data points with mutually correlated coordinates to a straight line, Measur. Sci. Technol. 22 , 035101 (2011) [CrossRef] [Google Scholar]
  11. P.M. Harris, C.E. Matthews, M.G. Cox, Summarizing the output of a Monte Carlo method for uncertainty evaluation, Metrologia 51 , 243–252 (2014) [Google Scholar]
  12. Mathworks, fgls − Feasible generalized least squares, 2020. https://uk.mathworks.com/help/econ/fgls.html [Google Scholar]
  13. S. Miller, R. Startz, Feasible generalized least squares using support vector machines, Econ. Lett. 175 , 28–31 (2019) [Google Scholar]
  14. C. Wuethrich, S. Souiyam, Monte Carlo determination of the uncertainty of effective area and deformation coefficient for a piston cylinder unit, in 24th IMEKO TC-3, 14th TC-5, 6th TC-16 and 5th TC-22 International Conference , edited by A. Salceanu, D. Agrez, J. Saliga, M. Savino (Cavtat-Dubrovnik, Croatia: IMEKO, 2020), pp. 1–6 [Google Scholar]
  15. P. Otal, C. Yardin, Modelling methods for pressure balance calibration, Measur. Sci. Technol. 31 , 034004 (2019) [CrossRef] [Google Scholar]
  16. V. Ramnath, Numerical analysis of the accuracy of bivariate quantile distributions utilizing copulas compared to the GUM supplement 2 for oil pressure balance uncertainties, Int. J. Metrol. Qual. Eng. 8 , 29 (2017) [CrossRef] [Google Scholar]
  17. BIPM, IEC, IFCC, ILAC, ISO, IUPAP, and OIML, Evaluation of measurement data − Supplement 2 to the Guide to the expression of uncertainty in measurement − Propogation of distributions using a Monte Carlo method, tech. rep., JCGM/WG1 GUM Supplement 2, 2011. 1st edition − https://www.bipm.org/en/publications/guides/ [Google Scholar]
  18. V. Ramnath, Determination of pressure balance distortion coefficient and zero-pressure effective area uncertainties, Int. J. Metrol. Qual. Eng. 2 , 101–119 (2011) [CrossRef] [EDP Sciences] [Google Scholar]
  19. CRAN, The R Project for Statistical Computing, 2020. https://www.r-project.org/about.html [Google Scholar]
  20. CRAN, VineCopula: Statistical Inference of Vine Copulas, 2020. https://cran.r-project.org/web/packages/VineCopula/index.html [Google Scholar]
  21. V. Ramnath, Analysis and comparison of hyper-ellipsoidal and smallest coverage regions for multivariate Monte Carlo measurement uncertainty analysis simulation datasets, MAPAN-J. Metrol. Soc. India 1–16 (2019) [Google Scholar]
  22. S.-N. Lee, M.-H. Shih, A volume problem for an n-dimensional ellipsoid intersecting with a hyperplane, Linear Algebra Appl. 132 , 90–102 (1990) [Google Scholar]
  23. E.W. Weisstein, Ellipse − a Wolfram web resource (2020). https://mathworld.wolfram.com/Ellipse.html [Google Scholar]

Cite this article as: Vishal Ramnath, Comparison of straight line curve fit approaches for determining parameter variances and covariances, Int. J. Metrol. Qual. Eng. 11, 14 (2020)

All Tables

Table 1

Statistical data for pressure balance cross-floating measurements for pressures Pk and areas Sk in approximate equal steps with k = 1, … , 10 from P1 = 50  MPa through to P10 = 500  MPa. The above data is sufficient to use as an input for the WTLSC algorithm and produces essentially the same quality of results as a full GLS Monte Carlo simulation. If the dimensionless correlation term ρ(Pk, Sk) is unknown or cannot be otherwise estimated it may be set to zero and the WTLSC algorithm will equivalently reproduce the results from a WLS algorithm which provides conservative estimates for the corresponding parameter expected values and associated variances/covariance. It should be noted that a zero correlation of the input/output data will in general still produce a non-zero correlation in the parameters to account for and model the coupling effect between the parameter random variables.

Table 2

Final results and comparison of numerical experiments using the WTLSC, OLS, WLS, and GLS methods.

All Figures

thumbnail Fig. 1

Gnu Octave/Matlab function file to implement weighted total least squares with correlation algorithm. The x input and its standard uncertainty u(x) have the same physical units, and similarly the y output and its standard uncertainty u(y) also have the same physical units, whilst the correlation ρ(x, y) = cov(x, y)/(u(x) × u(y)) is dimensionless.

In the text
thumbnail Fig. 2

Gnu Octave/Matlab script file using the function file FitWTLSC.m to calculate the pressure balance parameter values and associated variance and covariance information in a typical pressure laboratory calibration certificate for a client pressure balance instrument. The data in the loaded file is from Table 1 and may conveniently be extracted from an Excel spreadsheet of measurement results and as saved a txt file.

In the text
thumbnail Fig. 3

Comparison of ordinary least squares (OLS), weighted least squares (WLS) and generalized least squares regression analysis from Monte Carlo data. Hyper-ellipsoids are generated to indicate the corresponding 95% confidence regions (CRs) and reveals that the highest accuracy scheme is that of a GLS which has the smallest CR which corresponds to a high degree of certainty of the range of possible values, followed by a WLS which has a slightly lower accuracy compared to a GLS, and finally a OLS which has the lowest accuracy as the CR is unduly large.

In the text
thumbnail Fig. 4

Comparison of joint PDFs from OLS, WLS and GLS regressions. The confidence region may be visualized as a three dimensional volume underneath the envelope covered by the joint PDF and bounded by the hyper-ellipsoids, and indicates that the joint PDFs may fit inside one another. Due to the difficulties with visualizing three different joint PDFs intersecting within each other simultaneously, cuts along the major axis (orange line in previous figure) and minor axis (magenta line in previous figure) of the ellipsoids confidence regions are taken which produce level curves to more easily visualize the shape, orientation and size of the different OLS, WLS and GLS regression schemes. From these level curves it is observed that for level curves along the ellipsoid major axis that the GLS regression (green curve) has the narrowest/sharpest curve which corresponds to the highest confidence or accuracy in the estimates of the parameters, then the WLS regression (blue curve) which has a medium spread/sharpness, and finally the OLS regression (red curve) which is relative wide and flat which corresponds to less certainty/accuracy of the parameters in the regression analysis. A broadly similar behaviour is observed for level curves cut along the minor ellipsoid axis, and as a result it is observed and concluded a OLS regression has the least accuracy, a WLS regression has a medium accuracy, and a GLS regression has the highest accuracy. From the above graphs it may be concluded that a straight line regression for pressure metrology work should ideally utilize a GLS regression (green line) however if no covariance information is available the a WLS regression (blue line) can give a reasonable estimate particularly for industrial pressure metrology laboratories.

In the text
thumbnail Fig. 5

Comparison of relative errors of the quantities predicted by OLS, WLS and WTLSC regression approaches relative to predictions from a GLS regression approach. These graphs indicate that the relative error between a WTLSC regression and a GLS regression is relatively insignificant, and as a result a pressure metrologist instead of applying a more complicated GLS regression can simply apply a WTLSC regression with the aid of the supplied methodology developed in this paper.

In the text
thumbnail Fig. B1

Gnu Octave/Matlab function file to implement OLS, WLS and GLS regression analysis.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.