Issue 
Int. J. Metrol. Qual. Eng.
Volume 11, 2020



Article Number  14  
Number of page(s)  16  
DOI  https://doi.org/10.1051/ijmqe/2020011  
Published online  19 November 2020 
Research Article
Comparison of straight line curve fit approaches for determining parameter variances and covariances
Department of Mechanical and Industrial Engineering, University of South Africa, Private Bag X6, Florida 1710, South Africa
^{*} Corresponding author: ramnav@unisa.ac.za
Received:
24
August
2020
Accepted:
23
October
2020
Pressure balances are known to have a linear straight line equation of the form y = ax + b that relates the applied pressure x to the effective area y, and recent work has investigated the use of Ordinary Least Squares (OLS), Weighted Least Squares (WLS), and Generalized Least Squares (GLS) regression schemes in order to quantify the expected values of the zeropressure area A_{0} = b and distortion coefficient λ = a/b in pressure balance models of the form y = A_{0}(1 + λx). The limitations with conventional OLS, WLS and GLS approaches is that whilst they may be used to quantify the uncertainties u(a) and u(b) and the covariance cov(a, b), it is technically challenging to analytically quantify the covariance term cov(A_{0}, λ) without additional Monte Carlo simulations. In this paper, we revisit an earlier Weighted Total Least Squares with Correlation (WTLSC) algorithm to determine the variances u^{2}(a) and u^{2}(b) along with the covariance cov(a, b), and develop a simple analytical approach to directly infer the corresponding covariance cov(A_{0}, λ) for pressure metrology uncertainty analysis work. Results are compared to OLS, WLS and GLS approaches and indicate that the WTLSC approach may be preferable as it avoids the need for Monte Carlo simulations and additional numerical postprocessing to fit and quantify the covariance term, and is thus simpler and more suitable for industrial metrology pressure calibration laboratories. Novel aspects is that a Gnu Octave/Matlab program for easily implementing the WTLSC algorithm to calculate parameter expected values, variances and covariances is also supplied and reported.
Key words: Ordinary least squares (OLS) / weighted least squares (WLS) / generalized least squares (GLS) / weighted total least squares with correlation (WTLSC) / pressure measurement
© V. Ramnath, published by EDP Sciences, 2020
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
1.1 Research motivation
The use of least squares based statistical regression analysis is a widely used tool to analyse experimental data when estimating and fitting model parameter values and uncertainties. In the particular case of straight line equations of the form y = ax + b known input data x_{i} and output data y_{i} for datapoints i = 1, … , n inclusive of possible available uncertainty information are used in the regression analysis where standard uncertainty analysis theory from the GUM [1] specifies the model uncertainty as(1) (2)
In the above formula since x is a statistically independent random variable it follows that cov(x, a)= 0, cov(x, b) = 0, and in general cov(a, b) ≠ 0. As a result, the probability density function (PDF) of the output g_{y}(η_{y}) may be determined where η_{y} is a univariate random variable of the model output y, g_{x}(ξ_{x}) is a univariate random variable of the model input x, and g_{a,b}(ξ_{a}, ξ_{b}) is a bivariate random variable of the joint PDF of the model parameters a and b that incorporates the coupling effect between random variables ξ_{a} and ξ_{b} for the model parameters a and b that are modelled in terms of cov(a, b).
The research problem that occurs when OLS, WLS and GLS regressions are used to estimate the parameters a and b to varying degrees of accuracies, is that additional Monte Carlo simulations must then be performed in order to determine the variances u^{2}(a) and u^{2}(b) along with the covariance cov(a, b), and that this information must then be further processed in order to infer u^{2}(A_{0}), u^{2}(λ) and cov(A_{0}, λ). Consequently, in the absence of covariance information the model uncertainty will be systematically underestimated if the covariance is unknown or unavailable and only the variances of the model parameters are utilized in uncertainty calculations.
1.2 Research focus
This paper focuses on demonstrating the conceptual ease and functionality of the WTLSC approach and how it is able to incorporate measurement variance information u^{2}(x_{i}) and u^{2}(y_{i}) along with covariance information cov(x_{i}, y_{i}) of the data when available, in order to determine both the variances u^{2}(a) and u^{2}(b) along with the covariance information cov(a, b) of the model parameters for statistical completeness. A new simple analytical calculation approach to directly infer both the variances u^{2}(A_{0}) and u^{2}(λ) as well as covariance term cov(A_{0}, λ) from knowledge of u^{2}(a), u^{2}(b) and cov(a, b) as obtained from the WTLSC algorithm implemented with a Gnu Octave/Matlab program is presented that does not rely on additional Monte Carlo simulations or statistical postprocessing.
2 Literature review
2.1 Physical theory of pressure balances
Pressure balance theory from Dadson [2] for a crossfloated pressure balance in equilibrium may be used to define the applied pressure P in terms of a generalized force F and an effective area S as where F is a generalized force term to account for buoyancy effects, and S is a crosssectional effective area that accounts for pressure induced deformation and temperature expansion/contraction effects such that where S = A_{0}(1 + λP)φ, A_{0} is a effective area when extrapolated to a zero pressure, λ is a distortion coefficient based on linear elasticity theory, and φ is an appropriate temperature compensation function that adjusts the area to an appropriate reference temperature.
In general since there is a pressure dependence within the effective area term S the equation is usually rearranged as a homogeneous equation and then solved for specified inputs to obtain a value of P such that h = 0. In the event that the pressure P is known and the area S must be determined the equation is rearranged as and again rearranged as another homogeneous equation in order to determine the value of S that solves the homogeneous equation for specified inputs.
In the above equation m is the mass on top of the piston, ρ_{a} is the ambient fluid density in contact with the piston to account for buoyancy forces acting on the piston, ρ_{m} is the mass density, g is the gravitational acceleration constant, A_{0} is the zeropressure area, λ is the distortion coefficient, P is the applied pressure, and φ = 1 + α(t − t_{0}) is a temperature compensation function to adjust the crosssectional area to a reference temperature t_{0} where α = α_{p} + α_{c} is the linear thermal expansion coefficient of the combination of the piston and cylinder where α_{p} and α_{c} are the corresponding linear thermal expansion coefficients of the piston and cylinder respectively.
When the above system of equations is implemented the equation for the crossfloated area S may then be simply modelled as(3)
In the above equation the inputs are the expected values and standard uncertainties of the applied pressures (μ(P_{i}) ± u(P_{i})), the crossfloated areas (μ(S_{i}) ± u(S_{i})), and estimates of the covariances cov(P_{i}, S_{i}) for a sequence of measurements i = 1, … , n where n is the number of measurements, whilst the unknowns are considered to be the expected values of the zeropressure area A_{0}, the distortion coefficient λ, and the covariance cov(A_{0}, λ). This modelling approach is equivalent to fitting a straight line of the form y = ax + b from experimental data, and utilizing a least squares regression to quantify the values and uncertainties of the parameters a and b.
2.2 Historical and mathematical context
The fitting of straight lines is a common numerical task to fields such as science, engineering, and economics that require experimental or computational data to be modelled and fitted. Usually a leastsquares algorithm is implemented to minimize a merit function as discussed by Saunders [3] where in general a nonlinear calibration equation y = y(x ; a_{1}, … , a_{N}) is desired to be fitted from measurement pairs of controlled values x_{i} and observed values y_{i}. In the case where the number of measurements is M and the number of parameters to determine is N then a least squares solution is mathematically essential in the general case where M > N. For this situation a merit function is constructed as(4)and optimized in terms of the parameters a_{1}, … , a_{N} by simultaneously solving ∂χ^{2}/∂ a_{i} = 0 for i = 1, … , N where w_{i} is a suitable and appropriate weighting function to account for the underlying measurement accuracy and available information.
The simplest case involves an optimization where all of the data points have an assumed constant uncertainty/accuracy such that u(x_{i}) = u(y_{i}) = σ for i = 1, … , M measurements, which corresponds to the situation where no detailed or specific measurement uncertainty information is available, and as a result the optimization of the merit function does not require any specific knowledge of σ as this can be conveniently factored out when optimizing χ^{2}. This particular problem thus corresponds to the well known historical total least squares problem as investigated by Van Huffel and Vandervalle [4].
At a next conceptual level the problem of fitting a straight line with errors in both coordinates was originally formulated by York [5] and subsequently involved either attempts to directly solve the original problem posed by York or by various approximation methods in order to infer the variances of the fitting parameters by researchers working in the area of regression analysis of experimental measurements. For this problem the approach of Press et al. [6] has achieved the most notable recognition amongst many nonmathematicians for the fitting of a straight line where both x and y coordinates possess some underlying known statistical uncertainty.
In this particular case the model y(x) = a + bx involves a χ^{2} merit function optimization of the form(5)where u(x_{i}) and u(y_{i}) are the standard deviations of the x_{i} and y_{i} data points that are allowed to vary. Although conceptually straightforward the main challenge with directly solving this problem is that the presence of the parameter b in the denominator renders the partial derivative equation ∂χ^{2}/∂ b = 0 a nonlinear equation, and as a result this problem is considerably more difficult to solve than the earlier total least squares problem which is amenable to an analytical solution.
Consequently, the optimization problem for straight lines with errors in both coordinates is typically solved either with classical optimization techniques such as the LevenbergMarquardt method or with the Monte Carlo bootstrap technique as outlined by Press et al. [6], and as a result this solution approach is not generally directly accessible for many metrologists working in industrial laboratories due to the mathematical complexity of the method.
It should be noted that a conventional numerical solution of this nonlinear optimization problem using either classical techniques such as Newton's, steepest descent and homotopy methods as discussed by Burden and Faires [7] or newer methods such as the particle swarm optimization (PSO) approach would still suffer from two limitations, namely that it would firstly not be able to compute the parameter variances, and secondly that it would not be able to provide any insights into the parameter variances and covariances.
The statistical effects from the choice of OLS, WLS and GLS regressions was subsequently investigated by Duer et al. [8] who determined that whilst the specific choice of OLS, WLS or GLS regressions did not generally produce inconsistent estimates for the intercept b and slope a parameter values, that the different regression schemes nevertheless did in fact produce considerable variations in the corresponding estimates of the uncertainties u(a) and u(b) of the parameters a and b. As a result, Duer et al. concluded that a simplified OLS regression could only be safely utilized in the very specific and special case of where the uncertainties u(x_{k}) in the x_{k} data points was negligible.
In the particular case of scientific metrology work an inconsistent or incorrect estimate of a parameter value and variance/covariance could then have a catastrophic effect on high level laboratory intercomparison work or in safety critical experimental measurements as parameter uncertainties of equipment and instruments are typically utilized in many quality engineering systems to specify tolerances and safety limits for equipment and personnel.
Due to these specific limitations particularly for scientific measurement work at national laboratories the total least squares problem was subsequently revisited by Krystek and Anton [9] who simplified the original twodimensional problem into an equivalent onedimensional optimization which is amenable with a set of simplified mathematical tools. This initial approach was then later generalized to solve the full problem of errors in both coordinates with correlation effects in a later paper by Krystek and Anton [10]. This approach is now termed as a weighted total least squares with correlation (WTLSC) analysis approach and it makes an assumption that the covariance matrix U_{k} for individual experimental data points for k = 1, … , n are available prior to optimizing the merit function which is explicitly specified as (6)
Part of the reasoning as to why the WTLSC approach does not appear to have been more widely adopted is that in the specific area of scientific metrology that the calculation of the covariance matrix is extremely mathematically complicated when using standard calculus based approaches, and is therefore only tractable from a full Monte Carlo pure numerical simulation approach as discussed by Harris [11]. As a result, if the covariance matrix is unavailable then the WTLSC approach becomes problematic in which case the covariance must either be calculated or estimated, or alternately modified by setting the correlations to zero such that the algorithm implementation then simplifies to the earlier weighted total least squares WLS algorithm.
Whilst the absence of a covariance matrix may be viewed as technical limitation when implementing a GLS regression where the standard statistical formulation comprises of determining the value of the regression parameter β which optimizes the residuals ε in a linear regression model y = X β + ε, it has subsequently been determined by researchers that this is not an insurmountable challenge as mechanisms exist to infer a likely covariance matrix. For this problem the covariance matrix Cov[ε X] = Ω, is known to take the general form(7)
In order to account for the unavailability of the actual covariance matrix Ω which may be unknown, or impractical to estimate, one may nevertheless get a consistent estimate of Ω usually written as that is termed the feasible generalized least squares (FGLS) estimator using either iteration based commercial routines with software such as Matlab [12], or with newer support vector machines (SVMs) approaches that are currently under development such as recently reported results from Miller and Startz [13].
2.3 Recent statistical regression developments
Recently work reported in the literature by Wuethrich and Souiyam [14] building on earlier work by Otal and Yardin [15] has investigated the use of Monte Carlo based simulations for Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS) regression schemes in order to determine the parameter values of a pressure balance model of the form S = bP + a, where P is an input of applied pressure, S is an output of crossfloated area, and a and b are model parameters whose values and uncertainties must be determined. Conventionally the pressure balance model has the form S = A_{0}(1 + λP) where A_{0} is a zeropressure area and λ is a distortion coefficient and the regression analysis fits the values for a = A_{0} and the product b = A_{0} λ for i = 1, … , n datapoints such that(8) (9)
For the above regression scheme the distortion coefficient is recovered as λ = b/a after the residuals ε_{i} are minimized in the matrix equation X C = Y + ε. This minimization is implemented through the use of a coefficient matrix V that is used to incorporate the known information such that(10) (11)
In an OLS scheme no weighting is applied and a coefficient matrix V consists of just a diagonal matrix with unity on each of the diagonal elements such that V_{i,i} = 1, i = 1,…,n in a WLS scheme the coefficient matrix consists of a diagonal matrix where the elements on each of the diagonal elements are proportional to the uncertainties of the areas i.e. V_{i,i} ∝ u^{2}(S_{i}), and finally in a GLS scheme the diagonal elements of V equals the variances of the areas such that V_{i,i} = u^{2}(S_{i}) whilst the offdiagonal elements equal the covariances such that V_{i,j} = cov(S_{i}, S_{j}), i, j = 1, … , n, i ≠ j.
The general approach of Wuethrich and Souiyam [14] is to solve equation (9) with nominal values of the input quantities at specified pressures P_{i} for the equivalent homogeneous equation in order to determine corresponding values of S_{i} for a sequence of pressures P_{1}, … , P_{n} with i = 1, … , n, and to then use this information to determine the regression parameter C_{j} = [(A_{0})_{j}, b_{j}]^{T} from the matrix equation with an appropriate choice of weighting matrix V. This process is repeated M = 10000 times with slightly perturbed values of the inputs consistent with the underlying uncertainties to sequentially generate samples C_{j} for j = 1, … , M that then results in a sequence C_{1}, … , C_{M} of regression parameters, which are then postprocessed to calculate expected values, variances, and the covariance of the regression parameters A_{0} and b = A_{0} λ where the distortion coefficient is extracted as λ_{j} = b_{j}/(A_{0})_{j}. The pair of values ((A_{0})_{j}, λ_{j}) is then further postprocessed according to the GUM [1] for two sampled random variables q_{j} and r_{j} to calculate the covariance that is estimated as(12) (13) (14)
Investigations performed by Wuethrich and Souiyam [14] has concluded that a GLS regression produces high accuracy results, followed by the WLS regression which produces medium accuracy results, and finally by the OLS regression which produces low accuracy results. As a result the GLS approach may be considered the best overall approach however the key difficulty with implementing this approach is in the estimates of the covariances cov(S_{i}, S_{j}) that are required, which are either mathematically complex to determine or which must be numerically estimated with a large number of Monte Carlo simulations.
The key differences in the newer approach of Wuethrich and Souiyam [14] and an earlier approach of Ramnath [16] is that in the latter work a multivariate GUM Supplement 2 approach [17] was utilized that modelled the various inputs x by treating the two parameters A_{0} and λ as a single vector y = [A_{0}, λ]^{T} with an equation h(y, x) = 0 where a multivariate PDF g_{y}(η_{y}) of the output was to be determined from a multivariate input g_{x}(ξ_{x}) by numerically solving the equation h(η_{y}, ξ_{x}) = 0 for a large number M of Monte Carlo simulation events.
In this approach sampled multivariate values of the inputs consistent with the joint PDFs of the inputs x were used to solve the homogeneous vector equation corresponding to and a sequence of pressures P_{1}, … , P_{n} and areas S_{1}, … , S_{n} were generated that was then used with an OLS approach to calculate sampled values of ((A_{0})_{j}, λ_{j}) from the joint PDF g_{y}(η_{y}). The sampled values were then used to construct a bivariate copula that modelled u^{2}(A_{0}), u^{2}(λ) and cov(A_{0}, λ), however a limitation of this approach was that it required a large number of Monte Carlo simulations, was mathematically complex, and thus more suited to scientific metrology applications in national metrology institutes.
Differences in the two approaches is that firstly the method of Wuethrich and Souiyam [14] requires estimates for the uncertainties of the areas u^{2}(S_{1}), … , u^{2}(S_{n}) in the case of a WLS approach and additionally that of the covariances cov(S_{i}, S_{j}), 1i, jn for each Monte Carlo event j, and secondly makes an assumption of a Gaussian correlation between A_{0} and λ. On the other hand the method of Ramnath [16] firstly avoids the explicit need for calculating u^{2}(S_{i}) and cov(S_{i}, S_{j}) at each Monte Carlo event j as the data is considered as statistical samples, and secondly does not assume a Gaussian correlation between A_{0} and λ since bivariate copulas can model nonGaussian correlation effects.
It should be noted that the calculation of the covariances cov(S_{i}, S_{j}) that are required in a GLS regression whilst theoretically possible requires an additional matrix formulation solution as outlined in the GUM Supplement 2 [17], pp. 15–17. For an explicit matrix equation y = f(x) the covariance matrix of the output is and for an implicit matrix equation h(y, x) = 0 the covariance matrix of the output is solved from the matrix equation , where U_{x} is a covariance matrix of the input, and and are corresponding sensitivity or Jacobian matrices.
Due to the combination of algebraic and numerical illconditioning complexity that is present for determining a multivariate covariance U_{y}, this calculation is usually only tractable from a pure numerical based Monte Carlo simulation using an appropriate homogeneous equation for i = 1, … , n as previously discussed.
3 Mathematical modelling
We consider the standard case of a straight line equation y = ax + b with inputs x_{i}, outputs y_{i}, variances u^{2}(x_{i}) and u^{2}(y_{i}), and covariances cov(x_{i}, y_{i}) for measurements i = 1, … , n. This system was originally analysed by Krystek and Anton [10] who termed this system as a weighted total least squares with correlation system and developed a WTLSC algorithm that is able to calculate the variances u^{2}(a) and u^{2}(b) and the covariance cov(a, b). An earlier paper by Ramnath [18] utilized this scheme to quantify the variances u^{2}(A_{0}) and u^{2}(λ) but did not explicitly report on the covariance cov(A_{0}, λ) term.
Considering the straight line equation S = aP + b = A_{0}(1 + λ) it immediately follows that(15) (16)
From the above set of equations it follows that the variances are(17) (18)
Using a = A_{0} λ it follows that
Expanding out the partial derivatives in the above equation and solving for cov(A_{0}, λ) then yields (19)
4 Numerical simulations
Numerical simulations are performed using the summary of statistical data reported in Table 1 which is based on an earlier full M = 1000 Monte Carlo simulation reported in detail in Ramnath [16]. In this table the columns are the values of P_{k}, u(P_{k}), S_{k}, u(S_{k}) and ρ(P_{k}, S_{k}) for k = 1, … , 10, where ρ(P_{k}, S_{k}) is the dimensionless correlation that is defined in terms of the covariance cov(P_{k}, S_{k}) by the formula(20)
Results are obtained by running the Octave/Matlab implementation of the WTLSC algorithm reported in Figure 1 with the script in Figure 2 as summarized in Table 2. All of the results reported in this table were directly obtained from a Octave/Matlab program with the exception of the dimensionless correlation coefficient ρ(A_{0}, λ) as the available number of Monte Carlo simulation events at M = 1000 was too small to obtain meaningful statistical results with the use of equation (11).
Instead the correlation was estimated by saving the processed data from the Octave/Matlab program as a txt file and fitting a Gaussian copula which is defined in terms of the covariance using the following RStudio computer code [19]:
library(copula) library(VineCopula) xydata < read.table(''C:/mydata.txt'') x1 < xydata[, 1] x2 < xydata[, 2] fncF1 < ecdf(x1) fncF2 < ecdf(x2) u1 < fncF1(x1) u2 < fncF2(x2) C21 < BiCopSelect(u1, u2, familyset = c(1)) rho < C21$par
This trick was utilized since a calculation of a covariance/correlation is mathematically equivalent to the assumption of a Gaussian bivariate copula to model the coupling effect between two random variables, and the VineCopula statistical package [20] uses more advanced optimization algorithms to estimate the parameters for a covariance/correlation. In general if a large number of Monte Carlo simulation events for example M = 10000 is available then equation (11) can be used directly, and this observation illustrates the essential need for a large number of simulations to adequately apply a GLS regression in order to obtain a statistically meaningful result particularly when a relatively small number of scattered datapoints from a small number of Monte Carlo simulations is available for data analysis as shown in Figure 3.
It may be observed that in the case of a bivariate PDF that the joint PDF is then a threedimensional function and as a result it becomes difficult to visualize the three PDFs g_{OLS}(x), g_{WLS}(x) and g_{GLS}(x) corresponding to the OLS, WLS and GLS regressions as these function overlap and intersect with one another. As a result in order to simplify the interpretation of the numerical results level curves of the PDFs are extracted as shown in Figure 4 so that the relative shape and sharpness of the PDFs may be meaningfully compared later in the paper.
Due to the complexity of the underlying pressure balance equations a large number of Monte Carlo simulations such as say M = 10000 is only feasible if the equations are implemented in an appropriate computer programming environment such as a Visual Basic for Applications (VBA) script within a Excel spreadsheet or a Matlab script, and as a result may present onerous technical computing demands on a pressure metrologist in an industrial calibration laboratory.
The results from Table 2 are visualized in Figure 5 from which it is confirmed that a OLS regression gives the lowest accuracy and a WLS gives a medium accuracy. The relative errors for the zeropressure area , distortion coefficient , correlation , distortion coefficient uncertainty , and zeropressure uncertainty , may be defined as (21)where Q represents a selection of a quantity and T denotes a selection of a technique.
Considering the relative errors for the WTLSC regression when compared to predictions from the GLS which is considered as the most accurate technique, it is observed that the relative errors are , , , , and respectively for the representative pressure balance data that was analysed. As a result it may be concluded that a WTLSC algorithm gives essentially the same quality of a results as full GLS Monte Carlo simulation if there are available estimates for the correlation terms cov(P_{i}, S_{i}) for the data points i = 1, … , n.
The key benefits of utilizing a WTLSC algorithm is that it is a relatively straightforward implementation of using a Octave/Matlab function file FitWTLSC with about 20 lines of computer code as outlined in Figure 2, and does not require extensive or advanced programming skills for a pressure metrologist to implement. Limitations of the method are nevertheless present in that estimates for the correlation terms ρ(P_{i}, S_{i}), i = 1, … , n are technically necessary to utilize the WTLSC algorithm.
This issue may be addressed in one of two practical ways, namely (i) setting zero correlations for all of the crossfloat datapoints such that ρ(P_{i}, S_{i}) = 0 for i = 1, … , n for which the WTLSC algorithm will reproduce results for a WLS algorithm as a conservative compromise, or (ii) performing a small number of say 50 or 100 repeat measurements in an Excel spreadsheet which is feasible without the need for any specialist VBA scripting programming within a spreadsheet with appropriate expert physical judgement to generate (P_{i}, S_{i}) data to fit a Gaussian copula with the 11 lines of RStudio code previously discussed to obtain an approximate estimate for the correlation.
Referring to the results in Figure 3 it is observed that the PDFs of the OLS, WLS and GLS regressions may be compared by examining the behaviour of the PDFs along the semimajor and semiminor axes of the hyperellipsoid for the confidence region. When this analysis was performed as visualized in Figure 4 it was observed that the WLS regression performed relatively well when compared to the GLS regression. As a result it may be concluded that the limitations of the potential absence of a covariance matrix is not excessively detrimental.
Statistical data for pressure balance crossfloating measurements for pressures P_{k} and areas S_{k} in approximate equal steps with k = 1, … , 10 from P_{1} = 50 MPa through to P_{10} = 500 MPa. The above data is sufficient to use as an input for the WTLSC algorithm and produces essentially the same quality of results as a full GLS Monte Carlo simulation. If the dimensionless correlation term ρ(P_{k}, S_{k}) is unknown or cannot be otherwise estimated it may be set to zero and the WTLSC algorithm will equivalently reproduce the results from a WLS algorithm which provides conservative estimates for the corresponding parameter expected values and associated variances/covariance. It should be noted that a zero correlation of the input/output data will in general still produce a nonzero correlation in the parameters to account for and model the coupling effect between the parameter random variables.
Fig. 1
Gnu Octave/Matlab function file to implement weighted total least squares with correlation algorithm. The x input and its standard uncertainty u(x) have the same physical units, and similarly the y output and its standard uncertainty u(y) also have the same physical units, whilst the correlation ρ(x, y) = cov(x, y)/(u(x) × u(y)) is dimensionless. 
Fig. 2
Gnu Octave/Matlab script file using the function file FitWTLSC.m to calculate the pressure balance parameter values and associated variance and covariance information in a typical pressure laboratory calibration certificate for a client pressure balance instrument. The data in the loaded file is from Table 1 and may conveniently be extracted from an Excel spreadsheet of measurement results and as saved a txt file. 
Final results and comparison of numerical experiments using the WTLSC, OLS, WLS, and GLS methods.
Fig. 3
Comparison of ordinary least squares (OLS), weighted least squares (WLS) and generalized least squares regression analysis from Monte Carlo data. Hyperellipsoids are generated to indicate the corresponding 95% confidence regions (CRs) and reveals that the highest accuracy scheme is that of a GLS which has the smallest CR which corresponds to a high degree of certainty of the range of possible values, followed by a WLS which has a slightly lower accuracy compared to a GLS, and finally a OLS which has the lowest accuracy as the CR is unduly large. 
Fig. 4
Comparison of joint PDFs from OLS, WLS and GLS regressions. The confidence region may be visualized as a three dimensional volume underneath the envelope covered by the joint PDF and bounded by the hyperellipsoids, and indicates that the joint PDFs may fit inside one another. Due to the difficulties with visualizing three different joint PDFs intersecting within each other simultaneously, cuts along the major axis (orange line in previous figure) and minor axis (magenta line in previous figure) of the ellipsoids confidence regions are taken which produce level curves to more easily visualize the shape, orientation and size of the different OLS, WLS and GLS regression schemes. From these level curves it is observed that for level curves along the ellipsoid major axis that the GLS regression (green curve) has the narrowest/sharpest curve which corresponds to the highest confidence or accuracy in the estimates of the parameters, then the WLS regression (blue curve) which has a medium spread/sharpness, and finally the OLS regression (red curve) which is relative wide and flat which corresponds to less certainty/accuracy of the parameters in the regression analysis. A broadly similar behaviour is observed for level curves cut along the minor ellipsoid axis, and as a result it is observed and concluded a OLS regression has the least accuracy, a WLS regression has a medium accuracy, and a GLS regression has the highest accuracy. From the above graphs it may be concluded that a straight line regression for pressure metrology work should ideally utilize a GLS regression (green line) however if no covariance information is available the a WLS regression (blue line) can give a reasonable estimate particularly for industrial pressure metrology laboratories. 
Fig. 5
Comparison of relative errors of the quantities predicted by OLS, WLS and WTLSC regression approaches relative to predictions from a GLS regression approach. These graphs indicate that the relative error between a WTLSC regression and a GLS regression is relatively insignificant, and as a result a pressure metrologist instead of applying a more complicated GLS regression can simply apply a WTLSC regression with the aid of the supplied methodology developed in this paper. 
5 Conclusions
Based on the research reported in this paper the following conclusions were determined:

Numerical simulations have been performed that demonstrate that a weighted total least squares with correlation (WTLSC) algorithm gives essentially equivalent results for straight line parameters as a more advanced generalized least squares (GLS) algorithm with additional Monte Carlo simulations for parameter uncertainty estimation

A method has been developed for the WTLSC algorithm that is able to analytically extract the variance/covariance information of the straight line parameters without the need for additional and more advanced Monte Carlo simulations
6 Influences and implications
Based on the research reported in this paper the following influences and implications are:

Pressure metrologists working in industrial calibration laboratories may now utilize the WTLSC algorithm to conveniently and easily quantify the expected values and variances/covariance of the zeropressure area A_{0} and distortion coefficient λ

A computer program FitWTLSC.m is now freely and publicly available to conveniently implement the WTLSC algorithm in Gnu Octave or Matlab for calculating the straight line parameters and their associated variance/covariance information and may now also be used by other metrologists from different fields
This work was performed with funds provided by the Department of Higher Education, Science and Technology (DHEST) on behalf of the South African government for research by public universities.
Appendix A Hyperellipsoidal bivariate confidence region
Let random variables x_{1}, … x_{M} ∈ ℝ^{N} be grouped as Ω = [x_{1} … x_{M}] where x_{j}, j = 1, … , M are N × 1 vectors, Ω is a N × M matrix, and M is a suitably large positive integer number of Monte Carlo simulation events. Then the expected value μ a N × 1 vector and the covariance matrix U a N × N matrix may be estimated with the GUM Supplement 2 [17] as μ and U. Under these circumstances the joint PDF, unless higher order statistics information over and above the expected and covariance values are provided, is then assumed to follow a multivariate Gaussian distribution g(x) ~ (μ, U).
The corresponding joint PDF is then g(x) and consists of hyperellipsoids within a ℝ^{ N } dimensional space that are centred at μ and have an orientation determined from characteristics of U such that(A1) (A2) (A3)
In the special case of bivariate data where , and , the hyperellipsoid will lie in the x_{1} x_{2}plane within a 2 dimensional space, and the joint PDF will be (A4)
For the above equation the terms are , , and since the covariance matrix may equivalently be written as . According to the GUM Supplement 2 [17] the 100p% coverage region is a hyperellipsoid of the form(A5)
An earlier analysis by Ramnath [21] determined that the ellipsoid semiaxis values a_{1} and a_{2} could be approximately determined by forming the matrix(A6)
Once this matrix was formed the two eigenvectors v_{1} and v_{2} along with the corresponding eigenvalues λ_{1} and λ_{2} were calculated from the equation(A7)
The ellipse semiaxis values were then calculated as(A8) (A9)and the eigenvectors v_{1} and v_{2} then define the orientations of the ellipsoid principal axes following the analysis by Lee [22]. The disadvantage of this approach is that an appropriate software routine for determining eigenvectors and eigenvalues is necessary, however an advantage of this approach is that it is generalisable to higher dimensions.
An alternative direct algebraic approach is to directly substitute the statistical information μ and U into the formula equation (A5) and expand out so that
The above equation is in the form of the generalized ellipse equation as discussed by Weisstein [23] that is centred at x_{o} and y_{o}, with semimajor and semiminor axes a and b and rotated by an angle θ of the form(A11)
The algebraic coefficients may now be expressed in terms of the statistical terms as(A12) (A13) (A14) (A15) (A16) (A17)
The corresponding ellipse geometry parameters may now be algebraically calculated as (A18) (A19) (A20)
and the ellipse rotation is given as(A21)
The hyperellipsoid may then be conveniently plotted using the parametrization 0 ≤ t ≤ 2π so that the initial boundary points x_{i} and y_{i} are(A22) (A23)
Next the boundary points are rotated by the angle θ by using a rotation matrix to give x_{r} and y_{r} such that(A24)
Finally the points are translated as(A25) (A26)
This algorithm is used to produce the hyperellipsoids in Figure 3 and may be used to determine the points along the ellipse major and minor axes. As a result if these lines are used a “cut” through the joint PDF may be used to study how the level curves of the different OLS, WLS, and GLS regressions behave as it is too difficult to easily visualize the three dimensional joint PDFs in Figure 4 even with transparency effects. Once the lines are determine the x_{1} and x_{2} points are known from a simple straight line fit so the x_{1} and x_{2} points may then be substituted in the bivariate joint PDF in equation (A4) which will produce a density value. The set of points (x_{1}, x_{2}, g(x_{1}, x_{2})) may then be plotted as a threedimensional curve in order to visualize the shape and sharpness of the corresponding level curve of the PDF. This algebraic approach is only suitable for bivariate simulations and becomes impractical in higher dimensions, for which the eigenvector/eigenvalue approach is then necessary.
Appendix B Simplified worked example
In order to illustrate the method suppose that the experimental data has n = 5 points of generated pressures from 100 MPa through to 500 MPa in 100 MPa steps and crossfloated areas, which will give an experimental control variable of x = [x_{1}, x_{2}, x_{3}, x_{4}, x_{5}] and an experimental observed variable of y = [y_{1}, y_{2}, y_{3}, y_{4}, y_{5}]. Writing down the basis equation with the experimental data (for one particular Monte Carlo event) as follows.
Let
Considering an ordinary least squares (OLS) regression the estimate for the matrix V is justwhere W = V^{−1}.
Implementing the formula in Matlab or GNU Octave with the above information gives
Next considering a WLS regression it is known that the V_{i,i} ∝ u^{2}(S_{i}) so using the data from Table 1 let u(S_{1}) =0.0000990361, u(S_{2}) = 0.0001110578, u(S_{3}) = 0.0001766929, u(S_{4}) = 0.0002740451 and u(S_{5}) = 0.0004087028, and then simply let
Using Matlab/Octave to implement the matrix operations then produces
The above process is repeated for all of the experimental data points and then the results can be postprocessed in order to obtain the statistical characteristics for the OLS, WLS and GLS regressions as shown in Figure 3.
Actual physical data is first obtained from the experimental measurements in the laboratory and this data is used as inputs into the instrument models to produce the data X and Y in this example where the computer program outlines the entire logical calculation process. As an example the piston temperature may have been measured as say 20.4^{∘}C with a standard uncertainty of ±0.1^{∘}C with an assumed Gaussian distribution, then random temperatures consistent with the physical experimental uncertainty are sampled and used as inputs into the physical model. This is due to the fact that modern measurement science has generally adopted a Bayesian statistics framework for measurement uncertainty and as a result no physical measurement is deterministic since all physical measurements, with the partial exception of certain physical constants, have an underlying probability distribution i.e. any physical measurement is considered as a statistical sample from that random variables probability density function. Once sufficient samples have been generated and used as inputs/outputs into the measurement model then the OLS, WLS or GLS regressions may be implemented as outlined in Figure B1.
Fig. B1
Gnu Octave/Matlab function file to implement OLS, WLS and GLS regression analysis. 
The general steps to perform the regression analysis are therefore as follows:

Obtain laboratory nominal experimental measurement data x that has an associated joint PDF g_{X}(x)

Sample a random variable ξ from the joint PDF g_{X}(x)

Solve the associated measurement equation h(η, ξ) = 0 to obtain the random variable η associated with ξ, and repeat this several times to produce the Monte Carlo data Ω

Step through the data input ξ and data output η i.e. the inputs x and outputs y and substitute this in the regression equation C = (X^{T} W X)^{−1}(X^{T} W Y)

Solve the regression equation for C and then postprocess all the results to work out the statistical characteristics of A_{0} and b namely u^{2}(A_{0}), u^{2}(b) and cov(A_{0}, b)

Simply apply the new algebraic formula to work out u^{2}(A_{0}) = u^{2}(b), , and
The above computer code routines for performing the statistical regressions may be observed to be relatively lengthy and complicated, and may be compared to the WTLSC implementation which is simply the 20 lines of Matlab/Octave code in Figure 2. In this approach there is no need to perform any Monte Carlo simulations and the actual physical experimental data, in this simplified problem the n = 5 data points, may be used directly as inputs into the Octave/Matlab program FitWTLSC.m to directly extract the uncertainty information of the parameters A_{0} and λ and obtain u^{2}(A_{0}), u^{(}λ) and cov(A_{0}, λ).
References
 BIPM, IEC, IFCC, ILAC, ISO, IUPAP, and OIML, Evaluation of measurement data − Guide to the expression of uncertainty in measurement, tech. rep., JCGM/WG1 GUM, 2008. Revised 1st edition − https://www.bipm.org/en/publications/guides/ [Google Scholar]
 R.S. Dadson, S.L. Lewis, G.N. Peggs, The Pressure Balance: Theory and Practice (HMSO, London, 1982) ISBN 0114800480. [Google Scholar]
 P. Saunders, Propagation of uncertainty for nonlinear calibration equations with an application in radiation thermometry, Metrologia 40 , 93–101 (2003) [Google Scholar]
 S. Vanhuffel, J. Vandewalle, The Total Least Squares Problem: Computational Aspects and Analysis (SIAM, 1987) [Google Scholar]
 D. York, Least squares fitting of a straight line with correlated errors, Earth Planet. Sci. Lett. 5 , 320–324 (1968) [Google Scholar]
 W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes: The Art of Scientific Computing (Cambridge University Press, 2007), 3rd edn [Google Scholar]
 R.L. Burden, J.D. Faires, Numerical Analysis (Brookes/Cole, 2001), 7th edn [Google Scholar]
 W.C. Duer, P.J. Ogren, A. Meetze, C.J. Kitchen, R.V. Lindern, D.C. Yaworsky, C. Boden, J.A. Gayer, Comparison of ordinary, weighted, and generalized leastsquares straightline calibrations for LCMS=MS, GCMS, HPLC, GC, and enzymatic assay, J. Anal. Toxicol. 32 , 329–338 (2008) [CrossRef] [PubMed] [Google Scholar]
 M. Krystek, M. Anton, A weighted total leastsquares algorithm for fitting a straight line, Measur. Sci. Technol. 18 , 3438–3442 (2007) [CrossRef] [Google Scholar]
 M. Krystek, M. Anton, A leastsquares algorithm for fitting data points with mutually correlated coordinates to a straight line, Measur. Sci. Technol. 22 , 035101 (2011) [CrossRef] [Google Scholar]
 P.M. Harris, C.E. Matthews, M.G. Cox, Summarizing the output of a Monte Carlo method for uncertainty evaluation, Metrologia 51 , 243–252 (2014) [Google Scholar]
 Mathworks, fgls − Feasible generalized least squares, 2020. https://uk.mathworks.com/help/econ/fgls.html [Google Scholar]
 S. Miller, R. Startz, Feasible generalized least squares using support vector machines, Econ. Lett. 175 , 28–31 (2019) [Google Scholar]
 C. Wuethrich, S. Souiyam, Monte Carlo determination of the uncertainty of effective area and deformation coefficient for a piston cylinder unit, in 24th IMEKO TC3, 14th TC5, 6th TC16 and 5th TC22 International Conference , edited by A. Salceanu, D. Agrez, J. Saliga, M. Savino (CavtatDubrovnik, Croatia: IMEKO, 2020), pp. 1–6 [Google Scholar]
 P. Otal, C. Yardin, Modelling methods for pressure balance calibration, Measur. Sci. Technol. 31 , 034004 (2019) [CrossRef] [Google Scholar]
 V. Ramnath, Numerical analysis of the accuracy of bivariate quantile distributions utilizing copulas compared to the GUM supplement 2 for oil pressure balance uncertainties, Int. J. Metrol. Qual. Eng. 8 , 29 (2017) [CrossRef] [Google Scholar]
 BIPM, IEC, IFCC, ILAC, ISO, IUPAP, and OIML, Evaluation of measurement data − Supplement 2 to the Guide to the expression of uncertainty in measurement − Propogation of distributions using a Monte Carlo method, tech. rep., JCGM/WG1 GUM Supplement 2, 2011. 1st edition − https://www.bipm.org/en/publications/guides/ [Google Scholar]
 V. Ramnath, Determination of pressure balance distortion coefficient and zeropressure effective area uncertainties, Int. J. Metrol. Qual. Eng. 2 , 101–119 (2011) [CrossRef] [EDP Sciences] [Google Scholar]
 CRAN, The R Project for Statistical Computing, 2020. https://www.rproject.org/about.html [Google Scholar]
 CRAN, VineCopula: Statistical Inference of Vine Copulas, 2020. https://cran.rproject.org/web/packages/VineCopula/index.html [Google Scholar]
 V. Ramnath, Analysis and comparison of hyperellipsoidal and smallest coverage regions for multivariate Monte Carlo measurement uncertainty analysis simulation datasets, MAPANJ. Metrol. Soc. India 1–16 (2019) [Google Scholar]
 S.N. Lee, M.H. Shih, A volume problem for an ndimensional ellipsoid intersecting with a hyperplane, Linear Algebra Appl. 132 , 90–102 (1990) [Google Scholar]
 E.W. Weisstein, Ellipse − a Wolfram web resource (2020). https://mathworld.wolfram.com/Ellipse.html [Google Scholar]
Cite this article as: Vishal Ramnath, Comparison of straight line curve fit approaches for determining parameter variances and covariances, Int. J. Metrol. Qual. Eng. 11, 14 (2020)
All Tables
Statistical data for pressure balance crossfloating measurements for pressures P_{k} and areas S_{k} in approximate equal steps with k = 1, … , 10 from P_{1} = 50 MPa through to P_{10} = 500 MPa. The above data is sufficient to use as an input for the WTLSC algorithm and produces essentially the same quality of results as a full GLS Monte Carlo simulation. If the dimensionless correlation term ρ(P_{k}, S_{k}) is unknown or cannot be otherwise estimated it may be set to zero and the WTLSC algorithm will equivalently reproduce the results from a WLS algorithm which provides conservative estimates for the corresponding parameter expected values and associated variances/covariance. It should be noted that a zero correlation of the input/output data will in general still produce a nonzero correlation in the parameters to account for and model the coupling effect between the parameter random variables.
Final results and comparison of numerical experiments using the WTLSC, OLS, WLS, and GLS methods.
All Figures
Fig. 1
Gnu Octave/Matlab function file to implement weighted total least squares with correlation algorithm. The x input and its standard uncertainty u(x) have the same physical units, and similarly the y output and its standard uncertainty u(y) also have the same physical units, whilst the correlation ρ(x, y) = cov(x, y)/(u(x) × u(y)) is dimensionless. 

In the text 
Fig. 2
Gnu Octave/Matlab script file using the function file FitWTLSC.m to calculate the pressure balance parameter values and associated variance and covariance information in a typical pressure laboratory calibration certificate for a client pressure balance instrument. The data in the loaded file is from Table 1 and may conveniently be extracted from an Excel spreadsheet of measurement results and as saved a txt file. 

In the text 
Fig. 3
Comparison of ordinary least squares (OLS), weighted least squares (WLS) and generalized least squares regression analysis from Monte Carlo data. Hyperellipsoids are generated to indicate the corresponding 95% confidence regions (CRs) and reveals that the highest accuracy scheme is that of a GLS which has the smallest CR which corresponds to a high degree of certainty of the range of possible values, followed by a WLS which has a slightly lower accuracy compared to a GLS, and finally a OLS which has the lowest accuracy as the CR is unduly large. 

In the text 
Fig. 4
Comparison of joint PDFs from OLS, WLS and GLS regressions. The confidence region may be visualized as a three dimensional volume underneath the envelope covered by the joint PDF and bounded by the hyperellipsoids, and indicates that the joint PDFs may fit inside one another. Due to the difficulties with visualizing three different joint PDFs intersecting within each other simultaneously, cuts along the major axis (orange line in previous figure) and minor axis (magenta line in previous figure) of the ellipsoids confidence regions are taken which produce level curves to more easily visualize the shape, orientation and size of the different OLS, WLS and GLS regression schemes. From these level curves it is observed that for level curves along the ellipsoid major axis that the GLS regression (green curve) has the narrowest/sharpest curve which corresponds to the highest confidence or accuracy in the estimates of the parameters, then the WLS regression (blue curve) which has a medium spread/sharpness, and finally the OLS regression (red curve) which is relative wide and flat which corresponds to less certainty/accuracy of the parameters in the regression analysis. A broadly similar behaviour is observed for level curves cut along the minor ellipsoid axis, and as a result it is observed and concluded a OLS regression has the least accuracy, a WLS regression has a medium accuracy, and a GLS regression has the highest accuracy. From the above graphs it may be concluded that a straight line regression for pressure metrology work should ideally utilize a GLS regression (green line) however if no covariance information is available the a WLS regression (blue line) can give a reasonable estimate particularly for industrial pressure metrology laboratories. 

In the text 
Fig. 5
Comparison of relative errors of the quantities predicted by OLS, WLS and WTLSC regression approaches relative to predictions from a GLS regression approach. These graphs indicate that the relative error between a WTLSC regression and a GLS regression is relatively insignificant, and as a result a pressure metrologist instead of applying a more complicated GLS regression can simply apply a WTLSC regression with the aid of the supplied methodology developed in this paper. 

In the text 
Fig. B1
Gnu Octave/Matlab function file to implement OLS, WLS and GLS regression analysis. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.