Issue |
Int. J. Metrol. Qual. Eng.
Volume 8, 2017
|
|
---|---|---|
Article Number | 28 | |
Number of page(s) | 10 | |
DOI | https://doi.org/10.1051/ijmqe/2017021 | |
Published online | 27 November 2017 |
Research Article
Reversed inverse regression for the univariate linear calibration and its statistical properties derived using a new methodology
Quality Management Center, KEPCO NF,
242, Daedeok-daero 989 beon-gil
Daejeon
34057, Korea
Received:
12
March
2017
Accepted:
28
September
2017
Since simple linear regression theory was established at the beginning of the 1900s, it has been used in a variety of fields. Unfortunately, it cannot be used directly for calibration. In practical calibrations, the observed measurements (the inputs) are subject to errors, and hence they vary, thus violating the assumption that the inputs are fixed. Therefore, in the case of calibration, the regression line fitted using the method of least squares is not consistent with the statistical properties of simple linear regression as already established based on this assumption. To resolve this problem, “classical regression” and “inverse regression” have been proposed. However, they do not completely resolve the problem. As a fundamental solution, we introduce “reversed inverse regression” along with a new methodology for deriving its statistical properties. In this study, the statistical properties of this regression are derived using the “error propagation rule” and the “method of simultaneous error equations” and are compared with those of the existing regression approaches. The accuracy of the statistical properties thus derived is investigated in a simulation study. We conclude that the newly proposed regression and methodology constitute the complete regression approach for univariate linear calibrations.
Key words: bias / classical regression / error propagation / mean-data-point-based variance / population-regression-line-based variance / reversed inverse regression / simultaneous error equations / Taylor approximation
© P. Kang et al., published by EDP Sciences, 2017
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Simple linear regression is a model with a single independent variable in which a regression line is fitted through n data points such that the sum of squared errors (SSE), i.e., the vertical distances between the data points and the fitted line, is as small as possible. The statistical properties of this model have been established as theorems and are presented in many statistics textbooks, e.g., the textbook written by Walpole and Myers [1]. In this model, a regression line of y on x is fitted based on the assumption that x is fixed but y varies according to a normal distribution. This model is called “basic regression” throughout the remainder of this study. Unfortunately, when calibrating an instrument such as a chemical analyzer using basic regression, a problem arises. In practical calibrations, the observed measurements (the x values) are subject to errors, and hence they vary, thus violating the assumption of fixed inputs. As a result, in the case of calibration, the regression line fitted using the method of least squares is not consistent with the statistical properties of basic regression as already established based on this assumption.
Two approaches have been considered as possible solutions for this problem. In the first approach [2], called classical regression, the “standards” (the x values) are treated as the inputs, and the observed measurements (the y values) are treated as the response; these values are used to fit a regression line of y on x. This regression approach is consistent with the assumption that x is fixed. The problem with this approach is that estimating the x value for a new observed measurement involves the reciprocal of the estimated slope. Williams [3] demonstrated that the reciprocal of the slope has an infinite variance, which indicates that classical regression has an infinite variance and, hence, an infinite mean squared error. Nevertheless, Parker et al. [4] obtained an asymptotic approximation of the variance of the prediction interval using a formula derived by Casella and Berger [5] using the Delta Method. However, Parker et al.'s approach still has limitations. Even if we rely on this approximation, we cannot determine a prediction interval with a given confidence level because the approximation cannot be used to express the prediction interval as a tn−2 distribution.
In the second approach [6], called inverse regression, the standards (the x values) are treated as the response, the observed measurements (the y values) are treated as the inputs, and these values are used to fit a regression line of x on y. This regression approach is inconsistent with the assumption that the inputs are fixed. Shukla and Datta [7] and Oman [8] derived expressions for the mean and mean squared error of predicted x value based on multiple measurements taken during the prediction stage of the calibration process. Fuller [9] made a similar suggestion regarding the derivation of both the predicted x value and the prediction interval. Fuller's approach requires that the variance of the observed measurements is known. In his approach, it is necessary to measure a standard multiple times independently to estimate the variance. Parker et al. [4] derived the bias in prediction using a formula established by Pham-Gia et al. [10] with the aid of the Delta Method. Parker et al. [4] also showed through several simulation studies that inverse regression is preferable to classical regression in terms of bias and mean squared error. However, to derive the statistical properties of inverse regression, Parker et al. were obliged to borrow their estimate for the variance of the slope from “reversed basic regression” because of technical difficulties, which devalues their approach. (Reversed basic regression is basic regression in which the roles of x and y have merely been reversed.)
As a fundamental solution for the calibration problem, which has not yet been resolved completely, the current study introduces “reversed inverse regression” along with a new methodology for deriving its statistical properties. (Simply put, “fundamental solution for the univariate linear calibration problem” = “reversed inverse regression” + “new methodology for deriving the statistical properties of the regression”.) In the proposed regression approach, the observed measurements (the x values) are treated as the inputs, and the standards (the y values) are treated as the response; these values are used to fit a regression line of y on x. The statistical properties of this regression are derived using the “error propagation rule” and the “method of simultaneous error equations”. In this regression approach, it is not necessary to measure any standards multiple times independently. We present an example of practical calibration. Each of three types of regression (i.e., classical regression, inverse regression and reversed inverse regression) is applied to the calibration example, and the corresponding calibration results, including the subsequently calculated estimates for the variance of the prediction interval, are compared. In addition, the accuracy of the statistical properties derived using the new methodology is investigated in a Monte Carlo simulation study.
2 Regression and methodology
If the roles of x and y are reversed, then inverse regression becomes reversed inverse regression. Reversed inverse regression is more convenient to use for calibration than inverse regression because the reversed roles are consistent with the convention that the variable x represents the inputs, whereas the variable y represents the response. This regression approach also violates the assumption that the inputs are fixed. It is modeled as follows. (It may be desirable to use some other term than “reversed inverse regression”, e.g., “pseudo-basic regression”, to eliminate potential confusion in terminology.)
-
–
There is a linear relationship between x and y.
-
–
The observed measurements (the x values) are treated as the inputs, the standards (the y values) are treated as the response, and these values are used to fit a regression line of y on x.
-
–
For the fitting of the regression line, n data points of the form (xi, yi) (i = 1, …, n) are used. The xi value varies according to a normal distribution, whereas the yi value is fixed; yi = α + βxi + εi, εi ∼ N(0, σ2).
-
–
The xi's (i.e., x1, …, xn) are treated as variables. The variables xi and xj (i ≠ j) are independent of each other: cov[xi, xj] = 0, i ≠ j.
-
–
The regression line is fitted such that SSE is minimized.
-
•
, , ,
-
•
-
–
The variance of xi is uniform for all i (i = 1, …, n). In other words, the variance of the observed measurements is equal over the entire calibration range of interest.
-
•
denotes the variance of the variable xi; .
-
•
-
–
The population regression line y = α + βx is defined as follows:
-
•
, , and .
-
•
, , xi0 is the mean of xi, and .
-
•
All points (xi0, yi) (i = 1, …, n) lie on the population regression line. In this study, we call these points the “mean data points”.
-
•
(∑ denotes summation from i = 1 to n throughout this study.)
In reversed inverse regression, the assumption that the observed measurements (the x values), despite being the inputs, vary according to normal distributions is very important. Suppose that the regression line fitting is repeated an infinite number of times using a “new set of n different standards (or reference solutions)” each time. Here, this “new set of n different standards” refers to newly prepared standards whose nominal y values (or target y values) and confidence levels are identical to those of the previous set of standards. In this case, the xi's (i.e., x1, …, xn) will be observed to vary according to normal distributions. The standards are subject to errors that may arise when preparing or manufacturing them. However, such errors will appear as variations in the xi's after being combined with random measurement errors. If the “same set of n different standards” is measured repeatedly, we will only observe the variance associated with the random measurement errors; the errors of the standards themselves will not be reflected. Such a variance should not be treated as the variance needed to derive the statistical properties of linear regression. In this respect, Fuller [9] is incorrect, because his approach requires a standard to be independently measured multiple times to estimate the variance. As previously mentioned, reversed inverse regression does not require any such separate prior measurements.
The slope of the regression line that is fitted on the basis of reversed inverse regression is:
Unfortunately, it is technically difficult to derive the variance of the slope directly from the definition of the variance, i.e., var[f(x1, …, xn)] = E[{f(x1, …, xn)−E[f(x1, …, xn)]}2], because is a fractional expression that contains “∑(xi − )2” in the denominator and the xi's vary rather than being fixed. Because of this difficulty, we directly treat the xi's as variables and derive the variance of the slope based on the first-order Taylor approximation as follows: where the notation [ ]* or { }* indicates that the value of the function contained within the bracket is determined using the mean values of the variables, i.e., x10, …, xn0 [11]. Even in the case of derivation of expectations, this notation is often used for the same purpose. In particular, we define the expectation E[{f(x1, ..., xn) − f(x10, …, xn0)}2] as the “mean-data-point-based variance”. The approximation method for deriving the variance described herein is commonly referred to as the “error propagation rule”, and only the first-order partial derivatives are included in its derivation. To derive the variance of the slope, var, after the partial differentiation of with respect to the xi's, the variances of the xi's, including the covariances of xi and xj (j > i), are combined in accordance with the error propagation rule. The final result obtained from this combination process is the approximate variance of the slope. The same method can be used to derive the variance of the intercept and the variance of the predicted y value. All other statistical properties of reversed inverse regression, such as the expectation and bias of the slope and the expectation of the mean squared error, are derived by utilizing another special method, called the “method of simultaneous error equations” in this study, in combination with the error propagation rule. When we need to derive another statistical property from the primary expressions already obtained using the error propagation rule, the first-order Taylor approximation is mainly used. Error terms of orders higher than (σx/A)2 are discarded during or after the approximation calculations. For example, (σx/A)4 (=1/108) is very small and can be neglected in comparison with (σx/A)2 (=1/104).
The Delta Method is also an asymptotic approximation method based on Taylor approximation [12]. Parker et al. [4] used the Delta Method to derive the variance of the prediction interval for classical regression. When the Delta Method is applied to the inverted equation x = −/ + (1/)y, the xi's and yi's are not directly treated as variables. Instead, U (= + y0 − ϵ0) and V () are treated as the variables [4,5,10]. This is the most notable difference between the Delta Method and the approximation method used in this study.
3 Statistical properties of reversed inverse regression
The variance and bias of the slope and the expectation of the mean squared error are the statistical properties that are primarily required in linear regression because other properties, such as the variance and bias of the intercept and the variance of the prediction interval, depend on them. Therefore, the variance of the slope, var[], is first derived using the error propagation rule as follows (see supplementary material): (1)
To investigate the accuracy of the variance obtained using equation (1), we should consider two factors. One is that error terms of orders higher than are not included in the derivation. The other is that because equation (1) represents the population-regression-line-based variance, the bias in is not reflected in the calculation of [Syy/]*σ2 (=[Syy/]*β2). The bias in depends on and n. The details of the effects of these two factors are explained based on the simulation results in Section 5. For reference, the variance of for basic regression is [1/Sxx]*σ2, and this variance is not an approximation but an exact expression. The relationship between the estimates of varreversed inverse and varbasic for a given set of data points is as follows: where r(x, y) is the estimated correlation coefficient between x and y, i.e., r(x, y) = Sxy/(SxxSyy)1/2, and r2(x, y) is typically very close to 1 in linear calibrations.
The variance of the intercept, var, is also derived using the error propagation rule as follows (see supplementary material): (2)
Separately from the previous derivation process, another equation for deriving var[] can be obtained by applying the error propagation rule to = − : (3)
From equations (2) and (3), we can see that r, ≈ 0, and hence, and are nearly independent of each other. In equation (2), var[] is derived by treating as a function of xi's (i = 1, …, n), whereas in equation (3), var is derived by treating as a function of and . In this way, by formulating two separate equations to obtain the variance of a statistic using the error propagation rule, we can derive the covariance or correlation coefficient between any two statistics. This method is called the “method of simultaneous error equations” in this study. Nearly all of the covariances (or correlation coefficients) in a linear regression problem can be derived using this method. In addition, the derived covariances can be further used to derive other statistical properties. However, we should note that the covariances thus derived are typically approximations, not exact expressions.
A predicted y value is the y value of a point (x, y) on the fitted regression line and is determined by substituting x into = + x. The variance of such a predicted y value, var, is derived using the error propagation rule as follows: (4)
Separately from equation (4), another equation for deriving var can be obtained by applying the error propagation rule to = + x: (5)
From equations (4) and (5), the correlation coefficient r can be determined as follows:
As the next step, we derive the expectations of and , and the biases in , and . For this purpose, the following statistical properties are derived in advance using the method of simultaneous error equations (see supplementary material): E[1] = E[∑(xi−)2/∑(xi−)2] = E[∑(xi−)2] ∙ E[1/∑(xi−)2] + cov[∑(xi − )2, 1/∑(xi−)2], and hence, E[1/∑(xi−)2] ≈ {1 + [4/Sxx]}/{∑(xi0 − )2 + (n − 1)}. Therefore, the expectation of the slope, βE, can be derived as follows (see supplementary material for more details):
If we apply the first-order Taylor approximation to simplify the expression Sxy{1 + [4/Sxx]}/{∑(xi0 − )2 + (n − 1)}, we obtain the following expressions for βE and αE:
Accordingly, the biases in , and are as follows: (6) (7)
Based on these biases, we can see that β and α are not the mean, median, or mode of the and distributions. However, we can say that and , despite being slightly skewed, follow approximately normal distributions centered at β and α respectively, because the terms β[1/Sxx]*(n − 3) and β[1/Sxx]*(n − 3) are each very small in magnitude in practical calibrations. (When n is 3, β coincides with βE. The same can be said of α and αE.)
To show that the slope, intercept and predicted y value in reversed inverse regression can be expressed as tn−2 distributions, it is necessary to know the statistical properties of the mean squared error (MSE). The expectation of MSE is first derived (see supplementary material for more details): (8)
To investigate the accuracy of the expectation of MSE obtained using equation (8), we should consider the same factors taken into account in the case of the variance of . The accuracy of the derived E[MSE] is discussed in detail based on simulation results in Section 5.
The correlation coefficient between the slope and the mean squared error, r MSE), is derived using the method of simultaneous error equations. Let K = ∑(yi−−xi)2 = (SxxSyySxy − )/, A = = Sxy/Sxx, and F = ∑(yi − − xi)2 = (SxxSyy − )/Sxx. Then, two separate equations for deriving the variance of K can be established. The correlation coefficient r MSE) is obtained from these two equations:
Additionally, and are independent of each other and and MSE are also independent of each other, then r(, MSE) = r( − , MSE) ≈ 0.
In the expression ∑(yi − − xi)2/(n − 2), the yi's are constant, and follow approximately normal distributions, and the xi's also follow normal distributions. Therefore, (n − 2)MSE/σ2 approximately follows a χ2 distribution with n − 2 degrees of freedom. In addition, both and are nearly independent of MSE. Based on these facts, the following expressions can be obtained (see equations. (1), (2), (4) and (8)): where is the square root of MSE and y0 is the nominal y value of a newly prepared standard. The T's are all approximate tn−2 distributions. Although , (x − )2 and Syy/, which appear in the T's, are functions of xi (i = 1, …, n), the tn−2 distributions are not greatly deformed by these functions because the fluctuations of Syy/ (or [1/n + (Syy/)]) corresponding to the variations of the xi's are typically very small compared with the magnitude of Syy/ (or [1/n + (Syy/)]) itself. Based on these tn–2 distributions, we can evaluate the uncertainty (or confidence interval) of a measurement value determined based on the fitted regression line.
4 Comparison of regression approaches
Krutchkoff [6,13] compared classical regression and inverse regression using Monte Carlo simulations and recommended inverse regression based on the mean squared error. However, Berkson [14] and Halpern [15] presented significant criticisms of Krutchkoff's work. Parker et al. [4] also conducted several simulation studies and concluded that inverse regression performs better than classical regression. It seems that such debates arise because the existing regression approaches and accompanying methodologies are theoretically incomplete. Unusually, we compare different linear regression approaches using a practical calibration example. Each of three types of regression (classical, inverse and reversed inverse) is applied to the calibration scenario. In practical calibrations, the variance of the prediction interval is one of the most important statistical properties. Therefore, we identify the differences among the three regressions based on a comparison of the variances of the prediction interval estimated using the three regression approaches. For the fitting of a regression line as an example of practical calibration, we use a set of data points collected by Suh [16] while evaluating the uncertainty in the measurements recorded by an absorption spectrometer. The spectrometer determines the chemical concentrations (ppm) in a sample by measuring the absorbances (%) due to the corresponding chemical elements. Suh measured five different Cd (cadmium) standards. The data points collected by Suh and the calibration results are as follows:
4.1 Classical regression
-
–
x: Cd concentration (ppm), y: absorbance (%).
-
–
, , Sxx = 0.4, Syy = 0.02225, Sxy = 0.094, r(x, y) = Sxy/(SxxSyy)1/2 = 0.9964.
-
–
, ,
-
–
Regression line:
-
–
Estimator for the variance of the prediction interval (EVC):
-
•
= −0.04638 + 4.25532y. (Measurement equation)
-
•
EVC = {1 + 1/5 + (0.8685 − 0.5)2/0.4}0.000056(1/0.235)2 = 0.0015611 (at x = 0.8685 ppm).
-
•
Note: −0.04638 + 4.25532 × 0.215(%) = 0.8685 (ppm).
4.2 Inverse regression
-
–
x: Cd concentration (ppm), y: absorbance (%).
-
–
= 0.5, = 0.1284, Sxx = 0.4, Syy = 0.02225, Sxy = 0.094, r(x, y) = Sxy/(SxxSyy)1/2 = 0.9964.
-
–
MSE() = ∑(xi − − )2/(5 − 2) = 0.001, = Sxy/Syy = 4.22472, = −0.04245.
-
–
Regression line: = +
-
–
Estimator for the variance of the prediction interval (EVI): [1 + 1/n + (y − )2/Syy].
-
•
= −0.04245 + 4.22472y. (Measurement equation)
-
•
EVI = {1 + 1/5 + (0.215 − 0.1284)2/0.02225} × 0.001 = 0.0015371 (at y = 0.215%).
-
•
4.3 Reversed inverse regression
-
–
x: absorbance (%), y: Cd concentration (ppm).
-
–
= 0.1284, = 0.5, Sxx = 0.02225, Syy = 0.4, Sxy = 0.094, r(x, y) = Sxy/(SxxSyy)1/2 = 0.9964.
-
–
MSE() = ∑(yi − −xi)2/(5 − 2) = 0.001, = Sxy/Sxx = 4.22472, = −0.04245.
-
–
Regression line: = + x.
-
–
Estimator for the variance of the prediction interval (EVRI): [1 + 1/n + (x − )2(Syy/)].
-
•
= −0.04245 + 4.22472x. (Measurement equation)
-
•
EVRI = {1 + 1/5 + (0.215 − 0.1284)2(0.4/0.0942)} × 0.001 = 0.0015395 (at x = 0.215%).
-
•
The estimate EVRI derived via reversed inverse regression at x = 0.215% (the upper end of the calibration range) is compared with the estimate EVC derived via classical regression at x = 0.8685 ppm and with the estimate EVI derived via inverse regression at y = 0.215%. All three estimates are different from one another. Classical regression yields the largest estimate, and inverse regression yields the smallest one. This can be explained by rewriting and comparing the following three estimators. (Both EVC and EVI are those derived by Parker et al. [4].) When rewriting EVC and EVI, the roles of x and y were reversed to facilitate comparison. In addition, in the expression for classical regression was changed to
The correlation coefficient r(x, y) {=Sxy/(SxxSyy)1/2} is very close to, but always smaller than, 1 in linear calibrations. In addition, and The term is greater than Therefore, the estimates can be arranged in order of increasing magnitude as follows: “inverse”, “reversed inverse” and then “classical”. This ordering holds for all linear calibrations. The differences among the three estimates depend on r(x, y). In Suh's measurement experiment, r(x, y) is 0.9964 (n = 5), the estimate derived via classical regression at the upper end of the calibration range is approximately 1.5% greater than that derived via inverse regression, and the estimate derived via reversed inverse regression is approximately 0.15% greater than that derived via inverse regression. If Suh had repeated this measurement experiment, the results would have been similar to those of this calibration. Regarding these calibration results, we should remind ourselves that although we rely on the estimate derived via classical regression, we cannot determine the prediction interval with a given confidence level because the estimate cannot be used to express the prediction interval as a tn−2 distribution. In addition, we should remind ourselves that the estimate derived via inverse regression is not a theoretically correct one.
5 Simulation study
We conducted a Monte Carlo simulation study to investigate the accuracy of the statistical properties derived using the error propagation rule and the method of simultaneous error equations based on the first-order Taylor approximation. var, bias and E[MSE] were the main targets of investigation because the accuracy of other properties, such as var[], bias[], var[], bias[] and var[prediction interval], depends on the accuracy of these three properties. We designed a simulation of regression line fitting using five data points based on reversed inverse regression. We first created five intended mean data points (xi0, yi) (i = 1, …, 5) that were needed for the simulation as follows:
-
-
Intended population regression line: y = −0.3 + 0.025x (β = 0.025, α = −0.3).
Depending on the intended variance , the simulation study was organized into five simulation groups, SG1, SG2, SG3, SG4 and SG5, and the intended variances assigned to the five groups were 902, 602, 242, 122 and 62, respectively. Five simulations per group were conducted (25 simulations in total). In every simulation, the regression line fitting was repeated 50 000 times using independent random numbers generated from normal distributions using the program “Minitab 15”. The results of the conducted simulations are presented along with the corresponding theoretically derived properties in Tables 1 and 2. (Even if different parameters, such as a different number of data points, a different ratio of to , or non-equal distances between the xi0's, were applied in a simulation study, such a simulation study would yield conclusions essentially similar to those of this study.)
In Tables 1 and 2, the ratio of Svar[] to Dvar[] ranges from 0.971 to 1.017 and the ratio of SE[MSE] to DE[MSE] ranges from 0.983 to 1.002. In addition, the two derived variances *Dvar[] and Dvar[] are very close to each other. Therefore, we can conclude that the variance of the slope and the expectation of the mean squared error derived using the error propagation rule and the method of simultaneous error equations largely coincide with the simulation results.
According to Table 1, when is 62, the ratio of bias[] to {var[]}1/2 is approximately −0.01, and when is 902, the ratio is approximately −0.14. These two ratios are very different from each other in magnitude. In the case of either simulation or derivation, as the variance increases, both the absolute value of the bias in and the variance of increase. The rate of increase of the absolute value of the bias in is equal to the rate of increase of (see Eq. (6)), whereas the rate of increase of {var[]}1/2 is the square root of the rate of increase of (see Eq. (1)). This indicates that as increases, the distribution becomes more skewed. In Tables 1 and 2, the derived values of the bias in largely coincide with the simulation results regardless of . This indicates that although the first-order Taylor approximation is used to derive the bias in , the derived bias does not greatly differ from the simulation result. The bias in plays an important role in analyzing the accuracy of other derived statistical properties.
When is small, the derived variance of exactly coincides with the simulation result; however, when is large, the derived variance of is generally slightly greater than the simulation result. When the variance of (i.e., ) is derived using the error propagation rule, the partial derivatives of orders higher than the first are not included in the derivation, and the approximation var [f (x1, … , xn)] ≈ E [{f (x1, … , xn) − f (x1, … , xn)} 2] is used instead of the exact definition var [f (x1, … , xn)] = E [{f (x1, … , xn) − E [f (x1, … , xn)]} 2] to derive the variance. This results in two phenomena. The first phenomenon is that error terms of orders higher than are excluded from the derivation, and the second phenomenon is that the bias in is not reflected in the derivation. The bias in depends on and n (see equation (6)). In this simu lation study, n is 5. The first phenomenon typically causes the derived variance of (i.e., Dvar[]) to decrease, whereas the second phenomenon tends to cause it to increase. If is small, both effects are trivial, and Dvar[] is nearly equal to Svar[]. If is large, both of these effects are also large. However, the effect of the second phenomenon is much greater than that of the first. As a result, if is large, then Dvar[] is greater than Svar[]. If we substitute βE (SMean[] in Tab. 2) into equation (1) in place of β, we can obtain a variance of that is much closer to the simulation result. For example, for SG1-1, we can obtain (1000/40 0002) × 902 × 0.02474652 = 0.00176072 by substituting βE (= 0.0247465) into . (The difference between and is approximately equal to the square of bias[].) This value is very close to the simulation result. The difference that still remains can be regarded as the effect of the first phenomenon.
With regard to the expectation of the mean squared error, a similar explanation is possible. Even in this case, the effect of the second phenomenon is greater than that of the first phenomenon, and hence, DE[MSE] is generally greater than SE[MSE]. In particular, let us attempt to approximately calculate the effect of the first phenomenon using another expression for the expectation of MSE. Namely, in the equation , the last term on the right-hand side reflects the effect of the first phenomenon to a certain extent. This equation helps us understand the two phenomena.
In Table 2, if is large, then *Dvar[] is generally greater than Dvar[]. In every simulation, the estimate for the variance of the slope, i.e., (Syy/), was calculated for each regression line. *Dvar[] is the mean of the 50,000 estimates thus calculated. We can also obtain *Dvar[] using another method, as follows:
The term reflects the difference between *Dvar[] and Dvar[]. The difference depends on and n.
In this section, we investigated the accuracy of the statistical properties of reversed inverse regression as derived using the error propagation rule and the method of simultaneous error equations through comparisons with simulation results. However, it should also be noted that the main target that calibration experts wish to obtain (or approach) by means of regression line fitting is the population regression line y = α + βx, not the average regression line y = αE + βEx. In this respect, it is recommended that after the physical or chemical value of a sample is determined based on the fitted regression line, the determined value be corrected taking into account the bias in predicted y value (see Eq. (7)); such a bias correction will lead us closer to the true value.
Simulation results and theoretically derived properties.
Ratios of the simulation results to the corresponding derived properties.
6 Conclusion
From Osborne [17], it can be seen that considerable effort has been made to resolve the linear calibration problem since the 1930s. Most representatively, Eisenhart [2] suggested classical regression as a solution for the problem, and Krutchkoff [6] suggested inverse regression as another solution. Later, Parker et al. [4] derived the variances of the prediction interval and the biases in for these two types of regression using the Delta Method. However, it can be said that the problem has not yet been resolved completely. As a fundamental solution for this problem, the current study introduced reversed inverse regression along with a methodology for deriving its statistical properties. In this study, the statistical properties of reversed inverse regression, such as the variance and bias of the slope, the expectation of the mean squared error, and the variance of the predicted y value, were derived using the error propagation rule and the method of simultaneous error equations. The method of simultaneous error equations, which was introduced for the first time in this study, is a useful tool for deriving the covariance of any two statistics. As another example of its use, all of the statistical properties of basic regression can be derived much more easily with the aid of this method. Even in the case of weighted linear regression, this method can be used to derive its statistical properties.
We presented an example of practical calibration. Each of the three types of regression (i.e., classical, inverse and reversed inverse) was applied to this calibration example. As a result, we found that the estimates of the variance of the prediction interval can be arranged in order of increasing magnitude as follows: “inverse,” “reversed inverse” and then “classical”. This ordering holds for all linear calibrations. The differences among the three estimates depend on r(x, y). As the next step, to investigate the accuracy of the three derived statistical properties of reversed inverse regression, i.e., Dvar[], Dbias[] and DE[MSE], a Monte Carlo simulation study was conducted. Through this simulation study, we found that when the variance of the observed measurements, i.e., , is small, the theoretically derived variance and bias of the slope as well as the theoretically derived expectation of the mean squared error coincide with the simulation results. However, when is large, there are small differences between the derived properties and the simulation results. Such differences are caused by two phenomena. The first phenomenon is that error terms of orders higher than are excluded from the derivation, and the second phenomenon is that the bias in is not reflected in the derivation. The first phenomenon typically causes the derived statistical properties to decrease, whereas the second phenomenon tends to cause them to increase (when n is greater than 3). The effect of the second phenomenon is larger than that of the first phenomenon, and hence, the values of the derived properties are typically slightly greater than the simulation results. In this way, after performing simulations we could investigate and analyze the differences between the derived statistical properties and the simulation results. This is another benefit of the new methodology used to derive the statistical properties of reversed inverse regression.
7 Implications and influences
Lwin and Maritz [18] suggested that regression models do not require the assumption of fixed inputs. In other words, regardless of whether the regression model of interest is consistent with this assumption, the method of least squares can be applied to fit a regression line. In that sense, it is meaningless to identify whether the line fitted using one regression approach is preferable to that fitted using another regression approach. However, it is nevertheless essential to know the statistical properties of the type of regression used for fitting. Unfortunately, the known statistical properties of the existing regression approaches are not without flaw. By contrast, all of the statistical properties of reversed inverse regression can be derived using the newly proposed methodology, and the statistical properties derived in this manner are theoretically correct and sufficiently accurate. In this respect, we claim that reversed inverse regression and the new methodology for deriving its statistical properties together serve as a fundamental solution for the univariate linear calibration problem, which had not previously been completely resolved. Finally, we expect this new methodology to be widely used in the field of calibration.
Supplementary Material
Derivations of the statistical properties of reversed inverse regression. Access here
Acknowledgments
The study reported in this paper was conducted as part of a plan to improve the quality assurance and control system of KEPCO Nuclear Fuel. The authors would like to express their thanks for the support from their company, without which the study could not have been successfully completed. In particular, the authors would like to express special thanks to President & CEO, Jaehee Lee; Executive Vice President & Chief Production Officer, Sundoo Kim; and Ex-Executive Vice President & Chief Production Officer, Chuljoo Park, who cordially supported and encouraged the authors in their study on the statistical theory and development of a new calibration approach using a regression model.
References
- R.E. Walpole, R.H. Myers, Probability and Statistics for Engineers and Scientists, 5th edn. (Macmillan Publishing Company, London, 1993) [Google Scholar]
- C. Eisenhart, The interpretation of certain regression methods and their use in biological and industrial research, Ann. Math. Stat. 10, 162–186 (1939) [CrossRef] [Google Scholar]
- E.J. Williams, A note on regression methods in calibration, Technometrics 11, 189–192 (1969) [Google Scholar]
- P.A. Parker, G.G. Vining, S.R. Wilson, J.L. Szarka III, N.G. Johnson, The prediction properties of inverse and reverse regression for the simple linear calibration problem, J. Qual. Technol. 42, 332–347 (2010) [CrossRef] [Google Scholar]
- G. Casella, R.L. Berger, Statistical Inference, 2nd edn. (Duxbury, Pacific Grove, 2002) [Google Scholar]
- R.G. Krutchkoff, Classical and inverse regression methods, Technometrics 9, 425–439 (1967) [Google Scholar]
- G.K. Shukla, P. Datta, Comparison of the inverse estimator with the classical estimator subject to a preliminary test in linear calibration, J. Stat. Plan. Inference 12, 93–102 (1985) [Google Scholar]
- S.D. Oman, An exact formula for the M.S.E. of the inverse estimator in the linear calibration problem, J. Stat. Plan. Inference 11, 189–196 (1985) [Google Scholar]
- W. Fuller, Measurement Error Models (John Wiley & Sons, Hoboken, 1987) [CrossRef] [Google Scholar]
- T. Pham-Gia, N. Turkkan, E. Marchand, Density of the ratio of two normal random variables and applications, Commun. Stat. Theory Methods 35, 1569–1591 (2006) [Google Scholar]
- N. Tsoulfanidis, Measurement and Detection of Radiation (Hemisphere Publishing Corporation, Washington, 1983) [Google Scholar]
- A. Papanicolaou, Taylor Approximation and the Delta Method (coursehero, Stanford, 2009), 103 p. [Google Scholar]
- R.G. Krutchkoff, Classical and inverse regression methods in extrapolation, Technometrics 11, 605–608 (1969) [Google Scholar]
- J. Berkson, Estimation of a linear function for a calibration line; consideration of a recent proposal, Technometrics 11, 647–660 (1969) [Google Scholar]
- M. Halpern, On inverse estimation in linear regression, Technometrics 12, 727–736 (1970) [Google Scholar]
- M.Y. Suh, Methods for the Calculation of Uncertainty in Analytical Chemistry, KAERI/TR1602/2000 Korean Language (Korea Atomic Energy Research Institute, Daejeon, 2000) [Google Scholar]
- C. Osborne, Statistical calibration: a review, Int. Stat. Rev. 59, 309–336 (1991) [Google Scholar]
- T. Lwin, J.S. Maritz, An analysis of the linear calibration controversy from the perspective of compound estimation, Technometrics 24, 235–242 (1982) [Google Scholar]
Cite this article as: Pilsang Kang, Changhoi Koo, Hokyu Roh, Reversed inverse regression for the univariate linear calibration and its statistical properties derived using a new methodology, Int. J. Metrol. Qual. Eng. 8, 28 (2017)
All Tables
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.