Open Access
 Issue Int. J. Metrol. Qual. Eng. Volume 10, 2019 3 8 https://doi.org/10.1051/ijmqe/2019003 18 April 2019

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

Interlaboratory comparisons (ILCs) are one of the key activities in metrology, and are undertaken for a number of reasons, e.g., to determine a consensus value for a quantity by combining experimental results from a number of laboratories. In this paper, we are primarily interested in finding a consensus estimate for the quantity being measured. The starting point for the analysis of ILC data is usually a model [1] of the form(1)where α is the value of the measurand, xj is the estimate of α produced by the jth laboratory and its associated variance, where uj is its standard uncertainty. Given the model (1), the distribution for α is derived from the weighted least squares (WLS) estimateinvolving the n  ×  1 observation matrix C. If the data is regarded as a sample from Gaussian distributions as in (1), then a is a sample from the Gaussian distribution

Under the model (1), the sum of the squares of the weighted residuals r  =  r  ( v ) =  y  − Ca,is a sample from a distribution with n − 1 degrees of freedom. The degree of consistency of the data x  = (x 1,…, x n ) T and u  = (u 1,…, u n ) T can be assessed by evaluating Pr(ζ2 ≥ R 2) where . If Pr(ζ2 ≥ R 2) is less than 5%, say, then there is doubt about consistency of the data [1]. If the data is judged to be inconsistent, the WLS estimate a might not be a reliable estimate of α in the sense that the variance matrix V a may underestimate the uncertainty associated with this estimate. In the presence of inconsistency, a procedure can be applied to adjust the uncertainties to achieve consistency, see, e.g., [210]. The underlying assumption behind these approaches is that some or all of the laboratories have underestimated or neglected some uncertainty contributions, sometimes referred to as ‘dark uncertainty’ [11], and the adjusted uncertainties implicitly provide an estimate of this dark uncertainty derived from the complete set of laboratory results.

The methods discussed in this paper are relevant to several types of ILCs such as those included in ISO standards and in the field of testing. For these types of ILC, a primary objective is to learn about laboratory effects from the information generated in the inter comparison. They are however not directly relevant to the analysis of Key Comparisons (KC) as the type of data adjustments considered in this paper are not permitted.

The issue of whether or not to adjust a set of input uncertainties can be thought of as a model selection problem involving two models. The first model M 1 assumes complete faith in the input data x and u 0. In model M 1, any large value for R 2, say greater than the 95th percentile of the relevant χ 2 distribution is due to chance: we expect R 2 to exceed this threshold for 1 in 20 ILCs performed at ‘random’. The second model M 2 assumes that one or more of the laboratories have underestimated their uncertainties and that some adjustment to the input uncertainties may be necessary to achieve a more consistent set of results.

The observed value of R 2 is used to decide which of these models is most supported by the data: we set so that if , we select M 1 and select model M 2 otherwise. For example, if corresponds to the 95th percentile, we are saying that there is a better than 1 in 20 chance that inconsistency is due to model M 2 than model M 1. (It is possible to provide a more formal methodology for such model selection problems, e.g., [12,13], but this is not the focus of this paper. See also [6,14] that discuss Bayesian model averaging approaches to the analysis of ILC data.)

We now turn to the question of how far to adjust. Prior to the analysis, we assume that a threshold value for R 2 has been assigned, say corresponding to the 95th percentile of and that , so that model M 2 is selected. We look for adjusted variances v such that the corresponding value of R 2 ( v ) is such that . The choice of the constraint value c 2 needs to be considered. A common choice is to set c 2 = n − 1, the expected value of the distribution . In the example below, there are n = 11 laboratories. The 95th percentile for is while its mean is 10 which in fact corresponds to the 56th percentile, P (χ 2 < 10) = 0.56.(The 50th percentile, in other words the median, of a distribution is approximately ν(1 − 2/(9ν))3 and is somewhat less than its mean, due to the skewness of the distribution.) Thus, if , above the threshold, an adjustment procedure will be applied to achieve a value of R 2 ( v ) = 10, while if , less than the threshold, no adjustment procedure is applied. Hence, a relatively small change in the input data can lead to relatively large changes in the inferences about α. We would prefer the adjustment procedures to be smooth with respect to changes in the input data and therefore argue that c 2 should be the same as . It also seems somewhat illogical to tolerate a value of just below the threshold value but, if adjustment is deemed necessary, to adjust to a level of consistency much lower than the threshold value.

We return to the question of what is a good value for . While a value corresponding the 95th percentile is usual [5], other choices such as a value corresponding to, say, the 80th percentile may be appropriate, being a balance between only adjusting where there is evidence of inconsistency and achieving a level of consistency that is closer to the expected level. It also represents a compromise from current practice to set a threshold at the 95th percentile and adjust, if necessary, to the mean of the χ 2 distribution. In terms of model selection, a choice of the 80th percentile indicates that if there is a better than 1 in 5 chance that inconsistency is due to model M 2 than model M 1.

1.2 Relative entropy as a measure of the extent of an adjustment

Our main concern is to design adjustment procedures that lead to adjusted v that represent a minimal departure from the stated v 0 according to some measure. We regard the input v 0 as being the best estimates that the participating laboratories can deliver, based on their knowledge of their own systems. If the complete set of results indicates inconsistency, then this new knowledge is available to help provide adjusted v that improves consistency, i.e., delivers estimates that are more likely, given the new knowledge but at the same time respects as much as possible the knowledge represented by the initial input data. In this paper, we discuss adjusted uncertainty measures derived by minimising the relative entropy or Kullback–Leibler divergence [15,16] a measure of how far v is from v 0. We consider parameterising v  =  v  ( b ) as a function of parameters b with the input v 0 =  v  ( b 0) for some b 0 and we look to perturb b from b 0 to bring about consistency. We now regard a = a ( b ) and R 2 = R 2 ( b )  as functions of b through their dependence on .

Section 2 defines the relative entropy of an adjustment and Section 3 describes adjustment through minimising relative entropy. Section 4 describes a number of single parameter adjustment schemes in which the degree of adjustment is determined by the consistency constraint. Section 5 introduces the steepest descent (S-D) adjustment scheme. A comparison of adjustment procedures in terms of the degree of adjustment of ILC data is presented in Section 6. This data is used for illustrative purposes only as key comparison data cannot be adjusted in the manner described in the paper. Our concluding remarks are given in Section 7.

2 Relative entropy as a measure of degree of adjustment

Relative entropy, or Kullback–Leibler divergence [15,16], is a measure of the difference between two distributions. For continuous distributions it is given by

For any p = p (x) and p 0 = p 0 (x) ,  and  only if p = p 0. Note that  and therefore relative entropy is not a norm. It is usually the case that p 0 is an approximate distribution for p and the relative entropy is a measure of information gained using p over p 0.

For multivariate Gaussian distributions,whereand

In the above |A| is the determinant of a square matrix A and Tr (A) is the trace of A, the sum of its diagonal elements. For our application, with x the same for both distributions and diagonal variance matrices, the relative entropy is a function of b and b 0 in terms of v  =  v  ( b ) and v 0 =  v  ( b 0):

3 Adjustment procedure based on minimising relative entropy

We are interested in adjustment procedures that represent as small a change as necessary, as measured by the relative entropy, to bring about consistency. This problem can be formulated assubject to R 2 ( b ) ≤ c 2. The constraint is equivalent to R 2 ( b ) = c 2 as this is a minimisation problem.

The problem can also be parametrised in terms of weights ,

with

In terms of weights w , the consistency constraint can be written explicitly as where , , .

Thus, E ( w ) is a quadratic function of the parameters w , one of the advantages of working with w  ( b ) rather than v  ( b ). In terms of the weights w , we arrive at the adjustment procedure

subject to E ( w ) = 0, 0 ≤ w j  ≤ w 0,j , j = 1,…, n.

The last set of inequalities imposes the constraint that we only look for an adjustment that increases the input uncertainties.

3.1 Newton scheme to determine w

If we assume that E ( w 0) < 0, that is, R 2 ( v 0) > c 2, the optimal w automatically satisfies the constraints w j  ≤ w 0,j , and then the optimal w is a critical point of the Lagrangian function

This leads to a square system of n + 1 equationsthat can be solved iteratively using Newton's method, for example [17]. A starting estimate w  ( b ) could be w  ( b 0) with κ defined by the least squares solution of the first n equations above. The Newton scheme can be implemented in stages for decreasing values of c 2, starting with a value of c 2 close to c 0 2 determined by the input v 0, and ending up with the desired value for c 2. The staged approach may help issues with divergence of a simple Newton  scheme if the starting point is far from the solution.

There are number of adjustment procedures that are commonly used for which v  =  v  (λ) depends on a single parameter λ and the consistency constraint R 2 (λ) = c 2 defines the value of λ.

The Birge procedure [18] corresponds to the modelwhile the M-P procedure [8] corresponds to modelwhere e j  = 1, j = 1,…, n. The M-P procedure can be motivated by a model in which all the participants have omitted the same additive influence factor from their uncertainty budget. The variance of this influence factor is estimated from the consistency constraint. It is not so easy to associate the Birge procedure with a plausible underlying statistical model.

5 Adjustment procedures based on a steepest descent vector

Inconsistency is indicated if the sum of squared residuals R 2 ( b 0) is too high. This suggests adjusting b 0 to b along a descent direction g so that, at least for small λ > 0, R 2 ( b 0 + λ g ) < R 2 ( b 0). Since, we are interested in determining b that achieves consistency with an adjustment that is minimal in some way, we set g to be the steepest decent (S-D) direction:

A small step in the direction of g will bring about the swiftest reduction in R 2 ( b ) relative to the size of the change in b .

The S-D direction depends on how v  ( b ) is parametrised in terms of b . There are a number of candidates, (i) v j  ( b ) = b j so that b j represents the variance, (ii) so that b j represents the standard uncertainty, (iii) so that b j represents the precision or (iv) , where b j is square root of the precision.

We may also choose the parametrisation(2)

For this parametrisation, we have that(3)

Thus, the jth element of the S-D vector depends on the difference between x j and the weighted least-squares solution a 0 relative to the input uncertainty u 0,j . This means that the degree of adjustment for a laboratory depends on the extent to which its result is deemed to be outlying, relative to the weighted least-squares solution.

For the parametrisation (2), we have

Thus, for b j   ≈  0, j = 1,…, n, . This means that for the parametrisation (2), a small step along the S-D direction derived from (3) brings about the largest reduction in R 2 ( b ) relative to the change in .

5.1 Stepped approach

With the S-D approach, if the estimate x j provided by the jth laboratory happens to be very close to the consensus value a 0 determined by the WLS approach (using the input v  ( b 0)), then the S-D vector will have only a very small component in the direction of increasing v j so that v j remains essentially unadjusted. If the consensus value a given by the adjusted v  ( b ) is significantly different from the initial consensus value, there may be evidence the v j should also undergo some adjustment as x j is no longer so close to the new consensus value. This suggests implementing a staged approach determining S-D adjustments for a sequence of values descending from some large (just shy of the χ 2 value determined by the initial data, say) down to the desired c 2.

6 Example calculations

We present calculations for data derived from the Consultative Committee for Length key comparison CCL-K1 [19], calibration of gauge blocks by interferometry. This data is used for illustrative purposes only as according to the technical protocol, key comparison data cannot be adjusted in the manner described in this paper. The measured values x j and associated standard uncertainties u 0,j are from 11 participating laboratories measuring 9 tungsten carbide gauge blocks of nominal lengths ranging from 0.5 mm to 100 mm ([20], Tab. 5). The laboratory results and uncertainty bars x j  ± 2u j are represented in Figure 1 given by the first (red) uncertainty bar in each group. The figure shows that the results from laboratories 6 and 7 and, to a lesser extent, laboratory 11 deviated from the consensus value given by the weighted mean, the bottom (solid red) horizontal line. The analysis of the data for the 100 mm gauge block found that the R 2 ( b 0) value was the 98th percentile of the χ 2 distribution indicating inconsistency of the data: if we simulated a large number of data generated according to the model (1), only 2% would result in values of R 2 greater than or equal to R 2 (b 0).

 Fig. 1Estimates and 95% confidence intervals of a 100 mm gauge block. The graph shows the uncertainty bars x j  ± 2u j for the input uncertainties u 0, the first set of uncertainty bars in each group, and for the adjusted uncertainties based on the relative entropy method u at the 95th percentile, second set, 80th percentile, third set and the mean, fourth set. The horizontal lines are the weighted means determined from the input uncertainties (red solid line), adjusted uncertainties to the 95th percentile (green dashed line), adjusted uncertainties to the 80th percentile (blue dotted and dashed line) and adjusted uncertainties to the mean (magenta dotted line).

6.1 Adjustment to different consistency levels

As mentioned in Section 1, adjustments are often made to the mean of the χ 2 distribution. We first apply the relative entropy approach to determine adjustments to the 95th and 80th percentiles and to the mean. Table 1 shows the adjusted uncertainties for each level while Figure 1 shows the resulting uncertainty bars x j   ±  2u j , along with the corresponding weighted means. It is seen that, for the relative entropy method, the main adjustments are made to the results of laboratories 6, 7 and 11, with adjustments increasing as the level of consistency is increased. Of interest is the fact that even for the adjustment from the 98th percentile (the original data) to the 95th percentile, the weighted mean changes.

Table 1

Uncertainties adjusted to various levels using the relative entropy method.

We now leave aside the question of which level of consistency we should adjust and examine the degree of adjustment of different adjustment strategies as measured by relative entropy. For these comparisons, the adjustment is determined for c 2 set at the 80th percentile value of the χ 2 distribution. As shown in Table 1, the adjustments made to the uncertainties are also moderate compared to those that are adjusted to the mean of the χ 2 distribution.

Figure 2 shows the estimates x and the uncertainty bars x j   ±  2u j for the input uncertainties u 0 and for the adjusted uncertainties u from the Birge, M-P, S-D and minimum relative entropy (DKL) procedures.

Consensus values and their associated standard uncertainties are shown in Table 3. All the adjustment schemes result in a larger uncertainty compared to that for WLS, which is a direct consequence of adjusting each laboratory's uncertainty u 0,j . The minimum relative entropy adjustment method yields the smallest uncertainty of all the adjustment procedures as a result of the minimal adjustment to u 0,j .

There are a number of points of interest associated with Tables 2 and 3. In Table 2, we see that the S-D method makes a smaller change to the input distributions as measured by the relative entropy compared to the Birge and M-P procedures. Both the Birge and M-P procedures provide an adjustment direction that does not depend on how the input estimates x j relate to the consensus value a 0, whereas the S-D procedure does reflect this information with the uncertainties associated with estimates further from the consensus value undergoing more adjustment. Of course, the fact that there is inconsistency means that use of the initial consensus estimate a 0 to determine the degree of adjustment has to be undertaken with caution. An advantage of the relative entropy approach (as well as guaranteeing a minimal adjustment according to the criterion) is that the initial consensus estimate plays a significantly lesser role in the adjustment procedure. A feature of Table 3 is the spread of the consensus estimates from −73.2 nm to −76.4 nm, a range of 3.2 nm that can be compared to a standard uncertainty of the order of 4 nm. This shows that the method of adjustment can have a significant impact on the reported consensus value.

 Fig. 2Estimates and 95% confidence intervals of a 100 mm gauge block for various adjustment procedures. The graph shows the uncertainty bars x j  ± 2u j for the input uncertainties u 0, the first set of uncertainty bars in each group, and for the adjusted uncertainties u from the Birge, second set, Mandel-Paule (M-P), third set, steepest decent (S-D), fourth set and minimum relative entropy (DKL) procedures, fifth set. The horizontal lines are the weighted means determined from the input uncertainties (solid red), steepest descent (blue dashed) and minimum relative entropy (black dotted).
Table 2

D KL ( b || b 0) for various adjustment schemes.

Table 3

Consensus values with their associated uncertainty.

The adjustment procedures we have considered so far have the feature that the input uncertainties are adjusted to achieve a desired degree of consistency and the adjusted uncertainties are then treated as known reliably as far as the determination of the consensus value is concerned. It seems more reasonable to regard the input uncertainties as estimates of the true uncertainties and then use the ensemble of information provided by the ILC to update these estimates in a Bayesian framework. Such an approach recognises that the updated estimates are not known exactly and this lack of complete knowledge about them will contribute an uncertainty to the consensus value derived from them. A Bayesian version of the Birge and M-P procedures is considered in [18], for example, and leads to t-distributions associated with the consensus value. The Bayesian approach can be developed further [21] in a hierarchical model that leads to a consensus distribution that is a mixture of t-distributions (in the same way that a t-distribution is a mixture of Gaussians). The analysis of the gauge block data in [21] led to a consensus estimate of −73.3 nm with associated standard uncertainty of 3.8 nm which is more in line with the relative entropy estimate in Table 3. We must bear in mind that the relative entropy approach associates a Gaussian distribution to the consensus estimate while the Bayesian approaches assign heavier-tailed distributions.

Acknowledgments

This work was supported by the UK's National Measurement System programme for Data Science. We thank our colleague Dr. Peter Harris, NPL, for comments on a draft of this paper.

References

1. M.G. Cox, The evaluation of key comparison data, Metrologia 39 , 589–595 (2002) [Google Scholar]
2. R.T. Birge, Probable values of the general physical constants, Rev. Mod. Phys. 1 , 1–73 (1929) [Google Scholar]
3. A.G. Chunovkina, C. Elster, I. Lira, W. Wöger, Analysis of key comparison data and laboratory biases, Metrologia 45 , 211–216 (2008) [Google Scholar]
4. M.G. Cox, A.B. Forbes, J. Flowers, P.M. Harris, Least squares adjustment in the presence of discrepant data, in Advanced Mathematical and Computational Tools in Metrology VI , edited by P. Ciarlini, M.G. Cox, F. Pavese, G.B. Rossi (World Scientific, Singapore, 2004), pp. 37–51 [CrossRef] [Google Scholar]
5. M.G. Cox, The evaluation of key comparison data: determining the largest consistent subset, Metrologia 44 , 187–200 (2007) [Google Scholar]
6. C. Elster, B. Toman, Analysis of key comparisons: estimating laboratories' biases by a fixed effects model using Bayesian model averaging, Metrologia 47 , 113–119 (2010) [Google Scholar]
7. A.B. Forbes, C. Perruchet. Measurement systems analysis: concepts and computational approaches, in IMEKO World Congress, Rio de Janeiro, September 18–22, 2006 [Google Scholar]
8. R.C. Paule, J. Mandel, Consensus values and weighting factors, J. Res. Natl. Bur. Stand. 87 , 377–385 (1982) [CrossRef] [Google Scholar]
9. K. Weise, W. Woeger, Removing model and data non-conformity in measurement evaluation, Meas. Sci. Technol. 11 , 1649–1658 (2000) [Google Scholar]
10. R. Willink, Statistical determination of a comparison reference value using hidden errors, Metrologia 39 , 343–354 (2002) [Google Scholar]
11. S. Thompson, S. Ellison, Dark uncertainty, Accredit. Qual. Assur. 16 , 483–487 (2011) [CrossRef] [Google Scholar]
12. J.O. Berger, in Statistical decision theory and Bayesian analysis , 2nd edn. (Springer, New York, 1985) [Google Scholar]
13. H. Chipman, E.I. George, R.E. McCulloch, The practical implementation of Bayesian model selection (Institute of Mathematical Statistics, Beachwood, Ohio, 2001) [Google Scholar]
14. A.B. Forbes, Traceable measurements using sensor networks, Trans. Mach. Learning Data Mining 8 , 77–100 (2015) [Google Scholar]
15. S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 55 , 79–86 (1951) [CrossRef] [Google Scholar]
16. D.J.C. MacKay, Information theory, inference and learning algorithms (Cambridge University Press, Cambridge, 2003) [Google Scholar]
17. P.E. Gill, W. Murray, M.H. Wright, Practical optimization (Academic Press, London, 1981) [Google Scholar]
18. O. Bodnar, C. Elster, J. Fischer, A. Possolo, B. Toman, Evaluation of uncertainty in the adjustment of fundamental constants, Metrologia 53 , S46–S54 (2016) [Google Scholar]
19. Thalmann R. CCL key comparison: calibration of gauge blocks by interferometry, Metrologia 39 , 165 (2002) [Google Scholar]
20. R. Thalmann. CCL key comparison: calibration of gauge blocks by interferometry, Metrologia 39 , 165–177 (2002) [Google Scholar]
21. A.B. Forbes, A hierarchical model for the analysis of inter-laboratory comparison data, Metrologia 53 , 1295–1305 (2016) [Google Scholar]

Cite this article as: Kavya Jagan, Alistair B. Forbes, Assessing interlaboratory comparison data adjustment procedures, Int. J. Metrol. Qual. Eng. 10, 3 (2019)

All Tables

Table 1

Uncertainties adjusted to various levels using the relative entropy method.

Table 2

D KL ( b || b 0) for various adjustment schemes.

Table 3

Consensus values with their associated uncertainty.

All Figures

 Fig. 1Estimates and 95% confidence intervals of a 100 mm gauge block. The graph shows the uncertainty bars x j  ± 2u j for the input uncertainties u 0, the first set of uncertainty bars in each group, and for the adjusted uncertainties based on the relative entropy method u at the 95th percentile, second set, 80th percentile, third set and the mean, fourth set. The horizontal lines are the weighted means determined from the input uncertainties (red solid line), adjusted uncertainties to the 95th percentile (green dashed line), adjusted uncertainties to the 80th percentile (blue dotted and dashed line) and adjusted uncertainties to the mean (magenta dotted line). In the text
 Fig. 2Estimates and 95% confidence intervals of a 100 mm gauge block for various adjustment procedures. The graph shows the uncertainty bars x j  ± 2u j for the input uncertainties u 0, the first set of uncertainty bars in each group, and for the adjusted uncertainties u from the Birge, second set, Mandel-Paule (M-P), third set, steepest decent (S-D), fourth set and minimum relative entropy (DKL) procedures, fifth set. The horizontal lines are the weighted means determined from the input uncertainties (solid red), steepest descent (blue dashed) and minimum relative entropy (black dotted). In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.