Assessing interlaboratory comparison data adjustment procedures

Interlaboratory comparisons (ILCs) are one of the key activities in metrology. Estimates x = (x1,…, xn) T of a measurand α along with their associated standard uncertainties u0 = (u0,1,…, u0,n) T, u0,j  = u0 (xj) are provided by each of n laboratories. Employing a model of the form
xj ∈ N(α, v0,j),  j = 1,…,n,

v0,j = u0,j2, we may wish to find a consensus value for α. A χ2 test can be used to assess the degree to which the spread of the estimates x are consistent with the stated uncertainties u0. If they are judged to be inconsistent, then an adjustment procedure can be applied to determine vj  ≥  v0,j, so that x and v represent consistency. The underlying assumption behind this approach is that some or all of the laboratories have underestimated or neglected some uncertainty contributions, sometimes referred to as ‘dark uncertainty’, and the adjusted v provides an estimate of this dark uncertainty derived from the complete set of laboratory results. There are many such adjustment procedures, including the Birge and Mandel–Paule (M-P) procedures.
In implementing an adjustment procedure, a desirable objective is to make as minimal an adjustment as necessary in order to bring about the required degree of consistency. In this paper, we discuss the use of relative entropy, also known as the Kullback–Leibler divergence, as a measure of the degree of adjustment. We consider parameterising v = v (b) as a function of parameters b with the input v0 = v (b0) for some b0. We look to perturb b from b0 to bring about consistency in a way that minimises how far b is from b0 in terms of the relative entropy or Kullback–Leibler divergence.


Introduction
Interlaboratory comparisons (ILCs) are one of the key activities in metrology, and are undertaken for a number of reasons, e.g., to determine a consensus value for a quantity by combining experimental results from a number of laboratories. In this paper, we are primarily interested in finding a consensus estimate for the quantity being measured. The starting point for the analysis of ILC data is usually a model [1] of the form x j ∈ N a; v j À Á ; j ¼ 1; : : : ; n; ð1Þ where a is the value of the measurand, x j is the estimate of a produced by the jth laboratory and v j ¼ u 2 j its associated variance, where u j is its standard uncertainty. Given the model (1), the distribution for a is derived from the weighted least squares (WLS) estimate involving the n Â 1 observation matrix C. If the data is regarded as a sample from Gaussian distributions as in (1), then a is a sample from the Gaussian distribution N a; V a ð Þ, V a ¼ C ⊤ C À Á À1 : Under the model (1), the sum of the squares of the weighted residuals r = r (v) = y À Ca, is a sample from a x 2 nÀ1 distribution with n À 1 degrees of freedom. The degree of consistency of the data x = (x 1 , . . . , x n ) T and u = (u 1 , . . . , u n ) T can be assessed by evaluating Pr(z 2 ≥ R 2 ) where z 2 ∼ x 2 nÀ1 . If Pr(z 2 ≥ R 2 ) is less than 5%, say, then there is doubt about consistency of the data [1]. If the data is judged to be inconsistent, the WLS estimate a might not be a reliable estimate of a in the sense that the variance matrix V a may underestimate the uncertainty associated with this estimate. In the presence of inconsistency, a procedure can be applied to adjust the uncertainties to achieve consistency, see, e.g., [2][3][4][5][6][7][8][9][10]. The underlying assumption behind these approaches is that some or all of the laboratories have underestimated or neglected some uncertainty contributions, sometimes referred to as 'dark uncertainty' [11], and the adjusted uncertainties implicitly provide an estimate of this dark uncertainty derived from the complete set of laboratory results.
The methods discussed in this paper are relevant to several types of ILCs such as those included in ISO standards and in the field of testing. For these types of ILC, a primary objective is to learn about laboratory effects from the information generated in the intercomparison. They are however not directly relevant to the analysis of Key Comparisons (KC) as the type of data adjustments considered in this paper are not permitted.

When to adjust, how far to adjust
The issue of whether or not to adjust a set of input uncertainties can be thought of as a model selection problem involving two models. The first model M 1 assumes complete faith in the input data x and u 0 . In model M 1 , any large value for R 2 , say greater than the 95th percentilec 2 ; of the relevant x 2 distribution is due to chance: we expect R 2 to exceed this threshold for 1 in 20 ILCs performed at 'random'. The second model M 2 assumes that one or more of the laboratories have underestimated their uncertainties and that some adjustment to the input uncertainties may be necessary to achieve a more consistent set of results.
The observed value of R 2 is used to decide which of these models is most supported by the data: we setc 2 so that if R 2 c 2 , we select M 1 and select model M 2 otherwise. For example, ifc 2 corresponds to the 95th percentile, we are saying that there is a better than 1 in 20 chance that inconsistency is due to model M 2 than model M 1 . (It is possible to provide a more formal methodology for such model selection problems, e.g., [12,13], but this is not the focus of this paper. See also [6,14] that discuss Bayesian model averaging approaches to the analysis of ILC data. ) We now turn to the question of how far to adjust. Prior to the analysis, we assume that a threshold valuec 2 for R 2 has been assigned, sayc 2 corresponding to the 95th percentile of x 2 nÀ1 and that R 2 0 >c 2 , so that model M 2 is selected. We look for adjusted variances v such that the corresponding value of R 2 (v) is such that R 2 v ð Þ ¼ c 2 <c 2 . The choice of the constraint value c 2 needs to be considered. A common choice is to set c 2 = n À 1, the expected value of the distribution x 2 nÀ1 . In the example below, there are n = 11 laboratories. The 95th percentile for x 2 10 isc 2 ¼ 18:3 while its mean is 10 which in fact corresponds to the 56th percentile, P (x 2 < 10) = 0.56. (The 50th percentile, in other words the median, of a x 2 n distribution is approximately n(1 À 2/(9n)) 3 and is somewhat less than its mean, due to the skewness of the distribution.) Thus, if R 2 0 ¼ 18:5, above the threshold, an adjustment procedure will be applied to achieve a value of R 2 (v) = 10, while if R 2 0 ¼ 18:1, less than the threshold, no adjustment procedure is applied. Hence, a relatively small change in the input data can lead to relatively large changes in the inferences about a. We would prefer the adjustment procedures to be smooth with respect to changes in the input data and therefore argue that c 2 should be the same asc 2 . It also seems somewhat illogical to tolerate a value of R 2 0 just below the threshold value but, if adjustment is deemed necessary, to adjust to a level of consistency much lower than the threshold value.
We return to the question of what is a good value for c 2 ð¼ c 2 Þ. While a value corresponding the 95th percentile is usual [5], other choices such as a value corresponding to, say, the 80th percentile may be appropriate, being a balance between only adjusting where there is evidence of inconsistency and achieving a level of consistency that is closer to the expected level. It also represents a compromise from current practice to set a threshold at the 95th percentile and adjust, if necessary, to the mean of the x 2 distribution. In terms of model selection, a choice of the 80th percentile indicates that if R 2 >c 2 there is a better than 1 in 5 chance that inconsistency is due to model M 2 than model M 1 .

Relative entropy as a measure of the extent of an adjustment
Our main concern is to design adjustment procedures that lead to adjusted v that represent a minimal departure from the stated v 0 according to some measure. We regard the input v 0 as being the best estimates that the participating laboratories can deliver, based on their knowledge of their own systems. If the complete set of results indicates inconsistency, then this new knowledge is available to help provide adjusted v that improves consistency, i.e., delivers estimates that are more likely, given the new knowledge but at the same time respects as much as possible the knowledge represented by the initial input data. In this paper, we discuss adjusted uncertainty measures derived by minimising the relative entropy or Kullback-Leibler divergence [15,16] a measure of how far v is from v 0 . We consider parameterising v = v (b) as a function of parameters b with the input v 0 = v (b 0 ) for some b 0 and we look to perturb b from b 0 to bring about consistency. We now regard a = a (b) and R 2 = R 2 (b) as functions of b through their dependence on v ¼ vðbÞ.
Section 2 defines the relative entropy of an adjustment and Section 3 describes adjustment through minimising relative entropy. Section 4 describes a number of single parameter adjustment schemes in which the degree of adjustment is determined by the consistency constraint. Section 5 introduces the steepest descent (S-D) adjustment scheme. A comparison of adjustment procedures in terms of the degree of adjustment of ILC data is presented in Section 6. This data is used for illustrative purposes only as key comparison data cannot be adjusted in the manner described in the paper. Our concluding remarks are given in Section 7.
2 Relative entropy as a measure of degree of adjustment Relative entropy, or Kullback-Leibler divergence [15,16], is a measure of the difference between two distributions. For continuous distributions it is given by 0 jjpÞ and therefore relative entropy is not a norm. It is usually the case that p 0 is an approximate distribution for p and the relative entropy is a measure of information gained using p over p 0 .
For multivariate Gaussian distributions, In the above |A| is the determinant of a square matrix A and Tr (A) is the trace of A, the sum of its diagonal elements. For our application, with x the same for both distributions and diagonal variance matrices, the relative entropy is a function of b and b 0 in terms of 3 Adjustment procedure based on minimising relative entropy We are interested in adjustment procedures that represent as small a change as necessary, as measured by the relative entropy, to bring about consistency. This problem can be formulated as min The constraint is equivalent to R 2 (b) = c 2 as this is a minimisation problem.
The problem can also be parametrised in terms of weights In terms of weights w, the consistency constraint can be written explicitly as is a quadratic function of the parameters w, one of the advantages of working with w (b) rather than v (b). In terms of the weights w, we arrive at the adjustment procedure The last set of inequalities imposes the constraint that we only look for an adjustment that increases the input uncertainties.

Newton scheme to determine w
If we assume that E (w 0 ) < 0, that is, R 2 (v 0 ) > c 2 , the optimal w automatically satisfies the constraints w j w 0,j , and then the optimal w is a critical point of the Lagrangian function This leads to a square system of n + 1 equations that can be solved iteratively using Newton's method, for example [17]. A starting estimate w (b) could be w (b 0 ) with k defined by the least squares solution of the first n equations above. The Newton scheme can be implemented in stages for decreasing values c 2 k of c 2 , starting with a value of c 2 close to c 0 2 determined by the input v 0 , and ending up with the desired value for c 2 . The staged approach may help issues with divergence of a simple Newton scheme if the starting point is far from the solution.

Single parameter adjustment schemes
There are number of adjustment procedures that are commonly used for which v = v (l) depends on a single parameter l and the consistency constraint R 2 (l) = c 2 defines the value of l.
The Birge procedure [18] corresponds to the model v l ð Þ ¼ lv 0 ; while the M-P procedure [8] corresponds to model where e j = 1, j = 1, . . . , n. The M-P procedure can be motivated by a model in which all the participants have omitted the same additive influence factor from their uncertainty budget. The variance of this influence factor is estimated from the consistency constraint. It is not so easy to associate the Birge procedure with a plausible underlying statistical model.

Adjustment procedures based on a steepest descent vector
Inconsistency is indicated if the sum of squared residuals R 2 (b 0 ) is too high. This suggests adjusting b 0 to b along a descent direction g so that, at least for small l > 0, R 2 (b 0 + lg) < R 2 (b 0 ). Since, we are interested in determining b that achieves consistency with an adjustment that is minimal in some way, we set g to be the steepest decent (S-D) direction: A small step in the direction of g will bring about the swiftest reduction in R 2 (b) relative to the size of the change in b.
The S-D direction depends on how v (b) is parametrised in terms of b. There are a number of candidates, where b j is square root of the precision. We may also choose the parametrisation For this parametrisation, we have that Thus, the jth element of the S-D vector depends on the difference between x j and the weighted least-squares solution a 0 relative to the input uncertainty u 0,j . This means that the degree of adjustment for a laboratory depends on the extent to which its result is deemed to be outlying, relative to the weighted least-squares solution.
For the parametrisation (2), we have Thus, for b j ≈ 0, j = 1, . . . , n, D KL ðbjjb 0 Þ ∝ jjb À b 0 jj 2 . This means that for the parametrisation (2), a small step along the S-D direction derived from (3) brings about the largest reduction in R 2 (b) relative to the change in D KL ðbjjb 0 Þ.

Stepped approach
With the S-D approach, if the estimate x j provided by the jth laboratory happens to be very close to the consensus value a 0 determined by the WLS approach (using the input v (b 0 )), then the S-D vector will have only a very small component in the direction of increasing v j so that v j remains essentially unadjusted. If the consensus value a given by the adjusted v (b) is significantly different from the initial consensus value, there may be evidence the v j should also undergo some adjustment as x j is no longer so close to the new consensus value. This suggests implementing a staged approach determining S-D adjustments for a sequence of values c 2 k descending from some large c 2 0 (just shy of the x 2 value determined by the initial data, say) down to the desired c 2 .

Example calculations
We present calculations for data derived from the Consultative Committee for Length key comparison CCL-K1 [19], calibration of gauge blocks by interferometry. This data is used for illustrative purposes only as according to the technical protocol, key comparison data cannot be adjusted in the manner described in this paper. The measured values x j and associated standard uncertainties u 0,j are from 11 participating laboratories measuring 9 tungsten carbide gauge blocks of nominal lengths ranging from 0.5 mm to 100 mm ( [20], Tab. 5). The laboratory results and uncertainty bars x j ± 2u j are represented in Figure 1 given by the first (red) uncertainty bar in each group. The figure shows that the results from laboratories 6 and 7 and, to a lesser extent, laboratory 11 deviated from the consensus value given by the weighted mean, the bottom (solid red) horizontal line. The analysis of the data for the 100 mm gauge block found that the R 2 (b 0 ) value was the 98th percentile of the x 2 distribution indicating inconsistency of the data: if we simulated a large number of data generated according to the model (1), only 2% would result in values of R 2 greater than or equal to R 2 (b 0 ).

Adjustment to different consistency levels
As mentioned in Section 1, adjustments are often made to the mean of the x 2 distribution. We first apply the relative entropy approach to determine adjustments to the 95th and 80th percentiles and to the mean. Table 1 shows the adjusted uncertainties for each level while Figure 1 shows the resulting uncertainty bars x j ± 2u j , along with the corresponding weighted means. It is seen that, for the relative entropy method, the main adjustments are made to the results of laboratories 6, 7 and 11, with adjustments increasing as the level of consistency is increased. Of interest is the fact that even for the adjustment from the 98th percentile (the original data) to the 95th percentile, the weighted mean changes.

Comparison of adjustment procedures
We now leave aside the question of which level of consistency we should adjust and examine the degree of adjustment of different adjustment strategies as measured by relative entropy. For these comparisons, the adjustment is determined for c 2 set at the 80th percentile value of the x 2 distribution. As shown in Table 1, the adjustments made to the uncertainties are also moderate compared to those that are adjusted to the mean of the x 2 distribution. Figure 2 shows the estimates x and the uncertainty bars x j ± 2u j for the input uncertainties u 0 and for the adjusted uncertainties u from the Birge, M-P, S-D and minimum relative entropy (DKL) procedures.
The Birge and M-P schemes each apply the same adjustment to the uncertainties from all the laboratories, either multiplicatively or additively. The degree of adjustment for the S-D approach depends on how far the laboratory estimate is from the consensus value a 0 determined from the initial WLSs analysis. If x j is close to a 0 relative to u j,0 , then u j remains close to u j,0 . For laboratories 6, 7 and 11, the S-D approach makes a larger adjustment to the uncertainties compared to the other laboratories as the respective estimates are further away from a 0 relative to the laboratories' initial uncertainty. Visually, there is evidence that the S-D method brings about a smaller adjustment compared to the other two single parameter adjustment schemes, which is expected as shown by the calculations in Section 5. The extent of adjustment using each of these methods can be quantified using the Kullback-Leibler divergence, the lower the divergence, the smaller the adjustment. Table 2 shows that among the single parameter adjustment schemes, the S-D method has the lowest Kullback-Leibler divergence. It is next only to the minimum relative entropy adjustment scheme. Like the S-D scheme, the minimum relative entropy method  only makes a substantial adjustment to the uncertainties of laboratories 6, 7 and 11. Consensus values and their associated standard uncertainties are shown in Table 3. All the adjustment schemes result in a larger uncertainty compared to that for WLS, which is a direct consequence of adjusting each laboratory's uncertainty u 0,j . The minimum relative entropy adjustment method yields the smallest uncertainty of all the adjustment procedures as a result of the minimal adjustment to u 0,j .
There are a number of points of interest associated with Tables 2 and 3. In Table 2, we see that the S-D method makes a smaller change to the input distributions as measured by the relative entropy compared to the Birge and M-P procedures. Both the Birge and M-P procedures provide an adjustment direction that does not depend on how the input estimates x j relate to the consensus value a 0 , whereas the S-D procedure does reflect this information with the uncertainties associated with estimates further from the consensus value undergoing more adjustment. Of course, the fact that there is inconsistency means that use of the initial consensus estimate a 0 to determine the degree of adjustment has to be undertaken with caution. An advantage of the relative entropy approach (as well as guaranteeing a minimal adjustment according to the criterion) is that the initial consensus estimate plays a significantly lesser role in the adjustment procedure. A   Fig. 2. Estimates and 95% confidence intervals of a 100 mm gauge block for various adjustment procedures. The graph shows the uncertainty bars x j ± 2u j for the input uncertainties u 0 , the first set of uncertainty bars in each group, and for the adjusted uncertainties u from the Birge, second set, Mandel-Paule (M-P), third set, steepest decent (S-D), fourth set and minimum relative entropy (DKL) procedures, fifth set. The horizontal lines are the weighted means determined from the input uncertainties (solid red), steepest descent (blue dashed) and minimum relative entropy (black dotted).  Table 3 is the spread of the consensus estimates from À73.2 nm to À76.4 nm, a range of 3.2 nm that can be compared to a standard uncertainty of the order of 4 nm. This shows that the method of adjustment can have a significant impact on the reported consensus value.

Bayesian approaches to adjustment
The adjustment procedures we have considered so far have the feature that the input uncertainties are adjusted to achieve a desired degree of consistency and the adjusted uncertainties are then treated as known reliably as far as the determination of the consensus value is concerned. It seems more reasonable to regard the input uncertainties as estimates of the true uncertainties and then use the ensemble of information provided by the ILC to update these estimates in a Bayesian framework.
Such an approach recognises that the updated estimates are not known exactly and this lack of complete knowledge about them will contribute an uncertainty to the consensus value derived from them. A Bayesian version of the Birge and M-P procedures is considered in [18], for example, and leads to t-distributions associated with the consensus value. The Bayesian approach can be developed further [21] in a hierarchical model that leads to a consensus distribution that is a mixture of t-distributions (in the same way that a t-distribution is a mixture of Gaussians). The analysis of the gauge block data in [21] led to a consensus estimate of À73.3 nm with associated standard uncertainty of 3.8 nm which is more in line with the relative entropy estimate in Table 3. We must bear in mind that the relative entropy approach associates a Gaussian distribution to the consensus estimate while the Bayesian approaches assign heaviertailed distributions.

Concluding remarks
This paper has been concerned with the analysis of ILC data and approaches to resolving inconsistency in such data. When data is judged to be inconsistent, the WLS estimate of the consensus value of the quantity being measured might not be reliable. Several adjustment schemes have been discussed. The underlying assumption behind adjustment procedures is that some or all of the laboratories have underestimated or neglected some uncertainty contributions, sometimes referred to as 'dark uncertainty', and the adjusted uncertainties provide an estimate of this dark uncertainty derived from the complete set of laboratory results. We have been interested in procedures that make minimal adjustments to the input uncertainties. One way of quantifying the extent of adjustment is the Kullback-Leibler divergence or relative entropy. We have compared adjustment schemes such as the Birge and M-P methods with a S-D adjustment and relative entropy adjustment that are specifically designed to minimise the degree of adjustment. The Birge and M-P methods tend to make large adjustments compared to those made using a minimal relative entropy adjustment. The S-D method tends to compare well with the relative entropy adjustment and is not as numerically demanding as it involves only solving a single nonlinear equation. However, the S-D method is dependent on the initial consensus estimate. Adopting a stepped approach could help alleviate this issue. An advantage of the relative entropy approach (as well as guaranteeing a minimal adjustment according to the criterion) is that the initial consensus estimate, determined from inconsistent data, plays a significantly lesser role in the adjustment procedure.