Variable data measurement systems analysis: advances in gage bias and linearity referencing and acceptability

Measurement systems analysis (MSA) is a set of requirements and procedures adopted by the automotive industry and other disciplines to evaluate the accuracy and precision of measurement systems through assessing and quantifying the random and systematic errors and assigning appropriate dispositions for tolerance and performance acceptance. The methodology of variable data MSA comprises studies of a system's stability, bias, linearity and gage repeatability and reproducibility (GR&R). This paper describes advances in referencing and criteria for estimation of uncertainty errors, dispositions, and acceptability of MSA bias and linearity, proposing an extension to the basic statistical zero null-hypothesis to include overlap between confidence intervals and uncertainty associated with the reference standards used in bias and linearity studies.


Introduction
A measurement system may be defined collectively as the gage instrument hardware, software and tooling; the standards or reference parts; the procedures, personnel and measurement environment; and the statistical assumptions, hypotheses and data analysis. Measurement systems analysis (MSA) aims to estimate the accuracy and precision of measured, tested, and inspected characteristics of manufactured products; ensuring the inherent variabilities from all elements of a measurement system are understood and controlled, side by side with the product manufacturing process variability which is controlled within set limits. Variable data MSA study for a given characteristic comprises collecting data on stability, bias, linearity, and gage repeatability and reproducibility (GR&R); then À based on statistical hypothesis and disposition criteria À deciding acceptability of the measurement system. Bias and linearity studies expose any systematic errors and validate the accuracy of the measurement system over the operating range. GR&R studies, on the other hand, expose random errors and validate precision of the gage. Stability charts track normal random variation of measurements over usage time, flagging any drift or other special cause effects in the system. The approach in this paper aligns with the guidance provided in the automotive Measurement Systems Analysis (MSA) reference manual 4th Edition [1], with acceptance set at 95% confidence (±2s statistics). All relevant requirements and procedures are captured in Texas Instruments Inc. internal specifications, including formulated Excel worksheets for calculations and dispositions. Additionally, the paper proposes an extension to acceptance of bias and linearity by the statistical zero nullhypothesis to include quantified overlap between the bias confidence intervals and the uncertainty associated with the reference standards used in bias and linearity studies. Section 1 of the paper introduces the types of reference standards used in MSA studies, which include traceable, consensus and check standards. We derive the formulae estimating expanded uncertainty for calculated values of consensus and check standards, using a nested ANOVA method. Section 2 outlines the method for evaluating the amount of bias in a measurement system using repeatability trials, and the acceptance condition by null-hypothesis statistical zero bias condition (statzero). We then propose extending acceptance by a new criterion which we call statzero proxy, based on the degree of overlap between the confidence interval for the bias data fit at 95% confidence, and the uncertainty associated with the reference standard used in the bias evaluation experiment. We also include the Student's t-test for small repeatability sample.
Section 3 deals with evaluation of the measurement system's bias linearity over the gage operating range. First, we derive the simple linear regression formulae that are needed for computing the best fit line, its slope and intercept, and the confidence interval hyperbolae of the regression analysis. Then we set up the statzero conditions needed for acceptance of linearity, applicable to the regression best fit line as well as to the slope and intercept. The Student's t-test is also deployed to justify acceptance for a small sample. Furthermore, we extend the acceptance of linearity to the statzero proxy criteria based on the degree of overlap between the confidence interval hyperbolae curves and the uncertainty bars associated with the reference standards used in the linearity evaluation experiment.
Section 4 introduces examples to demonstrate calculation of a check standard and a consensus standard. It also contains examples demonstrating evaluation and acceptance of bias and linearity by the basic statzero conditions and the extended statzero proxy criteria. Figure 1 shows a typical flow for the reference standard (s), the setup of the measurement trials for single-point bias and multi-point bias linearity studies, and the decision tree for acceptance.

Traceable standard
MSA studies ab initio require reference standards with known values and uncertainties that are traceable to National Measurement Institute (NMI)-accepted values, such as NIST or equivalent. This prerequisite is essential for assessment of accuracy and precision of the measurement system by repeatability trials of a known standard value. Nonetheless, NMI-traceable standards may not be available for all measurement situations, e.g. could be nonexistent for a unique measurement characteristic and/or a unique metrology system; or maybe too expensive to purchase, e.g. for destructive test systems. In such cases, MSA may be performed using in-house reference specimens or master parts, referred to as check standards in the MSA reference manual [1]. However, whereas check standards are certainly suitable for stability and GR&R studies which do not require accuracy, we believe they should not be the go-to for usage in bias and linearity studies due to selftraceability limitation and lack of independently verified accuracy; unless no other option (see example under Results & Discussion). A better alternative to NMItraceable standards are consensus-generated standards, referred to as consensus standards in [1]. We will present these types in x 2.1.2 and x 2.1.3, and propose methods for estimating their values and uncertainties.

Check standard (in-house reference part)
The MSA reference manual [1] defines a check standard as "a measurement artifact that closely resembles what the process is designed to measure, but is inherently more stable than the measurement process being evaluated." Accordingly, we define it as an in-house reference specimen or master part created and verified at a production site or laboratory under controlled conditions at least similar to or better than normal processing conditions. We offer an evaluation method to estimate the check standard value, Rchk, and uncertainty, Uchk, that includes correlation to NMI-traceable standard generic with the check standard but with different value, if available.
The evaluation starts by running repeatability measurement trials Ri on the check standard using a calibrated gage having as much precision as possible, preferrably 10Â the resolution of the systems under MSA study (rule of thumb). Rchk is taken as the mean value: To estimate uncertainty, given that repeatability sample size is typically small (10 m 20), we start by using t-distribution statistics whereby T stat is expressed as: x and s are the sample mean and standard deviation, normally distributed about the true mean m, and s= ffiffiffiffi ffi m p is the familiar standard deviation of the mean (also called standard error of the mean). Tstat characterizes a wider spread and shift of mean for the t-distribution of random small samples relative to normal distribution of population at large (N >> 30, std. dev. = s) (see e.g. [5], x 2.7.3.). For t-curve with (m − 1) degrees of freedom (df = m − 1 since x is already decided), equation (2) is expressed at (1 − a)% confidence by the critical value t(a/2, m − 1), which we call Tcrit and rearrange: The left hand side of (3) represents uncertainty U(x) as a delta between x and the true mean; thus estimated by calculating the sample standard deviation and using Tcrit from standard T-statistic tables or by the Excel function = TINV(a/2, m − 1). In this paper we use 95% confidence, a = 0.05.
The standard deviation is found from the variance of the m repeatability measurements for Rchk: Hence, Next, we estimate the 'combined uncertainty'. This item is discussed in many literature references, but we limit our referencing to the MSA manual [1], NIST [2], and for more details the JCGM Guide to Uncertainty [3]. Here we include the gage calibration uncertainty tolerance Ug as specified by the equipment manufacturer or supplied by a calibration house, and the limit of its resolution r as a capability error component. We combine these in quadrature with U(x) to obtain the combined uncertainty uc, and multiply by 95%-confidence 2-tail coverage factor k = 2 to obtain the measurement 'expanded uncertainty' U = 2uc (using the terminology and symbols in [2]): Even though (6) ensures a reasonable estimate of measurement uncertainty by combining the standard error of the mean with fixed errors due to the quoted gage calibration uncertainty and the resolution limit of the instrument, this only goes to validate precision of the gage with a high degree of confidence. Not the same degree of confidence can be inferred regarding accuracy of the gage, i.e. how close is Rchk really to true value within the calculated measurement uncertainty. The biggest concern is whether the gage has a hidden 'offset' that the repeatability measurement method would not be able to uncover. To help address this, we propose adding variance statistics for a generic NMI-traceable standard, if available at the site owning the gage, (generic as being similar type to the check standard, e.g. a thin film wafer standard, having thickness different from the check standard.). Let the generic standard be characterized by Rt ± U t , where Rt is the traceable value (closer to true value, whatever its value is), and U t is the quoted uncertainty. Running m repeatability trials R'i on the traceable standard using the same gage, the mean value is: The delta between Rt, the quoted value of the traceable standard, and Rm, its mean value as determined by repeatability using the in-house gage, can be considered a systematic offset error: The uncertainty associated with DR may be composed additively from the expanded uncertainty of the repeatability trials on the generic standard, and U t the uncertainty quoted for it. The variance of the repeatability trials R'i is: The measurement expanded uncertainty for the generic standard would be (similar to (6)): Hence, the uncertainty in DR (in quadrature) Finally, applying quadrature combination of the uncertainty components (6) and (11) and the offset error DR of (8), we obtain the total estimated uncertainty for the check standard: The ±Uchk uncertainty represents self-traceable estimated accuracy error bar around the value estimated for the check standard.
Even though the formula (12) represents a reasonably good estimate of a 'simulated' accuracy of the gage by including an offset factor relative to a generic traceable standard, there is no guarantee that the offset is a constant, i.e. can be applied as is across the measurements range that the gage is used for. Because of this, and the evident selftraceability handicap, we do not recommend a check standard as alternate to NMI-traceable standard for use in bias and linearity studies unless no other option. See Example 1 in the Results & Discussion section for a quantitative illustration supporting the counter-recommendation. On the other hand, check standards are quite useful and handy for GR&R studies and ongoing stability tracking via SPC control/monitor charts.
For bias and linearity assessment, a more acceptable alternate to NMI-traceable standard, if unavailable or costprohibitive, is the consensus standard. This features better traceability than just self-traceability, as discussed next.

Consensus standard
The MSA reference manual [1] describes consensus value as "based on collaborative experimental work under the auspices of a scientific or engineering group, defined by a consensus of users such as professional and trade organizations." Accordingly, a consensus standard may start as a check standard belonging to one site (factory, laboratory), then gets evaluated by consensus measurement trials across three or more independent sites that have measurement systems compatible with the system in the site which generated the check standard. Additionally: (i) the participant sites' gages used in generating the consensus information should be calibrated and have at least equivalent or greater resolution (preferably 10Â, rule of thumb) than the gages for which the consensus standard is to be used in MSA studies; and (ii) the gages' calibration uncertainty tolerances Ug, as quoted by equipment manufacturers or by calibration vendors, should be available to be included in assessing the combined uncertainty. Based on these criteria, successful generation of a consensus standard would assure reasonable confidence in the accuracy of reference value within uncertainty limits established by independent subgroup data sets and augmented by available gage calibration and resolution errors.
A consensus standard is characterized by consensus value and combined uncertainty. Each site participating in consensus standard evaluation would run m repeatability measurement trials Ri on the characteristic feature(s) of the check standard/reference part at the same reference point(s), and calculate their subgroup sample average Rp(s) similar to equation (1). With carefully executed trials and assuming samples with normal distribution, the estimated consensus value Rcon is the mean of the subgroup samples' averages: where k is the number of participating sites (subgroups), Estimation of the combined uncertainty needs more work by assembling independent errors from the significant components of variation: viz. random standard deviation errors associated with analysis of variance (ANOVA) of independent sample means, and À as in x 2.1.2 À systematic errors due to equipment calibration uncertainty and instrument resolution limit.
Each site calculates the variance Vp(s) in their subgroup repeatability sample using equation (4): And calculates subgroup measurement expanded uncertainty U(s) according to equation (6): where Ug is the gage calibration uncertainty tolerance and r is the gage discriminating resolution.
Next, the participating sites combine the measurement variance over all subgroups. This will have two components: {mean within-subgroup} sample variance Vms, and {subgroup ↔ subgroup} variance Vss: Vms is estimated by averaging the repeatability sample variances Vp(s) over all subgroups: To estimate Vss, we use nested random-effects ANOVA model treating subgroup average Rp(s) as a sampledependent statistic around the group mean Rcon, with repeated measurement trials mathematically nested within the subgroups. Based on this, the expected value of the mean sum of squares from subgroup to subgroup is expressed by: where Vms/m is the standard variance of the samples mean relative to the population mean; in this case it is a correction factor accounting for overestimation of the expected value of Vss due to the nested subgroups ANOVA structure (see e.g. [4] Ch.10 on theory of ANOVA). Hence, Vss is obtained from equation (18) by subtracting the correction factor from e(Mss), then substituting in (16) to get the combined variance: Additionally, we consider the systematic error due to the participant sites' gage calibration uncertainty tolerances, Ug. Treating this like a variance, we estimate an average of the calibration uncertainty tolerance over the group of gages using quadrature summation: We also add the gage resolution r as a systematic capability error applicable to all measurements (assuming the participant gages have the same resolution.) Hence, the combined uncertainty u c over the group of all measurement trials is expressed by: Finally, using (19) in (21) gives the expanded uncertainty U = 2u c for the consensus standard: Vms and e(Mss) are calculated by equations (17) and (18), respectively. The ±Ucon expanded uncertainty represents consensus-traceable accuracy error bar around Rcon, the estimated value of the consensus standard established by (13). See Example 2 for a quantitative illustration.

Gage bias 2.2.1 Bias measurement
Bias study requires using a NMI-traceable reference standard Rt ± Ut. However if this is justifiably not available, then a consensus standard Rcon ± Ucon may be used. For conciseness we will use Rr (reference) to mean either R t or Rcon, and Ur for either Ut or Ucon.
The procedure starts by checking that the measurement system's gage is properly calibrated, then proceeding to repeatability measurement trials Ri of the reference standard by a qualified person or by automation as the case may require. The sample size should be m ≥ 10 trials. The bias average Bav is then obtained by averaging the deltas between the trial values Ri and the reference value Rr over the sample size: B av may be expressed as a percentage of the reference value: (B av )% = (B av /Rr) Â 100.
Ideally Bav should be zero. However, this is not typically the case due to inherent variation in the measurement system and random normal variation in the repeatability trial runs. Most, if not all, measurement systems tend to show a small non-zero positive or negative bias. Acceptability is subject to non-rejection of the null hypothesis, as will be discussed below.

Statistical zero bias hypothesis (statzero)
Acceptance of bias is subject to testing the null hypothesis: {H 0 : B = 0}, such that the bias error of a measurement system is acceptable if not statistically significantly different from zero [1], a condition referred to as 'statistical zero bias'. We will call this 'statzero' for short. For validation, we take into account the standard deviation of the trials' sample and the interval for normal 2-tail distributed bias at 95% confidence. We also validate the Student's t-test: Tstat < Tcrit in accordance with small sample size in bias studies (typically 10 m 20.) The standard deviation of the bias repeatability trials, s r , is given by: Unlike statistical systems in general where the population mean is unknown, the bias study case has a precise population mean, its target zero value. Hence, substituting x = B av and m = 0 in equation (2) gives: Next, we determine the upper and lower limits of the confidence interval [UCL; LCL], for the small single sample bias study using the general formula for boundaries of a presumed normal t-distribution at (1 − a)% confidence: The second equivalence in (26), obtained by substituting for s r / p m from (25), indicates the wider interval and shift in mean for the small sample subject to the Student ttest: T stat < T crit .

Bias acceptance by statzero condition
This is fulfilled by not rejecting the null hypothesis {H 0 : B = 0}, subject to zero confined within the confidence interval about the bias average [1]:

Statistical zero bias proxy (statzero proxy)
Acceptance by statzero condition (27) does not take into account the factor of uncertainty spread around true value of the reference used in bias study; namely extent of overlap bewteen the 95% confidence interval of repeatbility trials and the reference uncertainty bar. The MSA manual [1] did not include specific guidance or procedure to account for this overlap when making bias acceptance decisions. Henceforth, we propose an additional test of significance for non-zero bias, to include acceptance based on extent of the overlap. Disposition with the proposed criterion will be established by calculating DU ovrlp as a ratio of magnitude of overlap between the width of the confidence interval, UCL − LCL, and the reference uncertainty bar, ±Ur: DU ovrlp is a positive number between 0 and 1: 0 DU ovrlp 1. A value <0 means no overlap.
Expression (28) represents how much of the repeatability 95% confidence interval lies within the reference uncertainty interval. See Figure 2 for cartoon diagrams depicting various overlap cases.

Bias acceptance by statzero proxy criterion
We propose extending acceptance of non-zero bias as still insignificant if the 95% confidence interval of the repeatability sample is overlapping the reference uncertainty bar ±Ur by more than 25%, i.e.: The validity of >25% overlap as a general acceptability rule of thumb had been established in Statistics literature [4]. We adopt (29) as a criterion for incrementally extending bias acceptability beyond the statzero condition, and call this extended acceptance 'statistical zero bias proxy', or, for short, 'statzero proxy'. It draws credence from appreciable probability that the estimated uncertainty for the reference value, as determined by repeatability and represented by the confidence interval, is sufficiently overlapping with the traceable uncertainty of the standard used; thus facilitating extension of not rejecting the null hypothesis. This makes sense in light of the basic definition of uncertainty in the MSA reference manual as the "estimated range of values about the measured value in which the true value is believed to be contained". The criterion (29) therefore safeguards that the bias can still be considered statistically zero by proximity of the estimated reference value to the true value within acceptable overlap of uncertainty values.
See Example 3 demonstrating statzero and statzero proxy dispositions.

Gage linearity 2.3.1 Regression analysis
The purpose of linearity study is to verify that the bias of a measurement system satisfies the primary null hypothesis statzero condition (27) over the system's applicable operating range. Based on the statzero proxy criterion (29) advanced in x 2.2.5 for single bias sample, we propose extending the acceptance by statzero proxy to the linearity case. Mathematical validation of linearity requires bivariate linear regression analysis in place of single bias univariate analysis. Acceptance requires applying the statzero, or its proxy, not just to the bias average but also to the slope and intercept of the regression best fit line. It is therefore needed to determine the confidence intervals for the regression slope and intercept, in addition to the confidence interval about the bias measurements scatter points. The basic formula (26) for confidence interval limits still applies; however the repeatability standard deviation by least squares regression is algebraically more complex due to the error sum of squares analysis and the slope and intercept statistics.
To proceed, we first present the general formulae of simple linear regression model, which are solutions to a linear equation in the parameters a and b: (for ref. we use [5][6][7][8] Given n scatter data points (y i , x i ), the least squares estimators for the regression best fit line slope, b, and intercept, a, are obtained by minimizing the sum of the squared deviations e i 2 : y i are the samples' means of the x and y variables, respectively.
Working out the algebraic expressions yields the following decoupled formulae for b and a: (Note: These formulae appear in the MSA manual with a and b interchanged ([1], p. 97). Here we use a and a for intercept and b and b for slope, in alignment with [5][6][7][8]).
The regression best fit line points,ŷ, would be expressed by the equation: For repeated trial runs such as in bias linearity studies, the standard deviation for least squares repeatability residuals, s rr , is estimated from variance of the y i scatter points about the regressed best fit lineŷ. In the so-called 'reduced major axis regression method' [6], this is done by summing the rectangles of deltas between the y i data points and the expected valuesŷ i on the best fit line: where (n − 2) are the degrees of freedom (df) associated with the bivariate analysis (since the estimatorŷ is dependent on two estimators: a and b). The right hand form of equation (34) is derived by expanding P ðy i Àŷ i Þ 2 and usingŷ from equation (33), then using b P xi = ( P y i À na) and b P x i 2 = ( P x i y i À a P x i ) obtainable from the formulae (31b) and (32b), respectively.
Sinceŷ, a, b are not known, one needs to transform the formula (34) by substituting the formulae (32a) & (32b) into (34) followed by algebraic manipulation to obtain: where for brevity we drop the subscript 'i', and use The estimated values of the covariate slope and intercept of the regressed best fit line are influenced by the scatter-dependence of the data points, on the premise that a set of simulated regression lines around the population's true best fit line represents a samplingdependent statistic with slopes and intercepts distributed relative to the samples' means x and y [5]. Hence, variance components due to slope and intercept need to be considered in estimating the combined variance for the best fit line. Formulation can be simplified by realizing that: (a) all regression lines anchor at the point (x,y) such that y ¼ a þ bx is valid; and (b) the intercept variance can be handled through a transformation at a specified x 0 value of the independent variable, such thatŷ ¼ y þ bðx 0 À xÞ. Assuming uncertainty is the same for all y i measurements, the variance/standard error for the best fit lineŷ is therefore a combination of the standard error of the repeatability mean y, given by Vy = (srr) 2 /n, and the variance of the estimated mean slope V b multiplied by a factor. This is expressed by the following formula (for more details see [6] or [7]): where is obtainable from expression (31a) under the assumption of negligible uncertainty of the independent variable x; , where x is variable and c is a multiplication factor.) The formula for the variance associated with the intercept of the best fit line is obtained additively from the equation y = a + bx, and substitution of V b from expression (37): Switching to standard deviation expressions since these will be used in the formulae of confidence intervals, we insert the formula (35) into (36) followed by algebraic manipulations and taking square root to obtain Sŷ, the calculable standard deviation of the best fit lineŷ: The calculable standard deviation associated with the slope of the best fit line is obtained by substitution of (35) in (37) and taking square root: The calculable standard deviation associated with the intercept of the best fit line is obtained by substitution of (35) in (38) and taking square root: We are now in position to formulate the confidence intervals forŷ, b and a: Due to quadratic nonlinear components in the formulae above, the confidence interval points will trace hyperbolae curves at the lower and upper boundaries (see Figs. 4-6).

Linearity bias measurements and regression analysis
Gage linearity study requires a number of traceable reference standards or, in lieu consensus standards as appropriate, having accurate scalar values Rr(1), Rr(2), …, Rr(g), (g ≥ 5); such that the values cover the applicable operating range of the measurement system [1]. Information about the traceable or consensus-assessed uncertainty values Ur(1), Ur (2), .…, Ur(g) must also be available.
Using a typical gage representing the measurement system, the reference standards are to be measured by a single qualified appraiser À or by automation, as applicable À using repeatability trials' sample size m ≥ 10 for each reference subgroup Rr(j). In what follows, we index the subgroup references by j and the m trials by i. To minimize appraiser memory recall, it is recommended to randomize the standards and trials [1], if practically feasible. Random number generator Excel sheet, for example, may be used to set up random sequences. (Note that random sequencing may not be practical for fully automated systems.) After collecting the group {Rji} of reference measurement data for the g sets of repeatability trials, the bias value Bji for each individual trial is calculated, and all arranged in a matrix: Using equation (23), the bias average is calculated for each subgroup j: The bias repeatability scatter data Bji (dependent variable y) and the bias averages per (46) are plotted against values of the reference standards (independent subgroup x). Simple least squares linear regression is applied using the formulae in x 2.3.1 to calculate the regression parameters and obtain and plot the best fit line. Calculations and plots may be performed with any desired package, e.g. Minitab, JMP, or recently the increasingly popular R [7,8]. However, we chose to set up the formulae and execute using Excel since it is widely used and gives users the opportunity to readily verify the formulae. Our linearity Excel worksheet calculates the bias scatter values Bji and the average Bav(j) for each subgroup; then computes, x y, P x, P y, P xy, P x 2 , P y 2 for the whole group (n = gm) and uses the formulae (32), (33), (39), (42) to determine the slope b and intercept a of the best fit line, the regression's best fit points ŷ i , the standard deviation s ŷ , and the 95%-confidence [UCL; LCL] ŷ points; plotting the best fit line and confidence hyperbolae curves.

Linearity acceptance
The acceptance of gage linearity requires disposition by the null hypothesis statzero condition or, by extension as we propose to the statzero proxy criterion, at every reference point on the linearity range. We will use the disposition in x 2.2.2 for acceptance by statzero and the disposition in

Statzero condition applied to linearity
This requires the null hypothesis {H 0 : B = 0} not to be rejected at each bias checkpoint corresponding to a reference standard in the linearity study, i.e. subject to validity of the statzero condition (27) over the operating range of the measurement system. Furthermore, the acceptance test includes the slope and the intercept also meeting statzero condition. This imposes the following requisites: i) Zero is contained within the confidence interval around the regression's best fit points throughout the linearity range at every reference point j, whereby (42): a þ bRrðjÞÀðT crit ÞðSŷ Þ zero a þ bRrðjÞþðT crit ÞðSŷ Þ ð47Þ b and a are calculated by (32a) & (32b) and s ŷ is calculated by (39), using the substitutions: ii) The null hypothesis is also applicable to the slope and intercept statistics, such that by (43) and (44) iii) The Student t-test is valid for the slope and intercept statistics, such that: Where [Eqs. (51) are derived from the formula (2) by replacing x by the mean slope b or mean intercept a, applying m = 0 for the population of slopes and intercepts, and using the standard deviations of the mean slope and intercept, s b and s a , respectively.] The validation of small sample linearity study is by default subject to fulfilling the null hypothesis statzero conditions (47), (48), (49), and the t-test (50). See the illustrative Example 4. On the other hand, if the result of a linearity study fails any of the conditions above, then the next step is to evaluate acceptance by the statzero proxy criterion which we have proposed in x 2.2.4 for single sample bias case; here to be tested for linearity validation at every reference point j of the linearity subgroup samples, as will be explained below.

Statzero proxy criterion applied to linearity
Based on the criterion developed in x 2.2.4 for single sample bias, the acceptability of linearity by statzero proxy is subject to assessing the amount of overlap, D Uovrlp as determined by expression (28), between the hyperbolaebounded 95% confidence interval about the regression best fit line and the reference value uncertainty, at each of the linearity study reference values spaced across the gage applicable operating range. We consider linearity to be acceptable if DUovrlp is greater than 25% at every reference point, in alignment with the criterion (29). See the illustrative Examples 5 and 6.

Results & discussion
We will present generic examples and discuss them to illustrate the methods we proposed in x 2.

Check standard evaluation
Example 1: A production site keeps a NMI-traceable thin film oxide wafer standard with quoted thickness and expanded uncertainty Rt ± U t = (3000 ± 5) nm. The site starts a new process that requires a film thickness of ≈1000 nm; however there is no available standard for this at the site so they decide to use in-house reference parts for MSA stability and GR&R. The thin film gage used by the site has resolution r = 2 nm and calibration uncertainty tolerance Ug =± 2 nm. The site metrology engineer proceeds to establish a check standard by best estimate of a 1000 nm target thermal oxide film on prime wafer using the procedure described in x 2.1.2, running repeatability measurement trials on the check wafer and on the available traceable standard wafer, obtaining the data sets in Table 1 resulting in R chk ≅ 1005 nm and Rm ≅ 3010 nm. Using the repeatability variance results from Table 1 and the values of r and Ug above with T crit = 2.262 (m = 10, a = 0.05), equations (6) and (10) yield the measurement expanded uncertainty U ≅ 6.0 nm for the check standard and U' ≅ 5.9 nm for the traceable standard. By equation (8), the gage offset error DR = 3010-3000 = 10 nm. Using this and the values of U and U', and half the value of the traceable standard expanded uncertainty (half Ut = 2.5 nm) into equation (12) gives the total estimated uncertainty for the check standard: Uchk ≅ 13.5 nm. Hence the value of the inhouse check standard is estimated to be Rchk ≅ (1005 ± 14) nm. This is quite good for stability and GR&R studies. However, the gage offset of 10 nm will present an issue for bias and linearity studies since it represents a 'hidden' bias increment by ≈0.33% at 3000 nm which will not be accounted for if one uses the in-house check standard whose assessed value is traceable only to the in-house gage. This demonstrates why using check standards for bias and linearity is not recommended unless there is no other option for a unique measurement characteristic and/or a unique gage system or for destructive testing, as already alluded to in x 2.1.1 and 2.1.2. In such cases, one may adjust the bias readings to account for the offset. For process control monitoring, applying the offset to collected process data in SPC charts À if known at the process target value À is reasonable provided the specified process tolerance is sufficiently accommodating to absorb any negative impact on process Cpk entitlement; otherwise one may consider adjusting the tolerance limits in correlation with the offset, if allowed. The MSA manual [1] advises that if a system has non-zero bias, the first thing to do is attempt to recalibrate or remodify it to remove the offset, i.e. reset the gage to zero bias. If this is not successful, the manual posits that the gage may still be used by correcting for the offset at every measurement reading.

Consensus standard evaluation
Example 2: Four factory sites of a company, FAC-1-FAC-4, need a dimensional measurements standard for a characteristic feature on new product with target pitch of ≈500 nm ± 1.0% tolerance, to be verified by contactless profilometry. Traceable standards of titanium alloy with micro-etched features are available commercially but too expensive to purchase. The sites decide to adopt a selfmade 3D-printed reference block which includes a ≈500 nm trench as a consensus standard for their profilometry systems. The gages calibration uncertainty values are Ug = 1.0 nm, 1.0 nm, 1.5 nm, and 1.5 nm, respectively for FAC-1-FAC-4; and the gage resolution r = 0.5 nm as quoted by OEM manual. The sites then run repeatability measurement trials on the feature using the procedure described in x 2.1.3, obtaining 4 independent data sets shown in Table 2. This table also shows results per site of the trials means Rp(s), the repeatability variance Vp(s) by (14), and the measurement expanded uncertainty U(s) by (15). Using equation (13) with the values of Rp(s) in Table 2 yields the estimated consensus value Rcon = 501.9 nm. Using equation (17) with the values of Vp(s) in Table 2 and k = 4yieldsVms = 2.1 nm. Using equation (18) with the values of Rp(s) in Table 2 yields e(Mss) = 0.4 nm. Using equation (20) with the values of Ug yields U g = 1.27 nm. And finally, using equation (22) with the numerical results above yields the expanded uncertainty for the consensus standard: Ucon = 4.1 nm. Hence the consensus standard value is best estimated to be Rcon ± Ucon ≅ (502 ± 4) nm.
Graphically, Figure 3 shows the readings for each gage, the mean value of the measurements, and the error bars as calculated by equation (6) for expanded uncertainty of individual subgroup. It also shows the consensus value Rcon of 502 nm and its error bar of ±4 nm. It is a validation of our method that the ANOVA-estimated consensus expanded uncertainty error bar of ±4 nm encompasses the individual gage readings and error bars, within the target tolerance of ±5 nm.
The consensus standard round-robin method whereby samples of measurement trials for the same reference part are performed on independent measurement systems, coupled with ANOVA modeling, enhances the confidence in traceability and provides assurance that the estimated group mean Rcon represents a reasonably accurate value in the vicinity of the true population mean within the expanded uncertainty bar of ± Ucon.

Single sample bias disposition
Example 3: To illustrate the statzero and statzero proxy dispositions for single sample bias, suppose the factory site FAC-1 of Example 2 uses the established consensus standard reference of (502 ± 4) nm to run bias trials on four similar systems A, B, C, D in different processing areas of their factory, collecting the data in Table 3 and obtaining the results in Table 4. (Note that system A and system B are matched in precision by having similar expanded uncertainty of ±2.5 nm, while C and D are also matched at ±2.8 nm.) The results in Table 4 show that all four systems have lower means relative to the consensus reference value, with progressively negative bias offset and confidence intervals shifting to negative numbers. System A's mean of (501.5 ± 2.5) nm is the closest to the reference value and shows the smallest negative bias (0.09%). This is acceptable by statzero condition (27) since zero is contained within the confidence interval and Tstat is less than Tcrit, as seen in Table 4. System B shows 0.15% negative bias, slightly more than system A; however because the confidence interval slips below zero in negative territory and Tstat goes above Tcrit, system B is not accepted by statzero, even though it is matched in precision to system A, (note the sensitivity of the statzero hypothesis, there is only ≈0.25 nm difference between the trials means of systems A and B.) Applying equation (28) to system B's data gives 100% D Uovrlp , [ Table 4]; hence system B is acceptable by statzero proxy criterion (29). On the other hand, systems C and D exhibit bias an order of magnitude larger than system A, clearly away from the statzero zone. However, testing by the statzero proxy criterion shows that system C has 31% D Uovrlp , so its bias is acceptable by proxy and can be tolerated. System D, which is matched in precision to system C but exhibits slightly more negative bias than system C, just fails the statzero proxy criterion (29) by having 23% overlap, and thus its bias error cannot be tolerated. Action must be undertaken to investigate the source of the intolerable negatively-offset bias problem of system D, and adjustments should be made to bring it back to statzero or at least statzero proxy status.
In general, if the size of bias offset is within the maximum permissible calibration error (uncertainty tolerance) set by the gage manufacturer, then one may, if possible, tune the gage by counter-offset to correct the bias problem. However, if the size of offset exceeds the maximum permissible calibration error and, we propose, fails the statzero condition and statzero proxy criterion, then the gage is not acceptable and should be subjected to corrective recalibration or hardware/software modification. In this illustrative example, the gage calibration uncertainty Ug = 1.0 nm translates to a maximum permissible error of ≈ ±0.2% (relative to the reference value 502 nm). Both systems C and D exceed this error; however system C passes by the statzero proxy criterion and so is considered still in the accuracy zone, i.e. acceptable for use in process/product measurements with attempt to counter the offset bias if possible. On the other hand, by failing the statzero proxy criterion, system D has drifted outside the accuracy zone, so attempting to tune the nonconforming offset bias back is not the best course of action since the system may have significant hardware/software issues that need to be investigated and addressed.

Gage linearity
To illustrate the statzero and statzero proxy dispositions for linearity acceptance, we present and discuss the following generic examples:

Linearity acceptable by statzero
Example 4: Suppose that in addition to the 500 nm feature in Example 2, other 3D-printed micro-etched blocks are patterned with features at target pitches ≈1000, 1500, 2250, and 3000 nm, and maximum tolerances of ±0.8%, ±0.6%, ±0.5%, and ±0.4% respectively. The four sites which participated in generating the consensus standard Rcon(1) ≅ (502 ± 4) nm now run trial measurements on the other four features and generate consensus reference parts with the following values and expanded uncertainty: Rcon(2) ≅ (1012 ± 5) nm, Rcon(3) ≅ (1509 ± 5) nm, Rcon(4) ≅ (2262 ± 6) nm, and Rcon(5) ≅ (3015 ± 6) nm. FAC-1 site then uses the five consensus standards for a linearity study on their measurement system A. The trials data shown in Table 5  {Tcrit is obtainable from standard statistics tables or by the Excel function TINV(0.05, gm − 2) at 95% confidence level.} Table 6 shows the regression analysis results for the best-estimated slope and intercept, and Table 7 shows the results for the best fit line. Both tables validate the statzero conditions: (48) for slope, (49) for intercept, and (47) for best fit line, as well as the Student t-test (50), are Table 3. Example 3. FAC-1; 4 measurement systems; Bias trials, Consensus standard = (502 ± 4) nm, (measurement unit = nm.). all met at 95% confidence, with zero contained within the respective confidence intervals and both Tstat(b) and Tstat ( a ) less than Tcrit. Accordingly, the linearity of measurement system A is acceptable by the statzero condition. Note that this is true even as the best fit line shows a slight negative bias intercept of ≈ À0.5 nm through the range studied, as seen in Table 7 and the plot in Figure 4.

Linearity acceptable by statzero proxy
Example 5: Suppose the FAC-1 site of Example 3 next uses the five consensus standards of Example 4 for a linearity study on their measurement system C. The measurement trials are shown in Table 8, and the linear regression analysis results are in Tables 9 and 10. These show that statzero condition is satisfied for the slope, with   zero contained within the slope's confidence interval and Tstat (b) < Tcrit, but is not satisfied for the best fit line nor for the intercept; hence system C linearity is not accepted by statzero hypothesis. On the other hand, applying the statzero proxy criterion (28) and (29) gives the results in Table 11, which validate that all overlaps are >25%. Hence, linearity of measurement system C is acceptable by statzero proxy. Note that the bias average over the linearity range is in the negative zone as evidenced by the results in Table 10 and the graph of Figure 5, showing a small linear gradient from À 4.2 nm for Rcon(1) to À 3.5 nm for Rcon(5) at a small slope of 1.8 E À 4. Nonetheless, acceptance is justified by the amount of overlap between the confidence interval about the regression best fit line and the reference uncertainty bar being more than 25% for each of the five reference points [Tab. 11]), ensuring the gage is in the accuracy zone with acceptable linearity by regression analysis over the operating range. This facilitates tuning the gage back to statzero, if possible, by an amount equivalent to the linear regression's best line intercept, in this example approximately 4 nm. Alternatively, if practical, the offset may be applied to individual measurement points as the process/ product data are being collected.

Linearity unacceptable
Example 6: Suppose the FAC-1 site of Example 3 next uses the five consensus standards of Example 4 for a linearity study on their measurement system D. The measurement   Table 10. * Bias average values in Table 10.  Table 12, and the linear regression analysis results are in Tables 13 and 14. These show that statzero condition is satisfied for the slope, with zero contained within the slope's confidence interval and Tstat (b) < Tcrit, but is not satisfied for the best fit line nor for the intercept; hence system D linearity is not accepted by statzero hypothesis. Applying the statzero proxy criterion (28) and (29) gives the results in Table 15, which shows the >25% overlap criterion is valid for Rcon(3) À Rcon (5), but not valid for Rcon (1) and Rcon (2). Hence linearity of measurement system D is not acceptable by statzero proxy. The results in Table 14 show the bias average over the linearity range in the negative zone but, unlike in Example 5, it is nonlinear as it exhibits an inflexion point at Rcon(3) = 1509 nm, graphically depicted in Figure 6 at the intersection of the two dashed lines. The linear regression results show a slope of 4.7 E À 4, which is 2.6 times the slope in example 5 (1.8 E À 4), and intercept of À 5.9 nm. These results indicate that system D is non-linear  and hence does not lend itself to simple tuning back to statzero, or applying a uniform offset to the data points. This system's gage has to be subjected to corrective recalibration and/or hardware/software modification to fix the bias nonlinearity problem.

Range consideration
When the product manufacturing or test/inspection process spans a wide range of characteristic measurements, it is recommended to validate MSA linearity using three studies with three sets of reference parts, each set having at least five distinctly independent references representing the low end, mid-range, and high end of the production measurements. A similar approach may be adopted if the measured characteristic has ranges that differ widely by technology type.

Conclusions
This paper starts by introducing methods for establishing reference for MSA bias and linearity studies when there are no available traceable standards; in particular a method for establishing consensus and check standards values and expanded uncertainty using a nested ANOVA approach.
The paper argues for unsuitability of check standards, however, for evaluating bias and linearity of measurement systems due to limitation of self-traceability (even though check standards are appropriate for stability and GR&R studies of gage systems). We then proceed to present the mathematical t-statistic based background for studies of gage bias and linearity, providing the appropriate formulae for the single reference bias case as well as deriving the formulae for simple linear regression analysis needed for multi-reference bias linearity validation. For acceptance, we primarily use the null-hypothesis statistical zero bias (statzero) condition, combined with the Student's t-test to justify acceptance of bias and linearity given the small samples normally used in such studies (typically 10 m 20.). Moreover, we propose a novel idea of taking in consideration the degree of overlap between the confidence interval of bias fit data or the confidence hyperbole in case of linearity regression analysis, to extend acceptance of gage bias and linearity according to the criterion of >25% overlap between confidence intervals and the uncertainty bars of the reference standards used in bias and linearity studies. We call this extended test for significant overlap the statzero proxy criterion. We provide illustrative examples at the end to demonstrate the concepts and formulae used in this work, using calculated consensus standards.   Table 14.