Appendix A: An Illustration of Sample Size Calculations

In general, the value of the complication rate under the alternative hypothesis is derived using a combination of quantitative and qualitative reasoning. The precise methods used are context dependent and thus not discussed in detail here. In the present example, a cost-effectiveness analysis might suggest that complication rates of 6 percent and above would call into question the efficacy of CE. Given these inputs, it can be shown that the effect size is 0.21, and the sample size required for 80-percent power is approximately 370.

Design 2: Continuing to follow patients at high risk of stroke, now suppose that the goal of the registry is to compare complication rates across hospitals. For simplicity, we continue to assume that patients are sufficiently similar to the comparator patients that no explicit adjustment for case mix is required.

Design 2 is a simple form of benchmarking application. For example, the CE complication rates for each hospital might be reported to a regulatory agency and/or the general public, on the presumption that statistically significant differences between complication rates can be used to identify hospitals with differences in quality of care. The particular danger in this design is that the complication rate for any particular hospital might be estimated with relatively little precision, thus generating results that have more noise than signal. Another danger, discussed later, is that case-mix adjustment is required and not performed, or performed, but not adequately.

We assume that the benchmarking will focus on comparing specific hospitals—i.e., in the underlying statistical model, hospital will represent a fixed rather than random effect. The null hypothesis is that the complication rates for all the hospitals are identical, and the alternative hypothesis is that the complication rates follow some pattern other than being identical. In this design, specifying the alternative hypothesis of interest is a potentially formidable task. One way to formulate this hypothesis is to focus on outlier hospitals. For example, suppose that there are 10 hospitals in the registry, the overall complication rate among 9 of these is expected to be 3 percent, and the complication rate at the tenth hospital is 10 percent. This information, along with expected number of cases in each hospital, is sufficient to calculate an effect size and thus perform the sample size calculation.

When comparing complication rates among specific hospitals, some adjustment may be made for multiple comparisons—that is, in any group of hospitals, there will always be a hospital with the highest complication rate, and focusing on differences between the outcomes of this particular hospital versus outcomes of the others will overstate the level of statistical significance. The initial statistical test used to assess the homogeneity of complication rates across all the hospitals in the registry implicitly takes this multiple-comparison problem into account. Subsequent tests, in particular those tests that compare apparent outlier hospitals with others, should include an explicit adjustment for multiple comparisons, and the sample size calculations should reflect the fact that an adjusted comparison is being made.

In practice, the approach to this design might reasonably depend on whether registry data are being collected electronically or manually. If data are being collected electronically, the most sensible policy is to collect information on all CE procedures performed within each hospital and to use the sample size formula as an assessment of whether the registry as a whole is likely to produce results that are sufficiently accurate to support decision making. This assessment can be framed in terms of statistical power, as discussed above, or in terms of precision.

Considering precision, a 95-percent confidence interval for a nonzero complication rate for any hospital is p ± 1.96 sqrt (pq/n), where p is the observed complication rate, q = 1- p, and n is the sample size. Supposing that p = 3 percent and n = 300 per hospital, within any particular hospital, the width of this confidence interval is expected to be approximately ±1.9 percent. If data are being collected manually, and thus the marginal cost of data collection per patient is high, a reasonable policy would be to collect data on a sufficient number of patients in each hospital so that the precision of the estimates of the complication rate within a given hospital would be considered adequate.

As with hypothesis testing, the analysis to derive the width of the confidence interval usually applies a combination of qualitative and quantitative insights. In particular, the question can be reframed as the following: For what values of the complication rate will my decision (whether taken from the perspective of clinical medicine, public health, etc.) be the same? For example, if the decision is the same regardless of where the complication rate falls within the range of 2 to 4 percent, an interval of this width is sufficiently precise.

Unless sample sizes are large, using registries to compare individual hospitals is potentially quite problematic. Although determining the inputs to the power calculations is not always a straightforward task, performing this analysis is quite useful, even if the result is only to suggest extreme caution in the interpretation of differences between hospitals.

Design 3: Continuing to follow patients undergoing CE, now suppose that the goal of the registry is to compare two different versions of the surgical procedure. For simplicity, continue to assume that patients are sufficiently similar to the comparator patients that no explicit adjustment for case mix is required. The following discussion (after including an adjustment for case mix, if appropriate) also applies to comparisons of two different versions of a medical device and similar applications. The key distinctions between this design and Design 2 are that in Design 3 the primary comparison or comparisons can be stated ahead of time and the number of comparisons is relatively small, so that the issue of multiple comparisons can be ignored.

The analytic approach to this design is a logistic regression, with the input file having one record per patient. The outcome variable is the presence or absence of a complication, the categorically scaled control variable is the hospital, and the primary predictor is the categorically scaled coding of the type of surgical procedure (i.e., CE using version A vs. CE using version B). The null hypothesis is that, after accounting for any differences in hospitals, the two different versions of the procedure have identical complication rates. The alternative hypothesis is that the rates differ by a specified amount, this amount being the minimum clinically significant difference interpreted to be of concern. Power calculations proceed in the same fashion as for logistic regression with multiple predictors.

The main pitfall in this design is that patients who receive version A of the surgical procedure might differ from those who receive version B of the procedure along some dimension that has an impact on outcomes. (This pitfall is discussed in more detail under Design 4.) In this application, the null and alternative hypotheses are sometimes structured the same way as in an equivalence trial—that is, differences in complication rates are not expected, and the goal of the study is to demonstrate that complication rates for the two versions of the surgical procedure are similar within a certain level of precision. The structure of the analysis is not fundamentally different. Indeed, sample size calculations for equivalence trials are sometimes not performed within a hypothesis-testing framework but instead by identifying a sample size of sufficient magnitude to make the confidence interval for the difference in the complication rates between the two versions of the surgical procedure a certain width. For simplicity of presentation, let us assume from now on that any equivalence-trial-type calculations can be reframed into confidence interval format, and thus need not be discussed separately.

Design 4: Continuing to follow patients at high risk of stroke, and continuing to assume that the goal of the registry is to compare two different versions of the surgical procedure, now additionally assume that this comparison will include an adjustment for case mix. Within the logistic regression paradigm, variables used to adjust for case mix are accounted for as covariates (i.e., additional predictors). Alternatively, propensity-scoring methods could be used to adjust for those variables that predict the assignment of patients to particular versions of the procedure. For concreteness, let us focus on logistic regression. In order to perform a sample size calculation for a logistic regression, the analyst must specify the predictive ability of the covariates and the odds ratio associated with the predictor of interest. (For example, version B of the procedure might increase the odds of complications by a factor of 1.5.) Once these inputs are specified, the sample size calculation is straightforward.

Both the logistic-regression and propensity-scoring approaches suffer from the fundamental drawback that they can adjust only for covariates that are observed. In particular, if there are variables that predict outcome that are unmeasured (e.g., a physician's assessment of a patient's likelihood to comply with treatment, or an assessment of "stroke in evolution" not included in the administrative database used as the source of data for the registry), then the comparison between the two versions of the surgical procedure is potentially biased. Accordingly, before proposing to use a registry to compare complication rates (e.g., across different versions of a procedure or a device) or other outcomes, it is critical to determine that the following three conditions do not all hold: (1) a patient, provider, system, or other characteristic affects the complication rate; (2) this characteristic is unmeasured within the registry; and (3) there is a reasonable likelihood that this characteristic might be differentially distributed across the different versions of the procedure or the device. If all three conditions (in epidemiologic terms, the conditions for confounding) hold, use of the registry to compare outcomes is potentially dangerous.

Critical to Designs 1–4 is the assumption that the CE complication rate is stable over time. If this is the case, it is appropriate to use the registry to estimate a single complication rate associated with version A of the procedure, estimate another single complication rate associated with version B of the procedure, and compare the rates. On the other hand, if the technology of CE (e.g., physical materials, surgical technique) is improving, then the registry should continue to monitor the performance of CE over time. Such an ongoing monitoring function seems particularly relevant for medical devices and similar applications.

Even when the associated technology is assumed stable, some registries are intended to provide ongoing assessments of outcomes. For example, in a quality assurance context, CE complication rates might be assessed at individual hospitals on an annual basis (e.g., in order to check for problems that have recently arisen). On the other hand, a registry whose purpose is to assess whether complication rates observed in randomized trials could be achieved in usual practice could be designed with a sunset provision to cease operation once this question is answered. The latter type of registry might, for example, support a coverage decision by the Centers for Medicare & Medicaid Services.

Having an ongoing monitoring function induces additional analytical complications, among others a multiple-comparisons problem. Traditional statistical power calculations are performed under the assumption that the sample size is fixed and that, unless otherwise noted, multiple comparisons are not a major issue. Sequential testing methods associated with randomized trials (where, for example, the type I error of .05 is apportioned into an early test with alpha = .001 and a subsequent test with alpha = .499) are not appropriate for this particular design, since most of these methods assume that the maximum sample size is fixed. Some methods assume that what is fixed is not the number of patients but the number of events, but these methods are also inappropriate for registry applications.

Design 5: Suppose the goal is to estimate the complication rate associated with CE at multiple time points for the foreseeable future. Control chart methodology might reasonably be applied to this class of problems. This methodology, often used in the quality assurance and quality improvement context, was originally developed for industrial applications. In this example, the null hypothesis, under which the system in question is "in control," is that the CE complication rate remains at the desired value of 3 percent throughout the entire followup period. Samples are taken at given points in time (e.g., monthly). As an example, if these monthly samples are of size 100, then the standard error is approximately 1.7 percent. The analyst then creates a control chart by plotting these monthly complication rates over time and forming channels based on the standard error. In this example, the channel extending from the point estimate to 1 standard error above the point estimate is 3 percent to 4.7 percent.

Once the basic control chart—which goes by different names depending on the scale of measurement of the outcome variable—is formed, the plot is checked for various violations of the null hypothesis of constant complication rates. The set of possible violations to be flagged as statistically significant might include (1) any observation more than 3 standard errors from the mean; (2) two of three consecutive observations more than 2 standard errors from the mean; (3) eight observations in a row that increase or decrease; and (4) eight observations in a row on one side of the mean. These rules of thumb implicitly take into account the multiple comparisons problem by requiring noteworthy departures from the null hypothesis in order to be flagged; they are also based on the observed properties of physical machines as they fall out of adjustment: suddenly breaking down and producing an extreme outlier, or gradually heating and thus producing sequentially higher readings. Complication rates of CE might or might not follow the properties of physical machines, but the decision rules from control chart methodology are at least a good place to start.