Overview
This vignette outlines the statistical methods used to evaluate Key Risk Indicators (KRIs) in {gsm}. KRIs are metrics that allow users to measure pre-defined risks and determine the level of observed risk to data quality and patient safety in a clinical trial. The {gsm} package implements a standardized data pipeline to facilitate KRI analysis. Other vignettes provide an overview of this framework (1, 2 3, 4, 5), and the statistical methods for this process are described in detail below.
GSM calculates KRIs by defining a numerator and a denominator for each metric. Then by default, GSM calculates z-scores using a normal approximation with adjustment for over-dispersion to assign risk levels.
For KRIs that are percentages (binary outcome), the numerator is the # of events and the denominator is the # of total participants, and we then apply the normal approximation of the binomial distribution to determine a risk level.
For KRIs that are rates (count outcome), the numerator is the # of events and the denominator is the total participant exposure or study duration, and we then apply the normal approximation of the Poisson distribution to determine a risk level.
Alternative statistical methods to calculate standardized scores are also available in {gsm}, including the Identity, Fisher and Poisson methods. More details are provided below.
Statistical Methods
1. The Normal Approximation Method
Introduction
This method applies normal approximation of binomial distribution to the binary outcome KRIs, or normal approximation of Poisson distribution for the rate outcome KRIs (the sample sizes or total exposure of the sites) to assess data quality and safety. The control limits based on the asymptotic normal approximation are constructed to as risk thresholds for identifying site-level risks.
Reference: Zink, Richard C., Anastasia Dmitrienko, and Alex Dmitrienko. Rethinking the clinically based thresholds of TransCelerate BioPharma for risk-based monitoring. Therapeutic Innovation & Regulatory Science 52, no. 5 (2018): 560-571.
Methods
Binary
Consider the problem of monitoring KRIs with binary outcomes, such as protocol deviation or discontinuation from the study, across multiple sites in a clinical trial. Assume that there are sites with patients at the th site, . Denote the total number of patients in the study by . Let signify the outcome of interest for the th patient at the th site, where indicates that an event has occurred and indicates that an event has not occurred. Finally, let denote the site-level proportion at the th site. Monitoring tools focus on testing the null hypothesis of consistency of the true site-level proportion across multiple sites. Specifically, the null hypothesis states that the site-level proportion of the binary outcome is constant across the sites, that is, , where is the common proportion. This common proportion can be estimated as .
The control limits are computed using confidence limits based on an asymptotic normal approximation. A 95% confidence interval is obtained if the significance level . Let represent the total number of events that occur and let denote the estimated event rate at the th site. The asymptotic confidence interval for is given by where is the upper percentile of the standard normal distribution. To construct the control limits for the observed event rate at this site, the estimated event rate is forced to be equal to the overall event rate . This means that the lower (l) and upper (u) asymptotic control limits for the th site are defined as and , respectively. Asymptotic control limits may not be reliable in smaller clinical trials, so exact limits for an event rate may be preferable.
Rate
Assume that the distribution of number of events up to time is Poisson with mean , where is the event rate for a given unit of time. For the th site with events and exposure, define the exposure-adjusted incidence rate (EAIR) as . For all sites, define and with . Under a normal approximation, confidence interval for the th site is . For these funnel plots accounting for exposure, the x-axis representing the site sample size () in the above examples is replaced by the total exposure time . To develop a funnel plot, fix , and vary from to to compute the control limits. As an area of future research, the work of Chan and Wang (2009) may suggest methods appropriate for computing an exact confidence interval for the EAIR. Finally, similar methods can be applied for a count-type endpoint , where tij would denote the time on study for the th patient at the th site.
KRI Metric and Z-score
The KRI metric along with a KRI score are created for each site to measure the level of observed risk to data quality and patient safety in a clinical trial. For scoring purposes, Z-scores from the normal approximation are calculated and defined as such: for site , where is the KRI metric calculated for site , is the overall mean, is the measurement of variance.
For binary outcome, .
For rate outcome, .
Over-dispersion adjustment
The standard normal approximation method described above assumes the null distribution fully expresses the variability of the sites in-control, but in many situations this assumption will not hold. In the situation that there is a presence of greater variability than expected, majority of the sites will fall outside the specified limits, leading to a double of the appropriateness of the constructed limits.
A way of handling this issue is to allow over-dispersion in the normal approximation. A multiplicative over-dispersion adjustment was implemented in our approach.
Suppose a sample of units are to be in-control, the over-dispersion factor can be estimated as the mean squared z-scores, i.e., . For binary outcome, the over-dispersion adjusted variance is . For rate outcome, the over-dispersion adjusted variance is . Therefore, after the over-dispersion adjustment, the adjusted z-scores for site are , , respectively.
Reference: Spiegelhalter, David J. Funnel plots for comparing institutional performance. Statistics in medicine 24.8 (2005): 1185-1202.
Estimate and Score
The function Analyze_NormalApprox()
in
gsm calculates adjusted z-score for each site as
discussed above. The adjusted z-scores are then used as a scoring metric
in gsm to flag possible outliers using the thresholds
discussed below.
Threshold
By default, sites with adjusted z-score exceeding or from the normal approximation analysis are flagged as amber or red, respectively. The thresholds are set at common choices corresponding to 95.6% and 99.7% of the data around the mean in a standard normal distribution. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.
Special Situations
- Results are not interpretable or it is not appropriate to apply the asymptotic method: We don’t want to flag in certain situations when results not interpretable or when it is not appropriate to apply the asymptotic method due to the small sample sizes. The default threshold for minimum denominator requirement is 30 days exposure or 3 patients at the site level.
2. The Identity Method
Identity method simply uses the count of event in the numerator of the KRI metric itself as the score. The thresholds for monitoring site risk are set based on the actual counts.
3. The Fisher’s Exact Method
Introduction
For the binary outcome KRIs, an optional method in {gsm} is implemented with Fisher’s exact test.
Fisher’s exact test is a statistical significance test used in the analysis of contingency tables when we have nominal variables and want to find out if proportions for one variable are different among values of the other variables.
In contrast to large-sample based asymptotic statistics which rely on approximation, Fisher’s exact test can be applied when sample sizes are small.
The function Analyze_Fisher
in {gsm} utilizes
stats::fisher.test
to generate an estimate of odds ratio as
well as p-value using the Fisher’s exact test with site-level count
data. For each site, Fisher’s exact test is conducted by comparing to
all other sites combined in a 2×2 contingency table. The p-values are
then used as a scoring metric in gsm to flag possible
outliers. The default in stats::fisher.test
uses a
two-sided test (equivalent to testing the null of OR = 1) and does not
compute p-values by Monte Carlo simulation unless
simulate.p.value = TRUE
. Sites with p-values less than 0.05
from the Fisher’s exact test analysis are flagged by default. The
significance level was set at a common choice.
Methods
For example, in a contingency table comparing a particular site to all other sites combined, the two rows displaying the binary outcome are considered repeated Bernoulli random samples with same probability of success or failure under the null. Given a contingency table,
Site1 | RestSites |
---|---|
a | c |
b | d |
Fisher (1922) showed that conditional on the margins of the table, is distributed as a hypergeometric distribution with draws from a population with successes and failures. Let , the probability of obtaining such set of values is given by:
Estimate and Score
The function Analyze_Fisher()
in gsm
utilizes stats::fisher.test()
to generate an estimate of
odds ratio as well as p-value using the Fisher’s exact test with
site-level count data. For each site, Fisher’s exact test is conducted
by comparing to all other sites combined in a
contingency table. The p-values are then used as a scoring metric in
gsm to flag possible outliers using the thresholds
discussed below. The default in stats::fisher.test()
uses
two-sided test (equivalent to testing the null: OR=1) and not to compute
p-values by Monte Carlo simulation unless
simulate.p.value = TRUE
is specified.
Threshold
By default, sites with p-values less than 0.05 or 0.01 from the Fisher’s exact test analysis are flagged as amber or red, respectively. The thresholds are set based on empirical p-value approach, where we use the distribution of the p-values to find the best separation of the data to identify sites at risk. The default thresholds are set at common choices of significance levels. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.
The Fisher’s exact test assumptions
The row totals and the column totals are both fixed by design.
The samples are mutually exclusive and mutually independent.
The assumptions can be assessed by the knowledge of data collected. No assumption check is necessary.
Special situations
Functionally: where we don’t have required input to run Fishers: p-value will be set
NA
.Results not interpretable: we don’t want to flag in certain situations when results not interpretable due to small sample sizes. The default threshold for minimum denominator requirement is 3 patients at the site level.
An observed zero cell is not an issue when using Fisher’s exact test, however, when the expected cell is zero, it means either the marginal is zero (meaningless) or there are structural zeros (need to consider zero-inflated issue: West, L. and Hankin, R. (2008), “Exact Tests for Two-Way Contingency Tables with Structural Zeros,” Journal of Statistical Software, 28(11), 1–19).
constraints
For small samples, Fisher’s exact test is highly discrete. Fisher’s exact test is often considered to be more conservative. This may due to the use a discrete statistic with fixed significance levels (FET Controversies Wiki).
Although in practice, Fisher’s exact test is usually used when sample sizes are small (e.g., n<5), it is valid for all sample sizes. However, when sample sizes are large, the computation of the exact test evaluating the hypergeometric probability function given the marginal can take a very long time.
4. The Poisson Regression Method
Introduction
For the rate outcome KRIs, an optional method in {gsm} is implemented with Poisson regression.
The Poisson distribution is often used to model count data. If is the number of counts following Poisson distribution, the probability mass function is given by where is the average number of counts and .
Methods
This method fits a Poisson model to site-level data and then
calculates deviance residuals for each site. The Poisson model is run
using standard methods in the stats
package by fitting a
glm
model with family set to poisson
using a
“log” link. Site-level deviance residuals are calculated using
resid
from stats::predict.glm
via
broom::augment
.
Let be independent random variables with denoting the number of events observed from for the th observation following Poisson distribution. Then . Thus,the log-linear generalized linear model (Poisson regression) is
where is an offset term.
Estimate and Score
The function Analyze_Poisson()
in gsm
utilizes stats::glm()
to generate an estimate of fitted
values as well as deviance residual with site-level count data. The
p-values are then used as a scoring metric in gsm to flag
possible outliers using the thresholds discussed below.
Threshold
By default, sites with deviance residuals exceeding or from the Poisson analysis are flagged as amber or red, respectively. The thresholds are set based on empirical approach, where we use the distribution of the deviance residuals to find the best separation of the data to identify sites at risk. The default thresholds are set at empirical values based on pilot studies’ data. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.
Special Situations
- Results are not interpretable or it is not appropriate to apply the Poisson method: We don’t want to flag in certain situations when results not interpretable or when it is not appropriate to apply the Poisson method due to the small sample sizes. The default threshold for minimum denominator requirement is 30 days exposure at the site level.
Poisson regression assumptions
Independence The responses are independent of each other.
Count data The responses are non-negative integer (counts).
Poisson response Each follows the Poisson distribution as noted above with mean and variance equal to .
Linearity where are independent predictors.
Assumption checks, constraints and model diagnosis
The assumptions on independence and counted data can be assessed by the knowledge of data collected.
The assumptions on Poisson response can be checked by plotting histogram of the data and comparing empirical mean and variance stratified by the explanatory variable(s). If there is evidence that the assumption of mean=variance is violated, oftentimes we observe variance>mean. This is called overdispersion. In this case, negative binomial distribution provides an alternative where .
Diagnosis: Goodness of fit test (chi-squared) and deviance residuals. Residuals vs fitted plot. Q-Q plot.
Other considerations: Structural zeros may happen in contrast to random zeros due to sampling from poisson distribution. In this case, a mixture model (zero-inflated Poisson model) may be required.