Skip to contents

Overview

This vignette outlines the statistical methods used to evaluate Key Risk Indicators (KRIs) in {gsm}. KRIs are metrics that allow users to measure pre-defined risks and determine the level of observed risk to data quality and patient safety in a clinical trial. The {gsm} package implements a standardized data pipeline to facilitate KRI analysis. Other vignettes provide an overview of this framework (1, 2 3, 4, 5), and the statistical methods for this process are described in detail below.

GSM calculates KRIs by defining a numerator and a denominator for each metric. Then by default, GSM calculates z-scores using a normal approximation with adjustment for over-dispersion to assign risk levels.

For KRIs that are percentages (binary outcome), the numerator is the # of events and the denominator is the # of total participants, and we then apply the normal approximation of the binomial distribution to determine a risk level.

For KRIs that are rates (count outcome), the numerator is the # of events and the denominator is the total participant exposure or study duration, and we then apply the normal approximation of the Poisson distribution to determine a risk level.

Alternative statistical methods to calculate standardized scores are also available in {gsm}, including the Identity, Fisher and Poisson methods. More details are provided below.

Statistical Methods

1. The Normal Approximation Method

Introduction

This method applies normal approximation of binomial distribution to the binary outcome KRIs, or normal approximation of Poisson distribution for the rate outcome KRIs (the sample sizes or total exposure of the sites) to assess data quality and safety. The control limits based on the asymptotic normal approximation are constructed to as risk thresholds for identifying site-level risks.

Reference: Zink, Richard C., Anastasia Dmitrienko, and Alex Dmitrienko. Rethinking the clinically based thresholds of TransCelerate BioPharma for risk-based monitoring. Therapeutic Innovation & Regulatory Science 52, no. 5 (2018): 560-571.

Methods

Binary

Consider the problem of monitoring KRIs with binary outcomes, such as protocol deviation or discontinuation from the study, across multiple sites in a clinical trial. Assume that there are mm sites with nin_i patients at the ii th site, i=1,2,,mi = 1, 2, \dots, m. Denote the total number of patients in the study by n=i=1mnin=\sum_{i=1}^m n_i. Let XijX_{ij} signify the outcome of interest for the jj th patient at the ii th site, where Xij=1X_{ij}=1 indicates that an event has occurred and indicates that an event has not occurred. Finally, let pip_i denote the site-level proportion at the ii th site. Monitoring tools focus on testing the null hypothesis of consistency of the true site-level proportion across multiple sites. Specifically, the null hypothesis states that the site-level proportion of the binary outcome is constant across the sites, that is, H0:p1==pm=pH_0: p_1 = \dots = p_m = p, where pp is the common proportion. This common proportion can be estimated as p̂=1ni=1mj=1niXij\hat{p} = \frac{1}{n}\sum_{i=1}^m\sum_{j=1}^{n_i}X_{ij}.

The control limits are computed using confidence limits based on an asymptotic normal approximation. A 95% confidence interval is obtained if the significance level α=0.05\alpha=0.05. Let Xi=j=1niXijX_i=\sum_{j=1}^{n_i}X_{ij} represent the total number of events that occur and let p̂i=Xi/ni\hat{p}_i=X_i/n_i denote the estimated event rate at the ii th site. The asymptotic 100(1α)100(1 – \alpha)% confidence interval for pip_i is given by p̂iz1α/2p̂i(1p̂i)nipip̂i+z1α/2p̂i(1p̂i)ni \hat{p}_i-z_{1-\alpha/2}\sqrt{\frac{\hat{p}_i(1-\hat{p}_i)}{n_i}} \leq p_i \leq \hat{p}_i+z_{1-\alpha/2}\sqrt{\frac{\hat{p}_i(1-\hat{p}_i)}{n_i}} where z1α/2z_{1-\alpha/2} is the upper percentile of the standard normal distribution. To construct the control limits for the observed event rate at this site, the estimated event rate is forced to be equal to the overall event rate p̂i\hat{p}_i. This means that the lower (l) and upper (u) asymptotic control limits for the ii th site are defined as li=p̂z1α/2p̂(1p̂)nil_i=\hat{p}-z_{1-\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n_i}} and ui=p̂+z1α/2p̂(1p̂)niu_i=\hat{p}+z_{1-\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n_i}}, respectively. Asymptotic control limits may not be reliable in smaller clinical trials, so exact limits for an event rate may be preferable.

Rate

Assume that the distribution of number of events up to time TT is Poisson with mean λt\lambda t, where λ\lambda is the event rate for a given unit of time. For the ii th site with Xi=j=1NiXijX_i=\sum_{j=1}^{N_i}X_{ij} events and Ti=j=1NitijT_i=\sum_{j=1}^{N_i}t_{ij} exposure, define the exposure-adjusted incidence rate (EAIR) as λ̂i=XiTi\hat{\lambda}_i=\frac{X_i}{T_i}. For all sites, define X=i=1mXiX=\sum_{i=1}^{m}X_{i} and T=i=1mtiT=\sum_{i=1}^{m}t_{i} with λ̂=XT\hat{\lambda}=\frac{X}{T}. Under a normal approximation, 100(1a)100(1 – a)% confidence interval for the ii th site is λ̂iz1α/2λ̂iTipiλ̂i+z1α/2λ̂iTi \hat{\lambda}_i-z_{1-\alpha/2}\sqrt{\frac{\hat{\lambda}_i}{T_i}} \leq p_i \leq \hat{\lambda}_i+z_{1-\alpha/2}\sqrt{\frac{\hat{\lambda}_i}{T_i}} . For these funnel plots accounting for exposure, the x-axis representing the site sample size (nn) in the above examples is replaced by the total exposure time TT. To develop a funnel plot, fix λ̂i=λ̂\hat{\lambda}_i=\hat{\lambda}, and vary TT from min(Ti)min(T_i) to max(Ti)max(T_i) to compute the control limits. As an area of future research, the work of Chan and Wang (2009) may suggest methods appropriate for computing an exact confidence interval for the EAIR. Finally, similar methods can be applied for a count-type endpoint XijX_{ij}, where tij would denote the time on study for the jj th patient at the ii th site.

KRI Metric and Z-score

The KRI metric along with a KRI score are created for each site to measure the level of observed risk to data quality and patient safety in a clinical trial. For scoring purposes, Z-scores from the normal approximation are calculated and defined as such: zi=yiθ0V(Y|θ0)z_i=\frac{y_i-\theta_0}{\sqrt{V(Y|\theta_0)}} for site ii, where yiy_i is the KRI metric calculated for site ii, θ0\theta_0 is the overall mean, V(Y|θ0)\sqrt{V(Y|\theta_0)} is the measurement of variance.

For binary outcome, V(Y|θ0)=p̂(1p̂)ni\sqrt{V(Y|\theta_0)}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_i}}.

For rate outcome, V(Y|θ0)=λ̂Ti\sqrt{V(Y|\theta_0)}=\sqrt{\frac{\hat{\lambda}}{T_i}}.

Over-dispersion adjustment

The standard normal approximation method described above assumes the null distribution fully expresses the variability of the sites in-control, but in many situations this assumption will not hold. In the situation that there is a presence of greater variability than expected, majority of the sites will fall outside the specified limits, leading to a double of the appropriateness of the constructed limits.

A way of handling this issue is to allow over-dispersion in the normal approximation. A multiplicative over-dispersion adjustment was implemented in our approach.

Suppose a sample of mm units are to be in-control, the over-dispersion factor ϕ\phi can be estimated as the mean squared z-scores, i.e., ϕ̂=1mi=1mzi2\hat\phi = \frac{1}{m}\sum_{i=1}^m z_i^2. For binary outcome, the over-dispersion adjusted variance is V(Yi|ϕ,p)=ϕp(1p)niV'(Y_i|\phi, p)=\phi\frac{{p}(1-p)}{n_i}. For rate outcome, the over-dispersion adjusted variance is V(Yi|ϕ,λ)=ϕλTiV'(Y_i|\phi, \lambda)=\phi\frac{\lambda}{T_i}. Therefore, after the over-dispersion adjustment, the adjusted z-scores for site ii are zi=p̂ip̂ϕ̂p̂(1p̂)niz_i = \frac{\hat{p}_i - \hat{p}}{\sqrt{\hat\phi \frac{{\hat{p}}(1-\hat{p})}{n_i}}}, zi=λ̂iλ̂ϕ̂λ̂Tiz_i = \frac{\hat{\lambda}_i - \hat{\lambda}}{\sqrt{\hat\phi \frac{\hat\lambda}{T_i}}}, respectively.

Reference: Spiegelhalter, David J. Funnel plots for comparing institutional performance. Statistics in medicine 24.8 (2005): 1185-1202.

Estimate and Score

The function Analyze_NormalApprox() in gsm calculates adjusted z-score for each site as discussed above. The adjusted z-scores are then used as a scoring metric in gsm to flag possible outliers using the thresholds discussed below.

Threshold

By default, sites with adjusted z-score exceeding ±2\pm 2 or ±3\pm 3 from the normal approximation analysis are flagged as amber or red, respectively. The thresholds are set at common choices corresponding to 95.6% and 99.7% of the data around the mean in a standard normal distribution. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.

Special Situations

  1. Results are not interpretable or it is not appropriate to apply the asymptotic method: We don’t want to flag in certain situations when results not interpretable or when it is not appropriate to apply the asymptotic method due to the small sample sizes. The default threshold for minimum denominator requirement is 30 days exposure or 3 patients at the site level.

Recommendation

Normal approximation method can be used in all scenarios with binary or rate KRIs.

2. The Identity Method

Identity method simply uses the count of event in the numerator of the KRI metric itself as the score. The thresholds for monitoring site risk are set based on the actual counts.

3. The Fisher’s Exact Method

Introduction

For the binary outcome KRIs, an optional method in {gsm} is implemented with Fisher’s exact test.

Fisher’s exact test is a statistical significance test used in the analysis of contingency tables when we have nominal variables and want to find out if proportions for one variable are different among values of the other variables.

In contrast to large-sample based asymptotic statistics which rely on approximation, Fisher’s exact test can be applied when sample sizes are small.

The function Analyze_Fisher in {gsm} utilizes stats::fisher.test to generate an estimate of odds ratio as well as p-value using the Fisher’s exact test with site-level count data. For each site, Fisher’s exact test is conducted by comparing to all other sites combined in a 2×2 contingency table. The p-values are then used as a scoring metric in gsm to flag possible outliers. The default in stats::fisher.test uses a two-sided test (equivalent to testing the null of OR = 1) and does not compute p-values by Monte Carlo simulation unless simulate.p.value = TRUE. Sites with p-values less than 0.05 from the Fisher’s exact test analysis are flagged by default. The significance level was set at a common choice.

Methods

For example, in a 2×22 \times 2 contingency table comparing a particular site to all other sites combined, the two rows displaying the binary outcome are considered repeated Bernoulli random samples with same probability p=0.5p=0.5 of success or failure under the null. Given a 2×22 \times 2 contingency table,

Site1 RestSites
a c
b d

Fisher (1922) showed that conditional on the margins of the table, aa is distributed as a hypergeometric distribution with a+ca+c draws from a population with a+ba+b successes and c+dc+d failures. Let n=a+b+c+dn=a+b+c+d, the probability of obtaining such set of values is given by:

p=(a+ba)(c+dc)(na+c)=(a+bb)(c+dd)(nb+d)=(a+b)!(c+d)!(a+c)!(b+d)!a!b!c!d!n!. p=\frac{{{a+b} \choose a} {{c+d} \choose c}}{{n \choose {a+c}}}=\frac{{{a+b} \choose b} {{c+d} \choose d}}{{n \choose {b+d}}}=\frac{(a+b)!(c+d)!(a+c)!(b+d)!}{a! b! c! d! n!}.

Estimate and Score

The function Analyze_Fisher() in gsm utilizes stats::fisher.test() to generate an estimate of odds ratio as well as p-value using the Fisher’s exact test with site-level count data. For each site, Fisher’s exact test is conducted by comparing to all other sites combined in a 2×22 \times 2 contingency table. The p-values are then used as a scoring metric in gsm to flag possible outliers using the thresholds discussed below. The default in stats::fisher.test() uses two-sided test (equivalent to testing the null: OR=1) and not to compute p-values by Monte Carlo simulation unless simulate.p.value = TRUE is specified.

Threshold

By default, sites with p-values less than 0.05 or 0.01 from the Fisher’s exact test analysis are flagged as amber or red, respectively. The thresholds are set based on empirical p-value approach, where we use the distribution of the p-values to find the best separation of the data to identify sites at risk. The default thresholds are set at common choices of significance levels. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.

The Fisher’s exact test assumptions

  1. The row totals and the column totals are both fixed by design.

  2. The samples are mutually exclusive and mutually independent.

The assumptions can be assessed by the knowledge of data collected. No assumption check is necessary.

Special situations

  1. Functionally: where we don’t have required input to run Fishers: p-value will be set NA.

  2. Results not interpretable: we don’t want to flag in certain situations when results not interpretable due to small sample sizes. The default threshold for minimum denominator requirement is 3 patients at the site level.

  3. An observed zero cell is not an issue when using Fisher’s exact test, however, when the expected cell is zero, it means either the marginal is zero (meaningless) or there are structural zeros (need to consider zero-inflated issue: West, L. and Hankin, R. (2008), “Exact Tests for Two-Way Contingency Tables with Structural Zeros,” Journal of Statistical Software, 28(11), 1–19).

constraints

For small samples, Fisher’s exact test is highly discrete. Fisher’s exact test is often considered to be more conservative. This may due to the use a discrete statistic with fixed significance levels (FET Controversies Wiki).

Although in practice, Fisher’s exact test is usually used when sample sizes are small (e.g., n<5), it is valid for all sample sizes. However, when sample sizes are large, the computation of the exact test evaluating the hypergeometric probability function given the marginal can take a very long time.

Recommendation

Fisher’s exact test can be used in all scenarios with binary KRIs.

4. The Poisson Regression Method

Introduction

For the rate outcome KRIs, an optional method in {gsm} is implemented with Poisson regression.

The Poisson distribution is often used to model count data. If YY is the number of counts following Poisson distribution, the probability mass function is given by f(y)=μyeμy! f(y)=\frac{\mu^ye^{-\mu}}{y!} where μ\mu is the average number of counts and E(Y)=Var(Y)=μE(Y)=Var(Y)=\mu.

Methods

This method fits a Poisson model to site-level data and then calculates deviance residuals for each site. The Poisson model is run using standard methods in the stats package by fitting a glm model with family set to poisson using a “log” link. Site-level deviance residuals are calculated using resid from stats::predict.glm via broom::augment.

Let Y1,...,YNY_1, ..., Y_N be independent random variables with YiPoisson(μi)Y_i \sim Poisson(\mu_i) denoting the number of events observed from nin_i for the iith observation following Poisson distribution. Then E(Yi)=μi=niexiβE(Y_i)=\mu_i=n_ie^{x_i\beta}. Thus,the log-linear generalized linear model (Poisson regression) is logμi=logni+xiβYiPoisson(μi) \log{\mu_i}=\log{n_i}+x_i\beta \quad Y_i \sim Poisson(\mu_i)

where logni\log{n_i} is an offset term.

Estimate and Score

The function Analyze_Poisson() in gsm utilizes stats::glm() to generate an estimate of fitted values as well as deviance residual with site-level count data. The p-values are then used as a scoring metric in gsm to flag possible outliers using the thresholds discussed below.

Threshold

By default, sites with deviance residuals exceeding ±5\pm 5 or ±7\pm 7 from the Poisson analysis are flagged as amber or red, respectively. The thresholds are set based on empirical approach, where we use the distribution of the deviance residuals to find the best separation of the data to identify sites at risk. The default thresholds are set at empirical values based on pilot studies’ data. However, they are fully configurable in the package and can be customized and specified in the {gsm} functions.

Special Situations

  1. Results are not interpretable or it is not appropriate to apply the Poisson method: We don’t want to flag in certain situations when results not interpretable or when it is not appropriate to apply the Poisson method due to the small sample sizes. The default threshold for minimum denominator requirement is 30 days exposure at the site level.

Poisson regression assumptions

  1. Independence The responses yiy_i are independent of each other.

  2. Count data The responses yiy_i are non-negative integer (counts).

  3. Poisson response Each YiY_i follows the Poisson distribution as noted above with mean and variance equal to μi\mu_i.

  4. Linearity logμi=logni+xiβ\log{\mu_i}=\log{n_i}+x_i\beta where xix_i are independent predictors.

Assumption checks, constraints and model diagnosis

  1. The assumptions on independence and counted data can be assessed by the knowledge of data collected.

  2. The assumptions on Poisson response can be checked by plotting histogram of the data and comparing empirical mean and variance stratified by the explanatory variable(s). If there is evidence that the assumption of mean=variance is violated, oftentimes we observe variance>mean. This is called overdispersion. In this case, negative binomial distribution provides an alternative where Var(Yi)=ϕE(Yi)Var(Y_i)=\phi E(Y_i).

  3. Diagnosis: Goodness of fit test (chi-squared) and deviance residuals. Residuals vs fitted plot. Q-Q plot.

  4. Other considerations: Structural zeros may happen in contrast to random zeros due to sampling from poisson distribution. In this case, a mixture model (zero-inflated Poisson model) may be required.

Recommendation

Use this method when Poisson assumptions hold.