Data Ingestion for GSM App

library(gsm.app)
library(dplyr)

Introduction

The {gsm.app} package provides interactive dashboards for exploring Good Statistical Monitoring (GSM) Key Risk Indicator (KRI) assessments in clinical trials. A critical component of this exploration is the ability to drill down from high-level summary statistics to the underlying source data that drives those statistics.

This vignette explains why data fetching is essential for the generated GSM app and provides step-by-step guidance on implementing data ingestion functions that respond to the app’s current state. See vignette("data-preparation") for advice on preparing data for use with {gsm.app}.

Why Does the App Need to Fetch Data?

The GSM app operates on two levels:

Summary Level: Displays KRI results, statistical bounds, and group-level metrics.
Detail Level: Shows the underlying domain-specific data that generated those summaries.

When a user identifies a site with concerning metrics (e.g., high adverse event rate), they need to examine the specific records that contributed to that signal. The data fetching functionality enables this drill-down capability by:

Filtering data based on the current app selections (domain, group, subject).
Returning only relevant records for detailed review.
Maintaining performance by avoiding loading unnecessary data.
Providing real-time access to source data during investigation.

Performance and Memory Considerations

You might wonder why we don’t simply load all domain data when the app initializes. The dynamic data fetching approach offers several critical advantages:

Memory Efficiency: Clinical trial datasets can be enormous. A large Phase III study might have:

Millions of adverse event records.
Hundreds of thousands of laboratory measurements.
Extensive demographics and medical history data.
Multiple gigabytes of total domain data.

Loading all of this data into memory would consume substantial RAM and potentially crash the application or make it unusable on standard hardware.

App Startup Performance: Reading and loading large datasets takes time. By fetching data on-demand, the app starts quickly and users can begin exploring summary-level results immediately, rather than waiting for all domain data to load.

Network Efficiency: In database-connected environments, loading all domain data would require massive data transfers. On-demand fetching transfers only the specific records needed for the current investigation.

Scalability: As studies grow in size or when monitoring multiple studies simultaneously, the dynamic approach scales naturally without increasing the base memory footprint.

User Experience: Most app usage focuses on summary-level exploration. Users typically drill down to detailed data for only a small subset of sites or subjects, making it wasteful to preload data that may never be viewed.

The trade-off is a small delay when users first request detailed data for a specific domain/group combination, but this is much faster than the alternative of loading everything upfront.

Step 1: Manual Data Fetching Example

Let’s start with a concrete scenario. An investigator has identified that Site 101 is reporting adverse events at a rate significantly higher than other sites. They want to review the specific adverse event records for this site to understand what’s driving the signal.

Setting Up the Data Location

First, we’ll define where our domain data files are stored:

# Define the base path where domain data files are stored
data_path <- "C:/clinical_trial_data/study_ABC123"

# List the available domain files
# Typically you would have files like:
# - AE.csv (Adverse Events)
# - LB.csv (Laboratory Data) 
# - DM.csv (Demographics)
# - VS.csv (Vital Signs)
# - etc.

Manual Data Retrieval

Now let’s manually fetch the adverse events data for Site 101:

# Construct the file path for adverse events data
ae_file_path <- file.path(data_path, "AE.csv")

# Read the adverse events data
dfAE <- read.csv(ae_file_path, stringsAsFactors = FALSE)

# Check the structure of the data
head(dfAE)
#   SubjectID SiteID AETERM           AESEV    AEREL    AESTDT
# 1 ABC-001   101    Nausea          Mild     Possible 2023-01-15
# 2 ABC-002   101    Headache        Moderate Unlikely 2023-01-16  
# 3 ABC-003   102    Dizziness       Mild     Probable 2023-01-17
# 4 ABC-004   101    Fatigue         Severe   Definite 2023-01-18

# Filter for Site 101 specifically
dfAE_site101 <- dfAE %>%
  dplyr::filter(SiteID == "101")

# Review the results
nrow(dfAE_site101)  # Number of AE records for Site 101
head(dfAE_site101)  # First few records for review

This manual approach works for a one-time investigation, but the GSM app needs a dynamic solution that can respond to user selections in real-time.

Step 2: Understanding App State

The GSM app maintains several state variables that determine what data should be displayed:

strDomain: The currently selected data domain (e.g., “AE”, “LB”, “DM”).
strGroupLevel: The level of grouping (e.g., “Site”, “Country”, “Study”).
strGroupID: The specific group identifier (e.g., “101”, “USA”; optional, for group-level drill-down).
strSubjectID: The specific subject identifier (optional, for subject-level drill-down).
dSnapshotDate: The date of the active snapshot (currently always the most recent snapshot in the summary data).

When a user selects a site in the app interface, these state variables are automatically updated. For example:

User selects Site 101 in the adverse events summary → strDomain = "AE", strGroupLevel = "Site", strGroupID = "101".
User then selects Subject ABC-004 → strSubjectID = "ABC-004" is added.
User switches to lab data tab → strDomain = "LB" while other selections remain.

Step 3: Creating a Basic Data Fetching Function

Now let’s create a function that can respond to the app’s state. This function will be passed to run_gsm_app() via the fnFetchData parameter.

# Basic data fetching function that responds to app state
fnFetchData <- function(
  strDomain, 
  strGroupLevel = "Site", 
  strGroupID = NULL, 
  strSubjectID = NULL,
  dSnapshotDate = NULL
) {
  
  # Define base path to data files
  data_path <- "C:/clinical_trial_data/study_ABC123"
  
  # Construct file path based on the requested domain
  file_path <- file.path(data_path, paste0(strDomain, ".csv"))
  
  # Check if the file exists
  if (!file.exists(file_path)) {
    # Error messages are passed on to app users for better bug reports. The app
    # continues to function (other than the requested domain).
    stop("Domain data file not found: ", file_path)
  }
  
  # Read the domain data
  dfDomain <- read.csv(file_path, stringsAsFactors = FALSE)
  
  # Apply group-level filtering based on app state
  if (!is.null(strGroupID) && strGroupLevel %in% names(dfDomain)) {
    # Create the filter column name (e.g., "SiteID", "CountryID")
    group_col <- paste0(strGroupLevel, "ID")
    
    if (group_col %in% names(dfDomain)) {
      dfDomain <- dfDomain %>%
        dplyr::filter(.data[[group_col]] == strGroupID)
    }
  }
  
  # Apply subject-level filtering if specified
  if (!is.null(strSubjectID) && "SubjectID" %in% names(dfDomain)) {
    dfDomain <- dfDomain %>%
      dplyr::filter(.data$SubjectID == strSubjectID)
  }
  
  return(dfDomain)
}

How the Function Responds to App State

When the app calls this function, it automatically passes the current state:

Domain Selection: When user is viewing the AE tab → strDomain = "AE".
Group Selection: When user selects Site 101 → strGroupLevel = "Site", strGroupID = "101".
Subject Selection: When user selects a specific subject → strSubjectID = "ABC-004".

The function uses these parameters to:

Load the correct domain file (AE.csv).
Filter to the selected group (Site 101).
Further filter to the selected subject if specified.

Step 4: Enhanced Function with Error Handling

Let’s improve our function with better error handling and validation:

fnFetchData <- function(
  strDomain, 
  strGroupLevel = "Site", 
  strGroupID = NULL, 
  strSubjectID = NULL,
  dSnapshotDate = NULL
) {
  
  # Define data configuration
  data_path <- "C:/clinical_trial_data/study_ABC123"
  
  # Define valid domains and their file mappings
  valid_domains <- c(
    "AE" = "AE.csv",
    "LB" = "LB.csv", 
    "DM" = "DM.csv",
    "VS" = "VS.csv",
    "QUERY" = "QUERY.csv",
    "ENROLL" = "ENROLL.csv"
  )
  
  # Validate domain
  if (!strDomain %in% names(valid_domains)) {
    stop("Invalid domain: ", strDomain)
  }
  
  # Construct file path
  file_name <- valid_domains[[strDomain]]
  file_path <- file.path(data_path, file_name)
  
  # Read data with error handling
  dfDomain <- tryCatch({
    read.csv(file_path, stringsAsFactors = FALSE)
  }, error = function(e) {
    stop("Error reading file: ", file_path, "\n", e$message)
  })
  
  # Validate that we have data
  if (nrow(dfDomain) == 0) {
    return(data.frame())
  }
  
  # Apply group-level filtering
  if (!is.null(strGroupID) && !is.null(strGroupLevel)) {
    # Handle different group level column naming conventions
    possible_group_cols <- c(
      paste0(strGroupLevel, "ID"),     # e.g., "SiteID"
      strGroupLevel,                   # e.g., "Site"
      toupper(paste0(strGroupLevel, "ID")), # e.g., "SITEID"
      toupper(strGroupLevel)           # e.g., "SITE"
    )
    
    # Find which column exists in the data
    group_col <- intersect(possible_group_cols, names(dfDomain))[1]
    
    if (!is.na(group_col)) {
      dfDomain <- dfDomain %>%
        dplyr::filter(.data[[group_col]] == strGroupID)
    } else {
      warning("Group level column not found for ", strGroupLevel)
    }
  }
  
  # Apply subject-level filtering
  if (!is.null(strSubjectID)) {
    # Handle different subject ID column naming conventions
    possible_subj_cols <- c("SubjectID", "SUBJID", "USUBJID", "subjectID")
    subj_col <- intersect(possible_subj_cols, names(dfDomain))[1]
    
    if (!is.na(subj_col)) {
      dfDomain <- dfDomain %>%
        dplyr::filter(.data[[subj_col]] == strSubjectID)
    } else {
      warning("Subject ID column not found in data")
    }
  }
  
  return(dfDomain)
}

Step 5: Passing the Function to the App

Once you’ve defined your data fetching function, you pass it to the GSM app when launching (see vignette("data-preparation") for guidance on the other arguments):

# Launch the GSM app with your custom data fetching function
run_gsm_app(
  dfAnalyticsInput = dfAnalyticsInput,   # Your prepared analytics data
  dfBounds = dfBounds,                   # Statistical bounds
  dfGroups = dfGroups,                   # Group metadata
  dfMetrics = dfMetrics,                 # Metric definitions
  dfResults = dfResults,                 # KRI results
  fnFetchData = fnFetchData,             # Your custom data fetching function
  fnCountData = function(...) {          # Optional: data counting function
    nrow(fnFetchData(...))
  }
)

How the App Uses Your Function

When the app is running:

Initial Load: App displays summary KRI results using the prepared data frames.
User Interaction: User selects Site 101 in the AE summary chart.
State Update: App updates internal state: strDomain = "AE", strGroupLevel = "Site", strGroupID = "101".
Function Call: App calls fnFetchData("AE", "Site", "101", NULL, dSnapshotDate = max(dfAnalyticsInput$SnapshotDate)).
Data Display: Your function returns filtered AE records, which the app displays in a data table.
Further Drill-down: User selects a specific subject. If the app already has site-level data that includes this subject, it filters that data; otherwise it calls your fnFetchData() with the subject ID in addition to the other parameters.

Step 6: Advanced Data Fetching Considerations

For production deployments, consider these enhancements:

Database Connectivity

fnFetchData_DB <- function(
  strDomain, 
  strGroupLevel = "Site", 
  strGroupID = NULL, 
  strSubjectID = NULL,
  dSnapshotDate = NULL
) {
  
  # Connect to your database
  con <- DBI::dbConnect(
    RPostgres::Postgres(),
    host = "your-db-host",
    dbname = "clinical_trial_db",
    user = "username",
    password = "password"
  )
  
  # Build dynamic SQL query based on app state
  base_query <- paste("SELECT * FROM", tolower(strDomain))
  
  where_clauses <- c()
  
  if (!is.null(strGroupID)) {
    group_col <- paste0(tolower(strGroupLevel), "_id")
    where_clauses <- c(where_clauses, paste0(group_col, " = '", strGroupID, "'"))
  }
  
  if (!is.null(strSubjectID)) {
    where_clauses <- c(where_clauses, paste0("subject_id = '", strSubjectID, "'"))
  }
  
  if (length(where_clauses) > 0) {
    query <- paste(base_query, "WHERE", paste(where_clauses, collapse = " AND "))
  } else {
    query <- base_query
  }
  
  # Execute query and return results
  result <- DBI::dbGetQuery(con, query)
  DBI::dbDisconnect(con)
  
  return(result)
}

Performance Optimization

For large datasets, consider:

Lazy loading strategies.
Pagination for large result sets.
Caching frequently accessed data.
Indexing on commonly filtered columns.

# Example with caching and pagination
fnFetchData_Optimized <- function(
  strDomain, 
  strGroupLevel = "Site",
  strGroupID = NULL,
  strSubjectID = NULL,
  dSnapshotDate = NULL,
  nMaxRows = 10000
) {
  
  # Create cache key based on parameters
  cache_key <- paste(
    strDomain, strGroupLevel, strGroupID, strSubjectID, dSnapshotDate, sep = "_"
  )
  
  # Check if data is already cached
  if (exists(cache_key, envir = .GlobalEnv)) {
    return(get(cache_key, envir = .GlobalEnv))
  }
  
  # Fetch data using your preferred method
  dfDomain <- your_data_fetch_logic(strDomain, strGroupLevel, strGroupID, strSubjectID)
  
  # Apply row limit for performance
  if (nrow(dfDomain) > nMaxRows) {
    warning(paste("Data truncated to", nMaxRows, "rows for performance"))
    dfDomain <- dfDomain[1:nMaxRows, ]
  }
  
  # Cache the result
  assign(cache_key, dfDomain, envir = .GlobalEnv)
  
  return(dfDomain)
}

Data Counting Function (`fnCountData`)

The Domain Summary tab in gsm.app provides information on number of records for each domain. To construct these counts, {gsm.app} uses a special fnCountData function. fnCountData takes the same arguments as fnFetchData, but returns a single integer instead of a data.frame of results.

By default, fnCountData is constructed from fnFetchData via ConstructDataCounter(). When working with large datasets, retrieving only the count of records is often significantly faster than fetching the full dataset. In such cases, you may wish to provide a specialized fnCountData function to quickly fetch just the counts of data.

Sample Implementation

fnCountData <- function(
  strDomain, 
  strGroupLevel = "Site",
  strGroupID = NULL,
  strSubjectID = NULL,
  dSnapshotDate = NULL
) {
  # Connect to your database
  con <- DBI::dbConnect(
    RPostgres::Postgres(),
    host = "your-db-host",
    dbname = "clinical_trial_db",
    user = "username",
    password = "password"
  )
  
  # Build dynamic SQL query based on app state
  base_query <- paste("SELECT COUNT(*) FROM", tolower(strDomain))
  
  where_clauses <- c()
  
  if (!is.null(strGroupID)) {
    group_col <- paste0(tolower(strGroupLevel), "_id")
    where_clauses <- c(where_clauses, paste0(group_col, " = '", strGroupID, "'"))
  }
  
  if (!is.null(strSubjectID)) {
    where_clauses <- c(where_clauses, paste0("subject_id = '", strSubjectID, "'"))
  }
  
  if (length(where_clauses) > 0) {
    query <- paste(base_query, "WHERE", paste(where_clauses, collapse = " AND "))
  } else {
    query <- base_query
  }
  
  # Execute query and return results
  result <- DBI::dbGetQuery(con, query)
  DBI::dbDisconnect(con)
  
  return(result)
}

Summary

Data ingestion is a critical component that enables the GSM app to provide meaningful drill-down functionality. By implementing a custom fnFetchData function:

Respond to App State: Your function receives the current domain, group, and subject selections.
Filter Appropriately: Return only the data relevant to the current investigation.
Handle Errors Gracefully: Provide meaningful warnings and fallbacks.
Optimize Performance: Consider caching and pagination for large datasets.

The data fetching function bridges the gap between high-level KRI summaries and detailed source data, empowering investigators to understand the underlying drivers of risk signals in their clinical trials.