Understanding the Domain Registry • gsm.datasim

library(gsm.datasim)
#> Registered S3 method overwritten by 'logger':
#>   method         from 
#>   print.loglevel log4r

Introduction

The Domain Registry is a centralized system in gsm.datasim that defines how clinical domains (datasets) are generated and managed. It provides a structured approach to data generation that is extensible, maintainable, and consistent across different study types.

This vignette explains how the domain registry works, how to use existing domains, and how to extend it with new domains.

Domain Registry Architecture

What is a Domain?

In clinical trial terminology, a “domain” refers to a specific type of clinical data, such as:

AE: Adverse Events
LB: Laboratory Data
VISIT: Visit Records
PD: Pharmacodynamics
PK: Pharmacokinetics
QUERY: Data Queries

Each domain has its own data structure, generation rules, and dependencies.

Registry Structure

The domain registry is a named list where each entry defines:

Dataset name: The raw dataset identifier (e.g., “Raw_AE”)
Package: Which R package contains the generation function
Workflow path: Where to find mapping specifications
Generation function: How to create the data
Dependencies: What other domains this domain needs
Count functions: How many records to generate
Argument builders: How to prepare function arguments

Viewing the Registry

# Get the complete domain registry
registry <- get_domain_registry()

# View available domains
names(registry)

# Examine a specific domain entry
str(registry$Raw_AE)

Core Registry Domains

Adverse Events (Raw_AE)

Adverse Events are a critical domain in clinical trials and RBQM, specifically:

# View AE domain configuration
ae_config <- registry$Raw_AE

# Key components:
# - dataset: "Raw_AE"
# - package: "gsm.mapping" 
# - required_inputs: data, previous_data, combined_specs, etc.
# - count_fn: Uses ae_count from snapshot configuration
# - generate_fn: Calls Raw_AE() function

print(ae_config$dataset)
print(ae_config$package)
print(ae_config$required_inputs)

Laboratory Data (Raw_LB)

Laboratory data represents test results over time:

# View LB domain configuration  
lb_config <- registry$Raw_LB

# Key characteristics:
# - Uses subject_count for record generation
# - Includes visit-based data splitting
# - Supports longitudinal data patterns

print(lb_config$count_fn)
print(lb_config$generate_fn)

Available Domains

# Get all available domains from the registry
available_domains <- get_available_domains()
print(available_domains)

# Get domains for a specific package
mapping_domains <- get_available_domains(package = "gsm.mapping")
print(mapping_domains)

Using the Registry System

Basic Domain Usage

The registry system is used automatically when you configure studies:

# Create a study configuration
config <- create_study_config(
  study_id = "REGISTRY001",
  participant_count = 100,
  site_count = 5
)

# Add domains - these will use registry definitions
config <- add_dataset_config(config, "Raw_AE", enabled = TRUE)
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE)

# Generate data using registry
raw_data <- generate_study_data(config, verbose = TRUE)

# The registry handles:
# - Loading appropriate generation functions
# - Managing dependencies between domains  
# - Applying correct count formulas
# - Handling multi-package scenarios

Domain Dependencies

Some domains depend on others. The registry manages this automatically:

# Check domain dependencies
check_domain_dependencies <- function(domain_name) {
  registry <- get_domain_registry()
  domain_config <- registry[[domain_name]]
  
  if (!is.null(domain_config$dependencies)) {
    return(domain_config$dependencies)
  } else {
    return("No explicit dependencies defined")
  }
}

# Check dependencies for various domains
check_domain_dependencies("Raw_AE")
check_domain_dependencies("Raw_LB")

# Core domains (STUDY, SITE, SUBJ, ENROLL) are always required
core_domains <- c("Raw_STUDY", "Raw_SITE", "Raw_SUBJ", "Raw_ENROLL")
print("Core required domains:")
print(core_domains)

Multi-Package Support

The registry supports domains from different packages:

# Create a domain-package mapping
domain_package_df <- data.frame(
  domain = c("AE", "LB", "VISIT", "CustomDomain"),
  package = c("gsm.mapping", "gsm.mapping", "gsm.mapping", "custom.package"),
  stringsAsFactors = FALSE
)

# Create study with multi-package support
config <- create_study_config(
  study_id = "MULTI001",
  participant_count = 50,
  site_count = 3
)

config <- add_dataset_config(config, "Raw_AE", enabled = TRUE)
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE)

# Generate with package specification
raw_data <- generate_raw_data_for_endpoints(config, domain_package_df)

Domain Timeline and Evolution

Understanding Domain Timelines

Different domains have different temporal patterns:

# Get timeline information for a domain
ae_timeline <- get_domain_timeline("Raw_AE")
print(ae_timeline)

lb_timeline <- get_domain_timeline("Raw_LB") 
print(lb_timeline)

# Timeline shows:
# - When domain data typically starts appearing
# - How frequently it's collected
# - Growth patterns over time
# - Dependencies on other domains

Longitudinal Data Patterns

The registry handles different growth patterns:

# Domains have different growth patterns:

# Linear growth (steady increase)
config <- add_dataset_config(config, "Raw_AE", enabled = TRUE, 
                           growth_pattern = "linear")

# Exponential growth (accelerating increase)  
config <- add_dataset_config(config, "Raw_QUERY", enabled = TRUE,
                           growth_pattern = "exponential")

# Constant (same amount each snapshot)
config <- add_dataset_config(config, "Raw_VISIT", enabled = TRUE, 
                           growth_pattern = "constant")

# Custom formulas
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE,
                           count_formula = function(config) {
                             config$study_params$participant_count * 3  # 3 per participant
                           })

Working with Domain Generation Functions

Understanding Generation Functions

Each domain has a specific generation function (e.g., Raw_AE(), Raw_LB()):

# Domain generation functions typically take these parameters:
# - data: Current snapshot data
# - previous_data: Previous snapshot(s) for continuity  
# - combined_specs: Data generation specifications
# - n: Number of records to generate
# - startDate/endDate: Time boundaries
# - split_vars: How to distribute data across timepoints

# Example: AE generation function signature
# Raw_AE(data, previous_data, combined_specs, n, startDate, endDate, split_vars)

# Example: LB generation function signature  
# Raw_LB(data, previous_data, combined_specs, n, startDate, split_vars)

Custom Count Functions

You can customize how record counts are determined:

# Create a study with custom count logic
config <- create_study_config(
  study_id = "CUSTOM-COUNTS",
  participant_count = 100,
  site_count = 10
)

# Custom AE count: more AEs at larger sites
config <- add_dataset_config(
  config, 
  "Raw_AE", 
  enabled = TRUE,
  count_formula = function(config) {
    # Complex formula based on site size distribution
    base_ae_rate <- 0.8  # 80% of participants have AEs
    site_size_factor <- config$study_params$participant_count / config$study_params$site_count
    round(config$study_params$participant_count * base_ae_rate * (site_size_factor / 10))
  }
)

# Custom LB count: weekly labs for first month, then monthly
config <- add_dataset_config(
  config,
  "Raw_LB",
  enabled = TRUE,
  count_formula = function(config) {
    participants <- config$study_params$participant_count
    # Assuming monthly snapshots
    participants * 4  # 4 lab visits per participant per month
  }
)

Extending the Domain Registry

Adding New Domains

You can extend the registry with custom domains:

# Define a custom domain
add_custom_domain_to_registry <- function() {
  # Get current registry
  current_registry <- get_domain_registry()
  
  # Add custom domain
  current_registry$Raw_CUSTOM <- list(
    dataset = "Raw_CUSTOM",
    required_inputs = c("data", "combined_specs", "n", "start_date"),
    count_fn = function(counts, snapshot_idx) {
      # Custom count logic
      counts$subject_count[snapshot_idx] * 2  # 2 records per participant
    },
    generate_fn = function(context) {
      # All generation logic lives here — no separate Raw_CUSTOM() function needed.
      # Access everything through `context`:
      #   context$data           — current snapshot data (Raw_SUBJ, Raw_SITE, …)
      #   context$previous_data  — previous snapshot
      #   context$combined_specs — field specs from CombineSpecs()
      #   context$n              — target record count this snapshot
      #   context$start_date / context$end_date
      warning("Raw_CUSTOM generate_fn not implemented")
      data.frame()
    }
  )
  
  return(current_registry)
}

# Example usage (this would require implementing Raw_CUSTOM function)
# extended_registry <- add_custom_domain_to_registry()

Custom Generation Functions

If you create new domains, you’ll need generation functions:

# Example structure for a custom domain generation function
Raw_CUSTOM <- function(data, combined_specs, n, startDate, custom_param = NULL) {
  # Your custom data generation logic here
  
  # Basic structure:
  custom_data <- data.frame(
    studyid = rep(data$Raw_STUDY$studyid[1], n),
    usubjid = sample(data$Raw_SUBJ$usubjid, n, replace = TRUE),
    custom_field = paste("CUSTOM", 1:n, sep = "_"),
    custom_date = seq(as.Date(startDate), 
                      as.Date(startDate) + n - 1, 
                      by = "day")[1:n],
    custom_value = runif(n, 0, 100),
    stringsAsFactors = FALSE
  )
  
  return(custom_data)
}

# To use this in the registry:
# 1. Define the function in your package
# 2. Add registry entry as shown above  
# 3. Include the domain in your study configuration

Registry Migration and Maintenance

Domain Registry Migration

The registry supports migration between different data generation approaches:

# The package includes migration utilities for moving from
# older approaches to the registry-based system

# Check which domains can be migrated
available_for_migration <- get_available_domains()
print("Domains available for registry-based generation:")
print(available_for_migration)

# Migration typically involves:
# 1. Updating study configurations to use registry
# 2. Converting custom domain logic to registry format  
# 3. Testing compatibility with existing workflows

Registry Validation

# Validate registry configuration
validate_domain_registry <- function(registry = NULL) {
  if (is.null(registry)) {
    registry <- get_domain_registry()
  }
  
  validation_results <- list()
  
  for (domain_name in names(registry)) {
    domain_config <- registry[[domain_name]]
    
    # Check required fields
    required_fields <- c("dataset", "package", "generate_fn")
    missing_fields <- setdiff(required_fields, names(domain_config))
    
    # Check function availability
    package_available <- requireNamespace(domain_config$package, quietly = TRUE)
    
    validation_results[[domain_name]] <- list(
      missing_fields = missing_fields,
      package_available = package_available,
      has_count_fn    = !is.null(domain_config$count_fn),
      has_generate_fn = !is.null(domain_config$generate_fn)
    )
  }
  
  return(validation_results)
}

# Validate current registry
validation <- validate_domain_registry()
str(validation, max.level = 2)

Best Practices

Registry Usage Guidelines

Use existing domains when possible: The registry includes well-tested domains for common clinical data types.
Follow naming conventions: Use “Raw_” prefix for dataset names to maintain consistency.
Define clear dependencies: If your domain needs data from other domains, specify dependencies clearly.
Test thoroughly: Validate your custom domains across different study configurations.
Document extensions: Provide clear documentation for any custom domains you add.

Performance Considerations

# Registry-based generation is optimized for performance:

# 1. Efficient package loading - only loads needed packages
# 2. Dependency resolution - generates domains in correct order
# 3. Multi-package support - handles complex study configurations
# 4. Caching - reuses specifications across snapshots

# For large studies, consider:
# - Enabling verbose mode to monitor progress
# - Breaking large studies into smaller chunks
# - Using appropriate snapshot intervals

Troubleshooting

# Common issues and solutions:

# Issue: Domain not found in registry
# Solution: Check available domains and spelling
available <- get_available_domains()
print(available)

# Issue: Package not available
# Solution: Install required packages
missing_packages <- c()
registry <- get_domain_registry()
for (domain in names(registry)) {
  pkg <- registry[[domain]]$package
  if (!requireNamespace(pkg, quietly = TRUE)) {
    missing_packages <- c(missing_packages, pkg)
  }
}
if (length(missing_packages) > 0) {
  cat("Missing packages:", paste(missing_packages, collapse = ", "), "\n")
}

# Issue: Generation function errors  
# Solution: Check function parameters and data availability
# Use verbose mode to see detailed error messages

Conclusion

The Domain Registry provides a robust, extensible framework for clinical data generation in gsm.datasim. Key benefits include:

Standardization: Consistent approach to domain definition and generation
Extensibility: Easy to add new domains and customize existing ones
Maintainability: Centralized configuration reduces duplication
Multi-package support: Domains can come from different R packages
Dependency management: Automatic handling of domain dependencies

Understanding the registry system enables you to: - Use existing domains effectively - Create custom domains for specialized needs - Maintain complex multi-domain studies - Migrate from older generation approaches

The registry system is designed to grow with your needs while maintaining backward compatibility and ease of use.