library(gsm.datasim)
#> Registered S3 method overwritten by 'logger':
#> method from
#> print.loglevel log4rIntroduction
The Domain Registry is a centralized system in
gsm.datasim that defines how clinical domains (datasets)
are generated and managed. It provides a structured approach to data
generation that is extensible, maintainable, and consistent across
different study types.
This vignette explains how the domain registry works, how to use existing domains, and how to extend it with new domains.
Domain Registry Architecture
What is a Domain?
In clinical trial terminology, a “domain” refers to a specific type of clinical data, such as:
- AE: Adverse Events
-
LB: Laboratory Data
- VISIT: Visit Records
- PD: Pharmacodynamics
- PK: Pharmacokinetics
- QUERY: Data Queries
Each domain has its own data structure, generation rules, and dependencies.
Registry Structure
The domain registry is a named list where each entry defines:
- Dataset name: The raw dataset identifier (e.g., “Raw_AE”)
- Package: Which R package contains the generation function
- Workflow path: Where to find mapping specifications
- Generation function: How to create the data
- Dependencies: What other domains this domain needs
- Count functions: How many records to generate
- Argument builders: How to prepare function arguments
Viewing the Registry
# Get the complete domain registry
registry <- get_domain_registry()
# View available domains
names(registry)
# Examine a specific domain entry
str(registry$Raw_AE)Core Registry Domains
Adverse Events (Raw_AE)
Adverse Events are a critical domain in clinical trials and RBQM, specifically:
# View AE domain configuration
ae_config <- registry$Raw_AE
# Key components:
# - dataset: "Raw_AE"
# - package: "gsm.mapping"
# - required_inputs: data, previous_data, combined_specs, etc.
# - count_fn: Uses ae_count from snapshot configuration
# - generate_fn: Calls Raw_AE() function
print(ae_config$dataset)
print(ae_config$package)
print(ae_config$required_inputs)Available Domains
# Get all available domains from the registry
available_domains <- get_available_domains()
print(available_domains)
# Get domains for a specific package
mapping_domains <- get_available_domains(package = "gsm.mapping")
print(mapping_domains)Using the Registry System
Basic Domain Usage
The registry system is used automatically when you configure studies:
# Create a study configuration
config <- create_study_config(
study_id = "REGISTRY001",
participant_count = 100,
site_count = 5
)
# Add domains - these will use registry definitions
config <- add_dataset_config(config, "Raw_AE", enabled = TRUE)
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE)
# Generate data using registry
raw_data <- generate_study_data(config, verbose = TRUE)
# The registry handles:
# - Loading appropriate generation functions
# - Managing dependencies between domains
# - Applying correct count formulas
# - Handling multi-package scenariosDomain Dependencies
Some domains depend on others. The registry manages this automatically:
# Check domain dependencies
check_domain_dependencies <- function(domain_name) {
registry <- get_domain_registry()
domain_config <- registry[[domain_name]]
if (!is.null(domain_config$dependencies)) {
return(domain_config$dependencies)
} else {
return("No explicit dependencies defined")
}
}
# Check dependencies for various domains
check_domain_dependencies("Raw_AE")
check_domain_dependencies("Raw_LB")
# Core domains (STUDY, SITE, SUBJ, ENROLL) are always required
core_domains <- c("Raw_STUDY", "Raw_SITE", "Raw_SUBJ", "Raw_ENROLL")
print("Core required domains:")
print(core_domains)Multi-Package Support
The registry supports domains from different packages:
# Create a domain-package mapping
domain_package_df <- data.frame(
domain = c("AE", "LB", "VISIT", "CustomDomain"),
package = c("gsm.mapping", "gsm.mapping", "gsm.mapping", "custom.package"),
stringsAsFactors = FALSE
)
# Create study with multi-package support
config <- create_study_config(
study_id = "MULTI001",
participant_count = 50,
site_count = 3
)
config <- add_dataset_config(config, "Raw_AE", enabled = TRUE)
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE)
# Generate with package specification
raw_data <- generate_raw_data_for_endpoints(config, domain_package_df)Domain Timeline and Evolution
Understanding Domain Timelines
Different domains have different temporal patterns:
# Get timeline information for a domain
ae_timeline <- get_domain_timeline("Raw_AE")
print(ae_timeline)
lb_timeline <- get_domain_timeline("Raw_LB")
print(lb_timeline)
# Timeline shows:
# - When domain data typically starts appearing
# - How frequently it's collected
# - Growth patterns over time
# - Dependencies on other domainsLongitudinal Data Patterns
The registry handles different growth patterns:
# Domains have different growth patterns:
# Linear growth (steady increase)
config <- add_dataset_config(config, "Raw_AE", enabled = TRUE,
growth_pattern = "linear")
# Exponential growth (accelerating increase)
config <- add_dataset_config(config, "Raw_QUERY", enabled = TRUE,
growth_pattern = "exponential")
# Constant (same amount each snapshot)
config <- add_dataset_config(config, "Raw_VISIT", enabled = TRUE,
growth_pattern = "constant")
# Custom formulas
config <- add_dataset_config(config, "Raw_LB", enabled = TRUE,
count_formula = function(config) {
config$study_params$participant_count * 3 # 3 per participant
})Working with Domain Generation Functions
Understanding Generation Functions
Each domain has a specific generation function (e.g.,
Raw_AE(), Raw_LB()):
# Domain generation functions typically take these parameters:
# - data: Current snapshot data
# - previous_data: Previous snapshot(s) for continuity
# - combined_specs: Data generation specifications
# - n: Number of records to generate
# - startDate/endDate: Time boundaries
# - split_vars: How to distribute data across timepoints
# Example: AE generation function signature
# Raw_AE(data, previous_data, combined_specs, n, startDate, endDate, split_vars)
# Example: LB generation function signature
# Raw_LB(data, previous_data, combined_specs, n, startDate, split_vars)Custom Count Functions
You can customize how record counts are determined:
# Create a study with custom count logic
config <- create_study_config(
study_id = "CUSTOM-COUNTS",
participant_count = 100,
site_count = 10
)
# Custom AE count: more AEs at larger sites
config <- add_dataset_config(
config,
"Raw_AE",
enabled = TRUE,
count_formula = function(config) {
# Complex formula based on site size distribution
base_ae_rate <- 0.8 # 80% of participants have AEs
site_size_factor <- config$study_params$participant_count / config$study_params$site_count
round(config$study_params$participant_count * base_ae_rate * (site_size_factor / 10))
}
)
# Custom LB count: weekly labs for first month, then monthly
config <- add_dataset_config(
config,
"Raw_LB",
enabled = TRUE,
count_formula = function(config) {
participants <- config$study_params$participant_count
# Assuming monthly snapshots
participants * 4 # 4 lab visits per participant per month
}
)Extending the Domain Registry
Adding New Domains
You can extend the registry with custom domains:
# Define a custom domain
add_custom_domain_to_registry <- function() {
# Get current registry
current_registry <- get_domain_registry()
# Add custom domain
current_registry$Raw_CUSTOM <- list(
dataset = "Raw_CUSTOM",
required_inputs = c("data", "combined_specs", "n", "start_date"),
count_fn = function(counts, snapshot_idx) {
# Custom count logic
counts$subject_count[snapshot_idx] * 2 # 2 records per participant
},
generate_fn = function(context) {
# All generation logic lives here — no separate Raw_CUSTOM() function needed.
# Access everything through `context`:
# context$data — current snapshot data (Raw_SUBJ, Raw_SITE, …)
# context$previous_data — previous snapshot
# context$combined_specs — field specs from CombineSpecs()
# context$n — target record count this snapshot
# context$start_date / context$end_date
warning("Raw_CUSTOM generate_fn not implemented")
data.frame()
}
)
return(current_registry)
}
# Example usage (this would require implementing Raw_CUSTOM function)
# extended_registry <- add_custom_domain_to_registry()Custom Generation Functions
If you create new domains, you’ll need generation functions:
# Example structure for a custom domain generation function
Raw_CUSTOM <- function(data, combined_specs, n, startDate, custom_param = NULL) {
# Your custom data generation logic here
# Basic structure:
custom_data <- data.frame(
studyid = rep(data$Raw_STUDY$studyid[1], n),
usubjid = sample(data$Raw_SUBJ$usubjid, n, replace = TRUE),
custom_field = paste("CUSTOM", 1:n, sep = "_"),
custom_date = seq(as.Date(startDate),
as.Date(startDate) + n - 1,
by = "day")[1:n],
custom_value = runif(n, 0, 100),
stringsAsFactors = FALSE
)
return(custom_data)
}
# To use this in the registry:
# 1. Define the function in your package
# 2. Add registry entry as shown above
# 3. Include the domain in your study configurationRegistry Migration and Maintenance
Domain Registry Migration
The registry supports migration between different data generation approaches:
# The package includes migration utilities for moving from
# older approaches to the registry-based system
# Check which domains can be migrated
available_for_migration <- get_available_domains()
print("Domains available for registry-based generation:")
print(available_for_migration)
# Migration typically involves:
# 1. Updating study configurations to use registry
# 2. Converting custom domain logic to registry format
# 3. Testing compatibility with existing workflowsRegistry Validation
# Validate registry configuration
validate_domain_registry <- function(registry = NULL) {
if (is.null(registry)) {
registry <- get_domain_registry()
}
validation_results <- list()
for (domain_name in names(registry)) {
domain_config <- registry[[domain_name]]
# Check required fields
required_fields <- c("dataset", "package", "generate_fn")
missing_fields <- setdiff(required_fields, names(domain_config))
# Check function availability
package_available <- requireNamespace(domain_config$package, quietly = TRUE)
validation_results[[domain_name]] <- list(
missing_fields = missing_fields,
package_available = package_available,
has_count_fn = !is.null(domain_config$count_fn),
has_generate_fn = !is.null(domain_config$generate_fn)
)
}
return(validation_results)
}
# Validate current registry
validation <- validate_domain_registry()
str(validation, max.level = 2)Best Practices
Registry Usage Guidelines
Use existing domains when possible: The registry includes well-tested domains for common clinical data types.
Follow naming conventions: Use “Raw_” prefix for dataset names to maintain consistency.
Define clear dependencies: If your domain needs data from other domains, specify dependencies clearly.
Test thoroughly: Validate your custom domains across different study configurations.
Document extensions: Provide clear documentation for any custom domains you add.
Performance Considerations
# Registry-based generation is optimized for performance:
# 1. Efficient package loading - only loads needed packages
# 2. Dependency resolution - generates domains in correct order
# 3. Multi-package support - handles complex study configurations
# 4. Caching - reuses specifications across snapshots
# For large studies, consider:
# - Enabling verbose mode to monitor progress
# - Breaking large studies into smaller chunks
# - Using appropriate snapshot intervalsTroubleshooting
# Common issues and solutions:
# Issue: Domain not found in registry
# Solution: Check available domains and spelling
available <- get_available_domains()
print(available)
# Issue: Package not available
# Solution: Install required packages
missing_packages <- c()
registry <- get_domain_registry()
for (domain in names(registry)) {
pkg <- registry[[domain]]$package
if (!requireNamespace(pkg, quietly = TRUE)) {
missing_packages <- c(missing_packages, pkg)
}
}
if (length(missing_packages) > 0) {
cat("Missing packages:", paste(missing_packages, collapse = ", "), "\n")
}
# Issue: Generation function errors
# Solution: Check function parameters and data availability
# Use verbose mode to see detailed error messagesConclusion
The Domain Registry provides a robust, extensible framework for
clinical data generation in gsm.datasim. Key benefits
include:
- Standardization: Consistent approach to domain definition and generation
-
Extensibility: Easy to add new domains and
customize existing ones
- Maintainability: Centralized configuration reduces duplication
- Multi-package support: Domains can come from different R packages
- Dependency management: Automatic handling of domain dependencies
Understanding the registry system enables you to: - Use existing domains effectively - Create custom domains for specialized needs - Maintain complex multi-domain studies - Migrate from older generation approaches
The registry system is designed to grow with your needs while maintaining backward compatibility and ease of use.