Feature Selection and Feature Engineering with multiDEGGs in Nested Cross-Validation

In computational biology applications involving high-throughput data, researchers commonly encounter situations where the number of potential predictors far exceeds the available sample size. This dimensional challenge requires careful feature selection strategies for both mathematical and clinical reasons.

Standard feature selection methods typically evaluate predictors individually, identifying those variables that show the strongest univariate associations with the outcome variable (such as through t-tests or Wilcoxon tests). While effective, this approach overlooks the interconnected nature of biological systems, where

Feature engineering represents a complementary strategy that creates new predictors by combining or transforming existing variables. In biology, such approach can be used to capture higher-order information that reflects the interconnected nature of molecular processes. For instance, the ratio between two genes may provide more discriminative power than either gene expression level independently, particularly when their relative balance is disrupted in disease states.

The informative content encoded in differential interactions, combined with multiDEGGs’ ability to identify only literature-validated differential relationships, makes it particularly well-suited for both individual feature selection and guided creation of engineered predictors in machine learning. Such approach has potential to overcome the limitations of conventional algorithms which may select individual predictors without clear biological significance, compromising both the interpretability and clinical credibility of the resulting models.

Why Nested Cross-Validation for Feature Engineering?

It is crucial that feature selection and modification is conducted exclusively on training data within cross-validation loops to prevent information leakage from the test set. The nestedcv package enables the nested modification of predictors within each outer fold, ensuring that the attributes learned from the training part are applied to the test data without prior knowledge of the test data itself.
The selected and combined features, and corresponding model, can then be evaluated on the hold-out test data without introducing bias.

Both (nestcv.glmnet) and (nestcv.train) from nestedcv accept any user-defined function that filters or transforms the feature matrix by passing the function name to the modifyX parameter.
The multiDEGGs package provides two specialized functions for this purpose.

multiDEGGs_filter(): Pure Differential Network-Based Selection

The multiDEGGs_filter() function performs feature selection based entirely on differential network analysis. It identifies significant differential molecular interactions and can return either the interaction pairs alone or both pairs and individual variables involved in those interactions.

Key Parameters

When using multiDEGGs_filter(), you can control the following parameters through modifyX_options:

  • keep_single_genes (logical, default FALSE): Controls whether to include individual genes from significant pairs in addition to the pairs themselves
  • nfilter (integer, default NULL): Maximum number of predictors to return. When NULL, all significant interactions found are included

Usage Examples

Basic Usage: Pairs Only
library(multiDEGGs)
library(nestedcv)
data("synthetic_metadata")
data("synthetic_rnaseqData")

# Regularized linear model with interaction pairs only
fit.glmnet <- nestcv.glmnet(
  y = as.numeric(synthetic_metadata$response),
  x = t(synthetic_rnaseqData),
  modifyX = "multiDEGGs_filter",
  modifyX_options = list(
    keep_single_genes = FALSE,
    nfilter = 20
  ),
  modifyX_useY = TRUE,
  n_outer_folds = 5,
  n_inner_folds = 6,
  verbose = FALSE
)

summary(fit.glmnet)
#> Nested cross-validation with glmnet
#> No filter
#> Modifier:  multiDEGGs_filter 
#> Outer loop:  5-fold CV
#> Inner loop:  6-fold CV
#> 100 observations, 14 predictors
#> 
#>        alpha  lambda n.filter
#> Fold 1   0.1 0.10555        7
#> Fold 2   0.1 0.09944        7
#> Fold 3   0.1 0.08392        7
#> Fold 4   1.0 0.14202        7
#> Fold 5   0.1 0.15576        7
#> 
#> Final parameters:
#>  lambda    alpha  
#> 0.05894  0.10000  
#> 
#> Final coefficients:
#>  (Intercept) TNF:TNFRSF1A    AKT2:MTOR   IL1B:IL1R2    FASLG:FAS TGFB3:TGFBR1 
#>     1.823874    -0.193020    -0.119887     0.052089    -0.035947    -0.033527 
#> MAP2K2:MAPK3  FANCD2:FAN1 
#>    -0.021308    -0.008862 
#> 
#> Result:
#>        RMSE     R.squared   Pearson.r^2           MAE   
#>      0.4663        0.1074        0.1081        0.4391
Including Individual Genes (keep_single_genes = TRUE)
# Random forest model including both pairs and individual genes
fit.rf <- nestcv.train(
  y = synthetic_metadata$response,
  x = t(synthetic_rnaseqData),
  method = "rf",
  modifyX = "multiDEGGs_filter",
  modifyX_options = list(
    keep_single_genes = TRUE,
    nfilter = 30
  ),
  modifyX_useY = TRUE,
  n_outer_folds = 5,
  n_inner_folds = 6,
  verbose = FALSE
)
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.4.3
#> Loading required package: lattice

fit.rf$summary
#>                Reference
#> Predicted       Non_responder Responder
#>   Non_responder            56         1
#>   Responder                 2        41
#> 
#>               AUC            Accuracy   Balanced accuracy   
#>            0.9760              0.9700              0.9709

# Plot ROC on outer folds
plot(fit.rf$roc)

How nfilter works with keep_single_genes

  • When keep_single_genes = FALSE: nfilter limits only the number of interaction pairs returned
  • When keep_single_genes = TRUE: nfilter limits the combined count of unique individual genes plus interaction pairs. The function prioritizes pairs by significance and adds individual genes as needed until the limit is reached

multiDEGGs_combined_filter(): Hybrid Statistical and Network-Based Selection

The multiDEGGs_combined_filter() function combines traditional statistical feature selection with differential network analysis. This hybrid approach allows you to benefit from both conventional univariate selection methods and the biological insights from interaction analysis.

Key Parameters

  • filter_method (character): Statistical method for single feature selection.
    Options: "ttest", "wilcoxon", "ranger", "glmnet", "pls"
  • nfilter (integer): Maximum number of features to select
  • dynamic_nfilter (logical): Controls how nfilter is applied (see detailed explanation below)
  • keep_single_genes (logical): When dynamic_nfilter = TRUE, determines whether to include individual genes from multiDEGGs pairs

Dynamic vs. Balanced Selection Modes

Dynamic Selection (dynamic_nfilter = TRUE)

In dynamic mode, the function: 1. Selects nfilter single genes using the chosen statistical method 2. Adds ALL significant interaction pairs found by multiDEGGs 3. Total predictors = nfilter single genes + number of significant pairs

This mode allows the feature space to expand based on the biological complexity discovered in each fold.

# Dynamic selection with t-test for single genes
fit.dynamic <- nestcv.glmnet(
  y = as.numeric(synthetic_metadata$response),
  x = t(synthetic_rnaseqData),
  modifyX = "multiDEGGs_combined_filter",
  modifyX_options = list(
    filter_method = "ttest", 
    nfilter = 20,
    dynamic_nfilter = TRUE, 
    keep_single_genes = FALSE
  ),
  modifyX_useY = TRUE,
  n_outer_folds = 5,
  n_inner_folds = 6,
  verbose = FALSE
)
Balanced Selection (dynamic_nfilter = FALSE)

In balanced mode, the function:
1. Allocates approximately half of nfilter to interaction pairs
2. Fills remaining slots with single genes from the statistical filter
3. Maintains consistent total number of predictors across all folds

This mode ensures a fixed feature space size while balancing single genes and interactions.

# Balanced selection with Wilcoxon-test importance
fit.balanced <- nestcv.train(
  y = synthetic_metadata$response,
  x = t(synthetic_rnaseqData),
  method = "rf",
  modifyX = "multiDEGGs_combined_filter",
  modifyX_options = list(
    filter_method = "wilcoxon", 
    nfilter = 40,
    dynamic_nfilter = FALSE
  ),
  modifyX_useY = TRUE,
  n_outer_folds = 5,
  n_inner_folds = 6,
  verbose = FALSE
)

Available Statistical Methods

  • "ttest": Two-sample t-test for differential expression
  • "wilcoxon": Wilcoxon rank-sum test (non-parametric alternative to t-test)
  • "ranger": Random Forest variable importance scoring (the ranger package must be installed first)
  • "glmnet": Elastic net regularization coefficients
  • "pls": Partial Least Squares variable importance

Practical considerations

Before implementing multiDEGGs in your machine learning pipeline, it’s highly recommended to first run a preliminary analysis on your complete dataset to assess the number of differential interactions detected. This exploratory step can guide your choice of approach and parameter settings.

If multiDEGGs identifies only a small number of differential interactions (e.g., fewer than 10-20 pairs), these features alone may lack sufficient predictive power. In such cases, consider:

  • Using multiDEGGs_combined_filter() to integrate network-based features with traditional statistical selection methods
  • Setting keep_single_genes = TRUE in multiDEGGs_filter() to include individual genes involved in the differential pairs
  • Adjusting the percentile_vector or significance thresholds in the initial multiDEGGs analysis to potentially capture more interactions

Conversely, if a large number of differential interactions are detected, multiDEGGs_filter() alone may provide sufficient feature diversity for effective model training.

Feature Engineering Details

Both functions create ratio-based features from significant gene pairs (Gene A / Gene B), which capture the relative expression relationships that drive differential network connectivity. The predict methods automatically handle the feature transformation for both training and test data within each cross-validation fold, ensuring no information leakage.

Note: If no significant differential interactions are found in a particular fold, both functions automatically fall back to t-test-based selection to ensure robust performance across all scenarios. This fallback is indicated by a printed “0” during execution.

Citation

citation("multiDEGGs")
#> To cite package 'multiDEGGs' in publications use:
#> 
#>   Sciacca E, et al. (2023). "DEGGs: An R package with shiny app for the
#>   identification of differentially expressed gene-gene interactions in
#>   high-Throughput sequencing data." _Bioinformatics_, *39*, btad192.
#>   doi:10.1093/bioinformatics/btad192
#>   <https://doi.org/10.1093/bioinformatics/btad192>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {DEGGs: An R package with shiny app for the identification of differentially expressed gene-gene interactions in high-Throughput sequencing data},
#>     author = {Elisabetta Sciacca and {et al.}},
#>     journal = {Bioinformatics},
#>     year = {2023},
#>     volume = {39},
#>     pages = {btad192},
#>     doi = {10.1093/bioinformatics/btad192},
#>   }