Contents

1 Overview

1.1 Motivation

The identification of reproducible biological patterns from high-dimensional omics data is a key factor in understanding the biology of complex disease or traits. Incorporating prior biological knowledge into machine learning is an important step in advancing such research.

1.2 Deliverables

We have implemented a biologically informed multi-stage machine learning framework termed BioMM [1] specifically for phenotype prediction using omics-scale data based on prior biological information including gene ontological (GO) and KEGG pathways.

Features of BioMM in a nutshell:

  1. Applicability for various omics data modalities (e.g. methylome, transcriptomics, genomics).
  2. Various biological stratification strategies.
  3. Prioritizing outcome-associated functional patterns.
  4. End-to-end prediction at the individual level based on biological stratified patterns.
  5. Possibility for an extension to machine learning models of interest.
  6. Parallel computing.

2 Getting started

2.1 Installation and dependencies

  • Install BioMM from Bioconductor (R 4.0):
## Do not execute if you have already installed BioMM.
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# The following initializes usage of Bioc devel
BiocManager::install(version='devel')
BiocManager::install("BioMM")

BioMM installation from Github

install.packages("devtools")
library("devtools")
install_github("transbioZI/BioMM", build_vignettes=TRUE)
  • Load required libraries
library(BioMM)
library(BiocParallel) 
library(ranger)
library(rms)
library(glmnet)
library(e1071)
library(precrec)
library(vioplot)
library(CMplot)
library(imager)
library(topGO)
library(xlsx)

3 Omics data

A wide range of omics data is supported for BioMM including whole-genome DNA methylation, transcriptome-wide gene expression and genome-wide SNP data. Other types of omics data that can map into pathways are also encouraging.
For a better understanding of the BioMM framework, we used two small examplar datasets: one genome-wide DNA methylation data consisting of 20 subjects and 26486 CpGs, and one genome-wide gene expression data comprising of 20 subjects and 15924 genes for demonstration.

## DNA methylation data 
methylData <- readRDS(system.file("extdata", "/methylData.rds", package="BioMM"))
# The first column is the label, and the rest are the features (e.g., CpGs) 
head(methylData[,1:4])
##           label cg00000292 cg00002426 cg00003994
## GSM951168     0      0.538      0.054      0.029
## GSM951173     0      0.508      0.052      0.038
## GSM951174     0      0.547      0.044      0.047
## GSM951177     0      0.409      0.048      0.034
## GSM951178     0      0.374      0.044      0.051
## GSM951179     0      0.598      0.088      0.064
# 0: control and 1: patient
table(methylData[,1]) 
## 
##  0  1 
## 10 10
dim(methylData)
## [1]    20 26487
## Gene expression data
expData <- readRDS(system.file("extdata", "/expData.rds", package="BioMM"))
# The first column is the label, and the rest are the features (e.g., genes) 
head(expData[,1:4])
##           label     1    10   100
## GSM943251     0 5.242 8.807 4.514
## GSM943252     0 5.286 8.394 5.238
## GSM943253     0 5.009 8.251 5.162
## GSM943254     0 5.209 8.131 4.559
## GSM943255     0 5.125 8.448 4.936
## GSM943256     0 4.913 8.902 4.603
# 0: control and 1: patient
table(expData[,1]) 
## 
##  0  1 
## 10 10
dim(expData)
## [1]    20 15925

4 Feature stratification

Features like CpGs, genes or SNPs can be mapped into pathways based on genomic location and pathway annotation, as implemented in the function omics2pathlist(). The examples of pathway databases are gene ontology (GO), KEGG and Reactome, which are widely used public repositories. Gene ontological and KEGG pathways are used in this tutorial.

## Load feature annotation data
featureAnno <- readRDS(system.file("extdata", "cpgAnno.rds", package="BioMM"))
# The mapping between CpGs and genes (i.e. entrezID or gene symbol)
head(featureAnno)
##           ID chr entrezID symbol
## 1 cg00000292  16      487 ATP2A1
## 2 cg00002426   3     7871  SLMAP
## 3 cg00003994   7     4223  MEOX2
## 4 cg00005847   2     3232  HOXD3
## 5 cg00006414   7    57541 ZNF398
## 6 cg00007981  11    24145  PANX1
# total number of CpGs under investigation
str(unique(featureAnno[,1]))
##  chr [1:26486] "cg00000292" "cg00002426" "cg00003994" "cg00005847" "cg00006414" "cg00007981" "cg00008493" "cg00008713" ...
## Reprocessed Gene ontological pathway annotation with 10 and 200 genes for each pathway
golist <- readRDS(system.file("extdata", "goDB.rds", package="BioMM")) 
## Number of investigated biological processes
length(golist)
## [1] 2944
str(golist[1:3])
## List of 3
##  $ GO:0000002: Named chr [1:12] "291" "1890" "4205" "4358" ...
##   ..- attr(*, "names")= chr [1:12] "TAS" "IMP" "ISS" "IMP" ...
##  $ GO:0000012: Named chr [1:11] "3981" "7141" "7515" "23411" ...
##   ..- attr(*, "names")= chr [1:11] "IDA" "IDA" "IEA" "IMP" ...
##  $ GO:0000027: Named chr [1:29] "4839" "6122" "6123" "6125" ...
##   ..- attr(*, "names")= chr [1:29] "IMP" "IBA" "IBA" "IBA" ...
## Reprocessed KEGG pathway annotation with 10 and 200 genes for each pathway
kegglist <- readRDS(system.file("extdata", "keggDB.rds", package="BioMM"))  
## Number of investigated KEGG pathways 
length(kegglist)
## [1] 293
str(kegglist[1:3]) 
## List of 3
##  $ hsa00010: chr [1:67] "10327" "124" "125" "126" ...
##  $ hsa00020: chr [1:30] "1431" "1737" "1738" "1743" ...
##  $ hsa00030: chr [1:30] "132158" "2203" "221823" "226" ...

To annotate pathways, we demonstrate the usage of omics2pathlist() function based on two different pathway databases and two data modalities as follows.

## Feature annotation to pathways 
## Use 100 pathways to reduce the runtime for the downstream analysis. But if possible, please make sure to use all.
numPath <- 100

# GO pathway mapping using DNA methylation data
golistSub <- golist[seq_len(numPath)]
methylGOlist <- omics2pathlist(data=methylData, pathlistDB=golistSub, 
                               featureAnno=featureAnno, 
                               restrictUp=200, restrictDown=10, minPathSize=10) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   21.00   39.00   67.25   78.00  337.00
# KEGG pathway mapping using DNA methylation data
kegglistSub <- kegglist[seq_len(numPath)]
methylKEGGlist <- omics2pathlist(data=methylData, pathlistDB=kegglistSub, 
                                 featureAnno=featureAnno, 
                                 restrictUp=200, restrictDown=10, minPathSize=10) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   16.00   37.75   58.00   74.57   89.25  286.00
# GO pathway mapping using gene expression data
golistSub <- golist[seq_len(numPath)]
expGOlist <- omics2pathlist(data=expData, pathlistDB=golistSub, 
                            featureAnno=NULL, 
                            restrictUp=200, restrictDown=10, minPathSize=10) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   15.25   24.50   37.91   43.75  174.00
# KEGG pathway mapping using gene expression data
kegglistSub <- kegglist[seq_len(numPath)]
expKEGGlist <- omics2pathlist(data=expData, pathlistDB=kegglistSub, 
                              featureAnno=NULL, 
                              restrictUp=200, restrictDown=10, minPathSize=10) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   22.00   32.50   40.55   50.25  135.00

5 BioMM framework

5.1 Recapitulation

Briefly, the BioMM framework consists of two learning stages [1]. During the first stage, biological meta-information is used to ‘compress’ the variables of the original dataset into pathway-level ‘latent variables’ (henceforth called stage-2 data) using either supervised or unsupervised learning models (stage-1 models). In the second stage, a supervised model (stage-2 model) is built using the stage-2 data with non-negative outcome-associated features for final prediction.

5.1.1 Interface to machine learning models

The end-to-end prediction is performed using BioMM() function. Both supervised and unsupervised learning are implemented in the BioMM framework, which is indicated by the argument supervisedStage1=TRUE or supervisedStage1=FALSE. Commonly used supervised classifiers: generalized regression models with lasso, ridge or elastic net regularization (GLM) [4], support vector machine (SVM) [3] and random forest [2] are included. For the unsupervised method, regular or sparse constrained principal component analysis (PCA) [5] is used. predMode indicates the prediction type. In the case of classification setting, the “probability” or “classification” mode can be used. Generic resampling methods include cross-validation (CV) and bootstrapping (BS) procedures as the argument resample1="CV" or resample1="BS". Stage-2 data is reconstructed using either resampling methods during machine learning prediction or independent test set prediction if the argument testData is provided. For more details, please check BioMM() in the manual.

5.1.1.1 BioMM with Random Forest

To apply BioMM with the Random Forest model, we use the argument supervisedStage1=TRUE and classifier=randForest in BioMM(). DNA methylation data mapping to GO pathways is used.

## To reduce the runtime, only use a subset of DNA methylation data
## However, if possible, subsetting the data is not suggested.
trainData <- methylData[,1:3000]  
trainDataY <- trainData[,1]
testData <- NULL

## Model parameters
supervisedStage1=TRUE
classifier <- "randForest"
predMode <- "probability"
paramlist <- list(ntree=100, nthreads=20)   
core <- MulticoreParam(workers = 10) 

set.seed(123)
result <- BioMM(trainData=trainData, testData=NULL, 
                pathlistDB=golistSub, featureAnno, 
                restrictUp=200, restrictDown=10, minPathSize=10, 
                supervisedStage1, typePCA="regular", 
                resample1="BS", resample2="CV", dataMode="allTrain",
                repeatA1=50, repeatA2=1, repeatB1=10, repeatB2=1, 
                nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
                cutP1=0.05, cutP2=0.05, fdr2=NULL, 
                FScore=MulticoreParam(workers = 1), 
                classifier, predMode, paramlist, innerCore=core)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   13.75   16.50   20.46   27.25   45.00
if (is.null(testData)) {
    metricCV <- getMetrics(dataY = trainDataY, predY = result)
    message("Cross-validation prediction performance:")
    print(metricCV)
} else if (!is.null(testData)){
    testDataY <- testData[,1]  
    metricCV <- getMetrics(dataY = trainDataY, cvYscore = result[[1]])
    metricTest <- getMetrics(dataY = testDataY, testYscore = result[[2]])
    message("Cross-validation performance:")
    print(metricCV)
    message("Test set prediction performance:")
    print(metricTest)
}
##    AUC AUCPR ACC    R2
## R2 0.7 0.621 0.7 0.193

Other machine learning models can be employed with the following respective parameter settings. For the classifier "SVM", parameters can be tuned using an internal cross-validation if tuneP=TRUE. For generalized regression model glmnet, elastic net is specified by the input argument alpha=0.5. Alternatively, alpha=1 is for the lasso and alpha=0 is the ridge. For the unsupervised learning supervisedStage1=FALSE, regular PCA typePCA="regular" is applied and followed with random forest classification classifier2=TRUE.

5.1.2 Interface to biological stratification

For the stratification of predictors using biological information, various strategies can be applied. In this tutorial, BioMM() integrates GO and KEGG pathway based stratification, which not only accounts for epistasis between stage-1 features within the functional category, but also considers the interaction between pathway-level features. Therefore, this may provide more value-relevant information on biological insight into the underlying phenotype.

5.1.2.1 BioMM with KEGG pathways

To apply BioMM with the random forest model, we use the argument supervisedStage1=TRUE and classifier=randForest in BioMM(). Gene expression data mapping to KEGG pathways is demonstrated.

## to reduce the runtime, only use a subset of gene expression data
## However, if possible, subsetting the data is not suggested.
trainData <- expData[,1:3000] 
trainDataY <- trainData[,1] 
testData <- NULL 
## Only for gene expression data
featureAnno=NULL
## Model parameters
supervisedStage1=TRUE
classifier <- "randForest"
predMode <- "probability"
paramlist <- list(ntree=100, nthreads=20)   
core <- MulticoreParam(workers = 10) 

set.seed(123)
result <- BioMM(trainData=trainData, testData=NULL, 
                pathlistDB=kegglistSub, featureAnno, 
                restrictUp=200, restrictDown=10, minPathSize=10, 
                supervisedStage1, typePCA="regular", 
                resample1="BS", resample2="CV", dataMode="allTrain",
                repeatA1=50, repeatA2=1, repeatB1=10, repeatB2=1, 
                nfolds=10, FSmethod1=NULL, FSmethod2=NULL, 
                cutP1=0.05, cutP2=0.05, fdr2=NULL, 
                FScore=MulticoreParam(workers = 1), 
                classifier, predMode, paramlist, innerCore=core)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   12.00   15.00   16.44   22.00   25.00
## Cross-validation is applied on the training data, therefore 'result' only returns the CV predicted score.
metricCV <- getMetrics(dataY = trainDataY, predY = result)
message("Cross-validation prediction performance:")
print(metricCV)
##      AUC AUCPR ACC    R2
## R2 0.845 0.716 0.8 0.441

5.2 Stage-2 data exploration

5.2.1 Generation of stage-2 data

Here we use BioMM with the Random Forest method on gene expression data incorporating KEGG pathways to create stage-2 pathway-level data.

## Define the omics type 
# omicsType <- "methylation"
omicsType <- "expression"
pathType <- "GO"
pathType <- "KEGG"
if (omicsType == "methylation" & pathType == "GO"){
    studylist <- methylGOlist
} else if (omicsType == "methylation" & pathType == "KEGG"){
    studylist <- methylKEGGlist
} else if (omicsType == "expression" & pathType == "GO"){
    studylist <- expGOlist
} else if (omicsType == "expression" & pathType == "KEGG"){
    studylist <- expKEGGlist
} else {
    stop("Wrong specified omicsType and pathType!")
} 

length(studylist)
## [1] 100
## Model parameters 
classifier <- "randForest"
predMode <- "probability"
paramlist <- list(ntree=100, nthreads=20)   
core <- MulticoreParam(workers = 10) 

set.seed(123)
stage2dataA <- reconBySupervised(trainDataList=studylist, 
                   testDataList=NULL,
                   resample="BS", dataMode="allTrain",
                   repeatA=50, repeatB=1, nfolds=10,
                   FSmethod=NULL, cutP=0.05, fdr=NULL, 
                   FScore=MulticoreParam(workers = 1),
                   classifier, predMode, paramlist,
                   innerCore=core, outFileA=NULL, outFileB=NULL)
## Check stage-2 data
dim(stage2dataA)
## [1]  20 101
print(table(stage2dataA[,1]))
## 
##  0  1 
## 10 10
head(stage2dataA[,1:4])
##           label hsa00010 hsa00020 hsa00030
## GSM943251     0    0.502    0.554    0.515
## GSM943252     0    0.488    0.483    0.433
## GSM943253     0    0.546    0.604    0.463
## GSM943254     0    0.413    0.523    0.410
## GSM943255     0    0.538    0.642    0.418
## GSM943256     0    0.506    0.551    0.397

5.2.2 Feature Visualization

5.2.2.1 Explained variation of stage-2 data

The distribution of the proportion of variance explained for the individual generated feature of stage-2 data for the classification task is illustrated plotVarExplained() below. Nagelkerke pseudo R-squared measure is used to compute the explained variance. The argument posF=TRUE indicates that only positively outcome-associated features are plotted since negative associations likely reflect random effects in the underlying data [6].

core <- MulticoreParam(workers = 10)   
fileName <- paste0(omicsType,"_", pathType, "_featuresVarExplained.png")
plotVarExplained(data=stage2dataA, posF=TRUE, binarize=FALSE, core=core, 
                 pathTitle=paste0(pathType, " pathways"), fileName)
## png 
##   2
plot(load.image(fileName)) 

5.2.2.2 Prioritization of outcome-associated functional patterns

plotRankedFeature() is employed to rank and visualize the outcome-associated features from stage-2 data. The argument topF=10 and posF=TRUE are used to define the top 10 positively outcome-associated features. The negative log P value using logistic regression is utilized to evaluate the importance of the ranked features as indicated by the argument rankMetric="negPlogit". Other metrics including Nagelkerke pseudo R-squared “R2”, and Z score “Zscore” measure are also available (see plotRankedFeature in the manual for details). The size of the respective pathway is pictured as the argument colorMetric="size".

core <- MulticoreParam(workers = 1)   
rankMetric <- "negPlogit" 
filePrefix <- paste0(omicsType, "_", pathType, "_topPath_", rankMetric)
topPath <- plotRankedFeature(data=stage2dataA, 
                             posF=TRUE, topF=10, binarize=FALSE, 
                             blocklist=studylist,  
                             rankMetric=rankMetric, 
                             colorMetric="size",  core, 
                             pathTitle=paste0(pathType, " pathways"), 
                             fileName=paste0(filePrefix, ".png"))
plot(load.image(paste0(filePrefix, ".png")))   

The statistical metrics and descriptions of these above top pathways are shown below:

## Report the top pathways
if (pathType == "GO"){
    
  goterms = unlist(Term(GOTERM))  
  topGOID = gsub("\\.", ":", rownames(topPath))
  subTerm = goterms[is.element(names(goterms), topGOID)] 
  topIDmatch = subTerm[match(topGOID, names(subTerm))]  
  topPath <- data.frame(topPath, Description=topIDmatch)
  
} else if (pathType == "KEGG"){
  ## A matching list between KEGG ID and names. Data freezes on Aug 2020
  keggID2name <- readRDS(system.file("extdata", "/keggID2name202008.rds", 
                                     package="BioMM"))  
  keggSub <- keggID2name[is.element(keggID2name[,"ID"], rownames(topPath)),]
  topIDmatch <- keggSub[match(rownames(topPath), keggSub[,"ID"]),] 
  topPath <- data.frame(topPath, Description=topIDmatch[,"name"])
}

print(topPath) 
##            AUC    R2 Zscore negPlogit negPwilcox       ID size
## hsa00030 0.865 0.507  2.325     1.698      2.188 hsa00030   23
## hsa03010 0.825 0.487  2.280     1.646      1.809 hsa03010  123
## hsa00532 0.810 0.383  2.243     1.604      1.732 hsa00532   17
## hsa00240 0.870 0.489  2.114     1.462      2.410 hsa00240   47
## hsa00480 0.815 0.359  2.076     1.421      1.720 hsa00480   51
## hsa00983 0.770 0.338  2.002     1.344      1.346 hsa00983   74
## hsa00561 0.940 0.731  1.980     1.322      3.000 hsa00561   53
## hsa00760 0.745 0.314  1.919     1.260      1.158 hsa00760   28
## hsa03060 0.765 0.257  1.784     1.129      1.307 hsa03060   22
## hsa00051 0.750 0.228  1.760     1.105      1.194 hsa00051   30
##                                                                      Description
## hsa00030                                               Pentose phosphate pathway
## hsa03010                                                                Ribosome
## hsa00532 Glycosaminoglycan biosynthesis - chondroitin sulfate / dermatan sulfate
## hsa00240                                                   Pyrimidine metabolism
## hsa00480                                                  Glutathione metabolism
## hsa00983                                         Drug metabolism - other enzymes
## hsa00561                                                 Glycerolipid metabolism
## hsa00760                                  Nicotinate and nicotinamide metabolism
## hsa03060                                                          Protein export
## hsa00051                                         Fructose and mannose metabolism
write.xlsx(topPath,file=paste0(filePrefix, ".xlsx"))
# write.table(topPath,file=paste0(filePrefix, ".txt"), sep="\t")

5.2.2.3 The significance of CpGs in pathways of interest

cirPlot4pathway() illustrates the significance of the individual CpGs (for DNA methylation data) or genes (for gene expression data) falling into a set of pathways. Here the top 10 outcome-associated pathways are investigated. Negative log P value is used to define the significance of each CpG or genes within these pathways.

core <- MulticoreParam(workers = 10)   
pathID <- gsub("\\.", ":", rownames(topPath))
## The number of top pathways must be bigger than overall investigated pathways
pathSet <- studylist[is.element(names(studylist), pathID)]
pathMatch <- pathSet[match(pathID, names(pathSet))]
fileName <- paste0(omicsType, "_", pathType, "_SigRankBy_", rankMetric)

cirPlot4pathway(datalist=pathMatch, topPathID=names(pathMatch), core, fileName)
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
##  Circular-Manhattan Plotting pval.
##  Plots are stored in: /tmp/RtmpX1UN5a/Rbuild1313e45bf62a5f/BioMM/vignettes

5.3 Computational consideration

BioMM with supervised models at both stages incorporating pathway based stratification method will take longer to run than unsupervised approaches. But the prediction is more powerful. Therefore, we suggest the former even if the computation is more demanding, as the adoption of the next-generation technology (e.g., 5G) is pushing advances in computational storage and speed.

Furthermore, the stability of BioMM prediction is often facilitated with the increasing number of resampling repetitions and some other related model parameters such as the number of trees used in the Random Forest model. Finally, parallel computing is implemented and recommended for such a scenario. In this vignette, due to the runtime, we only showcased the smaller examples and models with less computation.

6 Session information

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_GB                 LC_COLLATE=C                 
##  [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
##  [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] xlsx_0.6.5           topGO_2.48.0         GO.db_3.15.0         AnnotationDbi_1.58.0 IRanges_2.30.0       S4Vectors_0.34.0    
##  [7] Biobase_2.56.0       graph_1.74.0         BiocGenerics_0.42.0  imager_0.42.13       magrittr_2.0.3       CMplot_4.0.0        
## [13] vioplot_0.3.7        zoo_1.8-10           sm_2.2-5.7           precrec_0.12.9       e1071_1.7-9          glmnet_4.1-4        
## [19] Matrix_1.4-1         rms_6.3-0            SparseM_1.81         Hmisc_4.7-0          ggplot2_3.3.5        Formula_1.2-4       
## [25] survival_3.3-1       lattice_0.20-45      ranger_0.13.1        BiocParallel_1.30.0  BioMM_1.12.0         BiocStyle_2.24.0    
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.4.1        igraph_1.3.1           splines_4.2.0          GenomeInfoDb_1.32.0    TH.data_1.1-1         
##   [6] digest_0.6.29          foreach_1.5.2          htmltools_0.5.2        magick_2.7.3           tiff_0.1-11           
##  [11] fansi_1.0.3            checkmate_2.1.0        memoise_2.0.1          nsprcomp_0.5.1-2       cluster_2.1.3         
##  [16] Biostrings_2.64.0      matrixStats_0.62.0     sandwich_3.0-1         jpeg_0.1-9             colorspace_2.0-3      
##  [21] blob_1.2.3             xfun_0.30              dplyr_1.0.8            crayon_1.5.1           RCurl_1.98-1.6        
##  [26] jsonlite_1.8.0         iterators_1.0.14       glue_1.6.2             gtable_0.3.0           zlibbioc_1.42.0       
##  [31] XVector_0.36.0         MatrixModels_0.5-0     shape_1.4.6            scales_1.2.0           mvtnorm_1.1-3         
##  [36] DBI_1.1.2              Rcpp_1.0.8.3           htmlTable_2.4.0        foreign_0.8-82         bit_4.0.4             
##  [41] proxy_0.4-26           htmlwidgets_1.5.4      httr_1.4.2             RColorBrewer_1.1-3     ellipsis_0.3.2        
##  [46] farver_2.1.0           pkgconfig_2.0.3        rJava_1.0-6            nnet_7.3-17            sass_0.4.1            
##  [51] utf8_1.2.2             labeling_0.4.2         tidyselect_1.1.2       rlang_1.0.2            munsell_0.5.0         
##  [56] tools_4.2.0            cachem_1.0.6           cli_3.3.0              generics_0.1.2         RSQLite_2.2.12        
##  [61] evaluate_0.15          stringr_1.4.0          fastmap_1.1.0          yaml_2.3.5             knitr_1.38            
##  [66] bit64_4.0.5            purrr_0.3.4            KEGGREST_1.36.0        readbitmap_0.1.5       nlme_3.1-157          
##  [71] quantreg_5.88          compiler_4.2.0         rstudioapi_0.13        png_0.1-7              tibble_3.1.6          
##  [76] bslib_0.3.1            stringi_1.7.6          highr_0.9              vctrs_0.4.1            pillar_1.7.0          
##  [81] lifecycle_1.0.1        BiocManager_1.30.17    jquerylib_0.1.4        data.table_1.14.2      bitops_1.0-7          
##  [86] R6_2.5.1               latticeExtra_0.6-29    bookdown_0.26          gridExtra_2.3          bmp_0.3               
##  [91] codetools_0.2-18       polspline_1.1.20       MASS_7.3-57            assertthat_0.2.1       xlsxjars_0.6.1        
##  [96] withr_2.5.0            multcomp_1.4-19        GenomeInfoDbData_1.2.8 parallel_4.2.0         grid_4.2.0            
## [101] rpart_4.1.16           class_7.3-20           rmarkdown_2.14         base64enc_0.1-3

7 References

[1] NIPS ML4H submission: Chen, J. and Schwarz, E., 2017. BioMM: Biologically-informed Multi-stage Machine learning for identification of epigenetic fingerprints. arXiv preprint arXiv:1712.00336.

[2] Breiman, L. (2001). “Random forests.” Machine learning 45(1): 5-32.

[3] Cortes, C., & Vapnik, V. (1995). “Support-vector networks.” Machine learning 20(3): 273-297.

[4] Friedman, J., Hastie, T., & Tibshirani, R. (2010). “Regularization paths for generalized linear models via coordinate descent.” Journal of statistical software 33(1): 1.

[5] Wold, S., Esbensen, K., & Geladi, P. (1987). “Principal component analysis.” Chemometrics and intelligent laboratory systems 2(1-3): 37-52.

[6] Claudia Perlich and Grzegorz Swirszcz. On cross-validation and stacking: Building seemingly predictive models on random data. ACM SIGKDD Explorations Newsletter, 12(2):11-15, 2011.