marr 1.6.0
Reproducibility is an on-going challenge with high-throughput technologies that have been developed in the last two decades for quantifying a wide range of biological processes. One of the main difficulties faced by researchers is the variability of output across replicate experiments (Li et al. (2011)). Several authors have addressed the issue of reproducibility among high-throughput experiments (Porazinska et al. (2010), Marioni et al. (2008), AC’t Hoen et al. (2013)). In each high-throughput experiment (e.g., arrays, sequencing, mass spectrometry), a large number of features are measured simultaneously, and candidates are often subjected for follow-up statistical analysis. We use the term features to refer to biological features (e.g., metabolites, genes) resulting from a high-throughput experiment in the rest of this article. When measurements show consistency across replicate experiments, we define that measurement to be reproducible. Similarly, measurements that are not consistent across replicates may be problematic and should be identified. In this vignette, features that show consistency across high-dimensional replicate experiments are termed reproducible and the ones that are not consistent are termed irreproducible. The reproducibility of a high-throughput experiment primarily depends on the technical variables, such as run time, technical replicates, laboratory operators and biological variables, such as healthy and diseased subjects. A critical step toward making optimal design choices is to assess how these biological and technical variables affect reproducibility across replicate experiments (Talloen et al. (2010), Arvidsson et al. (2008)).
In this vignette, we introduce the marr procedure Philtron et al. (2018), referred to as maximum rank reproducibility (marr) to identify reproducible features in high-throughput replicate experiments. In this vignette, we demonstrate with an example data set that the (ma)ximum (r)ank (r)eproducibility (marr) procedure can be adapted to high-throughput MS-Metabolomics experiments across (biological or technical) replicate samples (Ghosh et al, 2020, in preparation).
The marr procedure was originally proposed to assess reproducibility of gene
ranks in replicate experiments. The marr
R-package contains the Marr()
function, which calculates a matrix of signals (\(\text{irreproducible}=0\),
\(\text{reproducible}=1\)) with \(M\) rows (total number of features) and \(J\)
columns (\(J={I \choose 2}\)) (replicate sample pairs \({I \choose 2}\)), where
\(J\) is the total possible number of sample pairs of replicate experiments.
We assign feature \(m\) to be reproducible if a certain percentage signals
(\(100c_s\%\)) are reproducible for pairwise combinations of replicate
experiments, i.e.,
if \[ \frac{{\sum_{i<i'}{{{r_{m,{(i,i')}}}}}}}{J} >c_s, \]
such that, \(c_s \in (0,1)\).
Similarly, we assign a sample pair \((i,~i')\) to be reproducible if a certain percentage signals (\(100c_m\%\)) are reproducible across all features, i.e., if \[ \frac{\sum_{m}{{r{_{m,(i,i')}}}}}{M}>c_m, \] such that, \(c_m \in (0,1)\).
The reproducible signal matrix is shown in Figure 1 below.The marr package contains a pre-processed data SummarizedExperiment
assay
object of 645 metabolites (features) measured in plasma and 20 biological
replicates from the multi-center Genetic Epidemiology of COPD (COPDGene) study
which was designed to study the underlying genetic factors of COPD,
(Regan et al. (2011)). We only used a subset of the original raw COPD data in this
vignette.
The msprepCOPD data in the marr package was pre-processed using the
MSPrep software (Hughes et al. (2013)). The data pre-processing include \(3\) steps and
they are as follows:
1. Filtering
: Metabolites are removed if they are missing more
than \(80\%\) of the samples, (Bijlsma et al. (2006), Chong et al. (2018)).
Originally, there were 662 metabolites in the raw data. After filtering,
645 metabolites remain.
2. Missing value imputation technique
: We apply Bayesian Principal Component
Analysis (BPCA) to impute missing values (Hastie et al. (1999)).
3. Normalization
: Median normalization are performed.
## class: SummarizedExperiment
## dim: 645 20
## metadata(0):
## assays(1): abundance
## rownames: NULL
## rowData names(3): Mass Retention.Time Compound.Name
## colnames(20): 10062C 10071D ... 10473 10544U
## colData names(1): subject_id
Marr()
functionMarr()
The Marr()
function must have one object as input:
1. object
: a data frame or a matrix or a SummarizedExperiment object with
abundance measurements of metabolites (features) on the rows and replicates
(samples) as the columns. Marr()
accepts objects which are a data frame or
matrix with observations (e.g. metabolites) on the rows and replicates as the
columns.
2. pSamplepairs
: optional We assign a metabolite (feature) for a replicate
sample pair to be reproducible using a threshold value of pSamplepairs
(\(c_s=0.75\)).
3. pFeatures
: optional We assign a sample pair for a metabolite (feature)
to be reproducible using a threshold value of pFeatures
(\(c_m=0.75\)).
4. alpha
: optional level of significance to control the False Discovery
rate (FDR). Default is \(0.05\) (i.e., \(\alpha=0.05\)).
Marr()
We apply the Marr procedure to assess the reproducibility of replicates in the
msprepCOPD data. The distribution of reproducible pairs and metabolites
(features) are illustrated in Figures 2 and 3, respectively.
To run the Marr()
function, we only input the data object. We obtain 4 outputs
after running the Marr()
function. They are shown below:
library(marr)
Marr_output<- Marr(msprepCOPD, pSamplepairs =
0.75, pFeatures = 0.75, alpha=0.05)
Marr_output
## Marr: Maximum Rank Reproducibility
## MarrSamplepairs (length = 190 ):
## sampleOne sampleTwo reproducibility
## 1 10062C 10071D 60.000
## 2 10062C 10087S 64.651
## 3 10062C 10097V 60.620
## ...
## MarrFeatures (length = 645 ):
## Mass Retention.Time Compound.Name reproducibility
## 1 58.054 0.511 Aminoacetaldehyde 100
## 2 71.074 3.216 3-Buten-1-amine 100
## 3 85.089 2.119 2-Amino-3-methyl-1-butanol 100
## ...
## Mass Retention.Time Compound.Name reproducibility
## 1 58.0537 0.5107867 Aminoacetaldehyde 100.00000
## 2 71.0741 3.2161212 3-Buten-1-amine 100.00000
## 3 85.0893 2.1190255 2-Amino-3-methyl-1-butanol 100.00000
## 4 87.0687 0.5137375 4-Aminobutyraldehyde 16.31579
## 5 102.0782 0.6029379 N-Nitrosodiethylamine 20.52632
## 6 103.1000 3.8783060 Neurine 100.00000
## sampleOne sampleTwo reproducibility
## 1 10062C 10071D 60.00000
## 2 10062C 10087S 64.65116
## 3 10062C 10097V 60.62016
## 4 10062C 101020 51.78295
## 5 10062C 10104S 53.33333
## 6 10062C 10136F 55.50388
## Percent of reproducible sample pairs per metabolite (feature)
##greater than 75%
MarrFeaturesfiltered(Marr_output)
## [1] 49.45736
## Percent of reproducible metabolites (features) per sample pair
## greater than 75%
MarrSamplepairsfiltered(Marr_output)
## [1] 7.368421
The distribution of reproducible metabolites/features (sample pairs) per sample
pair (metabolite) can be extracted using theMarrSamplepairs()
(MarrFeatures()
) function (see above). The distribution of reproducible
metabolites/features and sample pairs can plotted using the
MarrPlotSamplepairs()
and MarrPlotFeatures()
functions, respectively (see
below).
Figure 2 illustrates percentage of reproducible metabolites (features) per sample pair in the \(x\)-axis. In Figure 2, the higher percentage of reproducible metabolites (features) per sample pair in the \(x\)-axis would indicate stronger reproducibility between the sample pairs.
Figure 3 illustrates percentage of reproducible sample pairs per metabolite (feature) in the \(x\)-axis. In Figure 3, the higher percentage of reproducible sample pairs per metabolite (feature) in the \(x\)-axis would indicate stronger reproducibility of a metabolite (feature) across all sample pairs.
## class: SummarizedExperiment
## dim: 319 20
## metadata(1): removedFeatures
## assays(1): abundance
## rownames: NULL
## rowData names(3): Mass Retention.Time Compound.Name
## colnames(20): 10062C 10071D ... 10473 10544U
## colData names(1): subject_id
## class: SummarizedExperiment
## dim: 645 16
## metadata(1): removedSamples
## assays(1): abundance
## rownames: NULL
## rowData names(3): Mass Retention.Time Compound.Name
## colnames(16): 10062C 10071D ... 10465Y 10473
## colData names(1): subject_id
## class: SummarizedExperiment
## dim: 319 16
## metadata(2): removedSamples removedFeatures
## assays(1): abundance
## rownames: NULL
## rowData names(3): Mass Retention.Time Compound.Name
## colnames(16): 10062C 10071D ... 10465Y 10473
## colData names(1): subject_id
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] marr_1.6.0 knitr_1.38 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.8.3 lattice_0.20-45
## [3] assertthat_0.2.1 digest_0.6.29
## [5] utf8_1.2.2 R6_2.5.1
## [7] GenomeInfoDb_1.32.0 stats4_4.2.0
## [9] evaluate_0.15 ggplot2_3.3.5
## [11] highr_0.9 pillar_1.7.0
## [13] zlibbioc_1.42.0 rlang_1.0.2
## [15] jquerylib_0.1.4 magick_2.7.3
## [17] S4Vectors_0.34.0 Matrix_1.4-1
## [19] rmarkdown_2.14 labeling_0.4.2
## [21] stringr_1.4.0 RCurl_1.98-1.6
## [23] munsell_0.5.0 DelayedArray_0.22.0
## [25] compiler_4.2.0 xfun_0.30
## [27] pkgconfig_2.0.3 BiocGenerics_0.42.0
## [29] htmltools_0.5.2 tidyselect_1.1.2
## [31] SummarizedExperiment_1.26.0 tibble_3.1.6
## [33] GenomeInfoDbData_1.2.8 bookdown_0.26
## [35] IRanges_2.30.0 matrixStats_0.62.0
## [37] fansi_1.0.3 crayon_1.5.1
## [39] dplyr_1.0.8 bitops_1.0-7
## [41] grid_4.2.0 jsonlite_1.8.0
## [43] gtable_0.3.0 lifecycle_1.0.1
## [45] DBI_1.1.2 magrittr_2.0.3
## [47] scales_1.2.0 cli_3.3.0
## [49] stringi_1.7.6 farver_2.1.0
## [51] XVector_0.36.0 bslib_0.3.1
## [53] ellipsis_0.3.2 generics_0.1.2
## [55] vctrs_0.4.1 tools_4.2.0
## [57] Biobase_2.56.0 glue_1.6.2
## [59] purrr_0.3.4 MatrixGenerics_1.8.0
## [61] fastmap_1.1.0 yaml_2.3.5
## [63] colorspace_2.0-3 BiocManager_1.30.17
## [65] GenomicRanges_1.48.0 sass_0.4.1
AC’t Hoen, Peter, Marc R Friedländer, Jonas Almlöf, Michael Sammeth, Irina Pulyakhina, Seyed Yahya Anvar, Jeroen FJ Laros, et al. 2013. “Reproducibility of High-Throughput mRNA and Small Rna Sequencing Across Laboratories.” Nature Biotechnology 31 (11): 1015.
Arvidsson, Samuel, Miroslaw Kwasniewski, Diego Mauricio Riaño-Pachón, and Bernd Mueller-Roeber. 2008. “QuantPrime–a Flexible Tool for Reliable High-Throughput Primer Design for Quantitative Pcr.” BMC Bioinformatics 9 (1): 465.
Bijlsma, Sabina, Ivana Bobeldijk, Elwin R Verheij, Raymond Ramaker, Sunil Kochhar, Ian A Macdonald, Ben Van Ommen, and Age K Smilde. 2006. “Large-Scale Human Metabolomics Studies: A Strategy for Data (Pre-) Processing and Validation.” Analytical Chemistry 78 (2): 567–74.
Chong, Jasmine, Othman Soufan, Carin Li, Iurie Caraus, Shuzhao Li, Guillaume Bourque, David S Wishart, and Jianguo Xia. 2018. “MetaboAnalyst 4.0: Towards More Transparent and Integrative Metabolomics Analysis.” Nucleic Acids Research 46 (W1): W486–W494.
Hastie, Trevor, Robert Tibshirani, Gavin Sherlock, Michael Eisen, Patrick Brown, and David Botstein. 1999. “Imputing Missing Data for Gene Expression Arrays.”
Hughes, Grant, Charmion Cruickshank-Quinn, Richard Reisdorph, Sharon Lutz, Irina Petrache, Nichole Reisdorph, Russell Bowler, and Katerina Kechris. 2013. “MSPrep—Summarization, Normalization and Diagnostics for Processing of Mass Spectrometry–Based Metabolomic Data.” Bioinformatics 30 (1): 133–34.
Li, Qunhua, James B Brown, Haiyan Huang, Peter J Bickel, and others. 2011. “Measuring Reproducibility of High-Throughput Experiments.” The Annals of Applied Statistics 5 (3): 1752–79.
Marioni, John C, Christopher E Mason, Shrikant M Mane, Matthew Stephens, and Yoav Gilad. 2008. “RNA-Seq: An Assessment of Technical Reproducibility and Comparison with Gene Expression Arrays.” Genome Research 18 (9): 1509–17.
Philtron, Daisy, Yafei Lyu, Qunhua Li, and Debashis Ghosh. 2018. “Maximum Rank Reproducibility: A Nonparametric Approach to Assessing Reproducibility in Replicate Experiments.” Journal of the American Statistical Association 113 (523): 1028–39.
Porazinska, Dorota L, WAY Sung, ROBIN M GIBLIN-DAVIS, and W Kelley Thomas. 2010. “Reproducibility of Read Numbers in High-Throughput Sequencing Analysis of Nematode Community Composition and Structure.” Molecular Ecology Resources 10 (4): 666–76.
Regan, Elizabeth A., John E. Hokanson, James R. Murphy, Barry Make, David A. Lynch, Terri H. Beaty, Douglas Curran-Everett, Edwin K. Silverman, and James D. Crapo. 2011. “Genetic Epidemiology of Copd (Copdgene) Study Design.” COPD: Journal of Chronic Obstructive Pulmonary Disease 7 (1): 32–43. https://doi.org/10.3109/15412550903499522.
Talloen, Willem, Sepp Hochreiter, Luc Bijnens, Adetayo Kasim, Ziv Shkedy, Dhammika Amaratunga, and Hinrich Göhlmann. 2010. “Filtering Data from High-Throughput Experiments Based on Measurement Reliability.” Proceedings of the National Academy of Sciences 107 (46): E173–E174.