if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("GenomicSuperSignature")
library(GenomicSuperSignature)
You can download GenomicSuperSignature from Google Cloud bucket using
GenomicSuperSignature::getModel
function. Currently available models are
built from top 20 PCs of 536 studies (containing 44,890 samples) containing
13,934 common genes from each of 536 study’s top 90% varying genes based on
their study-level standard deviation. There are two versions of this RAVmodel
annotated with different gene sets for GSEA: MSigDB C2 (C2
) and three
priors from PLIER package (PLIERpriors
). In this vignette, we are showing
the C2
annotated model.
Note that the first interactive run of this code, you will be asked to allow
R to create a cache directory. The model file will be stored there and
subsequent calls to getModel
will read from the cache.
RAVmodel <- getModel("C2", load=TRUE)
#> [1] "downloading"
RAVindex
is a matrix containing genes in rows and RAVs in columns. colData
slot provides the information on each RAVs, such as GSEA annotation and
studies involved in each cluster. metadata
slot stores model construction
information. trainingData
slot contains the information on individual
studies in training dataset, such as MeSH terms assigned to each study.
RAVmodel
#> class: PCAGenomicSignatures
#> dim: 13934 4764
#> metadata(8): cluster size ... version geneSets
#> assays(1): RAVindex
#> rownames(13934): CASKIN1 DDX3Y ... CTC-457E21.9 AC007966.1
#> rowData names(0):
#> colnames(4764): RAV1 RAV2 ... RAV4763 RAV4764
#> colData names(4): RAV studies silhouetteWidth gsea
#> trainingData(2): PCAsummary MeSH
#> trainingData names(536): DRP000987 SRP059172 ... SRP164913 SRP188526
version(RAVmodel)
#> [1] "1.1.1"
geneSets(RAVmodel)
#> [1] "MSigDB C2 version 7.1"
Replicable Axis of Variation (RAV) index is the main component of
GenomicSuperSignature. It serves as an index connecting new datasets and the
existing database. You can access it through GenomicSuperSignature::RAVindex
(equivalent of SummarizedExperiment::assay
). Rows are genes and columns are
RAVs.
Here, RAVmodel consists of 13,934 genes and 4,764 RAVs.
class(RAVindex(RAVmodel))
#> [1] "matrix" "array"
dim(RAVindex(RAVmodel))
#> [1] 13934 4764
RAVindex(RAVmodel)[1:4, 1:4]
#> RAV1 RAV2 RAV3 RAV4
#> CASKIN1 0.002449532 -2.845471e-03 0.001601715 -0.009366156
#> DDX3Y 0.006093438 1.870528e-03 -0.004346019 -0.006094526
#> MDGA1 0.009999975 2.610302e-03 -0.002831447 -0.006710757
#> EIF1AY 0.007923652 8.507021e-05 0.002322275 -0.007850784
Metadata slot of RAVmodel contains information related to the model building.
names(metadata(RAVmodel))
#> [1] "cluster" "size" "k" "n" "MeSH_freq"
#> [6] "updateNote" "version" "geneSets"
cluster
: cluster membership of each PCs from the training datasetsize
: an integer vector with the length of clusters, containing the number
of PCs in each clusterk
: the number of all clusters in the given RAVmodeln
: the number of top PCs kept from each study in the training datasetgeneSets
: the name of gene sets used for GSEA annotationMeSH_freq
: the frequency of MeSH terms associated with the training
dataset. MeSH terms like ‘Humans’ and ‘RNA-seq’ are top ranked (which is very
expected) because the training dataset of this model is Human RNA sequencing
data.updateNote
: a brief note on the given model’s specificationversion
: the version of the given modelhead(metadata(RAVmodel)$cluster)
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4 DRP000987.PC5
#> 1 2 3 4 5
#> DRP000987.PC6
#> 6
head(metadata(RAVmodel)$size)
#> RAV1 RAV2 RAV3 RAV4 RAV5 RAV6
#> 6 21 4 7 3 3
metadata(RAVmodel)$k
#> [1] 4764
metadata(RAVmodel)$n
#> [1] 20
geneSets(RAVmodel)
#> [1] "MSigDB C2 version 7.1"
head(metadata(RAVmodel)$MeSH_freq)
#>
#> Humans RNA-Seq Transcriptome
#> 1163 591 576
#> RNA Whole Exome Sequencing Sequencing, RNA
#> 372 284 277
updateNote(RAVmodel)
#> [1] "GSEA with set.seed"
metadata(RAVmodel)$version
#> [1] "1.1.1"
You can find which studies are in each cluster using studies
method. Output is
a list with the length of clusters, where each element is a character vector
containing the name of studies in each cluster.
length(studies(RAVmodel))
#> [1] 4764
studies(RAVmodel)[1:3]
#> $Cl4764_01
#> [1] "DRP000987" "SRP059172" "SRP059959" "SRP118741" "SRP144178" "SRP183186"
#>
#> $Cl4764_02
#> [1] "DRP000987" "ERP022839" "ERP104463" "ERP106996" "ERP109789" "ERP114435"
#> [7] "SRP059039" "SRP059172" "SRP065317" "SRP071547" "SRP073061" "SRP074274"
#> [13] "SRP075449" "SRP092158" "SRP096736" "SRP107337" "SRP111402" "SRP125008"
#> [19] "SRP133442" "SRP135684" "SRP155483"
#>
#> $Cl4764_03
#> [1] "DRP000987" "SRP001540" "SRP056840" "SRP125396"
You can check which PC from different studies are in RAVs using PCinRAV
.
PCinRAV(RAVmodel, 2)
#> [1] "DRP000987.PC2" "ERP022839.PC2" "ERP104463.PC1" "ERP106996.PC2"
#> [5] "ERP109789.PC2" "ERP114435.PC1" "SRP059039.PC3" "SRP059172.PC2"
#> [9] "SRP065317.PC1" "SRP071547.PC3" "SRP073061.PC2" "SRP074274.PC5"
#> [13] "SRP075449.PC2" "SRP092158.PC2" "SRP096736.PC2" "SRP107337.PC3"
#> [17] "SRP111402.PC1" "SRP125008.PC2" "SRP133442.PC1" "SRP135684.PC4"
#> [21] "SRP155483.PC2"
Silhouette width ranges from -1 to 1 for each cluster. Typically, it is
interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the
assigned cluster
- Values close to 0 suggest that the observation is borderline matched
between two clusters
- Values close to -1 suggest that the observations may be assigned to the
wrong cluster
For RAVmodel, the average silhouette width of each cluster is a quality control measure and suggested as a secondary reference to choose proper RAVs, following validation score.
x <- silhouetteWidth(RAVmodel)
head(x) # average silhouette width of the first 6 RAVs
#> 1 2 3 4 5 6
#> -0.05470163 0.06426256 -0.01800335 -0.04005584 0.05786189 -0.02520973
Pre-processed GSEA results on each RAV are stored in RAVmodel and can be
accessed through gsea
function.
class(gsea(RAVmodel))
#> [1] "list"
class(gsea(RAVmodel)[[1]])
#> [1] "data.frame"
length(gsea(RAVmodel))
#> [1] 4764
gsea(RAVmodel)[1]
#> $RAV1
#> Description NES
#> REACTOME_MITOCHONDRIAL_TRANSLATION REACTOME_MITOCHONDRIAL_TRANSLATION 2.503359
#> REACTOME_TRANSLATION REACTOME_TRANSLATION 2.271857
#> CROONQUIST_STROMAL_STIMULATION_UP CROONQUIST_STROMAL_STIMULATION_UP -3.634893
#> pvalue qvalues
#> REACTOME_MITOCHONDRIAL_TRANSLATION 1.00000e-10 9.218794e-08
#> REACTOME_TRANSLATION 1.00000e-10 9.218794e-08
#> CROONQUIST_STROMAL_STIMULATION_UP 1.20798e-10 9.218794e-08
You can find MeSH terms associated with each study using mesh
method.
Output is a list with the length of studies used for training. Each element of
this output list is a data frame containing the assigned MeSH terms and the
detail of them. The last column bagOfWords
is the frequency of the MeSH term
in the whole training dataset.
length(mesh(RAVmodel))
#> [1] 536
mesh(RAVmodel)[1]
#> $DRP000987
#> score identifier concept name
#> 35822 1000 DRP000987 C0042333 Genetic Variation
#> 58627 18575 DRP000987 C0017431 Genotype
#> 149067 37162 DRP000987 C0021247 Indonesia
#> 216848 74145 DRP000987 C0030498 Parasites
#> 716101 22634 DRP000987 C4760607 RNA-Seq
#> 761253 55509 DRP000987 C0024530 Malaria
#> 832289 96999 DRP000987 C0024535 Malaria, Falciparum
#> 1116527 161422 DRP000987 C0032150 Plasmodium falciparum
#> 1230230 7197 DRP000987 C0020969 Innate Immune Response
#> 1468242 162422 DRP000987 C0003062 Animals
#> 1573542 19938 DRP000987 C0035668 RNA
#> 1596889 162422 DRP000987 C0086418 Humans
#> 1627781 36315 DRP000987 C0752248 Transcriptome Analysis
#> 1689323 57176 DRP000987 C3178810 Transcriptome
#> explanation
#> 35822 Forced Non-Leaf Node Lookup:genetic variations
#> 58627 Forced Lookup:genotyping
#> 149067 RtM via: Indonesia, RtM via: Indonesians, Forced Leaf Node Lookup:indonesians
#> 216848 RtM via: Parasites, Forced Leaf Node Lookup:parasites, checkForceMH Boosted
#> 716101 RtM via: RNA-Seq, Forced New Lookup:rna-seq
#> 761253 RtM via: Malaria, Forced Non-Leaf Node Lookup:malaria
#> 832289 RtM via: Plasmodium falciparum infection
#> 1116527 RtM via: Plasmodium falciparum, Forced Leaf Node Lookup:plasmodium falciparums, checkForceMH Boosted
#> 1230230 Entry Term Replacement for "Immunity, Innate", RtM via: Immunity, Innate, Forced Non-Leaf Node Lookup:innate immune responses
#> 1468242 Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1573542 RtM via: RNA
#> 1596889 Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, CT Text Lookup: patients, Forced Leaf Node Lookup:humans, CT Text Lookup: human, CT Text Lookup: humans, CT Text Lookup: patient, Parasites, CT Treecode Lookup: B01.050.500.714 (Parasites)
#> 1627781 Entry Term Replacement for "Gene Expression Profiling", RtM via: Gene Expression Profiling, Forced Non-Leaf Node Lookup:transcriptome analysis
#> 1689323 RtM via: Transcriptome, Forced Leaf Node Lookup:transcriptomes
#> bagOfWords
#> 35822 11
#> 58627 20
#> 149067 2
#> 216848 7
#> 716101 591
#> 761253 6
#> 832289 4
#> 1116527 5
#> 1230230 8
#> 1468242 209
#> 1573542 372
#> 1596889 1163
#> 1627781 66
#> 1689323 576
PCA summary of each study can be accessed through PCAsummary
method. Output
is a list with the length of studies, where each element is a matrix containing
PCA summary results: standard deviation (SD), variance explained by each PC
(Variance), and the cumulative variance explained (Cumulative).
length(PCAsummary(RAVmodel))
#> [1] 536
PCAsummary(RAVmodel)[1]
#> $DRP000987
#> DRP000987.PC1 DRP000987.PC2 DRP000987.PC3 DRP000987.PC4
#> SD 42.6382804 21.97962707 12.94848367 11.89265640
#> Variance 0.3580557 0.09514629 0.03302091 0.02785537
#> Cumulative 0.3580557 0.45320196 0.48622287 0.51407824
#> DRP000987.PC5 DRP000987.PC6 DRP000987.PC7 DRP000987.PC8
#> SD 10.89460616 10.53655583 9.47278967 8.97268098
#> Variance 0.02337622 0.02186495 0.01767287 0.01585607
#> Cumulative 0.53745446 0.55931942 0.57699228 0.59284836
#> DRP000987.PC9 DRP000987.PC10 DRP000987.PC11 DRP000987.PC12
#> SD 8.43652010 7.90810771 7.60865968 7.076401846
#> Variance 0.01401774 0.01231676 0.01140165 0.009862254
#> Cumulative 0.60686609 0.61918285 0.63058449 0.640446748
#> DRP000987.PC13 DRP000987.PC14 DRP000987.PC15 DRP000987.PC16
#> SD 6.89881271 6.766783788 6.703417100 6.6042884
#> Variance 0.00937346 0.009018116 0.008850009 0.0085902
#> Cumulative 0.64982021 0.658838324 0.667688332 0.6762785
#> DRP000987.PC17 DRP000987.PC18 DRP000987.PC19 DRP000987.PC20
#> SD 6.44257163 6.245257583 6.042236748 5.992461574
#> Variance 0.00817466 0.007681604 0.007190294 0.007072317
#> Cumulative 0.68445319 0.692134796 0.699325091 0.706397408
sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GenomicSuperSignature_1.3.7 SummarizedExperiment_1.25.3
#> [3] Biobase_2.55.2 GenomicRanges_1.47.6
#> [5] GenomeInfoDb_1.31.10 IRanges_2.29.1
#> [7] S4Vectors_0.33.17 BiocGenerics_0.41.2
#> [9] MatrixGenerics_1.7.0 matrixStats_0.62.0
#> [11] BiocStyle_2.23.1
#>
#> loaded via a namespace (and not attached):
#> [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2
#> [4] httr_1.4.2 doParallel_1.0.17 RColorBrewer_1.1-3
#> [7] tools_4.2.0 backports_1.4.1 bslib_0.3.1
#> [10] utf8_1.2.2 R6_2.5.1 DBI_1.1.2
#> [13] colorspace_2.0-3 GetoptLong_1.0.5 withr_2.5.0
#> [16] tidyselect_1.1.2 curl_4.3.2 bit_4.0.4
#> [19] compiler_4.2.0 cli_3.2.0 DelayedArray_0.21.2
#> [22] bookdown_0.26 sass_0.4.1 scales_1.2.0
#> [25] rappdirs_0.3.3 stringr_1.4.0 digest_0.6.29
#> [28] rmarkdown_2.13 XVector_0.35.0 pkgconfig_2.0.3
#> [31] htmltools_0.5.2 dbplyr_2.1.1 fastmap_1.1.0
#> [34] rlang_1.0.2 GlobalOptions_0.1.2 RSQLite_2.2.12
#> [37] shape_1.4.6 jquerylib_0.1.4 generics_0.1.2
#> [40] jsonlite_1.8.0 dplyr_1.0.8 car_3.0-12
#> [43] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.8
#> [46] Matrix_1.4-1 Rcpp_1.0.8.3 munsell_0.5.0
#> [49] fansi_1.0.3 abind_1.4-5 lifecycle_1.0.1
#> [52] stringi_1.7.6 yaml_2.3.5 carData_3.0-5
#> [55] zlibbioc_1.41.0 BiocFileCache_2.3.4 blob_1.2.3
#> [58] grid_4.2.0 parallel_4.2.0 crayon_1.5.1
#> [61] lattice_0.20-45 circlize_0.4.14 knitr_1.38
#> [64] ComplexHeatmap_2.11.1 pillar_1.7.0 ggpubr_0.4.0
#> [67] rjson_0.2.21 ggsignif_0.6.3 codetools_0.2-18
#> [70] glue_1.6.2 evaluate_0.15 BiocManager_1.30.16
#> [73] png_0.1-7 vctrs_0.4.1 foreach_1.5.2
#> [76] gtable_0.3.0 purrr_0.3.4 tidyr_1.2.0
#> [79] clue_0.3-60 assertthat_0.2.1 cachem_1.0.6
#> [82] ggplot2_3.3.5 xfun_0.30 broom_0.8.0
#> [85] rstatix_0.7.0 tibble_3.1.6 iterators_1.0.14
#> [88] memoise_2.0.1 cluster_2.1.3 ellipsis_0.3.2