if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("LiNk-NY/terraTCGAdata")
Some public Terra workspaces come pre-packaged with TCGA data (i.e., cloud data
resources are linked within the data model). Particularly the workspaces that
are labelled OpenAccess_V1-0
. Datasets harmonized to the hg38 genome use a
different data model / workflow and are not compatible with the functions in
this package. For those that are, we make use of the Terra data model and
represent the data as MultiAssayExperiment
.
For more information on MultiAssayExperiment
, please see the vignette in
that package.
library(AnVIL)
library(terraTCGAdata)
A valid GCloud SDK installation is required to use the package. Use the
gcloud_exists()
function from the AnVIL package to identify
whether it is installed in your system.
gcloud_exists()
## [1] FALSE
You can also use the gcloud_project
to set a project name by specifying
the project argument:
gcloud_project()
To get a list of available TCGA workspaces, use the findTCGAworkspaces()
function:
findTCGAworkspaces()
You can then set a package-wide option with the terraTCGAworkspace
function
and check the setting with the getOption('terraTCGAdata.workspace')
option.
terraTCGAworkspace("TCGA_COAD_OpenAccess_V1-0_DATA")
getOption("terraTCGAdata.workspace")
In order to determine what datasets to download, use the getClinicalTable
function to list all of the columns that correspond to clinical data
from the different collection centers.
ct <- getClinicalTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
ct
names(ct)
After picking the column in the getClinicalTable
output, use the column
name as input to the getClinical
function to obtain the data:
column_name <- "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin"
clin <- getClinical(
columnName = column_name,
participants = TRUE,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
clin[, 1:6]
dim(clin)
We use the same approach for assay data. We first produce a list of assays
from the getAssayTable
and then we select one along with any sample
codes of interest.
at <- getAssayTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
at
names(at)
You can get a summary table of all the samples in the adata by using the
sampleTypesTable
:
sampleTypesTable(workspace = "TCGA_COAD_OpenAccess_V1-0_DATA")
Note that if you have the package-wide option set, the workspace argument is not needed in the function call.
prot <- getAssayData(
assayName = "protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
sampleCode = c("01", "10"),
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA",
sampleIdx = 1:4
)
head(prot)
Finally, once you have collected all the relevant column names,
these can be inputs to the main terraTCGAdata
function:
mae <- terraTCGAdata(
clinicalName = "clin__bio__nationwidechildrens_org__Level_1__biospecimen__clin",
assays =
c("protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data",
"rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data"),
sampleCode = NULL,
split = FALSE,
sampleIdx = 1:4,
workspace = "TCGA_COAD_OpenAccess_V1-0_DATA"
)
mae
We expect that most OpenAccess_V1-0
cancer datasets follow this data model.
If you encounter any errors, please provide a minimally reproducible example
at https://github.com/waldronlab/terraTCGAdata.
sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] terraTCGAdata_1.0.0 MultiAssayExperiment_1.22.0
## [3] SummarizedExperiment_1.26.0 Biobase_2.56.0
## [5] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0
## [7] IRanges_2.30.0 S4Vectors_0.34.0
## [9] BiocGenerics_0.42.0 MatrixGenerics_1.8.0
## [11] matrixStats_0.62.0 AnVIL_1.8.0
## [13] dplyr_1.0.8 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 tidyr_1.2.0 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] futile.options_1.0.1 rapiclient_0.1.3 evaluate_0.15
## [10] httr_1.4.2 pillar_1.7.0 zlibbioc_1.42.0
## [13] rlang_1.0.2 jquerylib_0.1.4 Matrix_1.4-1
## [16] rmarkdown_2.14 stringr_1.4.0 RCurl_1.98-1.6
## [19] DelayedArray_0.22.0 compiler_4.2.0 xfun_0.30
## [22] pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.2
## [25] tibble_3.1.6 GenomeInfoDbData_1.2.8 bookdown_0.26
## [28] codetools_0.2-18 fansi_1.0.3 crayon_1.5.1
## [31] bitops_1.0-7 grid_4.2.0 jsonlite_1.8.0
## [34] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.3
## [37] formatR_1.12 cli_3.3.0 stringi_1.7.6
## [40] XVector_0.36.0 futile.logger_1.4.3 bslib_0.3.1
## [43] ellipsis_0.3.2 generics_0.1.2 vctrs_0.4.1
## [46] lambda.r_1.2.4 tools_4.2.0 glue_1.6.2
## [49] purrr_0.3.4 parallel_4.2.0 fastmap_1.1.0
## [52] yaml_2.3.5 BiocManager_1.30.17 knitr_1.39
## [55] sass_0.4.1