From the Genomic Data Commons (GDC) website:
The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardization of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonized using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonizes these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.
The data model for the GDC is complex, but it worth a quick overview and a graphical representation is included here.
The GDC API exposes these nodes and edges in a somewhat simplified set of RESTful endpoints.
This quickstart section is just meant to show basic functionality. More details of functionality are included further on in this vignette and in function-specific help.
This software is available at Bioconductor.org and can be downloaded via
BiocManager::install
.
To report bugs or problems, either
submit a new issue
or submit a bug.report(package='GenomicDataCommons')
from within R (which
will redirect you to the new issue on GitHub).
Installation can be achieved via Bioconductor’s BiocManager
package.
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install('GenomicDataCommons')
library(GenomicDataCommons)
The GenomicDataCommons package relies on having network
connectivity. In addition, the NCI GDC API must also be operational
and not under maintenance. Checking status
can be used to check this
connectivity and functionality.
GenomicDataCommons::status()
## $commit
## [1] "4dd3680528a19ed33cfc83c7d049426c97bb903b"
##
## $data_release
## [1] "Data Release 35.0 - September 28, 2022"
##
## $status
## [1] "OK"
##
## $tag
## [1] "3.0.0"
##
## $version
## [1] 1
And to check the status in code:
stopifnot(GenomicDataCommons::status()$status=="OK")
The following code builds a manifest
that can be used to guide the
download of raw data. Here, filtering finds gene expression files
quantified as raw counts using STAR
from ovarian cancer patients.
ge_manifest <- files() %>%
filter( cases.project.project_id == 'TCGA-OV') %>%
filter( type == 'gene_expression' ) %>%
filter( analysis.workflow_type == 'STAR - Counts') %>%
manifest()
head(ge_manifest)
After the 762 gene expression files
specified in the query above. Using multiple processes to do the download very
significantly speeds up the transfer in many cases. On a standard 1Gb
connection, the following completes in about 30 seconds. The first time the
data are downloaded, R will ask to create a cache directory (see ?gdc_cache
for details of setting and interacting with the cache). Resulting
downloaded files will be stored in the cache directory. Future access to
the same files will be directly from the cache, alleviating multiple downloads.
fnames <- lapply(ge_manifest$id[1:20], gdcdata)
If the download had included controlled-access data, the download above would
have needed to include a token
. Details are available in
the authentication section below.
Accessing clinical data is a very common task. Given a set of case_ids
,
the gdc_clinical()
function will return a list of four tibble
s.
case_ids = cases() %>% results(size=10) %>% ids()
clindat = gdc_clinical(case_ids)
names(clindat)
## [1] "demographic" "diagnoses" "exposures" "main"
head(clindat[["main"]])
head(clindat[["diagnoses"]])
The GenomicDataCommons package can access the significant
clinical, demographic, biospecimen, and annotation information
contained in the NCI GDC. The gdc_clinical()
function will often
be all that is needed, but the API and GenomicDataCommons package
make much flexibility if fine-tuning is required.
expands = c("diagnoses","annotations",
"demographic","exposures")
clinResults = cases() %>%
GenomicDataCommons::select(NULL) %>%
GenomicDataCommons::expand(expands) %>%
results(size=50)
str(clinResults[[1]],list.len=6)
## chr [1:50] "b9a32a1c-9c93-5a92-8b30-e09a91dc3cfc" ...
# or listviewer::jsonedit(clinResults)
This package design is meant to have some similarities to the “hadleyverse” approach of dplyr. Roughly, the functionality for finding and accessing files and metadata can be divided into:
In addition, there are exhiliary functions for asking the GDC API for information about available and default fields, slicing BAM files, and downloading actual data files. Here is an overview of functionality1 See individual function and methods documentation for specific details..
projects()
cases()
files()
annotations()
filter()
facet()
select()
mapping()
available_fields()
default_fields()
grep_fields()
available_values()
available_expand()
results()
count()
response()
gdcdata()
transfer()
gdc_client()
aggregations()
gdc_token()
slicing()
There are two main classes of operations when working with the NCI GDC.
Both classes of operation are reviewed in detail in the following sections.
Vast amounts of metadata about cases (patients, basically), files, projects, and
so-called annotations are available via the NCI GDC API. Typically, one will
want to query metadata to either focus in on a set of files for download or
transfer or to perform so-called aggregations (pivot-tables, facets, similar
to the R table()
functionality).
Querying metadata starts with creating a “blank” query. One
will often then want to filter
the query to limit results prior
to retrieving results. The GenomicDataCommons package has
helper functions for listing fields that are available for
filtering.
In addition to fetching results, the GDC API allows faceting, or aggregating,, useful for compiling reports, generating dashboards, or building user interfaces to GDC data (see GDC web query interface for a non-R-based example).
A query of the GDC starts its life in R. Queries follow the four metadata
endpoints available at the GDC. In particular, there are four convenience
functions that each create GDCQuery
objects (actually, specific subclasses of
GDCQuery
):
projects()
cases()
files()
annotations()
pquery = projects()
The pquery
object is now an object of (S3) class, GDCQuery
(and
gdc_projects
and list
). The object contains the following elements:
projects()
function, the default fields from the GDC are used
(see default_fields()
)filter()
method and will be used to filter results on
retrieval.aggregations()
.Looking at the actual object (get used to using str()
!), note that the query
contains no results.
str(pquery)
## List of 5
## $ fields : chr [1:10] "dbgap_accession_number" "disease_type" "intended_release_date" "name" ...
## $ filters: NULL
## $ facets : NULL
## $ legacy : logi FALSE
## $ expand : NULL
## - attr(*, "class")= chr [1:3] "gdc_projects" "GDCQuery" "list"
[ GDC pagination documentation ]
With a query object available, the next step is to retrieve results from the
GDC. The GenomicDataCommons package. The most basic type of results we can get
is a simple count()
of records available that satisfy the filter criteria.
Note that we have not set any filters, so a count()
here will represent all
the project records publicly available at the GDC in the “default” archive"
pcount = count(pquery)
# or
pcount = pquery %>% count()
pcount
## [1] 72
The results()
method will fetch actual results.
presults = pquery %>% results()
These results are
returned from the GDC in JSON format and
converted into a (potentially nested) list in R. The str()
method is useful
for taking a quick glimpse of the data.
str(presults)
## List of 9
## $ id : chr [1:10] "TARGET-NBL" "GENIE-GRCC" "GENIE-DFCI" "GENIE-NKI" ...
## $ primary_site :List of 10
## ..$ TARGET-NBL: chr [1:20] "Retroperitoneum and peritoneum" "Lymph nodes" "Stomach" "Connective, subcutaneous and other soft tissues" ...
## ..$ GENIE-GRCC: chr [1:45] "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" "Ovary" ...
## ..$ GENIE-DFCI: chr [1:49] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
## ..$ GENIE-NKI : chr [1:42] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
## ..$ GENIE-VICC: chr [1:46] "Bronchus and lung" "Adrenal gland" "Gallbladder" "Esophagus" ...
## ..$ GENIE-UHN : chr [1:42] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
## ..$ GENIE-MDA : chr [1:42] "Eye and adnexa" "Uterus, NOS" "Ovary" "Other and unspecified urinary organs" ...
## ..$ GENIE-MSK : chr [1:49] "Eye and adnexa" "Other and ill-defined sites in lip, oral cavity and pharynx" "Uterus, NOS" "Rectum" ...
## ..$ GENIE-JHU : chr [1:33] "Eye and adnexa" "Uterus, NOS" "Rectum" "Ovary" ...
## ..$ FM-AD : chr [1:42] "Bronchus and lung" "Esophagus" "Cervix uteri" "Other and unspecified female genital organs" ...
## $ dbgap_accession_number: chr [1:10] "phs000467" NA NA NA ...
## $ project_id : chr [1:10] "TARGET-NBL" "GENIE-GRCC" "GENIE-DFCI" "GENIE-NKI" ...
## $ disease_type :List of 10
## ..$ TARGET-NBL: chr [1:2] "Neuroepitheliomatous Neoplasms" "Not Applicable"
## ..$ GENIE-GRCC: chr [1:32] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" ...
## ..$ GENIE-DFCI: chr [1:52] "Osseous and Chondromatous Neoplasms" "Other Leukemias" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
## ..$ GENIE-NKI : chr [1:23] "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" "Transitional Cell Papillomas and Carcinomas" ...
## ..$ GENIE-VICC: chr [1:43] "Neoplasms, NOS" "Adnexal and Skin Appendage Neoplasms" "Squamous Cell Neoplasms" "Gliomas" ...
## ..$ GENIE-UHN : chr [1:39] "Other Leukemias" "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
## ..$ GENIE-MDA : chr [1:34] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Fibromatous Neoplasms" "Myomatous Neoplasms" ...
## ..$ GENIE-MSK : chr [1:49] "Osseous and Chondromatous Neoplasms" "Synovial-like Neoplasms" "Lymphoid Leukemias" "Fibromatous Neoplasms" ...
## ..$ GENIE-JHU : chr [1:33] "Osseous and Chondromatous Neoplasms" "Other Leukemias" "Synovial-like Neoplasms" "Lymphoid Leukemias" ...
## ..$ FM-AD : chr [1:23] "Gliomas" "Acinar Cell Neoplasms" "Specialized Gonadal Neoplasms" "Miscellaneous Tumors" ...
## $ name : chr [1:10] "Neuroblastoma" "AACR Project GENIE - Contributed by Institut Gustave Roussy" "AACR Project GENIE - Contributed by Dana-Farber Cancer Institute" "AACR Project GENIE - Contributed by Netherlands Cancer Institute" ...
## $ releasable : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ state : chr [1:10] "open" "open" "open" "open" ...
## $ released : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
## - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
## - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
A default of only 10 records are returned. We can use the size
and from
arguments to results()
to either page through results or to change the number
of results. Finally, there is a convenience method, results_all()
that will
simply fetch all the available results given a query. Note that results_all()
may take a long time and return HUGE result sets if not used carefully. Use of a
combination of count()
and results()
to get a sense of the expected data
size is probably warranted before calling results_all()
length(ids(presults))
## [1] 10
presults = pquery %>% results_all()
length(ids(presults))
## [1] 72
# includes all records
length(ids(presults)) == count(pquery)
## [1] TRUE
Extracting subsets of results or manipulating the results into a more conventional R data structure is not easily generalizable. However, the purrr, rlist, and data.tree packages are all potentially of interest for manipulating complex, nested list structures. For viewing the results in an interactive viewer, consider the listviewer package.
Central to querying and retrieving data from the GDC is the ability to specify
which fields to return, filtering by fields and values, and faceting or
aggregating. The GenomicDataCommons package includes two simple functions,
available_fields()
and default_fields()
. Each can operate on a character(1)
endpoint name (“cases”, “files”, “annotations”, or “projects”) or a GDCQuery
object.
default_fields('files')
## [1] "access" "acl"
## [3] "average_base_quality" "average_insert_size"
## [5] "average_read_length" "channel"
## [7] "chip_id" "chip_position"
## [9] "contamination" "contamination_error"
## [11] "created_datetime" "data_category"
## [13] "data_format" "data_type"
## [15] "error_type" "experimental_strategy"
## [17] "file_autocomplete" "file_id"
## [19] "file_name" "file_size"
## [21] "imaging_date" "magnification"
## [23] "md5sum" "mean_coverage"
## [25] "msi_score" "msi_status"
## [27] "pairs_on_diff_chr" "plate_name"
## [29] "plate_well" "platform"
## [31] "proc_internal" "proportion_base_mismatch"
## [33] "proportion_coverage_10X" "proportion_coverage_10x"
## [35] "proportion_coverage_30X" "proportion_coverage_30x"
## [37] "proportion_reads_duplicated" "proportion_reads_mapped"
## [39] "proportion_targets_no_coverage" "read_pair_number"
## [41] "revision" "stain_type"
## [43] "state" "state_comment"
## [45] "submitter_id" "tags"
## [47] "total_reads" "tumor_ploidy"
## [49] "tumor_purity" "type"
## [51] "updated_datetime"
# The number of fields available for files endpoint
length(available_fields('files'))
## [1] 1017
# The first few fields available for files endpoint
head(available_fields('files'))
## [1] "access" "acl"
## [3] "analysis.analysis_id" "analysis.analysis_type"
## [5] "analysis.created_datetime" "analysis.input_files.access"
The fields to be returned by a query can be specified following a similar
paradigm to that of the dplyr package. The select()
function is a verb that
resets the fields slot of a GDCQuery
; note that this is not quite analogous to
the dplyr select()
verb that limits from already-present fields. We
completely replace the fields when using select()
on a GDCQuery
.
# Default fields here
qcases = cases()
qcases$fields
## [1] "aliquot_ids" "analyte_ids"
## [3] "case_autocomplete" "case_id"
## [5] "consent_type" "created_datetime"
## [7] "days_to_consent" "days_to_lost_to_followup"
## [9] "diagnosis_ids" "disease_type"
## [11] "index_date" "lost_to_followup"
## [13] "portion_ids" "primary_site"
## [15] "sample_ids" "slide_ids"
## [17] "state" "submitter_aliquot_ids"
## [19] "submitter_analyte_ids" "submitter_diagnosis_ids"
## [21] "submitter_id" "submitter_portion_ids"
## [23] "submitter_sample_ids" "submitter_slide_ids"
## [25] "updated_datetime"
# set up query to use ALL available fields
# Note that checking of fields is done by select()
qcases = cases() %>% GenomicDataCommons::select(available_fields('cases'))
head(qcases$fields)
## [1] "case_id" "aliquot_ids"
## [3] "analyte_ids" "annotations.annotation_id"
## [5] "annotations.case_id" "annotations.case_submitter_id"
Finding fields of interest is such a common operation that the
GenomicDataCommons includes the grep_fields()
function.
See the appropriate help pages for details.
The GDC API offers a feature known as aggregation or faceting. By
specifying one or more fields (of appropriate type), the GDC can
return to us a count of the number of records matching each potential
value. This is similar to the R table
method. Multiple fields can be
returned at once, but the GDC API does not have a cross-tabulation
feature; all aggregations are only on one field at a time. Results of
aggregation()
calls come back as a list of data.frames (actually,
tibbles).
# total number of files of a specific type
res = files() %>% facet(c('type','data_type')) %>% aggregations()
res$type
Using aggregations()
is an also easy way to learn the contents of individual
fields and forms the basis for faceted search pages.
[ GDC filtering
documentation ]
The GenomicDataCommons package uses a form of non-standard evaluation to specify R-like queries that are then translated into an R list. That R list is, upon calling a method that fetches results from the GDC API, translated into the appropriate JSON string. The R expression uses the formula interface as suggested by Hadley Wickham in his vignette on non-standard evaluation
It’s best to use a formula because a formula captures both the expression to evaluate and the environment where the evaluation occurs. This is important if the expression is a mixture of variables in a data frame and objects in the local environment [for example].
For the user, these details will not be too important except to note that a filter expression must begin with a “~”.
qfiles = files()
qfiles %>% count() # all files
## [1] 843011
To limit the file type, we can refer back to the section on faceting to see the possible values for the file field “type”. For example, to filter file results to only “gene_expression” files, we simply specify a filter.
qfiles = files() %>% filter( type == 'gene_expression')
# here is what the filter looks like after translation
str(get_filter(qfiles))
## List of 2
## $ op : 'scalar' chr "="
## $ content:List of 2
## ..$ field: chr "type"
## ..$ value: chr "gene_expression"
What if we want to create a filter based on the project (‘TCGA-OVCA’, for example)? Well, we have a couple of possible ways to discover available fields. The first is based on base R functionality and some intuition.
grep('pro',available_fields('files'),value=TRUE) %>%
head()
## [1] "analysis.input_files.proc_internal"
## [2] "analysis.input_files.proportion_base_mismatch"
## [3] "analysis.input_files.proportion_coverage_10X"
## [4] "analysis.input_files.proportion_coverage_10x"
## [5] "analysis.input_files.proportion_coverage_30X"
## [6] "analysis.input_files.proportion_coverage_30x"
Interestingly, the project information is “nested” inside the case. We don’t need to know that detail other than to know that we now have a few potential guesses for where our information might be in the files records. We need to know where because we need to construct the appropriate filter.
files() %>%
facet('cases.project.project_id') %>%
aggregations() %>%
head()
## $cases.project.project_id
## doc_count key
## 1 54096 FM-AD
## 2 49455 TCGA-BRCA
## 3 56636 CPTAC-3
## 4 38305 TARGET-AML
## 5 26469 TCGA-LUAD
## 6 36470 GENIE-MSK
## 7 22749 TCGA-UCEC
## 8 23663 TCGA-HNSC
## 9 22838 TCGA-THCA
## 10 23734 TCGA-KIRC
## 11 21971 TCGA-OV
## 12 23893 TCGA-LUSC
## 13 23134 TCGA-LGG
## 14 22531 TCGA-PRAD
## 15 20776 TCGA-COAD
## 16 18013 TCGA-GBM
## 17 28464 GENIE-DFCI
## 18 20167 TCGA-SKCM
## 19 19694 TCGA-STAD
## 20 27014 MMRF-COMMPASS
## 21 18331 TCGA-BLCA
## 22 16965 TCGA-LIHC
## 23 16959 TARGET-ALL-P2
## 24 13339 TCGA-CESC
## 25 13387 TCGA-KIRP
## 26 11593 TCGA-SARC
## 27 14968 BEATAML1.0-COHORT
## 28 11553 REBC-THYR
## 29 8187 TCGA-PAAD
## 30 8164 TCGA-ESCA
## 31 7868 TCGA-PCPG
## 32 7178 TCGA-READ
## 33 6731 TCGA-TGCT
## 34 9244 CPTAC-2
## 35 5358 TARGET-NBL
## 36 7221 TCGA-LAML
## 37 5376 TCGA-THYM
## 38 6984 HCMI-CMDC
## 39 5304 CGCI-HTMCP-CC
## 40 5454 CMI-MBC
## 41 3876 TCGA-ACC
## 42 3493 TCGA-KICH
## 43 5286 NCICCR-DLBCL
## 44 3666 TCGA-MESO
## 45 3383 TCGA-UVM
## 46 2532 TARGET-WT
## 47 2801 TARGET-OS
## 48 3625 TARGET-ALL-P3
## 49 3857 GENIE-MDA
## 50 3833 GENIE-VICC
## 51 2549 TCGA-UCS
## 52 3320 GENIE-JHU
## 53 2043 TCGA-DLBC
## 54 2059 TCGA-CHOL
## 55 2632 GENIE-UHN
## 56 2139 CGCI-BLGSP
## 57 1826 EXCEPTIONAL_RESPONDERS-ER
## 58 1571 MP2PRT-WT
## 59 1036 TARGET-RT
## 60 1093 WCDT-MCRPC
## 61 1038 GENIE-GRCC
## 62 878 OHSU-CNL
## 63 806 CMI-ASC
## 64 801 GENIE-NKI
## 65 758 ORGANOID-PANCREATIC
## 66 553 CTSP-DLBCL1
## 67 480 CMI-MPC
## 68 339 TRIO-CRU
## 69 222 BEATAML1.0-CRENOLANIB
## 70 163 TARGET-CCSK
## 71 96 TARGET-ALL-P1
## 72 21 VAREPOP-APOLLO
We note that cases.project.project_id
looks like it is a good fit. We also
note that TCGA-OV
is the correct project_id, not TCGA-OVCA
. Note that
unlike with dplyr and friends, the filter()
method here replaces the
filter and does not build on any previous filters.
qfiles = files() %>%
filter( cases.project.project_id == 'TCGA-OV' & type == 'gene_expression')
str(get_filter(qfiles))
## List of 2
## $ op : 'scalar' chr "and"
## $ content:List of 2
## ..$ :List of 2
## .. ..$ op : 'scalar' chr "="
## .. ..$ content:List of 2
## .. .. ..$ field: chr "cases.project.project_id"
## .. .. ..$ value: chr "TCGA-OV"
## ..$ :List of 2
## .. ..$ op : 'scalar' chr "="
## .. ..$ content:List of 2
## .. .. ..$ field: chr "type"
## .. .. ..$ value: chr "gene_expression"
qfiles %>% count()
## [1] 762
Asking for a count()
of results given these new filter criteria gives r qfiles %>% count()
results. Filters can be chained (or nested) to
accomplish the same effect as multiple &
conditionals. The count()
below is equivalent to the &
filtering done above.
qfiles2 = files() %>%
filter( cases.project.project_id == 'TCGA-OV') %>%
filter( type == 'gene_expression')
qfiles2 %>% count()
## [1] 762
(qfiles %>% count()) == (qfiles2 %>% count()) #TRUE
## [1] TRUE
Generating a manifest for bulk downloads is as simple as asking for the manifest from the current query.
manifest_df = qfiles %>% manifest()
head(manifest_df)
Note that we might still not be quite there. Looking at filenames, there are
suspiciously named files that might include “FPKM”, “FPKM-UQ”, or “counts”.
Another round of grep
and available_fields
, looking for “type” turned up
that the field “analysis.workflow_type” has the appropriate filter criteria.
qfiles = files() %>% filter( ~ cases.project.project_id == 'TCGA-OV' &
type == 'gene_expression' &
access == "open" &
analysis.workflow_type == 'STAR - Counts')
manifest_df = qfiles %>% manifest()
nrow(manifest_df)
## [1] 381
The GDC Data Transfer Tool can be used (from R, transfer()
or from the
command-line) to orchestrate high-performance, restartable transfers of all the
files in the manifest. See the bulk downloads section for
details.
[ GDC authentication documentation ]
The GDC offers both “controlled-access” and “open” data. As of this writing, only data stored as files is “controlled-access”; that is, metadata accessible via the GDC is all “open” data and some files are “open” and some are “controlled-access”. Controlled-access data are only available after going through the process of obtaining access.
After controlled-access to one or more datasets has been granted, logging into the GDC web portal will allow you to access a GDC authentication token, which can be downloaded and then used to access available controlled-access data via the GenomicDataCommons package.
The GenomicDataCommons uses authentication tokens only for downloading
data (see transfer
and gdcdata
documentation). The package
includes a helper function, gdc_token
, that looks for the token to
be stored in one of three ways (resolved in this order):
GDC_TOKEN
GDC_TOKEN_FILE
.gdc_token
As a concrete example:
token = gdc_token()
transfer(...,token=token)
# or
transfer(...,token=get_token())
The gdcdata
function takes a character vector of one or more file
ids. A simple way of producing such a vector is to produce a
manifest
data frame and then pass in the first column, which will
contain file ids.
fnames = gdcdata(manifest_df$id[1:2],progress=FALSE)
Note that for controlled-access data, a
GDC authentication token is required. Using the
BiocParallel
package may be useful for downloading in parallel,
particularly for large numbers of smallish files.
The bulk download functionality is only efficient (as of v1.2.0 of the GDC Data Transfer Tool) for relatively large files, so use this approach only when transferring BAM files or larger VCF files, for example. Otherwise, consider using the approach shown above, perhaps in parallel.
# Requires gcd_client command-line utility to be isntalled
# separately.
fnames = gdcdata(manifest_df$id[3:10], access_method = 'client')
res = cases() %>% facet("project.project_id") %>% aggregations()
head(res)
## $project.project_id
## doc_count key
## 1 18004 FM-AD
## 2 16824 GENIE-MSK
## 3 14232 GENIE-DFCI
## 4 3857 GENIE-MDA
## 5 3320 GENIE-JHU
## 6 2632 GENIE-UHN
## 7 2492 TARGET-AML
## 8 2052 GENIE-VICC
## 9 1587 TARGET-ALL-P2
## 10 1132 TARGET-NBL
## 11 1098 TCGA-BRCA
## 12 1046 CPTAC-3
## 13 1038 GENIE-GRCC
## 14 995 MMRF-COMMPASS
## 15 826 BEATAML1.0-COHORT
## 16 801 GENIE-NKI
## 17 652 TARGET-WT
## 18 617 TCGA-GBM
## 19 608 TCGA-OV
## 20 585 TCGA-LUAD
## 21 560 TCGA-UCEC
## 22 537 TCGA-KIRC
## 23 528 TCGA-HNSC
## 24 516 TCGA-LGG
## 25 507 TCGA-THCA
## 26 504 TCGA-LUSC
## 27 500 TCGA-PRAD
## 28 489 NCICCR-DLBCL
## 29 470 TCGA-SKCM
## 30 461 TCGA-COAD
## 31 443 TCGA-STAD
## 32 440 REBC-THYR
## 33 412 TCGA-BLCA
## 34 383 TARGET-OS
## 35 377 TCGA-LIHC
## 36 342 CPTAC-2
## 37 339 TRIO-CRU
## 38 307 TCGA-CESC
## 39 291 TCGA-KIRP
## 40 261 TCGA-SARC
## 41 212 CGCI-HTMCP-CC
## 42 200 CMI-MBC
## 43 200 TCGA-LAML
## 44 191 TARGET-ALL-P3
## 45 185 TCGA-ESCA
## 46 185 TCGA-PAAD
## 47 179 TCGA-PCPG
## 48 176 OHSU-CNL
## 49 172 TCGA-READ
## 50 150 TCGA-TGCT
## 51 124 TCGA-THYM
## 52 120 CGCI-BLGSP
## 53 113 TCGA-KICH
## 54 110 HCMI-CMDC
## 55 101 WCDT-MCRPC
## 56 92 TCGA-ACC
## 57 87 TCGA-MESO
## 58 84 EXCEPTIONAL_RESPONDERS-ER
## 59 80 TCGA-UVM
## 60 70 ORGANOID-PANCREATIC
## 61 69 TARGET-RT
## 62 58 TCGA-DLBC
## 63 57 TCGA-UCS
## 64 56 BEATAML1.0-CRENOLANIB
## 65 52 MP2PRT-WT
## 66 51 TCGA-CHOL
## 67 45 CTSP-DLBCL1
## 68 36 CMI-ASC
## 69 30 CMI-MPC
## 70 24 TARGET-ALL-P1
## 71 13 TARGET-CCSK
## 72 7 VAREPOP-APOLLO
library(ggplot2)
ggplot(res$project.project_id,aes(x = key, y = doc_count)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
cases() %>% filter(~ project.program.name=='TARGET') %>% count()
## [1] 6543
cases() %>% filter(~ project.program.name=='TCGA') %>% count()
## [1] 11315
# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
project.project_id=='TCGA-BRCA' ) %>%
facet('samples.sample_type') %>% aggregations()
resp$samples.sample_type
# The need to do the "&" here is a requirement of the
# current version of the GDC API. I have filed a feature
# request to remove this requirement.
resp = cases() %>% filter(~ project.project_id=='TCGA-BRCA' &
samples.sample_type=='Solid Tissue Normal') %>%
GenomicDataCommons::select(c(default_fields(cases()),'samples.sample_type')) %>%
response_all()
count(resp)
## [1] 162
res = resp %>% results()
str(res[1],list.len=6)
## List of 1
## $ id: chr [1:162] "3d676bba-154b-4d22-ab59-d4d4da051b94" "1133b8a9-6b11-4511-b70a-f200e3b8b5db" "17c1d42c-cb84-4655-a4cd-b54bae17ecaf" "9da462b0-93c2-4305-89f6-7199a30399a7" ...
head(ids(resp))
## [1] "3d676bba-154b-4d22-ab59-d4d4da051b94"
## [2] "1133b8a9-6b11-4511-b70a-f200e3b8b5db"
## [3] "17c1d42c-cb84-4655-a4cd-b54bae17ecaf"
## [4] "9da462b0-93c2-4305-89f6-7199a30399a7"
## [5] "14267783-5624-4fe5-ba81-9d67f1017474"
## [6] "26573441-eedb-4364-966c-e7f803deef19"
cases() %>%
GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
"cases.demographic.gender" %in% "female") %>%
GenomicDataCommons::results(size = 4) %>%
ids()
## [1] "cbfef004-b437-4d51-9d88-a2db50aa6481"
## [2] "a9644274-13bb-4228-9b4f-14260ccc26eb"
## [3] "096bd95f-9900-4db2-b1c4-103902c3b31f"
## [4] "0a45f302-5748-48f3-9dc9-66c01843a68e"
cases() %>%
GenomicDataCommons::filter(~ project.project_id == 'TCGA-COAD' &
"cases.demographic.gender" %exclude% "female") %>%
GenomicDataCommons::results(size = 4) %>%
ids()
## [1] "58facedb-fcb8-4ecf-8338-2bfa4947acef"
## [2] "0a94eecf-4db2-4846-8383-c83ff02e4a9f"
## [3] "8368d745-c74d-4236-ba75-16ca7aaeb3ca"
## [4] "eb4e4e09-98b3-4e85-8dd2-75676ff2af14"
cases() %>%
GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
missing("cases.demographic.gender")) %>%
GenomicDataCommons::results(size = 4) %>%
ids()
## [1] "aebd0313-23be-46a8-abc6-b16c531c3a8e"
## [2] "07119baf-64a7-454c-b1b0-c769b506a63d"
## [3] "494d50b6-578e-441b-8195-4b4d26c0d810"
## [4] "9b714a42-62e8-4b33-947e-6c4850725afd"
cases() %>%
GenomicDataCommons::filter(~ project.program.name == 'TCGA' &
!missing("cases.demographic.gender")) %>%
GenomicDataCommons::results(size = 4) %>%
ids()
## [1] "35243518-b086-4d76-a336-8c61a14f9ded"
## [2] "df3362cb-0f6b-412e-8af4-c5606526be17"
## [3] "6433f001-db5c-476b-83c7-23f4c5397ae9"
## [4] "eb80244a-5f20-49a8-8f73-e92c14395895"
res = files() %>% facet('type') %>% aggregations()
res$type
ggplot(res$type,aes(x = key,y = doc_count)) + geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
q = files() %>%
GenomicDataCommons::select(available_fields('files')) %>%
filter(~ cases.project.project_id=='TCGA-GBM' &
data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations()
## list()
# so need to add another filter
file_ids = q %>% filter(~ cases.project.project_id=='TCGA-GBM' &
data_type=='Gene Expression Quantification' &
analysis.workflow_type == 'STAR - Counts') %>%
GenomicDataCommons::select('file_id') %>%
response_all() %>%
ids()
I need to figure out how to do slicing reproducibly in a testing environment and for vignette building.
q = files() %>%
GenomicDataCommons::select(available_fields('files')) %>%
filter(~ cases.project.project_id == 'TCGA-GBM' &
data_type == 'Aligned Reads' &
experimental_strategy == 'RNA-Seq' &
data_format == 'BAM')
file_ids = q %>% response_all() %>% ids()
bamfile = slicing(file_ids[1],regions="chr12:6534405-6538375",token=gdc_token())
library(GenomicAlignments)
aligns = readGAlignments(bamfile)
Error in curl::curl_fetch_memory(url, handle = handle) :
SSL connect error
openssl
to version
1.0.1 or later.
openssl
, reinstall the R curl
and httr
packages.sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 GenomicDataCommons_1.20.3
## [3] magrittr_2.0.3 knitr_1.40
## [5] BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.29
## [4] utf8_1.2.2 R6_2.5.1 GenomeInfoDb_1.32.4
## [7] stats4_4.2.1 evaluate_0.17 highr_0.9
## [10] httr_1.4.4 pillar_1.8.1 zlibbioc_1.42.0
## [13] rlang_1.0.6 curl_4.3.3 jquerylib_0.1.4
## [16] magick_2.7.3 S4Vectors_0.34.0 rmarkdown_2.17
## [19] labeling_0.4.2 readr_2.1.3 stringr_1.4.1
## [22] RCurl_1.98-1.9 bit_4.0.4 munsell_0.5.0
## [25] compiler_4.2.1 xfun_0.33 pkgconfig_2.0.3
## [28] BiocGenerics_0.42.0 htmltools_0.5.3 tidyselect_1.1.2
## [31] tibble_3.1.8 GenomeInfoDbData_1.2.8 bookdown_0.29
## [34] IRanges_2.30.1 fansi_1.0.3 crayon_1.5.2
## [37] dplyr_1.0.10 tzdb_0.3.0 withr_2.5.0
## [40] bitops_1.0-7 rappdirs_0.3.3 grid_4.2.1
## [43] jsonlite_1.8.2 gtable_0.3.1 lifecycle_1.0.3
## [46] DBI_1.1.3 scales_1.2.1 cli_3.4.1
## [49] stringi_1.7.8 vroom_1.6.0 cachem_1.0.6
## [52] farver_2.1.1 XVector_0.36.0 xml2_1.3.3
## [55] bslib_0.4.0 ellipsis_0.3.2 generics_0.1.3
## [58] vctrs_0.4.2 tools_4.2.1 bit64_4.0.5
## [61] glue_1.6.2 purrr_0.3.5 hms_1.1.2
## [64] parallel_4.2.1 fastmap_1.1.0 yaml_2.3.5
## [67] colorspace_2.0-3 BiocManager_1.30.18 GenomicRanges_1.48.0
## [70] sass_0.4.2
S3
object-oriented programming paradigm is used.