The purpose of this vignette is to explore the file manifests available from the Human Cell Atlas project.
These files provide a metadata summary for a collection of files in a tabular format, including but not limited to information about process and workflow used to generate the file, information about the specimens the file data were derived from, and identifiers connect specific projects, files, and specimens.
The WARP (WDL Analysis Research Pipelines) repository contains information on a variety of pipelines, and can be used alongside a manifest to better understand the metadata.
Evaluate the following code chunk to install packages required for this vignette.
## install from Bioconductor if you haven't already
pkgs <- c("LoomExperiment", "hca")
pkgs_needed <- pkgs[!pkgs %in% rownames(installed.packages())]
BiocManager::install(pkgs_needed)
Load the packages into your R session.
library(dplyr)
library(SummarizedExperiment)
library(LoomExperiment)
library(hca)
The manifest for all files available can be obtained with
default_manifest_tbl <- hca::manifest()
default_manifest_tbl
This is seldom useful; instead, create a filter identifying the files of interest.
manifest_filter <- hca::filters(
projectId = list(is = "4a95101c-9ffc-4f30-a809-f04518a23803"),
fileFormat = list(is = "loom"),
workflow = list(is = c("optimus_v4.2.2", "optimus_v4.2.3"))
)
Retrieve the manifest
manifest_tibble <- hca::manifest(filters = manifest_filter)
manifest_tibble
## # A tibble: 20 × 56
## source_id sourc…¹ bundl…² bundle_version file_…³ file_…⁴ file_…⁵ file_…⁶
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 423338be… tdr:da… b593b6… 2020-02-03 01:00:00 131ea5… analys… 098cc6… loom
## 2 423338be… tdr:da… 5a63dd… 2021-02-02 23:50:00 1bb375… analys… 58cf25… loom
## 3 423338be… tdr:da… 407338… 2021-02-02 23:55:00 1f8ff0… analys… t-cell… loom
## 4 423338be… tdr:da… 1a41eb… 2021-02-02 23:50:00 2fffe2… analys… fcbaa3… loom
## 5 423338be… tdr:da… c12a6c… 2021-02-02 23:50:00 31aa5a… analys… t-cell… loom
## 6 423338be… tdr:da… f58d69… 2021-02-02 23:50:00 48eea2… analys… 219e1b… loom
## 7 423338be… tdr:da… 21c4e2… 2021-02-02 23:50:00 514589… analys… 36ca61… loom
## 8 423338be… tdr:da… 50620c… 2021-02-02 23:50:00 5bbebe… analys… 24ae6c… loom
## 9 423338be… tdr:da… e3ecdf… 2021-02-02 23:55:00 5bc232… analys… c763f6… loom
## 10 423338be… tdr:da… ae338c… 2021-02-02 23:50:00 6326b6… analys… t-cell… loom
## 11 423338be… tdr:da… d62c45… 2020-02-03 01:00:00 7848d8… analys… 294fe5… loom
## 12 423338be… tdr:da… 81df10… 2021-02-02 23:55:00 9f8bc0… analys… 58a18a… loom
## 13 423338be… tdr:da… 283832… 2020-02-03 01:00:00 b98cfa… analys… d65364… loom
## 14 423338be… tdr:da… c3f672… 2021-02-02 23:50:00 bf7751… analys… 6fcd2c… loom
## 15 423338be… tdr:da… a9c903… 2020-02-03 01:00:00 c7b647… analys… a040da… loom
## 16 423338be… tdr:da… 9d0f5c… 2020-02-03 01:00:00 d0b95f… analys… t-cell… loom
## 17 423338be… tdr:da… 59de15… 2021-02-02 23:50:00 d18759… analys… bfbf2c… loom
## 18 423338be… tdr:da… 54fb0e… 2021-02-02 23:55:00 dfd990… analys… c76d90… loom
## 19 423338be… tdr:da… 751656… 2021-02-02 23:55:00 e07ca7… analys… fb72f4… loom
## 20 423338be… tdr:da… 8e850d… 2021-02-02 23:50:00 fd41f3… analys… 3ddf14… loom
## # … with 48 more variables: read_index <lgl>, file_size <dbl>, file_uuid <chr>,
## # file_version <dttm>, file_crc32c <chr>, file_sha256 <chr>,
## # file_content_type <chr>, file_drs_uri <chr>, file_url <chr>,
## # cell_suspension.provenance.document_id <chr>,
## # cell_suspension.biomaterial_core.biomaterial_id <chr>,
## # cell_suspension.estimated_cell_count <lgl>,
## # cell_suspension.selected_cell_type <chr>, …
## # ℹ Use `colnames()` to see all variable names
And perform additional filtering, e.g., identifying the specimen organs represented in the files.
manifest_tibble |>
dplyr::count(specimen_from_organism.organ)
## # A tibble: 4 × 2
## specimen_from_organism.organ n
## <chr> <int>
## 1 blood 5
## 2 hematopoietic system 5
## 3 lung 5
## 4 mediastinal lymph node 5
manifest_tibble
and select one for downloadmanifest_tibble
## # A tibble: 20 × 56
## source_id sourc…¹ bundl…² bundle_version file_…³ file_…⁴ file_…⁵ file_…⁶
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 423338be… tdr:da… b593b6… 2020-02-03 01:00:00 131ea5… analys… 098cc6… loom
## 2 423338be… tdr:da… 5a63dd… 2021-02-02 23:50:00 1bb375… analys… 58cf25… loom
## 3 423338be… tdr:da… 407338… 2021-02-02 23:55:00 1f8ff0… analys… t-cell… loom
## 4 423338be… tdr:da… 1a41eb… 2021-02-02 23:50:00 2fffe2… analys… fcbaa3… loom
## 5 423338be… tdr:da… c12a6c… 2021-02-02 23:50:00 31aa5a… analys… t-cell… loom
## 6 423338be… tdr:da… f58d69… 2021-02-02 23:50:00 48eea2… analys… 219e1b… loom
## 7 423338be… tdr:da… 21c4e2… 2021-02-02 23:50:00 514589… analys… 36ca61… loom
## 8 423338be… tdr:da… 50620c… 2021-02-02 23:50:00 5bbebe… analys… 24ae6c… loom
## 9 423338be… tdr:da… e3ecdf… 2021-02-02 23:55:00 5bc232… analys… c763f6… loom
## 10 423338be… tdr:da… ae338c… 2021-02-02 23:50:00 6326b6… analys… t-cell… loom
## 11 423338be… tdr:da… d62c45… 2020-02-03 01:00:00 7848d8… analys… 294fe5… loom
## 12 423338be… tdr:da… 81df10… 2021-02-02 23:55:00 9f8bc0… analys… 58a18a… loom
## 13 423338be… tdr:da… 283832… 2020-02-03 01:00:00 b98cfa… analys… d65364… loom
## 14 423338be… tdr:da… c3f672… 2021-02-02 23:50:00 bf7751… analys… 6fcd2c… loom
## 15 423338be… tdr:da… a9c903… 2020-02-03 01:00:00 c7b647… analys… a040da… loom
## 16 423338be… tdr:da… 9d0f5c… 2020-02-03 01:00:00 d0b95f… analys… t-cell… loom
## 17 423338be… tdr:da… 59de15… 2021-02-02 23:50:00 d18759… analys… bfbf2c… loom
## 18 423338be… tdr:da… 54fb0e… 2021-02-02 23:55:00 dfd990… analys… c76d90… loom
## 19 423338be… tdr:da… 751656… 2021-02-02 23:55:00 e07ca7… analys… fb72f4… loom
## 20 423338be… tdr:da… 8e850d… 2021-02-02 23:50:00 fd41f3… analys… 3ddf14… loom
## # … with 48 more variables: read_index <lgl>, file_size <dbl>, file_uuid <chr>,
## # file_version <dttm>, file_crc32c <chr>, file_sha256 <chr>,
## # file_content_type <chr>, file_drs_uri <chr>, file_url <chr>,
## # cell_suspension.provenance.document_id <chr>,
## # cell_suspension.biomaterial_core.biomaterial_id <chr>,
## # cell_suspension.estimated_cell_count <lgl>,
## # cell_suspension.selected_cell_type <chr>, …
## # ℹ Use `colnames()` to see all variable names
file_uuid <- "24a8a323-7ecd-504e-a253-b0e0892dd730"
file_hca_tbl
for the file based on it’s uuidfile_filter <- hca::filters(
fileId = list(is = file_uuid)
)
file_tbl <- hca::files(filters = file_filter)
file_tbl
## # A tibble: 1 × 8
## fileId name fileF…¹ size version proje…² proje…³ url
## <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
## 1 24a8a323-7ecd-504e-a253-b0… t-ce… loom 3.90e8 2021-0… A sing… 4a9510… http…
## # … with abbreviated variable names ¹fileFormat, ²projectTitle, ³projectId
file_location <-
file_tbl |>
hca::files_download()
file_location
## 24a8a323-7ecd-504e-a253-b0e0892dd730-2021-02-11T19:00:05.000000Z
## "/home/biocbuild/.cache/R/hca/3e264919340f3d_3e264919340f3d.loom"
LoomExperiment
objectloom <- LoomExperiment::import(file_location)
metadata(loom) |>
dplyr::glimpse()
## List of 15
## $ last_modified : chr "20210211T185949.186062Z"
## $ CreationDate : chr "20210211T185658.758915Z"
## $ LOOM_SPEC_VERSION : chr "3.0.0"
## $ donor_organism.genus_species : chr "Homo sapiens"
## $ expression_data_type : chr "exonic"
## $ input_id : chr "58a18a4c-5423-4c59-9b3c-50b7f30b1ca5, c763f679-e13d-4f81-844f-c2c80fc90f46, c76d90b8-c190-4c58-b9bc-b31f586ec7f"| __truncated__
## $ input_id_metadata_field : chr "sequencing_process.provenance.document_id"
## $ input_name : chr "PP012_suspension, PP003_suspension, PP004_suspension, PP011_suspension"
## $ input_name_metadata_field : chr "sequencing_input.biomaterial_core.biomaterial_id"
## $ library_preparation_protocol.library_construction_approach: chr "10X v2 sequencing"
## $ optimus_output_schema_version : chr "1.0.0"
## $ pipeline_version : chr "Optimus_v4.2.2"
## $ project.project_core.project_name : chr "HumanTissueTcellActivation"
## $ project.provenance.document_id : chr "4a95101c-9ffc-4f30-a809-f04518a23803"
## $ specimen_from_organism.organ : chr "hematopoietic system"
colData(loom) |>
dplyr::as_tibble() |>
dplyr::glimpse()
## Rows: 91,713
## Columns: 43
## $ CellID <chr> "GCTTCCATCACCGT…
## $ antisense_reads <int> 0, 0, 0, 0, 0, …
## $ cell_barcode_fraction_bases_above_30_mean <dbl> 0.9846281, 0.98…
## $ cell_barcode_fraction_bases_above_30_variance <dbl> 0.003249023, 0.…
## $ cell_names <chr> "GCTTCCATCACCGT…
## $ duplicate_reads <int> 0, 0, 0, 0, 0, …
## $ emptydrops_FDR <dbl> 1.000000000, 0.…
## $ emptydrops_IsCell <raw> 00, 01, 00, 00,…
## $ emptydrops_Limited <raw> 00, 01, 00, 00,…
## $ emptydrops_LogProb <dbl> -689.6831, -120…
## $ emptydrops_PValue <dbl> 0.91840816, 0.0…
## $ emptydrops_Total <int> 255, 16705, 681…
## $ fragments_per_molecule <dbl> 1.693252, 8.453…
## $ fragments_with_single_read_evidence <int> 504, 139828, 58…
## $ genes_detected_multiple_observations <int> 82, 2873, 1552,…
## $ genomic_read_quality_mean <dbl> 36.62988, 36.87…
## $ genomic_read_quality_variance <dbl> 25.99015, 20.19…
## $ genomic_reads_fraction_bases_quality_above_30_mean <dbl> 0.8584288, 0.86…
## $ genomic_reads_fraction_bases_quality_above_30_variance <dbl> 0.03981779, 0.0…
## $ input_id <chr> "58a18a4c-5423-…
## $ molecule_barcode_fraction_bases_above_30_mean <dbl> 0.9820324, 0.98…
## $ molecule_barcode_fraction_bases_above_30_variance <dbl> 0.005782884, 0.…
## $ molecules_with_single_read_evidence <int> 276, 5028, 2041…
## $ n_fragments <int> 552, 202060, 84…
## $ n_genes <int> 227, 3381, 1826…
## $ n_mitochondrial_genes <int> 5, 22, 17, 5, 2…
## $ n_mitochondrial_molecules <int> 8, 3528, 2928, …
## $ n_molecules <int> 326, 23902, 998…
## $ n_reads <int> 679, 341669, 13…
## $ noise_reads <int> 0, 0, 0, 0, 0, …
## $ pct_mitochondrial_molecules <dbl> 1.1782032, 1.03…
## $ perfect_cell_barcodes <int> 667, 336674, 13…
## $ perfect_molecule_barcodes <int> 384, 227854, 89…
## $ reads_mapped_exonic <int> 343, 210716, 84…
## $ reads_mapped_intergenic <int> 39, 19439, 8042…
## $ reads_mapped_intronic <int> 175, 58833, 276…
## $ reads_mapped_multiple <int> 162, 90968, 365…
## $ reads_mapped_too_many_loci <int> 0, 0, 0, 0, 0, …
## $ reads_mapped_uniquely <int> 450, 227309, 93…
## $ reads_mapped_utr <int> 55, 29289, 1039…
## $ reads_per_fragment <dbl> 1.230072, 1.690…
## $ reads_unmapped <int> 67, 23392, 9340…
## $ spliced_reads <int> 99, 73817, 2987…
.loom
fileThe function optimus_loom_annotation()
takes in the file path of a
.loom
file generated by the Optimus pipeline and returns a
LoomExperiment
object whose colData
has been annotated with
additional specimen data extracted from a manifest.
annotated_loom <- optimus_loom_annotation(file_location)
annotated_loom
## class: SingleCellLoomExperiment
## dim: 58347 91713
## metadata(16): last_modified CreationDate ...
## specimen_from_organism.organ manifest
## assays(1): matrix
## rownames: NULL
## rowData names(29): Gene antisense_reads ... reads_per_molecule
## spliced_reads
## colnames: NULL
## colData names(98): input_id CellID ...
## sequencing_input.biomaterial_core.biomaterial_id
## sequencing_input_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowGraphs(0): NULL
## colGraphs(0): NULL
## new metadata
setdiff(
names(metadata(annotated_loom)),
names(metadata(loom))
)
## [1] "manifest"
metadata(annotated_loom)$manifest
## # A tibble: 4 × 56
## source_id sourc…¹ bundl…² bundle_version file_…³ file_…⁴ file_…⁵ file_…⁶
## <chr> <chr> <chr> <dttm> <chr> <chr> <chr> <chr>
## 1 423338be-… tdr:da… e3ecdf… 2021-02-02 23:55:00 5bc232… analys… c763f6… loom
## 2 423338be-… tdr:da… 81df10… 2021-02-02 23:55:00 9f8bc0… analys… 58a18a… loom
## 3 423338be-… tdr:da… 54fb0e… 2021-02-02 23:55:00 dfd990… analys… c76d90… loom
## 4 423338be-… tdr:da… 751656… 2021-02-02 23:55:00 e07ca7… analys… fb72f4… loom
## # … with 48 more variables: read_index <chr>, file_size <dbl>, file_uuid <chr>,
## # file_version <dttm>, file_crc32c <chr>, file_sha256 <chr>,
## # file_content_type <chr>, file_drs_uri <chr>, file_url <chr>,
## # cell_suspension.provenance.document_id <chr>,
## # cell_suspension.biomaterial_core.biomaterial_id <chr>,
## # cell_suspension.estimated_cell_count <lgl>,
## # cell_suspension.selected_cell_type <chr>, …
## # ℹ Use `colnames()` to see all variable names
## new colData columns
setdiff(
names(colData(annotated_loom)),
names(colData(loom))
)
## [1] "source_id"
## [2] "source_spec"
## [3] "bundle_uuid"
## [4] "bundle_version"
## [5] "file_document_id"
## [6] "file_type"
## [7] "file_name"
## [8] "file_format"
## [9] "read_index"
## [10] "file_size"
## [11] "file_uuid"
## [12] "file_version"
## [13] "file_crc32c"
## [14] "file_sha256"
## [15] "file_content_type"
## [16] "file_drs_uri"
## [17] "file_url"
## [18] "cell_suspension.provenance.document_id"
## [19] "cell_suspension.biomaterial_core.biomaterial_id"
## [20] "cell_suspension.estimated_cell_count"
## [21] "cell_suspension.selected_cell_type"
## [22] "sequencing_protocol.instrument_manufacturer_model"
## [23] "sequencing_protocol.paired_end"
## [24] "library_preparation_protocol.library_construction_approach"
## [25] "library_preparation_protocol.nucleic_acid_source"
## [26] "project.provenance.document_id"
## [27] "project.contributors.institution"
## [28] "project.contributors.laboratory"
## [29] "project.project_core.project_short_name"
## [30] "project.project_core.project_title"
## [31] "project.estimated_cell_count"
## [32] "specimen_from_organism.provenance.document_id"
## [33] "specimen_from_organism.diseases"
## [34] "specimen_from_organism.organ"
## [35] "specimen_from_organism.organ_part"
## [36] "specimen_from_organism.preservation_storage.preservation_method"
## [37] "donor_organism.sex"
## [38] "donor_organism.biomaterial_core.biomaterial_id"
## [39] "donor_organism.provenance.document_id"
## [40] "donor_organism.genus_species"
## [41] "donor_organism.development_stage"
## [42] "donor_organism.diseases"
## [43] "donor_organism.organism_age"
## [44] "cell_line.provenance.document_id"
## [45] "cell_line.biomaterial_core.biomaterial_id"
## [46] "organoid.provenance.document_id"
## [47] "organoid.biomaterial_core.biomaterial_id"
## [48] "organoid.model_organ"
## [49] "organoid.model_organ_part"
## [50] "_entity_type"
## [51] "sample.provenance.document_id"
## [52] "sample.biomaterial_core.biomaterial_id"
## [53] "sequencing_input.provenance.document_id"
## [54] "sequencing_input.biomaterial_core.biomaterial_id"
## [55] "sequencing_input_type"
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] hca_1.4.3 LoomExperiment_1.14.0
## [3] BiocIO_1.6.0 rhdf5_2.40.0
## [5] SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1
## [7] Biobase_2.56.0 GenomicRanges_1.48.0
## [9] GenomeInfoDb_1.32.2 IRanges_2.30.0
## [11] S4Vectors_0.34.0 BiocGenerics_0.42.0
## [13] MatrixGenerics_1.8.1 matrixStats_0.62.0
## [15] dplyr_1.0.9 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.3 tidyr_1.2.0 sass_0.4.2
## [4] vroom_1.5.7 bit64_4.0.5 jsonlite_1.8.0
## [7] bslib_0.4.0 assertthat_0.2.1 BiocManager_1.30.18
## [10] BiocFileCache_2.4.0 blob_1.2.3 GenomeInfoDbData_1.2.8
## [13] yaml_2.3.5 pillar_1.8.0 RSQLite_2.2.15
## [16] lattice_0.20-45 glue_1.6.2 digest_0.6.29
## [19] XVector_0.36.0 htmltools_0.5.3 Matrix_1.4-1
## [22] pkgconfig_2.0.3 bookdown_0.27 zlibbioc_1.42.0
## [25] purrr_0.3.4 HDF5Array_1.24.1 tzdb_0.3.0
## [28] tibble_3.1.7 generics_0.1.3 ellipsis_0.3.2
## [31] cachem_1.0.6 cli_3.3.0 crayon_1.5.1
## [34] magrittr_2.0.3 memoise_2.0.1 evaluate_0.15
## [37] fansi_1.0.3 tools_4.2.1 hms_1.1.1
## [40] formatR_1.12 lifecycle_1.0.1 stringr_1.4.0
## [43] Rhdf5lib_1.18.2 DelayedArray_0.22.0 lambda.r_1.2.4
## [46] compiler_4.2.1 jquerylib_0.1.4 rlang_1.0.4
## [49] futile.logger_1.4.3 grid_4.2.1 RCurl_1.98-1.7
## [52] rhdf5filters_1.8.0 rappdirs_0.3.3 bitops_1.0-7
## [55] rmarkdown_2.14 DBI_1.1.3 curl_4.3.2
## [58] R6_2.5.1 knitr_1.39 fastmap_1.1.0
## [61] bit_4.0.4 utf8_1.2.2 filelock_1.0.2
## [64] futile.options_1.0.1 readr_2.1.2 stringi_1.7.8
## [67] parallel_4.2.1 Rcpp_1.0.9 vctrs_0.4.1
## [70] dbplyr_2.2.1 tidyselect_1.1.2 xfun_0.31