Introduction

sesameData package provides associated data for sesame package. This includes example data for testing and instructional purpose, as we ll as probe annotation for different Infinium platforms.

library(sesameData)
library(GenomicRanges)

Data from ExperimentHub

Titles of all the available data can be shown with:

head(sesameDataList())

Local caching

Each sesame datum from ExperimentHub is accessible through the sesameDataGet interface. It should be noted that all data must be pre-cached to local disk before they can be used. This design is to prevent conflict in annotation data caching and remove internet dependency. Caching needs only be done once per sesame/sesameData installation. One can cache data using

sesameDataCache()

Once a data object is loaded, it is stored to a tempoary cache, so that the data doesn’t need to be retrieved again next time we call sesameDataGet. This design is meant to speeed up the run time.

For example, the annotation for HM27 can be retrieved with the title:

HM27.address <- sesameDataGet('HM27.address')

In-memory caching

It’s worth noting that once a data is retrieved through the sesameDataGet inferface (below), it will stay in memory so next time the object will be returned immediately. This design avoids repeated disk/web retrieval. In some rare situation, one may want to redo the download/disk IO, or empty the cache to save memory. This can be done with:

sesameDataGet_resetEnv()

##           used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 4296615 229.5    8461768 452.0  5789384 309.2
## Vcells 7568745  57.8   14786712 112.9 10146327  77.5

Transcript models

Sesame provides some utility functions to process transcript models, which can be represented as data.frame, GRanges and GRangesList objects. For example, sesameData_getTxnGRanges calls sesameDataGet("genomeInfo.mm10")$txns to retrieve a transcript-centric GRangesList object from GENCODE including its gene annotation, exon and cds (for protein-coding genes). It is then turned into a simple GRanges object of transcript:

txns_gr <- sesameData_getTxnGRanges("mm10")
txns_gr

## GRanges object with 142604 ranges and 9 metadata columns:
##                        seqnames          ranges strand |      transcript_type
##                           <Rle>       <IRanges>  <Rle> |          <character>
##   ENSMUST00000193812.1     chr1 3073253-3074322      + |                  TEC
##   ENSMUST00000082908.1     chr1 3102016-3102125      + |                snRNA
##   ENSMUST00000162897.1     chr1 3205901-3216344      - | processed_transcript
##   ENSMUST00000159265.1     chr1 3206523-3215632      - | processed_transcript
##   ENSMUST00000070533.4     chr1 3214482-3671498      - |       protein_coding
##                    ...      ...             ...    ... .                  ...
##   ENSMUST00000082419.1     chrM     13552-14070      - |       protein_coding
##   ENSMUST00000082420.1     chrM     14071-14139      - |              Mt_tRNA
##   ENSMUST00000082421.1     chrM     14145-15288      + |       protein_coding
##   ENSMUST00000082422.1     chrM     15289-15355      + |              Mt_tRNA
##   ENSMUST00000082423.1     chrM     15356-15422      - |              Mt_tRNA
##                          transcript_name     gene_name              gene_id
##                              <character>   <character>          <character>
##   ENSMUST00000193812.1 4933401J01Rik-201 4933401J01Rik ENSMUSG00000102693.1
##   ENSMUST00000082908.1       Gm26206-201       Gm26206 ENSMUSG00000064842.1
##   ENSMUST00000162897.1          Xkr4-203          Xkr4 ENSMUSG00000051951.5
##   ENSMUST00000159265.1          Xkr4-202          Xkr4 ENSMUSG00000051951.5
##   ENSMUST00000070533.4          Xkr4-201          Xkr4 ENSMUSG00000051951.5
##                    ...               ...           ...                  ...
##   ENSMUST00000082419.1        mt-Nd6-201        mt-Nd6 ENSMUSG00000064368.1
##   ENSMUST00000082420.1         mt-Te-201         mt-Te ENSMUSG00000064369.1
##   ENSMUST00000082421.1       mt-Cytb-201       mt-Cytb ENSMUSG00000064370.1
##   ENSMUST00000082422.1         mt-Tt-201         mt-Tt ENSMUSG00000064371.1
##   ENSMUST00000082423.1         mt-Tp-201         mt-Tp ENSMUSG00000064372.1
##                             gene_type      source       level  cdsStart
##                           <character> <character> <character> <numeric>
##   ENSMUST00000193812.1            TEC      HAVANA           2        NA
##   ENSMUST00000082908.1          snRNA     ENSEMBL           3        NA
##   ENSMUST00000162897.1 protein_coding      HAVANA           2        NA
##   ENSMUST00000159265.1 protein_coding      HAVANA           2        NA
##   ENSMUST00000070533.4 protein_coding      HAVANA           2   3216025
##                    ...            ...         ...         ...       ...
##   ENSMUST00000082419.1 protein_coding     ENSEMBL           3     13555
##   ENSMUST00000082420.1        Mt_tRNA     ENSEMBL           3        NA
##   ENSMUST00000082421.1 protein_coding     ENSEMBL           3     14145
##   ENSMUST00000082422.1        Mt_tRNA     ENSEMBL           3        NA
##   ENSMUST00000082423.1        Mt_tRNA     ENSEMBL           3        NA
##                           cdsEnd
##                        <numeric>
##   ENSMUST00000193812.1        NA
##   ENSMUST00000082908.1        NA
##   ENSMUST00000162897.1        NA
##   ENSMUST00000159265.1        NA
##   ENSMUST00000070533.4   3671348
##                    ...       ...
##   ENSMUST00000082419.1     14070
##   ENSMUST00000082420.1        NA
##   ENSMUST00000082421.1     15288
##   ENSMUST00000082422.1        NA
##   ENSMUST00000082423.1        NA
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

The returned GRanges object does not contain the exon coordinates. We can further collapse different transcripts of the same gene (isoforms) to gene level. Gene start is the minimum of all isoform starts and end is the maximum of all isoform ends.

genes_gr <- sesameData_txnToGeneGRanges(txns_gr)
genes_gr

## GRanges object with 55401 ranges and 2 metadata columns:
##                        seqnames          ranges strand |     gene_name
##                           <Rle>       <IRanges>  <Rle> |   <character>
##   ENSMUSG00000102693.1     chr1 3073253-3074322      + | 4933401J01Rik
##   ENSMUSG00000064842.1     chr1 3102016-3102125      + |       Gm26206
##   ENSMUSG00000051951.5     chr1 3205901-3671498      - |          Xkr4
##   ENSMUSG00000102851.1     chr1 3252757-3253236      + |       Gm18956
##   ENSMUSG00000103377.1     chr1 3365731-3368549      - |       Gm37180
##                    ...      ...             ...    ... .           ...
##   ENSMUSG00000064368.1     chrM     13552-14070      - |        mt-Nd6
##   ENSMUSG00000064369.1     chrM     14071-14139      - |         mt-Te
##   ENSMUSG00000064370.1     chrM     14145-15288      + |       mt-Cytb
##   ENSMUSG00000064371.1     chrM     15289-15355      + |         mt-Tt
##   ENSMUSG00000064372.1     chrM     15356-15422      - |         mt-Tp
##                                   gene_type
##                                 <character>
##   ENSMUSG00000102693.1                  TEC
##   ENSMUSG00000064842.1                snRNA
##   ENSMUSG00000051951.5       protein_coding
##   ENSMUSG00000102851.1 processed_pseudogene
##   ENSMUSG00000103377.1                  TEC
##                    ...                  ...
##   ENSMUSG00000064368.1       protein_coding
##   ENSMUSG00000064369.1              Mt_tRNA
##   ENSMUSG00000064370.1       protein_coding
##   ENSMUSG00000064371.1              Mt_tRNA
##   ENSMUSG00000064372.1              Mt_tRNA
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths

Annotate probes

One can annotate given probe ID using any genomic features stored in GRanges objects. For example, the following demonstrate the annotation of 500 random Mammal40 probes for gene promoters.

probes <- names(sesameData_getManifestGRanges("Mammal40"))[1:500]
head(probes) # our input

## [1] "cg08067365" "cg13449535" "cg19945840" "cg13587552" "cg13354934"
## [6] "cg15492552"

txns <- sesameData_getTxnGRanges("hg38")
pm <- promoters(txns, upstream = 1500, downstream = 1500)
pm <- pm[pm$transcript_type == "protein_coding"]
sesameData_annoProbes(probes, pm, column = "gene_name")

## Platform set to: Mammal40

## GRanges object with 500 ranges and 1 metadata column:
##              seqnames            ranges strand |    gene_name
##                 <Rle>         <IRanges>  <Rle> |  <character>
##   cg08067365     chr1   1013513-1013514      - |        ISG15
##   cg13449535     chr1   1165968-1165969      + |         <NA>
##   cg19945840     chr1   1232656-1232657      - | SDF4,B3GALT6
##   cg13587552     chr1   1281061-1281062      - |       SCNN1D
##   cg13354934     chr1   1360713-1360714      - |        MXRA8
##          ...      ...               ...    ... .          ...
##   cg12093060     chr1 21666908-21666909      + |         <NA>
##   cg02760280     chr1 21667062-21667063      + |         <NA>
##   cg08608952     chr1 22056705-22056706      + |         <NA>
##   cg14655297     chr1 22089987-22089988      - |         <NA>
##   cg23019935     chr1 22089995-22089996      + |         <NA>
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Manifest from Github

Sesame provides access to array manfiest stored as GRanges object. These GRanges object are converted from the raw tsv files on our array annotation website using the conversion code

gr <- sesameData_getManifestGRanges("HM450")
length(gr)

## [1] 485545

Note that by default the GRanges object exclude decoy sequence probes (e.g., _alt, and _random contigs). To include them, we need to use the decoy = TRUE option in sesameData_getManifestDF.

Subset probes

One can directly get probes from different parts of the genome.

library(GenomicRanges)

regs <- GRanges('chr5', IRanges(135313937, 135419936))
sesameData_getProbesByRegion(regs, platform = 'Mammal40')

## GRanges object with 10 ranges and 0 metadata columns:
##              seqnames              ranges strand
##                 <Rle>           <IRanges>  <Rle>
##   cg18945109     chr5 135350775-135350776      +
##   cg14826942     chr5 135350857-135350858      -
##   cg14620903     chr5 135350865-135350866      -
##   cg12825194     chr5 135350880-135350881      -
##   cg10071034     chr5 135350884-135350885      -
##   cg04472379     chr5 135350953-135350954      +
##   cg25568354     chr5 135350962-135350963      +
##   cg08401998     chr5 135351012-135351013      +
##   cg23345269     chr5 135367261-135367262      +
##   cg22464003     chr5 135369530-135369531      +
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

sesameData_getProbesByChromosome('chrX', platform = 'Mammal40')

## GRanges object with 1294 ranges and 0 metadata columns:
##                          seqnames              ranges strand
##                             <Rle>           <IRanges>  <Rle>
##               cg02171705     chrX     9463141-9463142      +
##               cg01252899     chrX     9463189-9463190      +
##   rs2521373_II_F_C_37521     chrX             9508950      -
##               cg26545086     chrX     9944794-9944795      -
##               cg14704094     chrX   10566874-10566875      -
##                      ...      ...                 ...    ...
##               cg04337186     chrX 154031401-154031402      +
##               cg16330204     chrX 154766280-154766281      +
##               cg00547789     chrX 155264465-155264466      -
##               cg10512285     chrX 155264469-155264470      -
##               cg18230281     chrX 155264505-155264506      +
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

sesameData_getAutosomeProbes("Mammal40")

## GRanges object with 36126 ranges and 0 metadata columns:
##              seqnames            ranges strand
##                 <Rle>         <IRanges>  <Rle>
##   cg08067365     chr1   1013513-1013514      -
##   cg13449535     chr1   1165968-1165969      +
##   cg19945840     chr1   1232656-1232657      -
##   cg13587552     chr1   1281061-1281062      -
##   cg13354934     chr1   1360713-1360714      -
##          ...      ...               ...    ...
##   cg17285325    chr22 50529914-50529915      -
##   cg00083937    chr22 50601376-50601377      +
##   cg00256932    chr22 50603303-50603304      +
##   cg13194594    chr22 50679061-50679062      +
##   cg19491113    chr22 50722054-50722055      -
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

sesameData_getProbesByGene('DNMT3A', "Mammal40", upstream=500)

## GRanges object with 9 ranges and 0 metadata columns:
##              seqnames            ranges strand
##                 <Rle>         <IRanges>  <Rle>
##   cg11228575     chr2 25228462-25228463      -
##   cg16316743     chr2 25232953-25232954      -
##   cg23393100     chr2 25234335-25234336      -
##   cg20545546     chr2 25240304-25240305      -
##   cg19346456     chr2 25240337-25240338      -
##   cg09611799     chr2 25240401-25240402      -
##   cg00206304     chr2 25241581-25241582      -
##   cg15034063     chr2 25247679-25247680      -
##   cg26550430     chr2 25274954-25274955      -
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

sesameData_getProbesByTSS('DNMT3A', "Mammal40")

## GRanges object with 2 ranges and 0 metadata columns:
##              seqnames            ranges strand
##                 <Rle>         <IRanges>  <Rle>
##   cg00206304     chr2 25241581-25241582      -
##   cg15034063     chr2 25247679-25247680      -
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

One can also get all TSS probes by

TSSprobes = sesameData_getProbesByTSS(NULL, "Mammal40")

Get nearby genes

sesameData_getGenesByProbes(c("cg14620903","cg22464003"), max_distance = 10000)

## Platform set to: Mammal40

## GRanges object with 1 range and 2 metadata columns:
##                      seqnames              ranges strand |   gene_name
##                         <Rle>           <IRanges>  <Rle> | <character>
##   ENSG00000113648.16     chr5 135333900-135399914      - |   MACROH2A1
##                           gene_type
##                         <character>
##   ENSG00000113648.16 protein_coding
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Installation

From Bioconductor

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("sesameData")

Development version can be installed from github

BiocManager::install("zwdzwd/sesameData")

sessionInfo()

## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0  IRanges_2.30.0      
##  [4] S4Vectors_0.34.0     sesameData_1.14.0    ExperimentHub_2.4.0 
##  [7] AnnotationHub_3.4.0  BiocFileCache_2.4.0  dbplyr_2.1.1        
## [10] BiocGenerics_0.42.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3                  png_0.1-7                    
##  [3] Biostrings_2.64.0             assertthat_0.2.1             
##  [5] digest_0.6.29                 utf8_1.2.2                   
##  [7] mime_0.12                     R6_2.5.1                     
##  [9] RSQLite_2.2.12                evaluate_0.15                
## [11] httr_1.4.2                    pillar_1.7.0                 
## [13] zlibbioc_1.42.0               rlang_1.0.2                  
## [15] curl_4.3.2                    jquerylib_0.1.4              
## [17] blob_1.2.3                    rmarkdown_2.14               
## [19] readr_2.1.2                   stringr_1.4.0                
## [21] RCurl_1.98-1.6                bit_4.0.4                    
## [23] shiny_1.7.1                   compiler_4.2.0               
## [25] httpuv_1.6.5                  xfun_0.30                    
## [27] pkgconfig_2.0.3               htmltools_0.5.2              
## [29] tidyselect_1.1.2              KEGGREST_1.36.0              
## [31] tibble_3.1.6                  GenomeInfoDbData_1.2.8       
## [33] interactiveDisplayBase_1.34.0 fansi_1.0.3                  
## [35] withr_2.5.0                   tzdb_0.3.0                   
## [37] crayon_1.5.1                  dplyr_1.0.8                  
## [39] later_1.3.0                   bitops_1.0-7                 
## [41] rappdirs_0.3.3                jsonlite_1.8.0               
## [43] xtable_1.8-4                  lifecycle_1.0.1              
## [45] DBI_1.1.2                     magrittr_2.0.3               
## [47] cli_3.3.0                     stringi_1.7.6                
## [49] cachem_1.0.6                  XVector_0.36.0               
## [51] promises_1.2.0.1              bslib_0.3.1                  
## [53] ellipsis_0.3.2                filelock_1.0.2               
## [55] generics_0.1.2                vctrs_0.4.1                  
## [57] tools_4.2.0                   bit64_4.0.5                  
## [59] Biobase_2.56.0                glue_1.6.2                   
## [61] purrr_0.3.4                   BiocVersion_3.15.2           
## [63] hms_1.1.1                     fastmap_1.1.0                
## [65] yaml_2.3.5                    AnnotationDbi_1.58.0         
## [67] BiocManager_1.30.17           memoise_2.0.1                
## [69] knitr_1.39                    sass_0.4.1

SeSAMe Data User Guide