SingleMoleculeFootprinting is an R package providing functions to analyze Single Molecule Footprinting (SMF) data. Following the workflow exemplified in this vignette, the user will be able to perform basic data analysis of SMF data with minimal coding effort.
Starting from an aligned bam file, we show how to
For installation, the user can use the following:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("SingleMoleculeFootprinting")
For compatibility with our analysis tools, we recommend the user to perform genomic alignments using the qAlign
function from QuasR as exemplified in the chunk below.
library(parallel)
library(QuasR)
library(BSgenome.Mmusculus.UCSC.mm10)
cl <- makeCluster(40)
prj <- QuasR::qAlign(sampleFile = Qinput,
genome = "BSgenome.Mmusculus.UCSC.mm10",
aligner = "Rbowtie",
projectName = "prj",
paired = "fr",
bisulfite = "undir",
alignmentParameter = "-e 70 -X 1000 -k 2 --best -strata",
alignmentsDir = "./",
cacheDir = tempdir(),
clObj = cl)
SingleMoleculeFootprinting inherits QuasR’s philosophy of working with pointer files. Briefly, a pointer is a tab-delimited file with two or three fields indicating the location of the input data files. For more details, please check the qAlign
documentation. Here we show how our pointer file looks like.
N.b.: The execution of the example code in this vignette, as well as some functions of this software package, depend on accessory data that can be downloaded and cached through the data package SingleMoleculeFootprintingData. Under the hood, we download this data for the user and create the QuasR sampleSheet file for the example data at tempdir()
under the name NRF1Pair_Qinput.txt
. We just ask the user to make sure that their default cache directory has enough storage capacity (~1 Gb). The user can check this via ExperimentHub::getExperimentHubOption(arg = "CACHE")
and eventually change this value via ExperimentHub::setExperimentHubOption(arg = "CACHE", value = "/your/favourite/location")
.
Before investing in a deep sequencing run for a SMF experiment, it is advisable to first perform a shallow sequencing run and to perform quality controls on the sequencing libraries.
It is always a good idea to examine canonical quality measures after aligning. We advice the user to employ the qQCreport
function from QuasR.
If SMF was performed on an large genome (e.g Mouse) it is possible that bait capture was performed to focus the sequencing efforts: here we check how efficient that process was by essentially computing the ratio of genomic alignments inside bait regions over the total ones.
BaitRegions <- SingleMoleculeFootprintingData::EnrichmentRegions_mm10.rds()
## see ?SingleMoleculeFootprintingData and browseVignettes('SingleMoleculeFootprintingData') for documentation
## loading from cache
clObj = makeCluster(5)
BaitCaptureEfficiency <- BaitCapture(sampleSheet = Qinput,
genome = BSgenome.Mmusculus.UCSC.mm10,
baits = BaitRegions, clObj = clObj)
## all necessary alignment files found
## preparing to run on 5 nodes...done
## counting alignments...done
## preparing to run on 5 nodes...done
## counting alignments...done
stopCluster(clObj)
BaitCaptureEfficiency
## [1] 1
In this case the capture efficiency equals to 1 because the example data was purposefully subset for an interesting region which falls entirely within baits. Under normal circumstances, one expects this value to be lower than 1 for the mouse genome for example we observe values around 0.7.
Here we ask how much of the observed Cytosine methylation falls in the expected contexts (CpG or GpC). Beware, this is a much slower computation than the above.
ConversionRateValue <- ConversionRate(sampleSheet = Qinput,
genome = BSgenome.Mmusculus.UCSC.mm10,
chr = 6)
For this sample, the observed conversion rate is 92.4%. This value happens to be slightly below the expected value of ~95%
Methylation values at the single molecule level can be extracted for the region of interest from aligned data using the CallContextMethylation
function. Under the hood, the function performs the following operations:
ExperimentType | substring | RelevanContext | Notes |
---|---|---|---|
Single enzyme SMF | _NO_ | DGCHN & NWCGW | Methylation info is reported separately for each context |
Double enzyme SMF | _DE_ | GCH + HCG | Methylation information is aggregated across the contexts |
No enzyme (endogenous mCpG only) | _SS_ | CG | - |
MySample <- suppressMessages(readr::read_delim(Qinput, delim = "\t")[[2]])
Region_of_interest <- GRanges(seqnames = "chr6",
ranges = IRanges(start = 88106000,
end = 88106500),
strand = "*")
Methylation <- CallContextMethylation(sampleSheet = Qinput,
sample = MySample,
genome = BSgenome.Mmusculus.UCSC.mm10,
range = Region_of_interest,
coverage = 20,
ConvRate.thr = 0.2)
## Setting QuasR project
## all necessary alignment files found
## Calling methylation at all Cytosines
## 0.9% of reads found with conversion rate above 0.2
## Subsetting Cytosines by permissive genomic context (NGCNN, NNCGN)
## Collapsing strands
## 0 reads found mapping to the + strand, collapsing to -
## 0 reads found mapping to the + strand, collapsing to -
## Filtering Cs for coverage
## Detected experiment type: DE
## Subsetting Cytosines by strict genomic context (GCH, HCG) based on the detected experiment type: DE
## Merging matrixes
Methylation[[1]]
## GRanges object with 38 ranges and 1 metadata column:
## seqnames ranges strand | NRF1pair_DE__M
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr6 88106112 - | 0.741117
## [2] chr6 88106116 - | 0.589141
## [3] chr6 88106118 - | 0.838407
## [4] chr6 88106127 - | 0.543319
## [5] chr6 88106138 - | 0.400601
## ... ... ... ... . ...
## [34] chr6 88106434 - | 0.554072
## [35] chr6 88106444 - | 0.607526
## [36] chr6 88106455 - | 0.679593
## [37] chr6 88106492 - | 0.720641
## [38] chr6 88106495 - | 0.809129
## -------
## seqinfo: 239 sequences (1 circular) from mm10 genome
Methylation[[2]][1:10, 1:10]
## 88106112 88106116 88106118
## M00758:819:000000000-CB7NB:1:1101:10081:9865 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 1 1
## 88106127 88106138 88106153
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 0 1 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10219:24481 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10408:27375 0 0 0
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 0 1 0
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 0 1
## 88106155 88106158 88106164
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1 0 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1 1 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1 1 1
## 88106172
## M00758:819:000000000-CB7NB:1:1101:10081:9865 1
## M00758:819:000000000-CB7NB:1:1101:10119:12887 1
## M00758:819:000000000-CB7NB:1:1101:10172:5248 1
## M00758:819:000000000-CB7NB:1:1101:10214:24193 1
## M00758:819:000000000-CB7NB:1:1101:10219:24481 1
## M00758:819:000000000-CB7NB:1:1101:10408:27375 1
## M00758:819:000000000-CB7NB:1:1101:10428:10873 1
## M00758:819:000000000-CB7NB:1:1101:10428:22900 1
## M00758:819:000000000-CB7NB:1:1101:10453:11606 1
## M00758:819:000000000-CB7NB:1:1101:10489:13730 1
The following chunk has the sole scope of downsampling reads to make the vignette lighter. The user should skip this.
Already at this stage it is possible to visually inspect results. To that end, we provide the function PlotAvgSMF
to plot the average SMF signal, defined as 1 - average methylation, at a genomic locus of interest. Optionally, the user can pass a GRanges object of genomic coordinates for the transcription factor binding sites (TFBSs) of interest to include in the plot, we show an example of such an object.
TFBSs <- GenomicRanges::GRanges("chr6",
IRanges(c(88106216, 88106253),
c(88106226, 88106263)),
strand = "-")
elementMetadata(TFBSs)$name <- c("NRF1", "NRF1")
names(TFBSs) <- c(paste0("TFBS_", c(4305215, 4305216)))
TFBSs
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## TFBS_4305215 chr6 88106216-88106226 - | NRF1
## TFBS_4305216 chr6 88106253-88106263 - | NRF1
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
PlotAvgSMF(MethGR = Methylation[[1]],
range = Region_of_interest,
TFBSs = TFBSs)
Furthermore, the function PlotSM
can be used to plot the single molecule SMF information at a given site.
PlotSM(MethSM = Methylation[[2]],
range = Region_of_interest)
## No sorting passed or specified, will plot unsorted reads
Optionally, hierarchical clustering can be performed by setting the parameter SortedReads = "HC"
. This can be useful to aggregate reads visually in order to spot PCR artifacts. N.b. reads will be downsampled to 500.
Ultimately though, we want to classify reads based on their patterns of molecular occupancy. To that end we provide the functions SortReadsBySingleTF
and SortReadsByTFCluster
to classify reads based either on the occupancy patterns of one or multiple transcription factors.
Under the hood, the classification is based on the definition of \(n+2\) bins (with \(n\) being the number of TFs). The \(n\) bins are each centered around one of the TFBSs of interest, while the 2 extra bins are located upstream and downstream of the two outmost TFBSs.
For SortReadsBySingleTF
, the coordinates of the bins relative to the center of the TFBS are [-35;-25], [-15;+15], [+25,+35]. Instead, the function SortReadsByTFCluster
draws a bin with coordinates [-7;+7] around the center of each TFBS, a bin with coordinates [-35;-25] relative to center of the most upstream TFBS and a bin with coordinates [+25,+35] relative to the center of the most downstream TFBS. The user can also employ custom coordinates by specifying them explicitly using the function SortReads
.
For each read, the binary methylation status of the cytosines contained in each bin is averaged to either a 0 or a 1 such that each read is eventually represented as sequence of \(n+2\) binary digits, for a total of \(2^{(n+2)}\) possible sequences.
Here, we show a usage case for the SortReadsByTFCluster
function, as we have already identified the double binding of NRF1 at the genomic site under scrutiny. Usage and exploration of the output is identical for the other function, except for the the format of the TFBSs argument which should consist of a GRanges object of length 1 for SortReadsBySingleTF
and of length \(>\) 1 for SortReadsByTFCluster
.
SortedReads_TFCluster <- SortReadsByTFCluster(MethSM = Methylation[[2]],
TFBSs = TFBSs)
## Collecting summarized methylation for bins
## TF cluster mode
## Number of retrieved states: 13
## States population:
## 0000 0001 0011 0101 0111 1000 1001 1010 1011 1100 1101 1110 1111
## 2 10 8 3 14 22 341 5 152 32 103 17 287
N.b. non-populated states are not returned.
The output of each of these sorting functions can be passed directly as the SortedReads
argument of the PlotSM
function.
PlotSM(MethSM = Methylation[[2]],
range = Region_of_interest,
SortedReads = SortedReads_TFCluster)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair
N.b. despite sorting reads by a TF cluster is in principle possible with clusters of any size, as of now the PlotSM
function can only deal with TF pairs.
In order to be quantitative about these observations, the user can employ the StateQuantificationPlot
. The function outputs a bar plot annotated with the percentage of reads found in each state. The function takes, as argument, the output of either of the two sorting functions.
StateQuantificationPlot(SortedReads = SortedReads_TFCluster)
## Inferring sorting was performed by TF pair
Finally, we provide the wrapper function PlotSingleSiteSMF
to plot at once the three kinds of information detailed above and to export results as a pdf.
PlotSingleSiteSMF(ContextMethylation = Methylation,
sample = MySample,
range = Region_of_interest,
SortedReads = SortedReads_TFCluster,
TFBSs = TFBSs,
saveAs = NULL)
## Subsetting data by range (extended)
## Sorting reads according to passed values before plotting
## Inferring sorting was performed by TF pair
## Inferring sorting was performed by TF pair
This vignette was produced under:
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SingleMoleculeFootprintingData_1.3.0 SingleMoleculeFootprinting_1.4.0
## [3] BSgenome.Mmusculus.UCSC.mm10_1.4.3 BSgenome_1.64.0
## [5] rtracklayer_1.56.0 Biostrings_2.64.0
## [7] XVector_0.36.0 GenomicRanges_1.48.0
## [9] GenomeInfoDb_1.32.0 IRanges_2.30.0
## [11] S4Vectors_0.34.0 BiocGenerics_0.42.0
##
## loaded via a namespace (and not attached):
## [1] rjson_0.2.21 hwriter_1.3.2.1
## [3] ellipsis_0.3.2 Rbowtie_1.36.0
## [5] bit64_4.0.5 interactiveDisplayBase_1.34.0
## [7] AnnotationDbi_1.58.0 fansi_1.0.3
## [9] xml2_1.3.3 cachem_1.0.6
## [11] knitr_1.38 QuasR_1.36.0
## [13] jsonlite_1.8.0 Rsamtools_2.12.0
## [15] dbplyr_2.1.1 png_0.1-7
## [17] shiny_1.7.1 BiocManager_1.30.17
## [19] readr_2.1.2 compiler_4.2.0
## [21] httr_1.4.2 assertthat_0.2.1
## [23] Matrix_1.4-1 fastmap_1.1.0
## [25] cli_3.3.0 later_1.3.0
## [27] htmltools_0.5.2 prettyunits_1.1.1
## [29] tools_4.2.0 glue_1.6.2
## [31] GenomeInfoDbData_1.2.8 dplyr_1.0.8
## [33] rappdirs_0.3.3 ShortRead_1.54.0
## [35] Rcpp_1.0.8.3 Biobase_2.56.0
## [37] jquerylib_0.1.4 vctrs_0.4.1
## [39] ExperimentHub_2.4.0 xfun_0.30
## [41] stringr_1.4.0 mime_0.12
## [43] lifecycle_1.0.1 restfulr_0.0.13
## [45] XML_3.99-0.9 AnnotationHub_3.4.0
## [47] zlibbioc_1.42.0 vroom_1.5.7
## [49] VariantAnnotation_1.42.0 hms_1.1.1
## [51] promises_1.2.0.1 MatrixGenerics_1.8.0
## [53] SummarizedExperiment_1.26.0 RColorBrewer_1.1-3
## [55] yaml_2.3.5 curl_4.3.2
## [57] memoise_2.0.1 sass_0.4.1
## [59] biomaRt_2.52.0 latticeExtra_0.6-29
## [61] stringi_1.7.6 RSQLite_2.2.12
## [63] highr_0.9 BiocVersion_3.15.2
## [65] BiocIO_1.6.0 GenomicFeatures_1.48.0
## [67] filelock_1.0.2 BiocParallel_1.30.0
## [69] rlang_1.0.2 pkgconfig_2.0.3
## [71] GenomicFiles_1.32.0 matrixStats_0.62.0
## [73] bitops_1.0-7 evaluate_0.15
## [75] lattice_0.20-45 purrr_0.3.4
## [77] GenomicAlignments_1.32.0 bit_4.0.4
## [79] tidyselect_1.1.2 plyr_1.8.7
## [81] magrittr_2.0.3 R6_2.5.1
## [83] generics_0.1.2 DelayedArray_0.22.0
## [85] DBI_1.1.2 pillar_1.7.0
## [87] withr_2.5.0 KEGGREST_1.36.0
## [89] RCurl_1.98-1.6 tibble_3.1.6
## [91] crayon_1.5.1 utf8_1.2.2
## [93] BiocFileCache_2.4.0 tzdb_0.3.0
## [95] rmarkdown_2.14 jpeg_0.1-9
## [97] progress_1.2.2 grid_4.2.0
## [99] data.table_1.14.2 blob_1.2.3
## [101] digest_0.6.29 xtable_1.8-4
## [103] httpuv_1.6.5 bslib_0.3.1