The package can be directly installed from the Bioconductor website.
Load the package before starting the analysis.
The package provides example datasets to run the functions.
data("TestData", package = "DominoEffect")
data("SnpData", package = "DominoEffect")
data("DominoData", package = "DominoEffect")
These can be used to run the package and obtain an example result data frame:
The default settings are to use the package-provided data that will be loaded in the objects mutation_dataset, gene_data and snp_data, respectively. Required column names are described below and the user should adher to these when providing their own mutation dataset. Gene information (DominoData) and SNP data (SnpData) correspond to the Ensembl version 73 (GRCh37). Optionally, the user can use data for different Ensembl releases and provide larger SNP datasets.
The package identifies individual amino acid residues, which accumulate a high fraction of the overall mutation load within a protein. Such hotspot mutation residues could have critical functions in regulating cancer-associated cellular processes. After detecting mutation hotspots, the package obtains functional and structural annotations for the affected protein regions as these could aid interpretation of mutation effects. The package is based on the Ensembl version 73 (GRCh37), but it is also flexible with allowing a user to obtain coordinates for different Ensembl releases via the BiomaRt package. The package can be run with the default options using the following command:
hotspot_mutations <- DominoEffect(mutation_dataset = TestData,
gene_data = DominoData, snp_data = SnpData)
## Warning: Ensembl will soon enforce the use of https.
## Ensure the 'host' argument includes "https://"
An example mutation set is provided as an object data(TestData, package = “DominoEffect”). It contains mutations reported in a sequencing study for the thyroid cancer (Integrated genomic characterization of papillary thyroid carcinoma, Cell 2014 159(3):676-90, Cancer Genome Atlas Research Network). The mutation_dataset should provided by the user should have the same columns (see below). The Mutation data can also be provided as a GRangesList or GRanges object (see GenomicRanges package).
Ensembl_gene_id | Protein_residue | Original_aa | Mutated_aa | Patient_id | Genomic_coordinate | Original_base | Mutated_base |
---|---|---|---|---|---|---|---|
ENSG00000204518 | 91 | D | D | TCGA-V4-A9F1-01A-11D-A39W-08 | 1:12711246 | C | T |
ENSG00000204518 | 112 | R | W | TCGA-V4-A9EK-01A-11D-A39W-08 | 1:12711307 | C | T |
ENSG00000144452 | 1036 | Q | E | TCGA-WC-AA9E-01A-11D-A39W-08 | 2:215865502 | G | C |
ENSG00000179869 | 875 | Q | L | TCGA-V4-A9EK-01A-11D-A39W-08 | 7:48311887 | A | T |
ENSG00000167972 | 689 | D | D | TCGA-YZ-A985-01A-11D-A39W-08 | 16:2347526 | G | A |
ENSG00000167972 | 155 | R | Q | TCGA-YZ-A985-01A-11D-A39W-08 | 16:2373673 | C | T |
Results of the analysis will be saved as a file Protein_hotspot_residues.txt in the working directory at it will contain positions of hotspot residues as well as information on the protein region affected by the hotspot. We recommend to further use one or more of the tools that prioritizes mutations with a likely deleterious effect and to use additional resources to filter out common population polymorphisms.
Gene | Ensembl_id | Prot_position | N_mut | Protein_funcional_region | Repres_tr | Assoc_unip_ids | Perc_region_1 | Perc_region_2 | Adj_p_value_region_1 | Adj_p_value_region_2 | Prot_length |
---|---|---|---|---|---|---|---|---|---|---|---|
GNAQ | ENSG00000156052 | 209 | 37 | domain: G-alpha; nucleotide binding: GTP; region: G3 motif | ENST00000286548 | P50148 | 94.87% | 94.87% | 0 | 0 | 359 |
GNA11 | ENSG00000088256 | 209 | 34 | domain: G-alpha; nucleotide binding: GTP; region: G3 motif | ENST00000078429 | P29992 | 94.44% | 94.44% | 0 | 0 | 359 |
SF3B1 | ENSG00000115524 | 625 | 14 | repeat: HEAT 3 | ENST00000335508 | O75533 | 77.78% | 77.78% | 0 | 0 | 1304 |
PRMT8 | ENSG00000111218 | 31 | 5 | region: Disordered; motif: SH3-binding 1 | ENST00000382622 | Q9NR22 | 100.00% | 100.00% | 0 | 0 | 394 |
CHEK2 | ENSG00000183765 | 416 | 5 | domain: Protein kinase; region: T-loop/activation segment | ENST00000382580 | O96017 | 55.56% | 55.56% | 0 | 0 | 586 |
The package can also be ran by modifying any or all of the options that are discussed in the section below:
The analysis has several parameters that define what a hotspot residue is. The default values are suggested below, but these can all be easily modified. As a recommendation, we provide values that we found useful when analyzing the pan-cancer dataset, but for smaller cohorts we strongly suggest these to be relaxed. For instance, for smaller cohorts to set: min_n_muts to 3. We recommend the user to try different parameters and if they allow shorter windows around the searched residues to increase the percentage threshold (which defines a minimum fraction of mutations within a window that need to map to a single residue). It is also possible to replace the percentage threshold with a parameter that defines overrepresentation compared to the expected number of mutations at a single residue. Details below.
Define how often a mutation needs to occur on a specific amino acid to be considered as a possible hotspot residue.
Threshold for the minoar allele frequence in the population above which a variant is considered a common variant and will not be assessed as a potential hotspot mutation.
Size of the sequence region (in amino acids) to which the frequency of the muation is compared to. We recommend asking for the hotspot to be significant within windows of different lengths, but it is also possible to use a single window.
Statistical threshold for the residues with frequent mutations that should be considered a protein hotspot. The value defines the adjusted p-values after performing a Poisson test and Benjamini-Hochberg correction for multiple testing.
A fraction of mutations within the window that need to fall on a single residue in order for it to be classified as a hotspot.
Requirement that a number of mutations on a single residue should exceed what would be expected by chance given a background mutation rate in the window (i.e. the surrounding region).
Hotspots are always filtered to include only residues that are significant according to the defined p-value threshold (i.e. adjusted Poisson p-value: poisson.thr above). Additionally, the parameter ‘approach’ defines whether a percentage or overrepresentation should be assessed to define if the residue accumulates a high fraction of mutations within the window. The options are to set the ‘approach’ parameter to : use “percentage”, “ratio” or “both”. When “percentage” is used the approach checks for the percentage.thr, when “ratio” it cheks for ratio.thr and when it is defined as “both” than it checks both percentage.thr and ratio.thr. If the user considers the p-value threshold to be sufficient, they can set percentage.thr and ratio.thr to zero.
If the write_to_file is set to “YES” the result object ‘results_w_annotations’ will be saved in the working directory as a file ‘Protein_hotspot_residues.txt’. The default is “NO”.
hotspot_mutations <- DominoEffect(mutation_dataset = TestData,
gene_data = DominoData, snp_data = SnpData)
It is also possible to run separately the function that identifies hotspot mutations. The function: identify_hotspots() can be ran on any unique identifiers, but the same column names as in the example DominoData should be preserved. If using different Ensembl releases, the user should provide a table of the same format, and with the same column names (for this we recommend using online Ensembl BioMart or the R package biomaRt). Ensembl identifiers are necessary for obtaining protein sequences in the function: map_to_func_elem().
hotspot_mutations <- identify_hotspots(mutation_dataset = TestData,
gene_data = DominoData,
snp_data = SnpData, min_n_muts = 5,
MAF_thresh = 0.01,
flanking_region = c(200, 300),
poisson.thr = 0.01,
percentage.thr = 0.15, ratio.thr = 45,
approach = "percentage")
If desired, the user can then annotate the identified hotspots using information from the UniProt/Swiss=Prot knowledgebase. Hotspots are identified on the Ensembl sequences. To make sure there are no any discrepancies, during this step sequences are retrieved from the UniProt KB and Ensembl Biomart and then the the Ensembl segment with the hotspot residue (15 aa) is aligned to the UniProt sequence. If different ensembl release than the default is used, two functions should be ran separately and a host address for the respective Ensembl release should be specified in the call to the function: map_to_func_elem(). For instance, for the current release: ens_release=“www.ensembl.org” instead of ens_release = “73”.
As input data the package needs these three datasets. We provide a small example mutation data set TestData (see above).
DominoData (that is used for gene_data) contains basic information for the genes in the Ensembl version 73. This includes: Ensembl gene identifier, Representative transcript identifier (i.e. a transcript with the longest protein coding sequence), cDNA_length of the representative transcript, Gene_name and Associated UniProt identifiers. The DominoData can also be provided as an TxDB object obtained from the makeTxDbFromEnsembl function (Genomic Features package), however, the functional annotation of hotspots will not be possible as this relies on Uniprot identifiers.
The required format of the gene_data is the following:
Ensembl_gene_id | Representative_tr | cDNA_length | Gene_name | Uniprot_id |
---|---|---|---|---|
ENSG00000256574 | ENST00000553795 | 987 | OR13A1 | Q8NGR1 |
ENSG00000127720 | ENST00000248306 | 1812 | METTL25 | Q8N6Q8 |
ENSG00000109819 | ENST00000264867 | 2397 | PPARGC1A | Q9UBK2 |
ENSG00000161057 | ENST00000435765 | 1302 | PSMC2 | P35998 |
ENSG00000237787 | ENST00000446603 | 303 | C3orf79 | P0CE67 |
ENSG00000051596 | ENST00000265097 | 1056 | THOC3 | Q96J01 |
Finally, to exclude common polymorphisms we provide SnpData, a set of SNPs with a population frequency higher than 1% that we obtained from the Ensembl BioMart version 73. The user can also set snp_data = NULL if they do not wish to include this in the analysis. Preferably, they should however use as comprehensive set of population polymorphisms as possible. The SnpData can also be provided as an vcf object (see VariantAnnotation package) or GPo object (see GenomicRanges package). The dataset has the following format:
Chr_name | Position_on_chr | Minor_allele_freq |
---|---|---|
1 | 150917624 | 0.0160 |
1 | 150936280 | 0.0124 |
1 | 169823521 | 0.0261 |
1 | 169823718 | 0.3516 |
1 | 169823790 | 0.0472 |
1 | 17256626 | 0.0124 |
The genomic information on the hotspot mutations can be converted into a GPo object for further analyses in other Bioconductor packages.
## UnstitchedGPos object with 6 positions and 4 metadata columns:
## seqnames pos strand | REF_NT MUT_NT Gene_ID
## <Rle> <integer> <Rle> | <character> <character> <character>
## [1] chr 22 29091840 * | T C ENSG00000183765
## [2] chr 19 3118942 * | A T ENSG00000088256
## [3] chr 9 80409488 * | T G ENSG00000156052
## [4] chr 9 80409488 * | T A ENSG00000156052
## [5] chr 12 3649787 * | T C ENSG00000111218
## [6] chr 2 198267483 * | C T ENSG00000115524
## Protein_pos
## <character>
## [1] 416
## [2] 209
## [3] 209
## [4] 209
## [5] 31
## [6] 625
## -------
## seqinfo: 5 sequences from an unspecified genome; no seqlengths
If the user is starting from genomic mutations, we are happy to share a perl script we wrote for mapping these to Ensembl protein residues. Please contact the package maintainers.
##Session Info
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DominoEffect_1.16.0 knitr_1.38
##
## loaded via a namespace (and not attached):
## [1] MatrixGenerics_1.8.0 Biobase_2.56.0
## [3] httr_1.4.2 sass_0.4.1
## [5] bit64_4.0.5 jsonlite_1.8.0
## [7] bslib_0.3.1 assertthat_0.2.1
## [9] highr_0.9 stats4_4.2.0
## [11] BiocFileCache_2.4.0 blob_1.2.3
## [13] BSgenome_1.64.0 GenomeInfoDbData_1.2.8
## [15] Rsamtools_2.12.0 yaml_2.3.5
## [17] progress_1.2.2 pillar_1.7.0
## [19] RSQLite_2.2.12 lattice_0.20-45
## [21] glue_1.6.2 digest_0.6.29
## [23] GenomicRanges_1.48.0 XVector_0.36.0
## [25] htmltools_0.5.2 Matrix_1.4-1
## [27] XML_3.99-0.9 pkgconfig_2.0.3
## [29] biomaRt_2.52.0 zlibbioc_1.42.0
## [31] purrr_0.3.4 BiocParallel_1.30.0
## [33] tibble_3.1.6 KEGGREST_1.36.0
## [35] generics_0.1.2 IRanges_2.30.0
## [37] ellipsis_0.3.2 withr_2.5.0
## [39] cachem_1.0.6 SummarizedExperiment_1.26.0
## [41] GenomicFeatures_1.48.0 BiocGenerics_0.42.0
## [43] cli_3.3.0 magrittr_2.0.3
## [45] crayon_1.5.1 memoise_2.0.1
## [47] evaluate_0.15 fansi_1.0.3
## [49] xml2_1.3.3 tools_4.2.0
## [51] data.table_1.14.2 prettyunits_1.1.1
## [53] hms_1.1.1 BiocIO_1.6.0
## [55] lifecycle_1.0.1 matrixStats_0.62.0
## [57] stringr_1.4.0 S4Vectors_0.34.0
## [59] DelayedArray_0.22.0 AnnotationDbi_1.58.0
## [61] Biostrings_2.64.0 compiler_4.2.0
## [63] jquerylib_0.1.4 GenomeInfoDb_1.32.0
## [65] rlang_1.0.2 grid_4.2.0
## [67] RCurl_1.98-1.6 rjson_0.2.21
## [69] VariantAnnotation_1.42.0 rappdirs_0.3.3
## [71] bitops_1.0-7 rmarkdown_2.14
## [73] restfulr_0.0.13 DBI_1.1.2
## [75] curl_4.3.2 R6_2.5.1
## [77] GenomicAlignments_1.32.0 dplyr_1.0.8
## [79] rtracklayer_1.56.0 fastmap_1.1.0
## [81] bit_4.0.4 utf8_1.2.2
## [83] filelock_1.0.2 stringi_1.7.6
## [85] parallel_4.2.0 Rcpp_1.0.8.3
## [87] vctrs_0.4.1 png_0.1-7
## [89] dbplyr_2.1.1 tidyselect_1.1.2
## [91] xfun_0.30