receptLoss is an R package designed to identify novel nuclear hormone receptors (NHRs) whose expression levels in cancers could serve as biomarkers for patient survival.

By utilizing both expression data from both tumor and normal tissue, receptLoss provides biological context to the process of tumor subclassification that is lacking in existing methods that rely solely on expression patterns in tumor tissue.

receptLoss is complementary to oncomix. Whereas oncomix detects genes that gain expression in subsets of tumors relative to normal tissue, receptLoss detects genes that lose expression in subsets of tumors relative to normal tissue.

Installation

## Install the development version from GitHub
devtools::install_github("dpique/receptLoss", 
                            build_opts=c("--no-resave-data", "--no-manual"),
                            build_vignettes=TRUE)

Usage

receptLoss consists of 2 main functions:

  • receptLoss() takes in 2 matrices of gene expression data, one from tumor and one from adjacent normal tissue. The output is a matrix with rows representing genes and columns representing summary statistics.

  • plotReceptLoss() generates a histogram visualization of the distribution of the gene desired by the user.

We begin by simulating two gene expression data matrices, one from tumor and the other from normal tissue.

library(receptLoss)
library(dplyr)
library(ggplot2)

set.seed(100)

## Simulate matrix of expression values from
## 10 genes measured in both normal tissue and 
## tumor tissue in 100 patients

exprMatrNml <- matrix(abs(rnorm(100, mean=2.5)), nrow=10) 
exprMatrTum <- matrix(abs(rnorm(1000)), nrow=10)
geneNames <- paste0(letters[seq_len(nrow(exprMatrNml))], 
                seq_len(nrow(exprMatrNml)))
rownames(exprMatrNml) <- rownames(exprMatrTum) <- geneNames

exprMatrNml and exprMatrTum are \(m \times n\) matrices containing gene expression data from normal and tumor tissue, respectively, with \(m\) genes as rows and \(n\) patients as columns. The row names of these matrices are the gene names.

These two matrices should have the same number of rows (ie genes), with genes listed in the same order between the two matrices. However, they don’t have to have the same number of columns (ie patients).

To run receptLoss(), we also define 2 parameters:

  • nSdBelow is an integer value that places a lower boundary (i.e. lowerBound, shown as the pink ‘B’ in image below) \(n\) standard deviations below the mean of each gene’s expression levels in normal tissue (dotted pink curve below). The larger nSdBelow is, the smaller (i.e. further to the left) the lowerBound becomes.

    • We recommend setting nSdBelow=2, as ~97.7% of the normal tissue expression data should be greater than the lowerBound (assuming the expression data from normal tissue is distributed as a Gaussian).
  • minPropPerGroup - a numeric value between \((0,0.5)\) indicating the minimum proportion of tumor samples desired within each of the two tumor subgroups defined by the lowerBound. Determines the value of meetsMinPropPerGrp (either TRUE or FALSE) in the output.

    • We recommend setting minPropPerGroup=0.20. Values close to 0 may result in the inclusion of genes that subdivide tumors into very unequally-sized subgroups. Values closer to 0.5 will identify genes that subdivide tumors into nearly equal-sized groups and may be unnecessarily restrictive.

<img src=“fig_2_1.png” alt=“fig 2”, width=“60%”>

nSdBelow <- 2
minPropPerGroup <- .2
rl <- receptLoss(exprMatrNml, exprMatrTum, nSdBelow, minPropPerGroup)
head(rl)
#> # A tibble: 6 × 7
#>   geneNm lowerBound propTumLessThBound  muAb  muBl deltaMu meetsMinPropPerGrp
#>   <chr>       <dbl>              <dbl> <dbl> <dbl>   <dbl> <lgl>             
#> 1 f6          0.928               0.58  1.52 0.425   1.10  TRUE              
#> 2 b2          0.538               0.38  1.30 0.258   1.05  TRUE              
#> 3 i9          0.379               0.24  1.10 0.203   0.893 TRUE              
#> 4 g7          0.805               0.57  1.30 0.405   0.891 TRUE              
#> 5 d4          0.359               0.32  1.04 0.174   0.866 TRUE              
#> 6 c3          0.554               0.41  1.12 0.290   0.826 TRUE

The output of receptLoss() is an \(m\times7\) matrix, with \(m\) equaling the number of genes. The 7 columns are as follows:

  • geneNm - the gene name

  • lowerBound (\(B\)) - the value nSdBelow the mean of the normal tissue expression data. Can be expressed as \[B=\mu_N - \sigma_N * n_{sdBelow},\] where \(\mu_N\) is the mean of the normal tissue expression data, \(\sigma_N\) is the standard deviation of the normal tissue expression data, and \(n_{sdBelow}\) is the value nSdBelow set by the user.

  • propTumLessThBound (\(\pi_L\)) - the proportion of tumor samples with expression levels less than lowerBound. Can be expressed as: \[\pi_L =\frac{1}{N_T}\sum_{j=1}^{N_T} \Bigg\{ \begin{array}{ll} 1,~ if~x_{j} < lowerBound \\ 0, ~ otherwise \end{array}, \] where \(x_{j}\) is the \(j^{th}\) tumor sample and \(N_T\) is the total number of tumor samples.

  • muAb (\(\mu_A\)) - “mu above”, the arithmetic mean across expression values from tumors greater than (ie above) the lowerBound.

  • muBl(\(\mu_B\)) - “mu below”, the arithmetic mean across expression values from tumors less than (ie below) the lowerBound.

  • deltaMu (\(\Delta\mu\)) - equal to \(\mu_A - \mu_B\). The rows in the output matrix are sorted in descending order by the deltaMu statistic, which indicates the degree of separation between the two tumor subgroups. Higher deltaMu values indicate tumor subgroups that are more cleanly separated and more likely to constitute a bimodal distribution within the tumor samples.

  • meetsMinPropPerGrp - a logical indicating whether the proportion of samples in each group is greater than that set by minPropPerGroup. If \(min(\pi_L, 1-\pi_L) >\) minPropPerGroup, then meetsMinPropPerGrp is TRUE; otherwise, it is FALSE. Genes for which meetsMinPropPerGrp equals FALSE can be filtered out - they do not have a sufficient proportion of tumors in each group to permit useful tumor subgrouping.

Visualization

Let’s take the top-ranked gene and plot its distribution.

clrs <- c("#E78AC3", "#8DA0CB")

tryCatch({plotReceptLoss(exprMatrNml, exprMatrTum, rl, 
    geneName=as.character(rl[1,1]), clrs=clrs)}, 
    warning=function(cond){
        knitr::include_graphics("rl_fig.png")
    }, error=function(cond){
        knitr::include_graphics("rl_fig.png")
    }
)

Here’s what this graph is showing us:

  • The x-axis represents RNA expression values, with lower values toward the left and larger values (i.e. higher expression) toward the right. The y-axis represents density. The name of the gene (“f6”) is shown in the upper left of the plot.

  • The dotted curve represents a Gaussian distribution fit to the expression data from normal tissue, and the blue histogram represents expression data from tumor tissue.

  • The pink vertical line corresponds to the lowerBound for the expression data from normal tissue.

  • Since most normal tissue expresses the RNA above the lowerBound, any tumors that express the RNA below this value have lost RNA expression relative to normal tissue. Thus, the lowerBound forms a boundary between 2 tumor subgroups that either have or have not lost RNA expression relative to normal tissue.

Nuclear Hormone Receptor (NHR) filtering

The question that inspired this package was whether the loss of expression of any of the ~50 NHRs (beyond the well-known estrogen, progesterone, and androgen NHRs) in uterine tumors was associated with differences in patient survival. NHRs might not only serve as survival biomarkers but also as drug targets, as their activity can be modulated by small molecules that resemble their hormonal ligands.

To facilitate the application of this question to additional cancer types, a list of all NHRs is included in this package as the object nhrs.

This object facilitates filtering of NHRs from a matrix of gene expression data, as it contains several commonly-used gene identifiers (e.g. HGNC symbol, HGNC ID, Entrez ID, and Ensembl ID) for the NHRs that might be found in different RNA expression datasets.

The source code for generating nhrs is available in “data-raw/nhrs.R”.

receptLoss::nhrs
#> # A tibble: 54 × 6
#>    hgnc_symbol hgnc_id hgnc_name                         entre…¹ ensem…² synon…³
#>    <chr>         <dbl> <chr>                               <dbl> <chr>   <chr>  
#>  1 NR0B1          7960 nuclear receptor subfamily 0 gro…     190 ENSG00… AHC|DS…
#>  2 NR0B2          7961 nuclear receptor subfamily 0 gro…    8431 ENSG00… Small …
#>  3 THRA          11796 thyroid hormone receptor alpha       7067 ENSG00… AR7|EA…
#>  4 THRB          11799 thyroid hormone receptor beta        7068 ENSG00… THR1|T…
#>  5 RARA           9864 retinoic acid receptor alpha         5914 ENSG00… RAR al…
#>  6 RARB           9865 retinoic acid receptor beta          5915 ENSG00… HAP|HB…
#>  7 RARG           9866 retinoic acid receptor gamma         5916 ENSG00… RARC|R…
#>  8 PPARA          9232 peroxisome proliferator activate…    5465 ENSG00… NUC1|n…
#>  9 PPARD          9235 peroxisome proliferator activate…    5467 ENSG00… NUCII|…
#> 10 PPARG          9236 peroxisome proliferator activate…    5468 ENSG00… PPARG1…
#> # … with 44 more rows, and abbreviated variable names ¹​entrez_gene_id,
#> #   ²​ensembl_gene_id, ³​synonyms
#> # ℹ Use `print(n = ...)` to see more rows

Conclusions

  • receptLoss identifies genes that subclassify tumors based on their RNA expression levels relative to normal tissue. The genes are ranked by their \(\Delta\mu\) statistic which reflects a measure of the cleaness of separation (ie bimodality) between the two tumor subgroups.

  • receptLoss can be expanded for use with a variety of tumors, genes (e.g. to identify novel candidate tumor suppressors), biological data (e.g. miRNA, protein expression), and even non-biological data types where you have numeric data from two groups (one normal group and one abnormal group) and where subgroup identification is desired within the abnormal group

  • receptLoss is particularly useful when there are a large number of tumor samples (hundreds) relative to normal samples (dozens), as is the case in several cancer databases, including the uterine cancer database from the Cancer Genome Atlas/Genomic Data Commons. By assuming that the normal expression data are distributed as a single Gaussian, receptLoss can subclassify large numbers of tumors even in the presence of small numbers of normal tissue samples.

Please contact me at with any suggestions, questions, or comments. Thank you!

Display this vignette

vignette("receptLoss") 

Session Info

sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.6    dplyr_1.0.9      receptLoss_1.8.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.26.1 tidyselect_1.1.2           
#>  [3] xfun_0.32                   bslib_0.4.0                
#>  [5] purrr_0.3.4                 lattice_0.20-45            
#>  [7] colorspace_2.0-3            vctrs_0.4.1                
#>  [9] generics_0.1.3              htmltools_0.5.3            
#> [11] stats4_4.2.1                yaml_2.3.5                 
#> [13] utf8_1.2.2                  rlang_1.0.4                
#> [15] jquerylib_0.1.4             pillar_1.8.0               
#> [17] withr_2.5.0                 glue_1.6.2                 
#> [19] DBI_1.1.3                   BiocGenerics_0.42.0        
#> [21] matrixStats_0.62.0          GenomeInfoDbData_1.2.8     
#> [23] lifecycle_1.0.1             stringr_1.4.0              
#> [25] MatrixGenerics_1.8.1        zlibbioc_1.42.0            
#> [27] munsell_0.5.0               gtable_0.3.0               
#> [29] evaluate_0.16               labeling_0.4.2             
#> [31] Biobase_2.56.0              knitr_1.39                 
#> [33] IRanges_2.30.0              fastmap_1.1.0              
#> [35] GenomeInfoDb_1.32.3         fansi_1.0.3                
#> [37] highr_0.9                   scales_1.2.0               
#> [39] cachem_1.0.6                DelayedArray_0.22.0        
#> [41] S4Vectors_0.34.0            jsonlite_1.8.0             
#> [43] XVector_0.36.0              png_0.1-7                  
#> [45] digest_0.6.29               stringi_1.7.8              
#> [47] GenomicRanges_1.48.0        grid_4.2.1                 
#> [49] cli_3.3.0                   tools_4.2.1                
#> [51] bitops_1.0-7                magrittr_2.0.3             
#> [53] sass_0.4.2                  RCurl_1.98-1.8             
#> [55] tibble_3.1.8                tidyr_1.2.0                
#> [57] pkgconfig_2.0.3             ellipsis_0.3.2             
#> [59] Matrix_1.4-1                assertthat_0.2.1           
#> [61] rmarkdown_2.14              R6_2.5.1                   
#> [63] compiler_4.2.1