The scDblFinder package gathers various methods for the detection and handling of doublets/multiplets in single-cell sequencing data (i.e. multiple cells captured within the same droplet or reaction volume). This vignette provides a brief overview of the different approaches (which are each covered in their own vignettes) for single-cell RNA sequencing. For doublet detection in genomic data, see the scATACseq vignette. For a more general introduction to the topic of doublets, refer to the OCSA book.

All methods require as an input either a matrix of counts or a SingleCellExperiment containing count data. With the exception of findDoubletClusters, which operates at the level of clusters (and consequently requires clustering information), all methods try to assign each cell a score indicating its likelihood (broadly understood) of being a doublet.

The approaches described here are complementary to doublets identified via cell hashes and SNPs in multiplexed samples: while hashing/genotypes can identify doublets formed by cells of the same type (homotypic doublets) from two samples, which are often nearly undistinguishable from real cells transcriptionally (and hence generally unidentifiable through the present package), it cannot identify doublets made by cells of the same sample, even if they are heterotypic (formed by different cell types). Instead, the methods presented here are primarily geared towards the identification of heterotypic doublets, which for most purposes are also the most critical ones.

0.1 computeDoubletDensity

The computeDoubletDensity method (formerly scran::doubletCells) generates random artificial doublets from the real cells, and tries to identify cells whose neighborhood has a high local density of articial doublets. See computeDoubletDensity for more information.

0.2 recoverDoublets

The recoverDoublets method is meant to be used when some doublets are already known, for instance through genotype-based calls or cell hashing in multiplexed experiments. The function then tries to identify intra-sample doublets that are neighbors to the known inter-sample doublets. See recoverDoublets for more information.

0.3 scDblFinder

The scDblFinder method combines both known doublets (if available) and cluster-based artificial doublets to identify doublets. The approach builds and improves on a variety of earlier efforts, and is at present the most accurate approach included in this package. See scDblFinder for more information.

0.4 directDblClassification

The directDblClassification method identifies doublets by training a classifier directly on gene expression. This follows the same procedure as scDblFinder for doublet generation and iterative training, but skips the k-nearest neighbor step and directly uses the matrix of real cells and artificial doublets. This is computationally more intensive and generally leads to worse predictions than scDblFinder, and it is included chiefly for comparative purposes. See ?directDblClassification for more information.

0.5 findDoubletClusters

The findDoubletClusters method identifies clusters that are likely to be composed of doublets by estimating whether their expression profile lies between two other clusters. See findDoubletClusters for more information.

1 Installation

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("scDblFinder")

# or, to get that latest developments:
BiocManager::install("plger/scDblFinder")

Session information

sessionInfo()

## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bluster_1.5.1               scDblFinder_1.9.12         
##  [3] scater_1.23.6               ggplot2_3.3.5              
##  [5] scran_1.23.1                scuttle_1.5.2              
##  [7] ensembldb_2.19.10           AnnotationFilter_1.19.0    
##  [9] GenomicFeatures_1.47.14     AnnotationDbi_1.57.1       
## [11] scRNAseq_2.9.2              SingleCellExperiment_1.17.2
## [13] SummarizedExperiment_1.25.3 Biobase_2.55.2             
## [15] GenomicRanges_1.47.6        GenomeInfoDb_1.31.10       
## [17] IRanges_2.29.1              S4Vectors_0.33.17          
## [19] BiocGenerics_0.41.2         MatrixGenerics_1.7.0       
## [21] matrixStats_0.62.0          BiocStyle_2.23.1           
## 
## loaded via a namespace (and not attached):
##   [1] AnnotationHub_3.3.12          BiocFileCache_2.3.5          
##   [3] igraph_1.3.1                  lazyeval_0.2.2               
##   [5] BiocParallel_1.29.21          digest_0.6.29                
##   [7] htmltools_0.5.2               magick_2.7.3                 
##   [9] viridis_0.6.2                 fansi_1.0.3                  
##  [11] magrittr_2.0.3                memoise_2.0.1                
##  [13] ScaledMatrix_1.3.0            cluster_2.1.3                
##  [15] limma_3.51.8                  Biostrings_2.63.3            
##  [17] prettyunits_1.1.1             colorspace_2.0-3             
##  [19] ggrepel_0.9.1                 blob_1.2.3                   
##  [21] rappdirs_0.3.3                xfun_0.30                    
##  [23] dplyr_1.0.8                   crayon_1.5.1                 
##  [25] RCurl_1.98-1.6                jsonlite_1.8.0               
##  [27] glue_1.6.2                    gtable_0.3.0                 
##  [29] zlibbioc_1.41.0               XVector_0.35.0               
##  [31] DelayedArray_0.21.2           BiocSingular_1.11.0          
##  [33] scales_1.2.0                  DBI_1.1.2                    
##  [35] edgeR_3.37.3                  Rcpp_1.0.8.3                 
##  [37] viridisLite_0.4.0             xtable_1.8-4                 
##  [39] progress_1.2.2                dqrng_0.3.0                  
##  [41] bit_4.0.4                     rsvd_1.0.5                   
##  [43] metapod_1.3.0                 httr_1.4.2                   
##  [45] ellipsis_0.3.2                farver_2.1.0                 
##  [47] pkgconfig_2.0.3               XML_3.99-0.9                 
##  [49] sass_0.4.1                    dbplyr_2.1.1                 
##  [51] locfit_1.5-9.5                utf8_1.2.2                   
##  [53] labeling_0.4.2                tidyselect_1.1.2             
##  [55] rlang_1.0.2                   later_1.3.0                  
##  [57] munsell_0.5.0                 BiocVersion_3.15.2           
##  [59] tools_4.2.0                   cachem_1.0.6                 
##  [61] xgboost_1.6.0.1               cli_3.2.0                    
##  [63] generics_0.1.2                RSQLite_2.2.12               
##  [65] ExperimentHub_2.3.7           evaluate_0.15                
##  [67] stringr_1.4.0                 fastmap_1.1.0                
##  [69] yaml_2.3.5                    knitr_1.38                   
##  [71] bit64_4.0.5                   purrr_0.3.4                  
##  [73] KEGGREST_1.35.0               sparseMatrixStats_1.7.0      
##  [75] mime_0.12                     xml2_1.3.3                   
##  [77] biomaRt_2.51.4                compiler_4.2.0               
##  [79] beeswarm_0.4.0                filelock_1.0.2               
##  [81] curl_4.3.2                    png_0.1-7                    
##  [83] interactiveDisplayBase_1.33.0 tibble_3.1.6                 
##  [85] statmod_1.4.36                bslib_0.3.1                  
##  [87] stringi_1.7.6                 highr_0.9                    
##  [89] lattice_0.20-45               ProtGenerics_1.27.2          
##  [91] Matrix_1.4-1                  vctrs_0.4.1                  
##  [93] pillar_1.7.0                  lifecycle_1.0.1              
##  [95] BiocManager_1.30.17           jquerylib_0.1.4              
##  [97] BiocNeighbors_1.13.0          cowplot_1.1.1                
##  [99] data.table_1.14.2             bitops_1.0-7                 
## [101] irlba_2.3.5                   httpuv_1.6.5                 
## [103] rtracklayer_1.55.4            R6_2.5.1                     
## [105] BiocIO_1.5.0                  bookdown_0.26                
## [107] promises_1.2.0.1              gridExtra_2.3                
## [109] vipor_0.4.5                   MASS_7.3-57                  
## [111] assertthat_0.2.1              rjson_0.2.21                 
## [113] withr_2.5.0                   GenomicAlignments_1.31.2     
## [115] Rsamtools_2.11.0              GenomeInfoDbData_1.2.8       
## [117] parallel_4.2.0                hms_1.1.1                    
## [119] grid_4.2.0                    beachmat_2.11.0              
## [121] rmarkdown_2.13                DelayedMatrixStats_1.17.0    
## [123] Rtsne_0.16                    shiny_1.7.1                  
## [125] ggbeeswarm_0.6.0              restfulr_0.0.13

Introduction to the scDblFinder package

22 April 2022

Abstract

Package

Contents