(0) Installation and Quick-Start

Installation

To install ompBAM, start R (version “4.2”) and enter:

To create new packages using ompBAM, install devtools and usethis packages from CRAN:

For MacOS users, make sure OpenMP libraries are installed correctly. We recommend users follow this guide, but the quickest way to get started is to install libomp via brew:

Creating and testing an example ompBAM-based package

To create a new package (in temporary directory):

Before this package is functional, it needs to be roxygenised. This will automate the export of the example function idxstats(). Run the following:

devtools::document(pkg_path)
#> ℹ Updating MyPkg documentation
#> ℹ Loading MyPkg
#> Exports from /tmp/RtmpTNL7Sp/MyPkg/src/ompBAM_example.cpp:
#>    int idxstats_pbam(std::string bam_file, int n_threads_to_use = 1)
#> 
#> /tmp/RtmpTNL7Sp/MyPkg/src/RcppExports.cpp updated.
#> /tmp/RtmpTNL7Sp/MyPkg/R/RcppExports.R updated.
#> Re-compiling MyPkg
#> * installing *source* package ‘MyPkg’ ...
#> ** using staged installation
#> ** libs
#> g++ -std=gnu++14 -I"/home/biocbuild/bbs-3.15-bioc/R/include" -DNDEBUG  -I'/tmp/RtmpqQWUN6/Rinst2faa5525427024/ompBAM/include' -I'/home/biocbuild/bbs-3.15-bioc/R/library/Rcpp/include' -I'/home/biocbuild/bbs-3.15-bioc/R/library/zlibbioc/include' -I/usr/local/include  -fopenmp  -fpic  -g -O2  -Wall -UNDEBUG -Wall -pedantic -g -O0 -c RcppExports.cpp -o RcppExports.o
#> g++ -std=gnu++14 -I"/home/biocbuild/bbs-3.15-bioc/R/include" -DNDEBUG  -I'/tmp/RtmpqQWUN6/Rinst2faa5525427024/ompBAM/include' -I'/home/biocbuild/bbs-3.15-bioc/R/library/Rcpp/include' -I'/home/biocbuild/bbs-3.15-bioc/R/library/zlibbioc/include' -I/usr/local/include  -fopenmp  -fpic  -g -O2  -Wall -UNDEBUG -Wall -pedantic -g -O0 -c ompBAM_example.cpp -o ompBAM_example.o
#> g++ -std=gnu++14 -shared -L/home/biocbuild/bbs-3.15-bioc/R/lib -L/usr/local/lib -o MyPkg.so RcppExports.o ompBAM_example.o -fopenmp -L/home/biocbuild/bbs-3.15-bioc/R/lib -lR
#> installing to /tmp/RtmpTNL7Sp/devtools_install_2fac361d3524df/00LOCK-MyPkg/00new/MyPkg/libs
#> ** checking absolute paths in shared objects and dynamic libraries
#> * DONE (MyPkg)
#> Writing NAMESPACE
#> Writing NAMESPACE

To simulate the compilation and loading of this package (this is almost equivalent to running R CMD BUILD then loading the MyPkg package:

To test the new package works as expected, run the following commands:

idxstats(ompBAM::example_BAM("scRNAseq"), 2)
#> /tmp/RtmpqQWUN6/Rinst2faa5525427024/ompBAM/extdata/MGE_E35_1.bam summary:
#> Name Length  Number of reads
#> 1    195471971   50000
#> 10   130694993   0
#> 11   122082543   0
#> 12   120129022   0
#> 13   120421639   0
#> 14   124902244   0
#> 15   104043685   0
#> 16   98207768    0
#> 17   94987271    0
#> 18   90702639    0
#> 19   61431566    0
#> 2    182113224   0
#> 3    160039680   0
#> 4    156508116   0
#> 5    151834684   0
#> 6    149736546   0
#> 7    145441459   0
#> 8    129401213   0
#> 9    124595110   0
#> MT   16299   0
#> X    171031299   0
#> Y    91744698    0
#> JH584299.1   953012  0
#> GL456233.1   336933  0
#> JH584301.1   259875  0
#> GL456211.1   241735  0
#> GL456350.1   227966  0
#> JH584293.1   207968  0
#> GL456221.1   206961  0
#> JH584297.1   205776  0
#> JH584296.1   199368  0
#> GL456354.1   195993  0
#> JH584294.1   191905  0
#> JH584298.1   184189  0
#> JH584300.1   182347  0
#> GL456219.1   175968  0
#> GL456210.1   169725  0
#> JH584303.1   158099  0
#> JH584302.1   155838  0
#> GL456212.1   153618  0
#> JH584304.1   114452  0
#> GL456379.1   72385   0
#> GL456216.1   66673   0
#> GL456393.1   55711   0
#> GL456366.1   47073   0
#> GL456367.1   42057   0
#> GL456239.1   40056   0
#> GL456213.1   39340   0
#> GL456383.1   38659   0
#> GL456385.1   35240   0
#> GL456360.1   31704   0
#> GL456378.1   31602   0
#> GL456389.1   28772   0
#> GL456372.1   28664   0
#> GL456370.1   26764   0
#> GL456381.1   25871   0
#> GL456387.1   24685   0
#> GL456390.1   24668   0
#> GL456394.1   24323   0
#> GL456392.1   23629   0
#> GL456382.1   23158   0
#> GL456359.1   22974   0
#> GL456396.1   21240   0
#> GL456368.1   20208   0
#> JH584292.1   14945   0
#> JH584295.1   1976    0
#> [1] 0

To create an ompBAM-based package from within RStudio:

To create an example package, create a new R project using ompBAM by running the following in the R console:

NB: be sure to replace project_path to the actual directory path in which you wish to store your project.

After running the example code above, use RStudio to open the newly created MyPkg.Rproj file located at the given path. Then, run the following:

This will process the roxygen2 flags and write to the NAMESPACE file.

After this, the new package can be built by running Install and Restart from the Build tab.

(1) Requirements to setting up a new R package to include ompBAM

ompBAM provides a one-step function called use_ompBAM() that creates a new package R project in a new directory (or adds requisite files to an existing project). The function is similar to those in the usethis package. The user supplies a directory path and the directory containing the package must also be the name of the new package.

(1a) Making a new package that compiles with ompBAM

Using the R function use_ompBAM(), we can create a new package inside the R-generated temporary directory as an example:

library(ompBAM)
pkg_path = file.path(tempdir(), "myPkgName")
use_ompBAM(pkg_path)
#> ✔ Creating '/tmp/RtmpTNL7Sp/myPkgName/'
#> ✔ Setting active project to '/tmp/RtmpTNL7Sp/myPkgName'
#> ✔ Creating 'R/'
#> ✔ Writing 'DESCRIPTION'
#> Package: myPkgName
#> Title: What the Package Does (One Line, Title Case)
#> Version: 0.0.0.9000
#> Authors@R (parsed):
#>     * First Last <first.last@example.com> [aut, cre] (YOUR-ORCID-ID)
#> Description: What the package does (one paragraph).
#> License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
#>     license
#> Encoding: UTF-8
#> Roxygen: list(markdown = TRUE)
#> RoxygenNote: 7.1.2
#> ✔ Writing 'NAMESPACE'
#> ✔ Setting active project to '/tmp/RtmpTNL7Sp/MyPkg'
#> ✔ Setting active project to '/tmp/RtmpTNL7Sp/myPkgName'
#> ✔ Creating 'src/'
#> ✔ Adding '*.o', '*.so', '*.dll' to 'src/.gitignore'
#> ✔ Created /tmp/RtmpTNL7Sp/myPkgName/R/ompBAM_imports.R with roxygen tags
#> ✔ Created src/Makevars.in and src/Makevars.win
#> ✔ Created configure scripts
#> ✔ Created src/ompBAM_example.cpp with idxstats_pbam() function
#> ✔ Adding Rcpp to Imports field in DESCRIPTION
#> ✔ Adding zlibbioc to Imports field in DESCRIPTION
#> ✔ Adding ompBAM to LinkingTo field in DESCRIPTION
#> ✔ Adding Rcpp to LinkingTo field in DESCRIPTION
#> ✔ Adding zlibbioc to LinkingTo field in DESCRIPTION
#> ✔ myPkgName successfully created in /tmp/RtmpTNL7Sp/myPkgName
#> Please run roxygen2 using devtools::document() before building the package.

We recommend the user run this function in a project directory of their choice and NOT use tempdir(), as the temp directory is destroyed on each R session restart! We only demonstrate using tempdir() here to demonstrate the typical output of the use_ompBAM() function.

Users following this vignette should instead use something like:

In the remainder of this section we will examine this newly-created project and explain the importance of the configurations made.

After running ompBAM(), please open the newly-generated myPkgName.Rproj from within RStudio.

Note that the newly-created package is designed to run with roxygen2. We recommend users build their package documentation and NAMESPACE using roxygen2. To do this, simply run devtools::document() to process roxygen2 tags.

(1b) Dependencies required to compile with ompBAM

To compile with ompBAM capabilities, the package must import both Rcpp and zlibbioc. Check that the following has been added to the DESCRIPTION file:

Imports: Rcpp, zlibbioc
LinkingTo: Rcpp, zlibbioc, ompBAM

Also, use_ompBAM() creates a “wrapper” function for the C++ function idxstats_pbam() function. This is a simple R function idxstats() which in turn calls the C++ function. We export this function with a roxygen2 tag so that idxstats() can be called once the MyPkgName package is loaded.

Open this file to verify that the following has been added:

To make sure your roxygen2 setup works, make sure the new package is your active project by opening the project within RStudio. Then run devtools::document() to allow roxygen2 to do its magic. When it is done, the NAMESPACE file should contain the following:

# Generated by roxygen2: do not edit by hand

export(idxstats)
import(Rcpp)
import(zlibbioc)
useDynLib(myPkgName, .registration = TRUE)

If for whatever reason roxygen2 says it did not edit the NAMESPACE file because it was not created by roxygen2, we suggest deleting the NAMESPACE file, then run devtools::document() again.

Also, you may have noticed, the included source code may have been prompted to compile. Don’t worry about this, we will explain it in detail in the next section.

(1c) Make Files

use_ompBAM() has created two make files that instructs the compiler to link with OpenMP and the zlib library. You can view these files from within the src/ directory.

Make files instruct the C++ compiler to link to the appropriate libraries at compile-time. For R packages, these are called Makevars and are streamlined make files. For more information, refer to (Writing R Extensions - Using MakeVars) [https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Using-Makevars]

Makevars.in is a template make file used by Linux and MacOS, and should contain the following:

PKG_CXXFLAGS = $(OMPBAM_PKG_CXXFLAGS)
PKG_LIBS = $(OMPBAM_PKG_LIBS)

Additionally, a configure script is added to the root project directory. This is a shell script run when the package is built from source. It detects whether the operating system is Linux or MacOS, and whether OpenMP libraries are available for MacOS. It assigns the correct compile and linker flags to the OMPBAM_PKG_CXXFLAGS and OMPBAM_PKG_LIBS variables in the template make file. The contents of the configure script is beyond the scope of this documentation, but feel free to look at it. It is based on the configure script for the data.table R package on CRAN.

For Windows installations, a second make file called “Makevars.win” is required. This is because zlib libraries are linked dynamically (see the zlibbioc vignette for more information). In windows systems, “Makevars.win” is used to build your package, and contains the following:

PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS) 
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)

ZLIB_CFLAGS+=$(shell echo 'zlibbioc::pkgconfig("PKG_CFLAGS")'|\
    "${R_HOME}/bin/R" --vanilla --slave)
PKG_LIBS+=$(shell echo 'zlibbioc::pkgconfig("PKG_LIBS_shared")' |\
    "${R_HOME}/bin/R" --vanilla --slave)

This will ensure that the zlib libraries are linked correctly to Windows systems.

(1d) OpenMP compatibility

Windows and Linux systems should support OpenMP natively. MacOS, however, does not. In order to set up OpenMP for MacOS, we recommend following this guide. In short, to utilise OpenMP on MacOS users must first install the correct OpenMP libraries. Secondly, specific compiler and linker flags must be set. This has already been done using the configure script and template Makevars.in file as described in section (1c).

To simplify the installation of OpenMP libraries on MacOS, instruct package users to run the following:

As can be seen, OpenMP for MacOS will continue to be a difficult issue. We recommend writing C++ code that is compatible for both OpenMP and non-OpenMP systems. For non-OpenMP systems, multi-threading can still be implemented via BiocParallel::MulticoreParam() but the session memory will be multiplied over the number of cores used.

In C++ code, to check whether OpenMP is installed in a system, use the #ifdef and #ifndef directives. An example below:

(2) Step-by-step guide to creating your first ompBAM-powered package

use_ompBAM() creates source code contained within src/ompBAM_example.cpp which contains a “Hello world!” equivalent function that is ready to be compiled. This function, called idxstats_pbam(), replicates the idxstats function of Samtools. It reports the chromosome names and lengths of the genome to which the sequences in the BAM file is aligned, as well as the number of reads aligned to each chromosome. See the idxstats documentation for more details

In this section, we will examine the src/ompBAM_example.cpp source code in detail and explain how we used ompBAM to re-create the idxstats function.

First, lets see how this function works. Make sure you have created the example project using the steps as described in Section (0) - Installation and Quick-Start. Make sure you can reproduce all the example output in that section.

Lets look at the line used to run the ompBAM-based idxstats function:

Like the Samtools idxstats, this function counts the number of reads aligned to each chromosome in the genome. However, unlike idxstats, this function can be run on BAM files that are not sorted by chromosome and not indexed.

In the following guide, we will go through the src/ompBAM_example.cpp in the example package, step by step, explaining the code behind this simple function.

(2a) Headers and Includes

To add the ompBAM header files into your source code, src/ompBAM_example.cpp contains the following:

ompBAM headers must be added AFTER Rcpp. This is because ompBAM uses cout function for console output. The cout C++ function is not supported in R, and must replaced by Rcout in Rcpp.

The line #define cout Rcpp::Rcout ensures that cout in all ompBAM headers are interpreted by the compiler as Rcpp::Rcout.

(2b) Sanity check for number of threads

src/ompBAM_example.cpp contains a simple internal function use_threads() which takes an int parameter and returns an unsigned int parameter:

This function takes a user request for the number of threads (CPU workers) to be used. Sometimes these requests may not be appropriate (e.g. requesting more threads than available in the system). First, notice that we use the #ifdef directives to tell the compiler which block of code to compile, based on whether OpenMP is available on the system. If not compiled with OpenMP , this function returns 1 (single core). If compiled with OpenMP, the function checks the maximum available threads and returns this number if the user requests more threads than is available.

We recommend users include a similar function to make sure their program knows how many threads to run, as they will be running an OpenMP-based loop in their functions.

(2c) Opening BAM files with ompBAM

Now, lets return to the idxstats_pbam() function. We will show a subset of this function below:

Note 3 things:

  1. The line // [[Rcpp::export]] tells Rcpp to export this function so that R can call these functions directly. They will not be exported functions so you will still need to create wrapper functions from within R, and export them, for example:
  1. idxstats_pbam() takes two inputs. The first, bam_file, takes a string input which is the file name of the BAM file. The second, n_threads_to_use will take a user input to request the number of threads. We sanitise this number as discussed in the previous section, to make sure the number of threads in this system is not above that available to the system.

  2. We create a pbam_in object called inbam. The pbam_in class is described in the pbam_in.hpp file which is included in ompBAM.hpp. Users can find the header files in the include/ directory within the installation path of ompBAM, but we endeavour to explain as much as you need to know in this document. For now, we call pbam_in::openFile() which directs the pbam_in object inbam to open and examine the given BAM file. openFile() takes a std::string for the path to the BAM file, and an unsigned int to specify the number of threads to initialize pbam_in for parallel file reading and decompression.

(2d) Obtaining chromosome names and lengths

Once pbam_in opens a BAM file, it will automatically check the BAM file for errors. After noting the file size, it will check whether the BAM file is truncated (by verifying whether a BGZF EOF end of file marker is set at the end of the file). It will then read the BAM header and extract the chromosome names and their genome lengths.

To obtain the chromosome names and lengths, and to check that the BAM file is indeed valid:

Here, we create a string vector to store the chromosome names, and an unsigned 32-bit integer vector for chromosome lengths. These vectors are passed by reference to the pbam_in::obtainChrs() function which will fill these vectors with the chromosome names and lengths. obtainChrs() will return a negative number if the BAM header read fails, and will return the number of chromosomes if the function is run successfully. We can check for this with the following:

(2e) Constructing a loop to read the BAM file using ompBAM

ompBAM is designed to read an entire BAM file using multiple threads. This functionality is ideal for programs that perform whole-transcriptome calculations. Programs that interrogate small genomic regions are better off using htslib and interrogating sorted BAM files.

To read an entire BAM file, we need to construct code contained within a loop. This is because we can conserve RAM by not committing the entire decompressed BAM to memory. Instead, we read a small portion of the BAM file sequentially, process all the aligned reads therein, before reading / decompressing more of the BAM file again.

First, lets declare a vector to store the per-chromosome reads so we can count these reads:

To sequentially read a “small” portion of the BAM file, we call pbam_in::fillReads() and contain this within a while loop:

pbam_in::fillReads() returns 0 if the BAM file successfully extracts some reads and we are not at the end of file. If all the reads have already been extracted from the BAM file and no data was decompressed, fillReads() returns 1, and if the BAM file is corrupt or the decompression returns an error, fillReads() returns -1.

Note that fillReads() will also return an -1 error if the program has not processed all the reads decompressed by the previous call to fillReads(). We will explain this further in the next section

(2f) Processing BAM reads using an OpenMP parallel FOR loop

OpenMP provides an easy-to-use parallelism which we can use to run each iteration of a FOR loop in parallel. That is, each iteration of the FOR loop is run simultaneously, each iteration being processed by a separate thread. We can set up this parallel FOR loop in the following code:

Note the line beginning with #pragma omp parallel for. This line simply specifies that the FOR loop declared after this line should be run in parallel. num_threads(n_threads_to_really_use) declares that we want to use the specified number of threads, as stored in the n_threads_to_really_use variable. schedule(static,1) declares that one thread should be used to run each FOR iteration.

Also note that this line is enclosed within an #ifdef _OPENMP. If OpenMP is not available at compile, the compiler simply ignores the #pragma and will run the FOR loop sequentially.

Within this loop, we are free to declare any thread-specific variables. Variables declared within the parallel FOR loop are thread-specific, meaning their contents cannot be accessed by other threads processing other iterations. So within the FOR loop, we declare a read_counter vector which will count reads processed by each thread.

(2g) Getting thread-specific reads from ompBAM

To obtain a single aligned read from a thread-specific buffer of reads processed by ompBAM, we declare a pbam1_t object and use pbam_in::supplyRead() to get a single read from the thread-specific buffer:

As the above code is contained within the parallel FOR loop, i here refers to the thread ID. For n threads, i will take any integer between 0 and n-1. supplyRead() will return the next read in the thread-specific buffer which we receive by calling it within the pbam1_t constructor at declaration.

Note that the above code is equivalent to:

(2g:i) Checking “validity” of reads and “realizing” reads

pbam1_t is an ompBAM object that is designed to store and get values contained in a single aligned BAM read. We will discuss its full functionality in section (4). For now, the most important function is pbam1_t::validate(), which will check whether the obtained read actually contains any data.

A pbam1_t can be “invalid” for several reasons. The most important reason is when supplyRead() has reached the end of its buffer. When this occurs, supplyRead() will simply return an empty read, which will not validate.

So, to profile all the reads in a single thread, we simply call supplyRead() for thread i until it returns an invalid read, using the following:

pbam1_t::validate() will return false for two other reasons.

  1. If the BAM contains corrupt data, validate() will return false. validate() checks the first 4 bytes of the read which returns the size of the read (in bytes). It then checks whether this is larger than the size of the individual components, including read name, cigar, sequence and quality scores. If there is corrupt data, we might expect that this will trigger a failure and safely abort the program, as the next call to fillReads() will trigger an error notifying that not all reads were processed.

  2. pbam1_t optimises performance by storing pointers to the memory that contains decompressed BAM data, rather than creating separate memory containers. Therefore, copying a pbam1_t read will only copy its memory address, not the data contained in the read. We refer to these as “virtual” pbam1_t reads. The consequence is, the next time fillReads() is called, these reads will point to an invalid memory address and will become invalid.

In some situations (e.g. for paired RNA-seq reads), it is desirable to keep some reads around in memory. To do this we can do something like the following:

In the above code, we call pbam1_t::realize(). This function creates a read- specific data buffer and copies the data from the thread-specific buffer. A “real” pbam1_t read simply means it has its own data storage and is thus “persistent”, i.e. it will still store the read after the next call to fillReads(). In other words:

NB: the pbam1_t::realize() and pbam1_t::isReal() functions are not demonstrated in the included example code created by use_ompBAM(). For more information about these functions, refer to Section (4c)

(2h) Counting chromosome-specific reads

Now that we have discussed the most important aspects of pbam1_t as well as the fillReads() and supplyRead() functions of pbam_in, we will now count the reads within the OpenMP parallel FOR loop. The entire processing loop is included below and we will discuss the remaining lines in detail:

If you have read all the sections until now, you will understand most of the code in this loop. We will discuss the things we have not already addressed.

(2h:i) Obtaining the chromosome ID of each aligned read

The pbam1_t::refID() returns the refID of the aligned read. Rather than the name of the chromosome, it returns the ID as a number from 0 to k-1 for an k-chromosome genome. These are in the same order as described in the BAM header. So here, we simply check whether the read is mapped to a valid chromosome refID, and add it to the thread-specific read container vector:

For other “getter” functions of pbam1_t, refer to Section (4e-h).

(2h:ii) Tallying reads to a variable external to the parallel FOR loop

Any variables declared outside the OpenMP FOR loop is accessible to code within the parallel loop. This can be both a blessing and a curse. It can cause problems when multiple threads write to the variable at the same time (known as a “race condition”). To prevent this, we declare that only 1 thread can run the code that is responsible for accessing the external variable total_reads, using the #pragma omp critical directive. This directive instructs the following block of code can only be run by a single thread at any one time. Of course, we declare this within an #ifdef _OPENMP directive so that compilers without OpenMP will ignore it.

(2i) Closing the BAM file

After we have read the BAM file, we will close the file.

This will instruct pbam_in to close the file that it opened internally using the ifstream object contained within. It will also destroy all memory associated with its file and decompressed data buffers, thereby safely freeing up memory. This is also called on destruction of the pbam_in object, so files are safely closed when the pbam_in object goes out of scope. Of course, pbam1_t reads will invalidate once these buffers are cleared, except “real” pbam1_t reads created via pbam1_t::realize().

(2j) Summarizing the tallied reads

For the output of this function, we use the function Rcout (from Rcpp) to print out the chromosome names, lengths and read counts, just like the samtools idxstats function:

(2k) Adding a progress bar

BAM files are very large files (or we wouldn’t use multi-threading to read them). Sometimes a progress bar is very useful in this regard.

pbam_in has two functions that provide useful data that can be parsed to the RcppProgress progress bar.

  • pbam_in::GetFileSize() will return the file size (in bytes) of the opened BAM file.
  • pbam_in::IncProgress() will return the number of bytes processed (read and decompressed) since the last call to IncProgress() was made

These functions are demonstrated in the example code contained within the ompBAMExample package contained within the examples/ folder of the ompBAM installation path. Readers are welcome to peruse this example package which contains the idxstats_pbam function with the addition of a RcppProgress bar.

The path to the source code can be returned using the following in R:

Alternatively, refer to Section (3i) for an example RcppProgress bar

(3) pbam_in function documentation

The pbam_in object is responsible for opening BAM files for sequential multi-threaded BAM processing.

Internally, pbam_in checks BAM files for truncation, data corruption and other issues. Then it reads and decompresses BAM files using multiple threads. Applications using ompBAM simply have to retrieve reads from the pbam_in object using the supplyRead() function, after “priming the pump” to fill the reads buffer using the fillReads() function. supplyRead() is considered “thread-safe”, meaning that applications using ompBAM can run supplyRead() within multi-threaded blocks of code. This allows applications using ompBAM to process reads using multiple threads.

It is highly recommend to read the previous section (2) - “Step-by-step guide to creating your first ompBAM-powered package” before scrutinising this section. The “Step-by-step guide” provides the context behind all the functions mentioned in this section.

Please note that currently, ompBAM only supports BAM file reading. BAM writing may be supported in later versions of ompBAM.

(3a) Constructor

Creates a pbam_in object.

Parameters

  • const size_t file_buffer_cap: (default 500 Mb) The size (in bytes) of each of the two file buffers
  • const size_t data_buffer_cap: (default 1 Gb) The size (in bytes) of the data buffer containing uncompressed data
  • const unsigned int chunks_per_file_buffer: (default 5) How many chunks should the file buffer be divided. See details
  • const bool read_file_using_multiple_threads (default true): Whether to use multiple threads to read compressed data from file.

Details

pbam_in reads a set amount of data from the BAM file to efficiently process reads. Internally, it allocates two “file” buffers to store compressed data, each of size file_buffer_cap, as well as a larger “data” buffer to store uncompressed data, of size data_buffer_cap.

The first call to pbam_in::fillReads(), the function that decompresses the BAM file to extract BAM reads, will fill the first file buffer with data, as well as a fraction (one “chunk”) of data to fill the second file buffer. This chunk size is determined by chunks_per_file_buffer. For example, if file_buffer_cap = 5e8 i.e. 500 Mb and chunks_per_file_buffer = 5, then the chunk size is 100 Mb.

After importing compressed data, pbam_in will use multiple threads to decompress enough data to either fill up the data buffer, or decompress a portion of the file buffer (as determined by chunk size), whichever occurs first.

Subsequent calls to pbam_in::fillReads() will check whether the number of chunks stored in the second file buffer exceeds the number of chunks already processed in the first file buffer. Should this occur, it will read compressed data from the BAM file and place this in the second file buffer. When the amount of unprocessed data in the first file buffer drop below one chunk size amount, a “buffer swap” occurs where the data is copied from the second file buffer to the first. Thereafter, the process continues until the entire BAM file has been read and decompressed.

As can be seen, the optimal values of file_buffer_cap, data_buffer_cap and chunks_per_file_buffer will depend on the compression ratio of the BAM file, the I/O speed of the storage device, as well as the available memory. We believe the default settings is a good starting point and will consume approximately 2 Gb of RAM.

Storage on hard disk drives (with spinning components) typically found on traditional desktop computers may experience lower file I/O in multi-threaded applications. Solid-state drives (typically found in modern notebook laptops) and some RAID setups may favour multi-threaded file input. To instruct pbam_in to read the file using a single thread and decompress with the remaining cores, set read_file_using_multiple_threads = false.

Note that copy constructor and copy assignment operators are disabled for pbam_in. Thus, code containing things like the following will fail at compile-time:

(3b) openFile()

Opens a BAM file given the file path, and the requested number of threads to be used. Also checks the BAM file and reads the BAM header (silently).

Parameters

  • std::string filename: The path to the BAM file to be opened
  • const unsigned int n_threads: The number of threads to use to read the BAM file

(3c) SetInputHandle()

Assigns an istream handle to pbam_in. Also defines how many threads to use to read the BAM file. This is used as an alternative to pbam_in::openFile().

Parameters

  • std::istream *in_stream: The handle of an ifstream that has opened a BAM file using input binary access
  • const unsigned int n_threads: The number of threads to use to read the BAM file

Details

Frankly this is superseded by openFile().

NB: multi-threaded read access is disabled when pbam_in accesses a BAM file via this method, because it does not know the BAM file path.

(3d) closeFile()

Closes the BAM file. Also deallocates all memory contained in the pbam_in object, essentially resetting this object.

Parameters

None

Details

Note that closeFile() internally calls reset(), which is also called at object destruction. Thus, any open BAM files are closed when the pbam_in object goes out of scope. This function is provided if closing the ifstream associated with the BAM file, and/or freeing up RAM associated with pbam_in, is desired.

(3e) obtainChrs()

Retrieves chromosome names and lengths from an opened BAM file

Parameters

  • std::vector<std::string> & s_chr_names A reference to a string vector to contain chromosome names
  • std::vector<uint32_t> & u32_chr_lens A reference to an uint32_t vector to contain chromosome lengths

Return value

(int) The number of chromosomes in the genome the BAM file is aligned to. Chromosome names and lengths are returned to the two vectors supplied by reference.

If pbam_in has not opened a valid BAM file, this function returns -1

Details

When a BAM file is opened via openFile() or SetInputHandle(), pbam_in will automatically read and store the header data, namely chromosome names and lengths. obtainChrs() can be called to retrieve this data.

(3f) fillReads()

Reads from the BAM file, decompresses the file buffer to extract the reads.

Parameters

None

Return value

Returns 0 if successful, -1 if error, and 1 if end of file is already reached.

Details

A loop is required to read the BAM file until finished. This can be set up by checking the return value of fillReads() and making sure it is zero. If it is non-zero, abort the loop as there is either an error with the file, or end of file was reached in the last call to fillReads(). See example below.

NB1: fillReads() should only be called by the main thread (i.e. it should not be called from within child threads). fillReads() contain OpenMP parallel FOR loops that perform file reads and decompressions using multi-threading.

NB2: Applications must ensure that all reads decompressed by fillReads() are obtained from every thread via pbam_in::supplyRead(). If some reads are not obtained, the next call to fillReads() will return an error.

(3g) supplyRead()

Reads the thread-specific data buffer to supply a single aligned read from the BAM file.

Parameters

  • const unsigned int thread_id The index of the thread-specific buffer from which to retrieve the read.

Return value

pbam1_t A pbam1_t object containing the data from the aligned read.

Details

After fillReads() is called and the data buffer is filled, the reads are split into equal portions of decompressed data. Each chunk of data is processed by a single thread. This allows developers to set up an OpenMP for-loop such that reads from each chunk are processed exclusively by each thread.

Reads are divided into contiguous blocks. E.g., thread i=0 contains a contiguous block of reads, and that the first read in thread i=1 contains the read following the last read from thread i=0.

Setting i to the thread-ID makes sure supplyRead() is thread-safe, ensuring that we don’t have a situation where multiple threads are accessing the same data (i.e. a “race condition”).

When the end of the thread-specific buffer is reached, fillReads() will return an empty read (which will not validate using pbam1_t::validate(). This is useful to check whether the thread-specific buffer is exhausted.

(3h) remainingThreadReadsBuffer();

Queries how much decompressed data is available to be read by the thread- specific buffer

Parameters

  • const unsigned int thread_id (default 0) The index of the thread- specific buffer from which to retrieve the read.

Return value

Returns (in bytes) the amount of aligned reads available to be returned by supplyReads()

Details

This function is useful to distinguish between whether an invalid read returned by supplyReads() is because all the data from the thread-specific buffer has been process, or whether corrupt data is detected.

Frankly, it is sufficient to check for errors at fillReads(), because it is probably inappropriate to keep reading the BAM file if some reads are corrupt.

(3i) Progress and File Size Functions

Returns data regarding file size, and progress

Return value

  • GetFileSize() Returns the file size of the BAM file
  • GetProgress() Returns the total number of (compressed) bytes in the BAM file that has been processed (i.e. read and decompressed)
  • IncProgress() Returns the number of bytes processed since the last call to IncProgress()

Details

These functions are useful for addition of progress bars. The following example demonstrates a simple method of setting up an RcppProgress progress bar.

(4) pbam1_t function documentation

The pbam1_t object is used to retrieve data from a single aligned read.

pbam1_t object is returned from the supplyRead() function of pbam_in and is the primary way in which ompBAM provides access to the alignment data in BAM files. pbam1_t contain many useful functions that make it easy to access data from read alignments. In the documentation below, “reads” and “read alignments” are use interchangeably and are equivalent.

NB: In all the examples below, it is assumed that a pbam_in object, named inbam, has been initialized, and the first fillReads() has been called using the following code:

(4a) Constructor

Creates a pbam1_t object

Parameters

  • char * src A raw pointer to the char* in the pbam_in data buffer containing the read.
  • const bool realize Whether to “realize” the read.

Details

As explained in the previous section (2g:i Checking “validity” of reads and “realizing” reads"), pbam1_t does not initially contain dedicated memory space to store the data in a BAM read. Rather, it is a “virtual” read, in which it contain pointers to the corresponding data on the data buffer in the pbam_in object. This allows rapid processing of the BAM file and spares unnecessary memcpy operations which will increase RAM usage and slow the program.

However it can be converted into a “real” read, whereby pbam1_t will allocate the required memory space and copies the data from the pbam_in data buffer. When reads are “realized”, subsequent calls to pbam_in::fillReads() will not result in invalidation of such reads. This is important when data persistence is required, e.g. when matching paired reads in coordinate-sorted paired-end RNA-seq data.

NB: In typical usage, users are not expected to use the constructor to create “real” reads. Instead, it is better to first obtain the virtual read using pbam_in::supplyRead(), and subsequently “realize” the read via pbam1_t::realize().

(4b) validate()

Checks whether the read contained within the pbam1_t object contains valid data for a BAM read.

Parameters

None

Return value

Returns true if the read contains valid data, and false otherwise

Details

validate() makes the following checks, and returns false quickly if any of these checks fail:

  • whether a valid pointer to data exists.
  • whether the block_size as given by the first 4 bytes of the buffer matches the internally stored value of the read block_size calculated at its construction
  • whether the size of the individual components (cigar, sequence, quality scores, and size of additional tag data) totals the given block_size.

It is most often used to check whether pbam_in::supplyRead() has supplied an empty read signifying end of the thread-specific buffer. It can also check for invalid data pointers if users call pbam_in::fillReads() before “realizing” the reads to make the pbam1_t data persist. As explained previously in Section (2g:i), pbam_in::fillReads() will invalidate virtual reads as their data would have been recycled; however, realized reads will be persistent and continue to validate.

Internally, validate() is called prior to returning any values from the pbam_1 “getter” functions.

(4c) realize()

Converts a virtual read into a real read, thereby allowing data persistence by allocating specific memory space to store the data.

Parameters

None

Return value

Returns 0 if the pbam1_t was successfully converted to a “real” read, and -1 if the given read was invalid or if this function fails.

Details

realize() first checks if the pbam1_t was already a real read (i.e. it is contained within a dedicated persistent data buffer). If it is not, it allocates a buffer using malloc, and copies the data from the buffer pointed to by the previously-virtual read, using memcpy. It then rechecks validity.

This function is used to convert reads such that they no longer point to memory initiated by pbam_in, but instead have their own dedicated memory buffers (i.e. their memory becomes persistent)

(4d) isReal()

Checks whether the pbam1_t is persistent (real) or points to the pbam_in memory (virtual)

Parameters

None

Return value

Returns true if the read is real, and false if it is virtual.

Details

See Section (4c) pbam1_t::realize() for detailed explanation of “real” vs “virtual” reads

(4e) Getters of Read Core Data

The core is the first 36 bytes of the alignment read data. It contains constant-length data including chromosome refID and the left-most genomic coordinates of the alignment. It also contains metadata including cigar size and sequence length.

Parameters

None

Return value

See comments in the “Usage” section above. Also, it is helpful to refer to SAMv1.pdf - section 4.2 for further details regarding the BAM format.

(4f) Getters of Cigar Data

The cigar is data which specifies the nature of the alignment.

Parameters

  • std::string & dest The reference to a string to contain the cigar string
  • const uint16_t pos The index of the cigar operation (0 <= pos < cigar_size())
  • std::vector<char> & ops The reference to a char vector to contain the cigar operation
  • std::vector<uint32_t> & vals The reference to a uint32_t vector to contain the number of nucleotides to apply the cigar operation

Return value

uint32_t * cigar() returns a pointer to the data buffer at which the cigar data begins.

uint32_t cigar_size() returns the number of cigar operations. The BAM format is limited at 65535 operations which is insufficient for some long-read applications. Recently, the “CG” tag has been implemented that allows storage of alignment cigar data beyond this limit. ompBAM allows for this by first checking whether the “CG” tag exists. If it does, cigar_size() returns the length of this tag. If it does not, cigar_size() returns n_cigar_op which is the 16-bit storage of the cigar length.

int cigar() takes as reference a string and fills it with a string (in SAM format) of the entire cigar string.

cigar_op() and cigar_val() returns the cigar operation and length, respectively, given the index of cigar operation pos (where 0 <= pos < cigar_size())

cigar_ops_and_vals() takes two vectors by reference (ops and vals) and fills these with the cigar operations and lengths, respectively, of the entire cigar.

Refer to SAMv1.pdf - section 1.4 for further details regarding the cigar string (in SAM format).

(4g) Getters of Alignment Sequence and Quality score

Reads consist of a sequence of nucleotides. FASTQ files also contain per- nucleotide quality scores representing the level of statistical significance of each sequenced nucleotide. These values are often stored in BAM files after alignment.

Parameters

  • std::string & dest A string reference to contain the sequence.
  • std::vector<uint8_t> & dest A 8-bit unsigned integer vector to contain the list of quality scores

Return value

uint8_t * seq() and char * qual() returns the pointer to the raw data containing the sequence (in uint8_t) and quality scores (in char). Advanced users will have to convert the 4-bit sequence to nucleotides as described in the SAM documentation: SAMv1.pdf.

It is more convenient to use the latter functions. int seq(std::string & dest) takes a string reference as parameter and fills this reference with the sequence of the read. It returns the length of the read. qual(std::vector<uint8_t> & dest) takes a uint8_t vector by reference and fills this with per-nucleotide quality scores for the sequence. It returns the length of the read.

Details

Quality score can take values between [0,93] (without ASCII conversion). Users wishing to convert these to ASCII must add +33 to these scores before ASCII conversion. Alignments without quality scores will have all QUAL scores set at 255 (0xFF).

It is helpful to refer to SAMv1.pdf - section 4.2.3 for further details regarding SEQ and QUAL encoding.

(4h) Tag Getters

Tags are auxiliary information about each alignment.

Usage

// Provides an index of tags available to the alignment
int AvailTags(std::vector<std::string> & tags);

// Provides metadata about the specific tag
char Tag_Type(const std::string tag);
char Tag_Subtype(const std::string tag);
uint32_t Tag_Size(const std::string tag);
char Tag_Type_SAM(const std::string tag);

// Returns raw char pointer to the beginning of the info stored by the tag
// - For advanced users only
char * p_tagVal(const std::string tag);

// Returns values of fixed length
// - For tags of type AcCsSiIf
// Returns '\0' or 0 if tag does not exist or if the type is inappropriate
// for the given tag
char tagVal_A(const std::string tag);       // tags of type 'A'
int8_t tagVal_c(const std::string tag);     // tags of type 'c'
uint8_t tagVal_C(const std::string tag);    // tags of type 'C'
int16_t tagVal_s(const std::string tag);    // tags of type 's'
uint16_t tagVal_S(const std::string tag);   // tags of type 'S'
int32_t tagVal_i(const std::string tag);    // tags of type 'i'
uint32_t tagVal_I(const std::string tag);   // tags of type 'I'
float tagVal_f(const std::string tag);      // tags of type 'f'

// Fills given string reference by given Z-type tag
// Returns tag length of string if success, -1 if fail
int tagVal_Z(const std::string tag, std::string & dest);  // tags of type 'Z'

// Returns a B-tag by reference to its respective type
// Returns tag length if success, -1 if fail
int tagVal_B(const std::string tag, std::vector<int8_t> & dest);     // 'B, c'
int tagVal_B(const std::string tag, std::vector<uint8_t> & dest);    // 'B, C'
int tagVal_B(const std::string tag, std::vector<int16_t> & dest);    // 'B, s'
int tagVal_B(const std::string tag, std::vector<uint16_t> & dest);   // 'B, S'
int tagVal_B(const std::string tag, std::vector<int32_t> & dest);    // 'B, i'
int tagVal_B(const std::string tag, std::vector<uint32_t> & dest);   // 'B, I'
int tagVal_B(const std::string tag, std::vector<float> & dest);      // 'B, f'

Parameters

  • std::vector<std::string> & tags A reference to a string vector in which to store a list of available tags for the read.
  • const std::string tag A string containing the two-character tag to query
  • std::string & dest A string reference to store the given string in Z-type tags.
  • std::vector<T> & dest A reference to a vector of type in which to store the data contained in ‘B’-type tags.

Return value

  • AvailTags() returns the number of tags in the read. It also fills the given string vector with a list of available tags that can be queried.
  • Tag_Type(), Tag_Subtype() and Tag_Size() returns the type, subtype and tag length of the given tag. Tag_Type_SAM() returns the type as displayed in SAM format. See details for further information

  • p_tagVal() returns the raw pointer to the beginning of the data contained within the given tag. That is, the pointer is 3 bytes downstream to the start of the tag label in all tags except tags of type ‘B’, where it is 8 bytes downstream of the tag label. See SAMv1.pdf for more details.

  • tagVal_{A/c/C/s/S/i/I/f}() returns the 1-length value of the given tag type stored in the given tag.
  • tagVal_Z() takes by reference a string dest in which to store the string contained in Z-type tags. It returns the length of the tag if success, or -1 if fail.
  • tagval_B() takes by reference a vector dest of the appropriate type, in which to store the vector of values associated with B-type tags. It returns the length of the tag if success, or -1 if fails.

Details

Tags of type A/c/C/s/S/i/I/f are of length 1. Tags of type Z are of the length of its stored string (including the terminating ‘\0’ byte). Tags of type B are of the length of its vector.

Tag types designate the data type of its stored value. These are ‘A’ char, ‘c’ int8_t, ‘C’ uint8_t, ‘s’ int16_t, ‘S’ uint16_t, ‘i’ int32_t, ‘I’ uint32_t, and ‘f’ float.

Subtypes only apply to tags of type B. They can be of type c/C/s/S/i/I/f which refers to the data type of its stored value (see above).

SAM types are slightly different. Tags of BAM type c/C/s/S/i/I are displayed in SAM format as type ‘i’. Other types remain as they are.

Note that tags of type ‘H’ are not supported in ompBAM.

For more details, refer to SAMv1.pdf, section 4.2.4 for more information about how tags are stored in BAM format.

Also, refer to SAMtags.pdf for a list of the commonly-used BAM/SAM tags annotated with their SAM-types and the type of information they store.

(5) SessionInfo

sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] MyPkg_0.0.0.9000 ompBAM_1.0.0    
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.8.3      bslib_0.3.1       compiler_4.2.0    pillar_1.7.0     
#>  [5] jquerylib_0.1.4   remotes_2.4.2     prettyunits_1.1.1 tools_4.2.0      
#>  [9] zlibbioc_1.42.0   testthat_3.1.4    pkgload_1.2.4     pkgbuild_1.3.1   
#> [13] digest_0.6.29     memoise_2.0.1     jsonlite_1.8.0    evaluate_0.15    
#> [17] lifecycle_1.0.1   tibble_3.1.6      pkgconfig_2.0.3   rlang_1.0.2      
#> [21] rstudioapi_0.13   cli_3.3.0         yaml_2.3.5        xfun_0.30        
#> [25] fastmap_1.1.0     withr_2.5.0       stringr_1.4.0     knitr_1.39       
#> [29] roxygen2_7.1.2    xml2_1.3.3        devtools_2.4.3    desc_1.4.1       
#> [33] fs_1.5.2          sass_0.4.1        vctrs_0.4.1       rprojroot_2.0.3  
#> [37] glue_1.6.2        R6_2.5.1          processx_3.5.3    fansi_1.0.3      
#> [41] rmarkdown_2.14    sessioninfo_1.2.2 callr_3.7.0       purrr_0.3.4      
#> [45] magrittr_2.0.3    ps_1.7.0          htmltools_0.5.2   ellipsis_0.3.2   
#> [49] usethis_2.1.5     utf8_1.2.2        stringi_1.7.6     cachem_1.0.6     
#> [53] crayon_1.5.1      brio_1.1.3