rprimer 1.0.0
rprimer provides tools for designing degenerate oligos (primers and probes) for sequence variable targets, such as RNA viruses. A multiple DNA sequence alignment is used as input data, and the outputs are presented in data frame format and as dashboard-like plots. The aim of the package is to provide a visual and efficient approach to degenerate oligo design, where sequence conservation analysis forms the basis of the design process.
In this vignette, I describe and demonstrate how to use the package by designing broadly reactive primers and probes for reverse transcription (RT) polymerase chain reaction (PCR)-based amplification and detection of hepatitis E virus (HEV), a sequence variable RNA virus and a common foodborne pathogen.
To install rprimer, start R (version 4.2.0 or higher), and enter the following code:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rprimer")
For best functionality, it is also recommended to install the Biostrings package [1]:
BiocManager::install("Biostrings")
The oligo and assay design workflow consists of five functions:
consensusProfile()
designOligos()
designAssays()
checkMatch()
plotData()
The package can be run through a Shiny application (a graphical user interface). To start the application, type runRprimerApp()
from within R upon installing and attaching the package.
library(rprimer)
library(Biostrings)
The first step is to identify and align target sequences of interest. It is important that the alignment does not contain any poorly aligned sequences and accurately represents the generic variation of the target population.
The file “example_alignment.txt” is provided with the package and contains an alignment of 50 HEV sequences. The file path can be retrieved by:
system.file("extdata", "example_alignment.txt", package = "rprimer")
The alignment must be imported to R using the readDNAMultipleAlignment()
function from Biostrings [1]. The input file can contain from one to several thousand sequences.
Below, I show how to import the file “example_alignment.txt”:
filepath <- system.file("extdata", "example_alignment.txt", package = "rprimer")
myAlignment <- readDNAMultipleAlignment(filepath, format = "fasta")
For some applications, there is often an interest in amplifying a specific region of the genome. In such cases, the alignment can be masked so that only the desired oligo binding regions are available for the subsequent design process. This type of masking can be achieved by using colmask()
from Biostrings [1], as illustrated in the example below:
## Mask everything but position 3000 to 4000 and 5000 to 6000
myMaskedAlignment <- myAlignment
colmask(myMaskedAlignment, invert = TRUE) <- c(3000:4000, 5000:6000)
It is also possible to mask specific sequences by using rowmask()
. Gap-rich positions can be hidden with gapmask()
[1].
consensusProfile
Once the alignment is imported, a consensus profile can be generated. The consensus profile contains all the information needed for the subsequent design process:
The majority consensus base is the most frequently occurring letter among the letters A, C, G, T and -.
Identity represents the proportion of sequences, among all sequences with a DNA base (A, C, G, T) or gap (-) that has the majority consensus base.
The IUPAC consensus character includes all DNA bases that occur at a relative frequency higher than a user-specified ambiguity threshold. The threshold can range from 0 to 0.2: a value of 0 (default) will catch all the variation, but with the potential downside of generating oligos with high degeneracy. A higher value will capture most of the variation, but with the potential downside of missing less common sequence variants.
Coverage represents the proportion of sequences in the target alignment (among all sequences with a DNA base) that are covered by the IUPAC consensus character.
consensusProfile()
has two arguments:
x
: a Biostrings::MultipleDNAAlignment
object (see Data import)ambiguityThreshold
: position wise “detection level” for ambiguous bases. All DNA bases that occur with a proportion higher than the specified value will be included in the IUPAC consensus character. Defaults to 0
, can range from 0
to 0.2
Type the following code to generate a consensus profile from the example alignment below:
myConsensusProfile <- consensusProfile(myAlignment, ambiguityThreshold = 0.05)
Results (row 100-105):
position | a | c | g | t | other | gaps | majority | identity | iupac | coverage |
---|---|---|---|---|---|---|---|---|---|---|
100 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0 | C | 1.00 | C | 1.00 |
101 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0 | A | 1.00 | A | 1.00 |
102 | 0.16 | 0.00 | 0.84 | 0.00 | 0.00 | 0 | G | 0.84 | R | 1.00 |
103 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0 | G | 1.00 | G | 1.00 |
104 | 0.00 | 0.98 | 0.00 | 0.00 | 0.02 | 0 | C | 1.00 | C | 1.00 |
105 | 0.20 | 0.00 | 0.02 | 0.78 | 0.00 | 0 | T | 0.78 | W | 0.98 |
The results can be visualized by using plotData()
:
plotData(myConsensusProfile)
The dots represent the value at each position, and the black lines show centered running averages. High identity values (in combination with low gap values) indicate high sequence conservation.
The data can be explored in more detail by zooming into a specific region of interest:
## Select position 5000 to 5800 in the consensus profile
selection <- myConsensusProfile[
myConsensusProfile$position >= 5000 & myConsensusProfile$position <= 5800,
]
plotData(selection)
designOligos
The next step is to design oligos. Primers must be designed, and probes are optional. Primers can be generated by using two strategies:
The ambiguous strategy (default) generates primers from the IUPAC consensus sequence alone, which means that ambiguous bases can occur at any position in the primer
The mixed strategy generates primers from both the majority and IUPAC consensus sequence. These primers consist of a shorter degenerate part at the 3’ end (~1/3 of the primer, targeting a conserved region) and a longer consensus part at the 5’ end (~2/3 of the primer), which instead of having ambiguous bases, contains the most frequently occuring nucleotide at each position. This strategy resembles the widely used Consensus-Degenerate Hybrid Oligonucleotide Primer (CODEHOP) principle [2]. It aims to allow amplification of highly variable targets using primers with low degeneracy. The idea is that the degenerate 3’ end will bind specifically to the target sequence in the initial PCR cycles and promote amplification despite eventual mismatches at the 5’ consensus part (since 5’ end mismatches are generally less detrimental than 3’ end mismatches) [2]. In this way, the generated products will perfectly match the 5’ ends of all primers, allowing them to be efficiently amplified in later PCR cycles. To provide a sufficiently high melting temperature (tm) with eventual 5’ end mismatches, it is recommended to design relatively long primers (at least 25 bases) when using this strategy
Probes are always designed using the ambiguous strategy.
designOligos()
has several arguments (design settings):
x
: the consensus profile from Step 1maxGapFrequency
: maximum gap frequency (in the alignment) for primers and probes, defaults to 0.01
lengthPrimer
: primer length range, defaults to c(18, 22)
maxDegeneracyPrimer
: maximum degeneracy for primers, defaults to 4
gcClampPrimer
: if primers should have a GC clamp, defaults to TRUE
. A GC clamp is identified as two to three G or C:s within the last five bases (3’ end) of the primer. The presence of a GC clamp is thought to promote specific binding of the 3’ endavoidThreeEndRunsPrimer
: if primers with more than two runs of the same nucleotide at the terminal 3’ end should be avoided (to reduce the risk of mispriming), defaults to TRUE
gcPrimer
: GC-content range of primers, defaults to c(0.40, 0.65)
tmPrimer
: melting temperature range for primers (in Celsius degrees), defaults to c(55, 65)
. Melting temperatures are calculated for perfectly matching oligo-target duplexes using the nearest neighbor method [3], using the formula, salt correction method, and table values as described in [4]. See the manual (?oligos
) for more informationconcPrimer
: primer concentration in nM, defaults to 500
(for tm calculation)designStrategyPrimer
: design strategy for primers, "ambiguous"
or "mixed"
, defaults to "ambiguous"
probe
: if probes should be designed, defaults to TRUE
lengthProbe
: defaults to c(18, 22)
maxDegeneracyProbe
: defaults to 4
avoidFiveEndGProbe
: if probes with a G at the terminal 5’ end should be avoided (to prevent quenching of the 5’ flourophore of hydrolysis probes), defaults to TRUE
gcProbe
: defaults to c(0.40, 0.65)
tmProbe
: defaults to c(55, 70)
concProbe
: defaults to 250
concNa
: sodium ion (equivalent) concentration in the PCR reaction (in M), defaults to 0.05
(50 mM) (for calculation of tm and delta G)All sequence variants of an oligo must fulfill all specified design constraints to be considered.
Oligos with at least one sequence variant containing more than four consecutive runs of the same nucleotide (e.g. “AAAAA”) and/or more than three consecutive runs of the same di-nucleotide (e.g. “TATATATA”) are not considered.
An error message will return if no oligos are found. If so, a good idea could be to re-run the process from Step 1 and increase the ambiguityThreshold
in consensusProfile()
, and/or relax the design constraints above.
Enter the following code to design oligos with default settings:
myOligos <- designOligos(myConsensusProfile)
Results (first five rows):
type | fwd | rev | start | end | length | iupacSequence | iupacSequenceRc | identity | coverage | degeneracy | gcContentMean | gcContentRange | tmMean | tmRange | deltaGMean | deltaGRange | sequence | sequenceRc | gcContent | tm | deltaG | method | score | roiStart | roiEnd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
probe | TRUE | TRUE | 124 | 143 | 20 | TCYGCCYTGGCGAATGCTGT | ACAGCATTCGCCARGGCRGA | 0.95 | 0.99 | 4 | 0.60 | 0.10 | 63.17 | 4.33 | -21.61 | 2.08 | TCCGCCCTGGCGAATGCTGT, TCTGCCCTGGCGAATGCTGT, TCCGCCTTGGCGAATGCTGT, TCTGCCTTGGCGAATGCTGT | ACAGCATTCGCCAGGGCGGA, ACAGCATTCGCCAGGGCAGA, ACAGCATTCGCCAAGGCGGA, ACAGCATTCGCCAAGGCAGA | 0.65, 0.60, 0.60, 0.55 | 65.33624, 62.86736, 63.47792, 61.00290 | -22.65389, -21.40840, -21.81968, -20.57419 | ambiguous | 2 | 1 | 7597 |
probe | FALSE | TRUE | 127 | 146 | 20 | GCCYTGGCGAATGCTGTGGT | ACCACAGCATTCGCCARGGC | 0.98 | 0.99 | 2 | 0.62 | 0.05 | 63.18 | 1.84 | -21.73 | 0.83 | GCCCTGGCGAATGCTGTGGT, GCCTTGGCGAATGCTGTGGT | ACCACAGCATTCGCCAGGGC, ACCACAGCATTCGCCAAGGC | 0.65, 0.60 | 64.10586, 62.26140 | -22.14750, -21.31329 | ambiguous | 3 | 1 | 7597 |
primer | TRUE | FALSE | 128 | 146 | 19 | CCYTGGCGAATGCTGTGGT | ACCACAGCATTCGCCARGG | 0.97 | 0.99 | 2 | 0.61 | 0.05 | 61.48 | 1.95 | -19.84 | 0.83 | CCCTGGCGAATGCTGTGGT, CCTTGGCGAATGCTGTGGT | ACCACAGCATTCGCCAGGG, ACCACAGCATTCGCCAAGG | 0.6315789, 0.5789474 | 62.45335, 60.50291 | -20.25708, -19.42287 | ambiguous | 3 | 1 | 7597 |
primer | TRUE | FALSE | 128 | 147 | 20 | CCYTGGCGAATGCTGTGGTR | YACCACAGCATTCGCCARGG | 0.96 | 0.99 | 4 | 0.60 | 0.10 | 61.61 | 3.37 | -20.55 | 1.76 | CCCTGGCGAATGCTGTGGTA, CCTTGGCGAATGCTGTGGTA, CCCTGGCGAATGCTGTGGTG, CCTTGGCGAATGCTGTGGTG | TACCACAGCATTCGCCAGGG, TACCACAGCATTCGCCAAGG, CACCACAGCATTCGCCAGGG, CACCACAGCATTCGCCAAGG | 0.60, 0.55, 0.65, 0.60 | 61.77167, 59.91655, 63.28690, 61.45963 | -20.50897, -19.67476, -21.43472, -20.60051 | ambiguous | 2 | 1 | 7597 |
probe | TRUE | TRUE | 128 | 146 | 19 | CCYTGGCGAATGCTGTGGT | ACCACAGCATTCGCCARGG | 0.97 | 0.99 | 2 | 0.61 | 0.05 | 60.45 | 1.94 | -19.84 | 0.83 | CCCTGGCGAATGCTGTGGT, CCTTGGCGAATGCTGTGGT | ACCACAGCATTCGCCAGGG, ACCACAGCATTCGCCAAGG | 0.6315789, 0.5789474 | 61.41687, 59.47567 | -20.25708, -19.42287 | ambiguous | 3 | 1 | 7597 |
The manual (?designOligos
) contains a detailed description of each variable. However, what may not be completely self-explanatory is that some variables (e.g. sequence
gcContent
and tm
) can hold several values in each entry. For such variables, all values within a specific row can be retrieved by:
myOligos$sequence[[1]] ## All sequence variants of the first oligo (i.e., first row)
#> [1] "TCCGCCCTGGCGAATGCTGT" "TCTGCCCTGGCGAATGCTGT" "TCCGCCTTGGCGAATGCTGT"
#> [4] "TCTGCCTTGGCGAATGCTGT"
myOligos$gcContent[[1]] ## GC-content of all variants of the first oligo
#> [1] 0.65 0.60 0.60 0.55
myOligos$tm[[1]] ## Tm of all variants of the first oligo
#> [1] 65.33624 62.86736 63.47792 61.00290
The primer and probe candidates can be visualized using plotData()
:
plotData(myOligos)
Next, I show how to design mixed primers using the dataset where I masked all positions but 3000 to 4000 and 5000 to 6000. In this case, I select to not design probes. I also select to modify some of the other settings.
## I first need to make a consensus profile of the masked alignment
myMaskedConsensusProfile <- consensusProfile(myMaskedAlignment, ambiguityThreshold = 0.05)
myMaskedMixedPrimers <- designOligos(myMaskedConsensusProfile,
maxDegeneracyPrimer = 8,
lengthPrimer = c(24:28),
tmPrimer = c(65,70),
designStrategyPrimer = "mixed",
probe = FALSE)
plotData(myMaskedMixedPrimers)
Importantly, for mixed primers, identity refers to the average identity within the 5’ consensus part of the primer binding region, and coverage refers to the average coverage of the 3’ degenerate part of the primer. Conversely, for ambiguous primers (and probes), both of these values are calculated for the oligo as a whole. For a detailed explanation of identity and coverage, see Step 1.
All valid oligos are scored based on their average identity, coverage, and GC-content. The score can range from 0 to 9, where 0 is considered best. The manual (?designOligos
) contains more information on the scoring system.
The best scoring oligos can be retrieved as follows:
## Get the minimum score from myOligos
bestOligoScore <- min(myOligos$score)
bestOligoScore
#> [1] 0
## Make a subset that only oligos with the best score are included
oligoSelection <- myOligos[myOligos$score == bestOligoScore, ]
It is possible to select a specific oligo and plot the nucleotide distribution within the binding region, by subsetting the consensus profile from Step 1:
## Get the binding region of the first oligo in the selection above (first row):
bindingRegion <- myConsensusProfile[
myConsensusProfile$position >= oligoSelection[1, ]$start &
myConsensusProfile$position <= oligoSelection[1, ]$end,
]
To plot the binding region, we can add type = "nucleotide"
:
plotData(bindingRegion, type = "nucleotide")
The binding region can be plotted in forward direction as above, or as a reverse complement, by specifying rc = TRUE
.
designAssays
The final step is to find pairs of forward and reverse primers, and eventually, to combine them with probes. If probes are present in the input dataset, only assays with a probe located between the primer pair will be kept.
designAssays()
has four arguments:
x
: the oligo dataset from Step 2length
: amplicon length range, defaults to c(65, 120)
tmDifferencePrimers
: maximum allowed difference between the mean tm of the forward and reverse primer (in Celsius degrees). Defaults to NULL
, which means that primers will be paired regardless of their tm.1 Note that the strategy of matching primer tm is flawed. What in reality is of interest is the amount of primer bound at the annealing temperature (ta). A PCR is most efficient if the two primers hybridize to the target at the same extent at the ta. See [5] for details on how to calculate the proportion of hybridized primer at different temperatures.An error message will return if no assays are found.
Below, I show how to design assays using default settings using the dataset myOligos
from Step 2:
myAssays <- designAssays(myOligos)
Output (first five rows):
start | end | length | totalDegeneracy | score | startFwd | endFwd | lengthFwd | iupacSequenceFwd | identityFwd | coverageFwd | degeneracyFwd | gcContentMeanFwd | gcContentRangeFwd | tmMeanFwd | tmRangeFwd | deltaGMeanFwd | deltaGRangeFwd | sequenceFwd | gcContentFwd | tmFwd | deltaGFwd | methodFwd | startRev | endRev | lengthRev | iupacSequenceRev | identityRev | coverageRev | degeneracyRev | gcContentMeanRev | gcContentRangeRev | tmMeanRev | tmRangeRev | deltaGMeanRev | deltaGRangeRev | sequenceRev | gcContentRev | tmRev | deltaGRev | methodRev | plusPr | minusPr | startPr | endPr | lengthPr | iupacSequencePr | iupacSequenceRcPr | identityPr | coveragePr | degeneracyPr | gcContentMeanPr | gcContentRangePr | tmMeanPr | tmRangePr | deltaGMeanPr | deltaGRangePr | sequencePr | sequenceRcPr | gcContentPr | tmPr | deltaGPr | methodPr | roiStart | roiEnd |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5605 | 5673 | 69 | 6 | 2.00 | 5605 | 5624 | 20 | GGCRGTGGTTTCTGGGGTGA | 0.98 | 1 | 2 | 0.62 | 0.05 | 62.84 | 2.51 | -20.86 | 1.25 | GGCAGTGGTTTCTGGGGTGA, GGCGGTGGTTTCTGGGGTGA | 0.60, 0.65 | 61.57995, 64.09018 | -20.23509, -21.48058 | ambiguous | 5654 | 5673 | 20 | GTTGGTTGGATGAASATAGG | 1 | 1 | 2 | 0.4 | 0 | 50.71 | 1.1 | -15.27 | 0.52 | GTTGGTTGGATGAATATAGG, GTTGGTTGGATGAAAATAGG | 0.4, 0.4 | 50.15470, 51.25631 | -15.00784, -15.52870 | ambiguous | TRUE | FALSE | 5625 | 5642 | 18 | CMGGGTTGATTCTCAGCC | GGCTGAGAATCAACCCKG | 0.97 | 0.99 | 2 | 0.58 | 0.06 | 55.14 | 2.79 | -17.06 | 1.25 | CAGGGTTGATTCTCAGCC, CCGGGTTGATTCTCAGCC | GGCTGAGAATCAACCCTG, GGCTGAGAATCAACCCGG | 0.5555556, 0.6111111 | 53.74555, 56.54052 | -16.43245, -17.67794 | ambiguous | 1 | 7597 |
5605 | 5673 | 69 | 6 | 2.33 | 5605 | 5624 | 20 | GGCRGTGGTTTCTGGGGTGA | 0.98 | 1 | 2 | 0.62 | 0.05 | 62.84 | 2.51 | -20.86 | 1.25 | GGCAGTGGTTTCTGGGGTGA, GGCGGTGGTTTCTGGGGTGA | 0.60, 0.65 | 61.57995, 64.09018 | -20.23509, -21.48058 | ambiguous | 5654 | 5673 | 20 | GTTGGTTGGATGAASATAGG | 1 | 1 | 2 | 0.4 | 0 | 50.71 | 1.1 | -15.27 | 0.52 | GTTGGTTGGATGAATATAGG, GTTGGTTGGATGAAAATAGG | 0.4, 0.4 | 50.15470, 51.25631 | -15.00784, -15.52870 | ambiguous | TRUE | FALSE | 5625 | 5643 | 19 | CMGGGTTGATTCTCAGCCC | GGGCTGAGAATCAACCCKG | 0.97 | 0.99 | 2 | 0.61 | 0.05 | 57.63 | 2.64 | -18.54 | 1.25 | CAGGGTTGATTCTCAGCCC, CCGGGTTGATTCTCAGCCC | GGGCTGAGAATCAACCCTG, GGGCTGAGAATCAACCCGG | 0.5789474, 0.6315789 | 56.30714, 58.95105 | -17.91855, -19.16404 | ambiguous | 1 | 7597 |
5605 | 5673 | 69 | 6 | 2.00 | 5605 | 5624 | 20 | GGCRGTGGTTTCTGGGGTGA | 0.98 | 1 | 2 | 0.62 | 0.05 | 62.84 | 2.51 | -20.86 | 1.25 | GGCAGTGGTTTCTGGGGTGA, GGCGGTGGTTTCTGGGGTGA | 0.60, 0.65 | 61.57995, 64.09018 | -20.23509, -21.48058 | ambiguous | 5654 | 5673 | 20 | GTTGGTTGGATGAASATAGG | 1 | 1 | 2 | 0.4 | 0 | 50.71 | 1.1 | -15.27 | 0.52 | GTTGGTTGGATGAATATAGG, GTTGGTTGGATGAAAATAGG | 0.4, 0.4 | 50.15470, 51.25631 | -15.00784, -15.52870 | ambiguous | TRUE | TRUE | 5625 | 5644 | 20 | CMGGGTTGATTCTCAGCCCT | AGGGCTGAGAATCAACCCKG | 0.97 | 1.00 | 2 | 0.58 | 0.05 | 58.87 | 2.55 | -19.43 | 1.25 | CAGGGTTGATTCTCAGCCCT, CCGGGTTGATTCTCAGCCCT | AGGGCTGAGAATCAACCCTG, AGGGCTGAGAATCAACCCGG | 0.55, 0.60 | 57.59837, 60.14566 | -18.80351, -20.04900 | ambiguous | 1 | 7597 |
5605 | 5673 | 69 | 6 | 1.67 | 5605 | 5624 | 20 | GGCRGTGGTTTCTGGGGTGA | 0.98 | 1 | 2 | 0.62 | 0.05 | 62.84 | 2.51 | -20.86 | 1.25 | GGCAGTGGTTTCTGGGGTGA, GGCGGTGGTTTCTGGGGTGA | 0.60, 0.65 | 61.57995, 64.09018 | -20.23509, -21.48058 | ambiguous | 5654 | 5673 | 20 | GTTGGTTGGATGAASATAGG | 1 | 1 | 2 | 0.4 | 0 | 50.71 | 1.1 | -15.27 | 0.52 | GTTGGTTGGATGAATATAGG, GTTGGTTGGATGAAAATAGG | 0.4, 0.4 | 50.15470, 51.25631 | -15.00784, -15.52870 | ambiguous | TRUE | TRUE | 5625 | 5645 | 21 | CMGGGTTGATTCTCAGCCCTT | AAGGGCTGAGAATCAACCCKG | 0.98 | 1.00 | 2 | 0.55 | 0.05 | 59.21 | 2.43 | -20.08 | 1.25 | CAGGGTTGATTCTCAGCCCTT, CCGGGTTGATTCTCAGCCCTT | AAGGGCTGAGAATCAACCCTG, AAGGGCTGAGAATCAACCCGG | 0.5238095, 0.5714286 | 57.99473, 60.42130 | -19.45540, -20.70089 | ambiguous | 1 | 7597 |
5605 | 5673 | 69 | 6 | 2.00 | 5605 | 5624 | 20 | GGCRGTGGTTTCTGGGGTGA | 0.98 | 1 | 2 | 0.62 | 0.05 | 62.84 | 2.51 | -20.86 | 1.25 | GGCAGTGGTTTCTGGGGTGA, GGCGGTGGTTTCTGGGGTGA | 0.60, 0.65 | 61.57995, 64.09018 | -20.23509, -21.48058 | ambiguous | 5654 | 5673 | 20 | GTTGGTTGGATGAASATAGG | 1 | 1 | 2 | 0.4 | 0 | 50.71 | 1.1 | -15.27 | 0.52 | GTTGGTTGGATGAATATAGG, GTTGGTTGGATGAAAATAGG | 0.4, 0.4 | 50.15470, 51.25631 | -15.00784, -15.52870 | ambiguous | TRUE | FALSE | 5625 | 5646 | 22 | CMGGGTTGATTCTCAGCCCTTC | GAAGGGCTGAGAATCAACCCKG | 0.98 | 0.99 | 2 | 0.57 | 0.05 | 59.91 | 2.28 | -21.11 | 1.25 | CAGGGTTGATTCTCAGCCCTTC, CCGGGTTGATTCTCAGCCCTTC | GAAGGGCTGAGAATCAACCCTG, GAAGGGCTGAGAATCAACCCGG | 0.5454545, 0.5909091 | 58.77534, 61.05418 | -20.48812, -21.73361 | ambiguous | 1 | 7597 |
The output contains many columns, but you can get an overview by calling View(as.data.frame(myAssays))
if you are working in RStudio.
Again, the results can be visualized using plotData()
:
plotData(myAssays)
There are now over 1000 assays, but we can reduce the number by filtering the dataset. In the example below, I select to subset the assays based on their average oligo score (see the score variable in the table above), and only keep those with the best score.
The best possible average score is 0 and the worst is 9 (see Step 2 for more information).
## Get the minimum (best) score from myAssays
bestAssayScore <- min(myAssays$score)
bestAssayScore
#> [1] 0.6666667
## Make a subset that only contains the assays with the best score
myAssaySelection <- myAssays[myAssays$score == bestAssayScore, ]
Oligo and assay binding region(s) can be inspected in more detail by using the consensus profile from Step 1. As an example, I select to take a closer look at the first assay in myAssaySelection
from Step 3.
The start and end position of an assay can be retrieved as follows:
from <- myAssaySelection[1, ]$start
to <- myAssaySelection[1, ]$end
It is now possible to indicate the amplicon region, using the highlight
argument in plotData()
:
plotData(myConsensusProfile, highlight = c(from, to))
To get a more detailed view, we can make a subset of the consensus profile to only contain the amplicon region, and plot it:
myAssayRegion <- myConsensusProfile[
myConsensusProfile$position >= from &
myConsensusProfile$position <= to,
]
plotData(myAssayRegion, type = "nucleotide")
The consensus amplicon sequence is obtained as follows:
paste(myAssayRegion$iupac, collapse = "")
#> [1] "CRGTGGTTTCTGGGGTGACMGGGTTGATTCTCAGCCCTTCGCMMTCCCCTATWTTCATCCAACCAACCC"
It is often valuable to investigate how the generated oligos match with their target sequences. checkMatch()
can be used for this purpose. The function is a wrapper to vcountPDict()
from Biostrings [1] and has two arguments:
The output gives information on the proportion and names of target sequences that match perfectly and with one, two, three, or four or more mismatches to the oligo within the intended oligo binding region in the alignment (on-target match). It also tells the proportion and names of target sequences that match with a maximum of two mismatches to the oligo outside the intended oligo binding region (off-target match). Thus, be cautious that false negatives (or positives) may occur due to poorly aligned sequences. Moreover, the output does not tell which strand (minus or plus) the oligo matches to, which is important to consider when assessing off-target matches to single-stranded targets. See the manual (?checkMatch
) for more information.
Ambiguous bases and gaps in the target sequences will be identified as mismatches.
checkMatch()
can be used for both oligo and assay datasets. The function is rather slow, especially if there are many target sequences, so a good idea could be to select only a few oligo or assay candidates to investigate.
Below, I show how to check how the three first candidates of the oligoSelection
dataset matches to the sequences in myAlignment
:
## Check the first three candidates in the oligoSelection dataset
oligoSelectionMatch <- checkMatch(oligoSelection[1:3, ], target = myAlignment)
Output:
iupacSequence | perfectMatch | idPerfectMatch | oneMismatch | idOneMismatch | twoMismatches | idTwoMismatches | threeMismatches | idThreeMismatches | fourOrMoreMismatches | idFourOrMoreMismatches | offTargetMatch | idOffTargetMatch |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GGGTTGATTCTCAGCCCTT | 0.9 | AB073912.1, AB425830.1, AB437316.1, AB437317.1, AB443627.1, AB291955.1, AB074918.2, BD378055.1, KJ507955.1, AB290312.1, KU176130.1, MG783569.1, LR777865.1, EU723516.1, LC055973.1, MF444097.1, MF444040.1, MF444092.1, MF444103.1, MH184583.1, MH504141.1, MH504149.1, MN646690.1, GU937805.1, LY647099.1, KJ013415.1, MH410175.1, MH410176.1, AY594199.1, GU361892.1, KX531115.1, AB909125.1, HM439284.1, AB161717.1, LC387631.1, KR872415.1, AF444003.1, JF443726.1, LC314156.1, MN401238.1, MH504159.1, KY436507.1, AY230202.1, AY204877.1, MH918640.1 | 0.1 | AB481228.1, KJ701409.1, JQ953665.1, AB521805.1, LC042232.1 | 0 | 0 | 0 | 0 | ||||
GGGTTGATTCTCAGCCCTT | 0.9 | AB073912.1, AB425830.1, AB437316.1, AB437317.1, AB443627.1, AB291955.1, AB074918.2, BD378055.1, KJ507955.1, AB290312.1, KU176130.1, MG783569.1, LR777865.1, EU723516.1, LC055973.1, MF444097.1, MF444040.1, MF444092.1, MF444103.1, MH184583.1, MH504141.1, MH504149.1, MN646690.1, GU937805.1, LY647099.1, KJ013415.1, MH410175.1, MH410176.1, AY594199.1, GU361892.1, KX531115.1, AB909125.1, HM439284.1, AB161717.1, LC387631.1, KR872415.1, AF444003.1, JF443726.1, LC314156.1, MN401238.1, MH504159.1, KY436507.1, AY230202.1, AY204877.1, MH918640.1 | 0.1 | AB481228.1, KJ701409.1, JQ953665.1, AB521805.1, LC042232.1 | 0 | 0 | 0 | 0 | ||||
GGTTGATTCTCAGCCCTT | 0.9 | AB073912.1, AB425830.1, AB437316.1, AB437317.1, AB443627.1, AB291955.1, AB074918.2, BD378055.1, KJ507955.1, AB290312.1, KU176130.1, MG783569.1, LR777865.1, EU723516.1, LC055973.1, MF444097.1, MF444040.1, MF444092.1, MF444103.1, MH184583.1, MH504141.1, MH504149.1, MN646690.1, GU937805.1, LY647099.1, KJ013415.1, MH410175.1, MH410176.1, AY594199.1, GU361892.1, KX531115.1, AB909125.1, HM439284.1, AB161717.1, LC387631.1, KR872415.1, AF444003.1, JF443726.1, LC314156.1, MN401238.1, MH504159.1, KY436507.1, AY230202.1, AY204877.1, MH918640.1 | 0.1 | AB481228.1, KJ701409.1, JQ953665.1, AB521805.1, LC042232.1 | 0 | 0 | 0 | 0 |
Note that the id variables can hold several values in each row. All values can be retrieved by:
## Get the id of all sequences that has one mismatch to the first oligo in the input dataset
oligoSelectionMatch$idOneMismatch[[1]]
#> [1] "AB481228.1" "KJ701409.1" "JQ953665.1" "AB521805.1" "LC042232.1"
The output can be visualized using plotData()
:
plotData(oligoSelectionMatch)
Before proceeding to wet lab evaluation, it is highly recommended to also evaluate the final primer and probe candidates for 1) the potential to form primer-dimers and hairpin structures, and 2) the potential to (not) cross react with non-targets.
These tasks are best performed by other software. Below, I show how to export results to a file so they can be readily analyzed by other tools.
All result tables can be saved to .txt or .csv-files, for instance:
write.csv(myAssays, file = "myAssays.csv", quote = FALSE, row.names = FALSE)
write.table(myAssays, file = "myAssays.txt", quote = FALSE, row.names = FALSE)
It is also possible to export oligos and assays in fasta-format. To do this, the object of interest must first be coerced to a DNAStringSet
object, as follows:
## Convert the first two oligos
as(myOligos[1:2, ], "DNAStringSet")
#> DNAStringSet object of length 6:
#> width seq names
#> [1] 20 TCCGCCCTGGCGAATGCTGT oligo_1_variant_1
#> [2] 20 TCTGCCCTGGCGAATGCTGT oligo_1_variant_2
#> [3] 20 TCCGCCTTGGCGAATGCTGT oligo_1_variant_3
#> [4] 20 TCTGCCTTGGCGAATGCTGT oligo_1_variant_4
#> [5] 20 GCCCTGGCGAATGCTGTGGT oligo_2_variant_1
#> [6] 20 GCCTTGGCGAATGCTGTGGT oligo_2_variant_2
## Convert the first two assays
as(myAssays[1:2, ], "DNAStringSet")
#> DNAStringSet object of length 12:
#> width seq names
#> [1] 20 GGCAGTGGTTTCTGGGGTGA assay_1_fwd_varia...
#> [2] 20 GGCGGTGGTTTCTGGGGTGA assay_1_fwd_varia...
#> [3] 20 GGCAGTGGTTTCTGGGGTGA assay_2_fwd_varia...
#> [4] 20 GGCGGTGGTTTCTGGGGTGA assay_2_fwd_varia...
#> [5] 20 CCTATATTCATCCAACCAAC assay_1_rev_varia...
#> ... ... ...
#> [8] 20 CCTATTTTCATCCAACCAAC assay_2_rev_varia...
#> [9] 18 CAGGGTTGATTCTCAGCC assay_1_pr_variant_1
#> [10] 18 CCGGGTTGATTCTCAGCC assay_1_pr_variant_2
#> [11] 19 CAGGGTTGATTCTCAGCCC assay_2_pr_variant_1
#> [12] 19 CCGGGTTGATTCTCAGCCC assay_2_pr_variant_2
Note that all sequences will be written in the same direction as the input target alignment, and all sequence variants of each oligo will be printed.
The sequences can now be saved in fasta-format by using writeXStringSet()
from Biostrings [1].
toFile <- as(myOligos[1:2, ], "DNAStringSet")
writeXStringSet(toFile, file = "myOligos.txt")
The design process as a whole is summarized below.
Using the R console:
library(rprimer)
## Enter the filepath to an alignment with target sequences of interest
filepath <- system.file("extdata", "example_alignment.txt", package = "rprimer")
## Import the alignment
myAlignment <- Biostrings::readDNAMultipleAlignment(filepath, format = "fasta")
## Design primers, probes and assays (modify settings if needed)
myConsensusProfile <- consensusProfile(myAlignment, ambiguityThreshold = 0.05)
myOligos <- oligos(myConsensusProfile)
myAssays <- assays(myOligos)
## Visualize the results
plotData(myConsensusProfile)
plotData(myOligos)
plotData(myAssays)
## Show result tables (in RStudio)
View(as.data.frame(myConsensusProfile))
View(as.data.frame(myOligos))
View(as.data.frame(myAssays))
Using the Shiny application:
library(rprimer)
## Start application
runRprimerApp()
The “Rprimer” classes (RprimerProfile
, RprimerOligo
, RprimerAssay
, RprimerMatchOligo
and RprimerMatchAssay
) extends the DFrame
class from S4Vectors [6], and behave in a similar way as traditional data frames, with methods for [
, $
, nrow()
, ncol()
, head()
, tail()
, rbind()
, cbind()
etc. They can be coerced to traditional data frames by using as.data.frame()
.
Example datasets of each class are provided with the package, and are loaded by:
data("exampleRprimerProfile")
data("exampleRprimerOligo")
data("exampleRprimerAssay")
data("exampleRprimerMatchOligo")
data("exampleRprimerMatchAssay")
To provide reproducible examples, I have also included an alignment of class Biostrings::DNAMultipleAlignment
. It is loaded by data("exampleRprimerAlignment")
.
Tables used for determination of complement bases, IUPAC consensus character codes, degeneracy and nearest neighbors can be found by typing:
rprimer:::lookup$complement
rprimer:::lookup$iupac
, rprimer:::lookup$degenerates
,rprimer:::lookup$degeneracy
andrprimer:::lookup$nn
, respectively.The source code is available at https://github.com/sofpn/rprimer.
Enter citation("rprimer")
from within R for information on how to cite the package.
This document was generated under the following conditions:
sessionInfo()
#> R version 4.2.0 RC (2022-04-19 r82224)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] Biostrings_2.64.0 GenomeInfoDb_1.32.0 XVector_0.36.0
#> [4] IRanges_2.30.0 S4Vectors_0.34.0 BiocGenerics_0.42.0
#> [7] rprimer_1.0.0 kableExtra_1.3.4 BiocStyle_2.24.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.8.3 svglite_2.1.0 assertthat_0.2.1
#> [4] digest_0.6.29 utf8_1.2.2 mime_0.12
#> [7] plyr_1.8.7 R6_2.5.1 evaluate_0.15
#> [10] highr_0.9 httr_1.4.2 ggplot2_3.3.5
#> [13] pillar_1.7.0 zlibbioc_1.42.0 rlang_1.0.2
#> [16] rstudioapi_0.13 jquerylib_0.1.4 magick_2.7.3
#> [19] rmarkdown_2.14 mathjaxr_1.6-0 labeling_0.4.2
#> [22] webshot_0.5.3 stringr_1.4.0 RCurl_1.98-1.6
#> [25] munsell_0.5.0 shiny_1.7.1 compiler_4.2.0
#> [28] httpuv_1.6.5 xfun_0.30 pkgconfig_2.0.3
#> [31] systemfonts_1.0.4 htmltools_0.5.2 tidyselect_1.1.2
#> [34] tibble_3.1.6 GenomeInfoDbData_1.2.8 bookdown_0.26
#> [37] fansi_1.0.3 viridisLite_0.4.0 crayon_1.5.1
#> [40] dplyr_1.0.8 later_1.3.0 bitops_1.0-7
#> [43] grid_4.2.0 DBI_1.1.2 jsonlite_1.8.0
#> [46] xtable_1.8-4 gtable_0.3.0 lifecycle_1.0.1
#> [49] magrittr_2.0.3 scales_1.2.0 cli_3.3.0
#> [52] stringi_1.7.6 reshape2_1.4.4 farver_2.1.0
#> [55] promises_1.2.0.1 xml2_1.3.3 bslib_0.3.1
#> [58] ellipsis_0.3.2 vctrs_0.4.1 generics_0.1.2
#> [61] tools_4.2.0 glue_1.6.2 purrr_0.3.4
#> [64] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
#> [67] BiocManager_1.30.17 rvest_1.0.2 knitr_1.38
#> [70] patchwork_1.1.1 sass_0.4.1
1. H. Pagès and P. Aboyoun and R. Gentleman and S. DebRoy. Biostrings: Efficient manipulation of biological strings. 2021. https://bioconductor.org/packages/Biostrings.
2. TM. Rose, ER. Schultz, JG. Henikoff, S. Pietrokovski, CM. McCallum, and S. Henikoff. Consensus-degenerate hybrid oligonucleotide primers for amplification of distantly related sequences. Nucleic acids research. 1998;26:1628–35.
3. J. SantaLucia. A unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics. Proceedings of the National Academy of Sciences. 1998;95:1460–5.
4. J. SantaLucia and D. Hicks. The thermodynamics of dna structural motifs. Annu Rev Biophys Biomol Struct. 2004;33:415–40.
5. J. SantaLucia. Physical principles and visual-omp software for optimal pcr design. Springer; 2007.
6. H. Pagès and M. Lawrence and P. Aboyoun. S4Vectors: Foundation of vector-like and list-like containers in bioconductor. 2021. https://bioconductor.org/packages/S4Vectors.