MSnSet
s. There is also features for quality control and visualisation of results. Please see other vignettes for convergence and other methodology.
bandle 1.0.0
# options(width = 150)
library("bandle")
library("ggplot2")
library("pRoloc")
library("MSnbase")
library("pheatmap")
library("viridis")
library("dplyr")
setStockcol(NULL)
setStockcol(paste0(getStockcol(),"90")) # make colours transparent
In this vignette we use a real-life biological use-case to demonstrate how to analyse mass-spectrometry based proteomics data using the Bayesian ANalysis of Differential Localisation Experiments (BANDLE) method.
As mentioned in “Vignette 1: Getting Started with BANDLE” data from mass
spectrometry based proteomics methods most commonly yield a matrix of
measurements where we have proteins/peptides/peptide spectrum matches
(PSMs) along the rows, and samples/fractions along the columns. To use bandle
the data must be stored as a MSnSet
, as implemented in the Bioconductor
MSnbase package. Please see the relevant vignettes in
MSnbase for constructing these data containers.
The data used in this vignette has been published in Mulvey et al. (2021) and is currently
stored as MSnSet
instances in the the pRolocdata package. We
will load it in the next section.
In this workflow we analyse the data produced by Mulvey et al. (2021). In this experiment triplicate hyperLOPIT experiments (Mulvey et al. 2017) were conducted on THP-1 human leukaemia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).
In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples. Please note to adhere to Bioconductor vignette build times we only load 2 of the 3 replicates for each condition to demonstrate the BANDLE workflow.
library("pRolocdata")
data("thpLOPIT_unstimulated_rep1_mulvey2021")
data("thpLOPIT_unstimulated_rep3_mulvey2021")
data("thpLOPIT_lps_rep1_mulvey2021")
data("thpLOPIT_lps_rep3_mulvey2021")
By typing the names of the datasets we get a MSnSet
data summary. For
example,
thpLOPIT_unstimulated_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
## unstim_rep1_set2_131_F24 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
## fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
thpLOPIT_lps_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
## lps_rep1_set2_131_F25 (20 total)
## varLabels: Tag Treatment ... Fraction (5 total)
## varMetadata: labelDescription
## featureData
## featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
## fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
## total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021.
## Normalised to sum of intensities.
## MSnbase version: 2.14.2
We see that the datasets thpLOPIT_unstimulated_rep1_mulvey2021
and
thpLOPIT_lps_rep1_mulvey2021
contain 5107 and 4879 proteins respectively,
across 20 TMT channels. The data is accessed through different slots of the
MSnSet
(see str(thpLOPIT_unstimulated_rep1_mulvey2021)
for all available
slots). The 3 main slots which are used most frequently are those that contain
the quantitation data, the features i.e. PSM/peptide/protein information and the
sample information, and these can be accessed using the functions exprs
,
fData
, and pData
, respectively.
To run bandle
there are a few minimal requirements that the data must fulfill.
Data are required to have:
list
of MSnSet
instancesIf we use the dim
function we see that the datasets we have loaded have the
same number of channels but a different number of proteins per experiment.
dim(thpLOPIT_unstimulated_rep1_mulvey2021)
## [1] 5107 20
dim(thpLOPIT_unstimulated_rep3_mulvey2021)
## [1] 5733 20
dim(thpLOPIT_lps_rep1_mulvey2021)
## [1] 4879 20
dim(thpLOPIT_lps_rep3_mulvey2021)
## [1] 5848 20
We use the function commonFeatureNames
to extract proteins that are common
across all replicates. This function has a nice side effect which is that it
also wraps the data into a list
, ready for input into bandle
.
thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021, ## unstimulated rep
thpLOPIT_unstimulated_rep3_mulvey2021, ## unstimulated rep
thpLOPIT_lps_rep1_mulvey2021, ## 12h-LPS rep
thpLOPIT_lps_rep3_mulvey2021)) ## 12h-LPS rep
## 3727 features in common
We now have our list of MSnSet
s ready for bandle
with 3727 proteins common
across all 4 replicates/conditions.
thplopit
## Instance of class 'MSnSetList' containig 4 objects.
We can visualise the data using the plot2D
function from pRoloc
## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
"12h-LPS 1st rep", "12h-LPS 2nd rep")
## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)
By default the plot2D
uses principal components analysis (PCA)
for the data transformation. Other options such as t-SNE, kernal
PCA etc. are also available, see ?plot2D
and the method
argument.
PCA sometimes will randomly flip the axis, because the eigenvectors
only need to satisfy \(||v|| = 1\), which allows a sign flip. You will
notice this is the case for the 3rd plot. If desired you can flip
the axis/change the sign of the PCs by specifying any of the arguments
mirrorX
, mirrorY
, axsSwitch
to TRUE when you call plot2D
.
bandle
: fitting GPs and setting the priorsAs mentioned in the first vignette, bandle
uses a complex model to analyse the
data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior
distribution of parameters and latent variables from which statistics of
interest can be computed. Again, here we only run a few iterations for brevity
but typically one needs to run thousands of iterations to ensure convergence, as
well as multiple parallel chains.
First, we need to fit non-parametric regression functions to the markers
profiles. We use the fitGPmaternPC
function using the default penalised
complexity priors (see ?fitGP
), which work well.
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))
We apply the fitGPmaternPC
function on to each dataset by using lapply
over
the thplopit
list of data. The posterior predictive means, standard deviations
and MAP hyperparamters for the GP are returned. If desired we can visualise the
predictives overlaid onto the marker profiles of the data by using the plotGPmatern
function.
The prior needs to form a K*3
matrix (where K
is the number of subcellular
classes in the data),
(mrkCl <- getMarkerClasses(thplopit[[1]], fcol = "markers"))
## [1] "40S/60S Ribosome" "Chromatin" "Cytosol"
## [4] "Endoplasmic Reticulum" "Golgi Apparatus" "Lysosome"
## [7] "Mitochondria" "Nucleolus" "Nucleus"
## [10] "Peroxisome" "Plasma Membrane"
So for this data we require a 11*3
matrix. Three columns are needed, one for
the prior, one for length-scale amplitude and finally one for the standard
deviation parameters (see hyppar
in ?fitGP
). For more details see the
manuscript by Crook et al. (2021).
K <- length(mrkCl)
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(10, 60, 250),
each = K), ncol = 3)
head(pc_prior)
## [,1] [,2] [,3]
## [1,] 10 60 250
## [2,] 10 60 250
## [3,] 10 60 250
## [4,] 10 60 250
## [5,] 10 60 250
## [6,] 10 60 250
Now we have generated these complexity priors we can pass them as an
argument to the fitGPmaternPC
function. For example,
gpParams <- lapply(thplopit,
function(x) fitGPmaternPC(x, hyppar = pc_prior))
By plotting the predictives using the plotGPmatern
function we see that
the distributions and fit looks sensible for each class so we will proceed with
setting the prior on the weights.
par(mfrow = c(4, 3))
plotGPmatern(thplopit[[1]], gpParams[[1]])