scp 1.6.0
scp
packageThe scp
package is used to process and analyse mass spectrometry
(MS)-based single cell proteomics (SCP) data. The functions rely on a
specific data structure that wraps
QFeatures
objects (Gatto (2020)) around
SingleCellExperiment
objects (Amezquita et al. (2019)). This data structure could be seen as
Matryoshka dolls were the SingleCellExperiment
objects are small
dolls contained in the bigger QFeatures
doll.
The SingleCellExperiment
class provides a dedicated framework for
single-cell data. The SingleCellExperiment
serves as an interface to
many cutting-edge methods for processing, visualizing and analysis
single-cell data. More information about the SingleCellExperiment
class and associated methods can be found in the
OSCA book.
The QFeatures
class is a data framework dedicated to manipulate and
process MS-based quantitative data. It preserves the relationship
between the different levels of information: peptide to spectrum match
(PSM) data, peptide data and protein data. The QFeatures
package
also provides an interface to many utility functions to streamline the
processing MS data. More information about MS data analysis tools can
be found in the
RforMassSpectrometry project.
Before running the vignette we need to load the scp
package.
library("scp")
We also load ggplot2
, magrittr
and dplyr
for convenient data
manipulation and plotting.
library("ggplot2")
library("magrittr")
library("dplyr")
This vignette will guide you through some common steps of mass
spectrometry-based single-cell proteomics (SCP) data analysis. SCP is
an emerging field and further research is required to develop a
principled analysis workflow. Therefore, we do not guarantee that the
steps presented here are the best steps for this type of data analysis.
This vignette performs the steps that were described in the SCoPE2
landmark paper (Specht et al. (2021)). We hope to convince the reader that,
although the workflow is probably not optimal, scp
has the full
potential to perform standardized and principled data analysis. All
functions presented here are comprehensively documented, highly modular,
can easily be extended with new algorithms. Suggestions, feature
requests or bug reports are warmly welcome. Feel free to open an issue
in the
GitHub repository.
This workflow can be applied to any MS-based SCP data. The minimal requirement to follow this workflow is that the data should contain the following information:
Raw.file
: field in both the feature data and the sample data that gives
the names of the batches or MS runs or MS files.Channel
: field in the sample data that links to columns in the
quantification data and that allows to link samples to MS channels
(more details in another
vignette).SampleType
: field in the sample data that provides the type of sample
that is acquired (carrier, blank, reference, single-cell). Only needed
for multiplexing experiments.Potential.contaminant
: field in the feature data that marks
contaminant peptides.Reverse
: field in the feature data that marks reverse peptides.PIF
: field in the feature data that provides spectral purity.PEP
or dart_PEP
: field in the feature data that provides peptide
posterior error probabilities.Modified.sequence
: field in the feature data that provides the
peptide identifiers.Leading.razor.protein
: field in the feature data that provides the
protein identifiers.Reporter.intensity.
followed by an index (1
to 16
).Each required field will be described more in detail in the corresponding sections. Names can be adapted by the user to more meaningful ones or adapted to other output tables.
The first step is to read in the PSM quantification table
generated by, for example, MaxQuant (Tyanova, Temu, and Cox (2016)). We created a
small example data by subsetting the MaxQuant evidence.txt
table
provided in the SCoPE2 landmark paper (Specht et al. (2021)). The
mqScpData
table is a typical example of what you would get after
reading in a CSV file using read.csv
or read.table
. See
?mqScpData
for more information about the table content.
data("mqScpData")
We also provide an example of a sample metadata table that provides
useful information about the samples that are present in the example
data. See ?sampleAnnotation
for more information about the table
content.
data("sampleAnnotation")
As a note, the example sample data contains 5 different types of samples
(SampleType
) that can be found in a TMT-based SCP data set:
table(sampleAnnotation$SampleType)
#>
#> Blank Carrier Macrophage Monocyte Reference Unused
#> 19 3 20 5 3 14
Carrier
) contain 200 cell equivalents and
are meant to boost the peptide identification rate.Reference
) contain 5 cell equivalents
and are used to partially correct for between-run variation.Unused
) are channels that are left empty due
to isotopic cross-contamination by the carrier channel.Blank
) contain samples that do not contain any
cell but are processed as single-cell samples.Macrophage
) or monocyte
(Monocyte
).Using readSCP
, we combine both tables in a QFeatures
object
formatted as described above.
scp <- readSCP(featureData = mqScpData,
colData = sampleAnnotation,
channelCol = "Channel",
batchCol = "Raw.file",
removeEmptyCols = TRUE)
#> Loading data as a 'SingleCellExperiment' object
#> Splitting data based on 'Raw.file'
#> Formatting sample metadata (colData)
#> Formatting data as a 'QFeatures' object
scp
#> An instance of class QFeatures containing 4 assays:
#> [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 395 rows and 11 columns
#> [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 487 rows and 11 columns
#> [3] 190321S_LCA10_X_FP97_blank_01: SingleCellExperiment with 109 rows and 11 columns
#> [4] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 370 rows and 16 columns
See here that the 3 first assays contain 11 columns that correspond to the TMT-11 labels and the last assay contains 16 columns that correspond to the TMT-16 labels.
Important: More details about the usage of readSCP
and how to
read your own data set are provided in the Load data using readSCP
vignette.
Another way to get an overview of the scp object is to plot the
QFeatures
object. This will create a graph where each node is an
assay and links between assays are denoted as edges.
plot(scp)
All single-cell data contain many zeros. The zeros can be biological
zeros or technical zeros and differentiating between the two types is
not a trivial task. To avoid artefacts in downstream steps, we replace
the zeros by the missing value NA
. The zeroIsNA
function takes the
QFeatures
object and the name(s) or index/indices of the assay(s) to
clean and automatically replaces any zero in the selected quantitative
data by NA
.
scp <- zeroIsNA(scp, i = 1:4)
A common steps in SCP is to filter out low-confidence PSMs. Each PSM
assay contains feature meta-information that are stored in the
rowData
of the assays. The QFeatures
package allows to quickly
filter the rows of an assay by using these information. The available
variables in the rowData
are listed below for each assay.
rowDataNames(scp)
#> CharacterList of length 4
#> [["190222S_LCA9_X_FP94BM"]] uid Sequence Length ... residual participated
#> [["190321S_LCA10_X_FP97AG"]] uid Sequence Length ... residual participated
#> [["190321S_LCA10_X_FP97_blank_01"]] uid Sequence ... residual participated
#> [["190914S_LCB3_X_16plex_Set_21"]] uid Sequence ... residual participated
Below are some examples of criteria that are used to identify low-confidence. The information is readily available since this was computed by MaxQuant:
We can perform this filtering using the filterFeatures
function from
QFeatures
. filterFeatures
automatically accesses the feature
metadata and selects the rows that meet the provided condition(s). For
instance, Reverse != "+"
keeps the rows for which the Reverse
variable in the rowData
is not "+"
(i.e. the PSM is not matched
to the decoy database).
scp <- filterFeatures(scp,
~ Reverse != "+" &
Potential.contaminant != "+" &
!is.na(PIF) & PIF > 0.8)
To avoid proceeding with failed runs, another interesting filter is to
remove assays with too few features. If a batch contains less than,
for example, 150 features we can then suspect something wrong happened
in that batch and it should be removed. Using dims
, we can query the
dimensions (hence the number of features and the number of samples) of
all assays contained in the dataset.
dims(scp)
#> 190222S_LCA9_X_FP94BM 190321S_LCA10_X_FP97AG 190321S_LCA10_X_FP97_blank_01
#> [1,] 283 318 60
#> [2,] 11 11 11
#> 190914S_LCB3_X_16plex_Set_21
#> [1,] 200
#> [2,] 16
Actually, a QFeatures
object can be seen as a three-order array:
\(features \times samples \times assay\). Hence, QFeatures
supports
three-order subsetting x[rows, columns, assays]
. We first select the
assays that have sufficient PSMs (the number of rows is greater than
150), and then subset the scp
object for the assays that meet the
criterion.
keepAssay <- dims(scp)[1, ] > 150
scp <- scp[, , keepAssay]
#> Warning: 'experiments' dropped; see 'metadata'
#> harmonizing input:
#> removing 11 sampleMap rows not in names(experiments)
#> removing 11 colData rownames not in sampleMap 'primary'
scp
#> An instance of class QFeatures containing 3 assays:
#> [1] 190222S_LCA9_X_FP94BM: SingleCellExperiment with 283 rows and 11 columns
#> [2] 190321S_LCA10_X_FP97AG: SingleCellExperiment with 318 rows and 11 columns
#> [3] 190914S_LCB3_X_16plex_Set_21: SingleCellExperiment with 200 rows and 16 columns
Notice the 190321S_LCA10_X_FP97_blank_01
sample was removed because
it did not contain sufficient features, as expected from a blank
run. This could also have been the case for failed runs.
Another type of filtering is specific to SCP. In the SCoPE2 analysis, the authors suggest a filters based on the sample to carrier ratio (SCR), that is the reporter ion intensity of a single-cell sample divided by the reporter ion intensity of the carrier channel (200 cells) from the same batch. It is expected that the carrier intensities are much higher than the single-cell intensities.
The SCR can be computed using the computeSCR
function from
scp
. The function must be told which channels are the samples that
must be divided and which channel contains the carrier. This
information is provided in the sample metadata and is accessed using
the colData
, under the SampleType
field.
table(colData(scp)[, "SampleType"])
#>
#> Blank Carrier Macrophage Monocyte Reference Unused
#> 3 3 20 5 3 4
In this dataset, SampleType
gives the type of sample that is present
in each TMT channel. The SCoPE2 protocole includes 5 types of samples:
Carrier
) contain 200 cell equivalents and
are meant to boost the peptide identification rate.Reference
) contain 5 cell equivalents
and are used to partially correct for between-run variation.Unused
) are channels that are left empty due
to isotopic cross-contamination by the carrier channel.Blank
) contain samples that do not contain any
cell but are processed as single-cell samples.Macrophage
) or monocyte
(Monocyte
).The computeSCR
function expects the following input:
QFeatures
datasetcolvar
: the variable in the sample metadata (colData
) that hold
the information used to discriminate sample channels from carrier
channels.carrierPattern
: a string pattern (following regular expression
syntax) that identifies the carrier channel in each batch.samplePattern
: a string pattern (following regular expression
syntax) that identifies the samples to divide.Optionally, you can also provide the following arguments:
rowDataName
: the name of the column in the rowData
where to
store the computed SCR for each feature.sampleFUN
: when multiple samples are present in an assay, there
are as many SCR as there are samples that need to be summarized to
a single value per feature. sampleFUN
tells which function to use
for summarizing the sample values before computing the SCR; the
default is the mean
.carrierFUN
: some designs might include several carriers per run
(not the case in this example). Similarly to sampleFUN
,
carrierFUN
tells which function to use for summarizing the carrier
values before computing the SCR; the default is the same function as
sampleFUN
.The function creates a new field in the rowData
of the assays. We
compute the average SCR for each PSM and store it in the corresponding
rowData
, under the MeanSCR
column.
scp <- computeSCR(scp,
i = 1:3,
colvar = "SampleType",
carrierPattern = "Carrier",
samplePattern = "Macrophage|Monocyte",
sampleFUN = "mean",
rowDataName = "MeanSCR")
Before applying the filter, we plot the distribution of the average
SCR. We collect the rowData
from several assays in a single table
DataFrame
using the rbindRowData
function from QFeatures
.
rbindRowData(scp, i = 1:3) %>%
data.frame %>%
ggplot(aes(x = MeanSCR)) +
geom_histogram() +
geom_vline(xintercept = c(1/200, 0.1),
lty = c(2, 1)) +
scale_x_log10()
The expected ratio between single cells and the carrier is 1/200
(dashed line). We can see that the distribution mode is slightly
shifted towards higher ratios with a mode around 0.01. However, there
are a few PSMs that stand out of the distribution and have a much
higher signal than expected, indicating something wrong happened
during the quantification of those PSMs. We therefore filter out PSMs
with an average SCR higher than 0.1 (solide line). This is again easily
performed using the filterFeatures
functions.
scp <- filterFeatures(scp,
~ !is.na(MeanSCR) &
MeanSCR < 0.1)
Finally, we might also want to control for false discovery rate (FDR).
MaxQuant already computes posterior error probabilities (PEP), but
filtering on PEPs is too conservative (Käll et al. (2008)) so we provide the
pep2qvalue
function to convert PEPs to q-values that are directly
related to FDR. We here compute the q-values from the PEP (dart_PEP
)
across all 3 assays. dart_PEP
contains the PEP values that have been
updated using the DART-ID algorithm (Chen, Franks, and Slavov (2019)). The function will
store the results in the rowData
, we here asked to name the new
column qvalue_PSMs
.
scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
rowDataName = "qvalue_PSMs")
We also allow to compute q-values at peptide or protein level rather
than PSM. In this case, you need to supply the groupBy
argument.
Suppose we want to compute the q-values at protein level, we can fetch
the protein information stored under Leading.razor.protein
in the
rowData
. This time, we store the q-values in a new field called
qvalue_proteins
.
scp <- pep2qvalue(scp,
i = 1:3,
PEP = "dart_PEP",
groupBy = "Leading.razor.protein",
rowDataName = "qvalue_proteins")
We can now filter the PSM to control, let’s say, the protein FDR at
1%. This can be performed using filterFeatures
because the q-values
were stored in the rowData
.
scp <- filterFeatures(scp,
~ qvalue_proteins < 0.01)
In order to partialy correct for between-run variation, SCoPE2
suggests computing relative reporter ion intensities. This means that
intensities measured for single-cells are divided by the reference
channel containing 5-cell equivalents. We use the divideByReference
function that divides channels of interest by the reference
channel. Similarly to computeSCR
, we can point to the samples and
the reference columns in each assay using the annotation contained in
the colData
.
We here divide all columns (using the regular expression wildcard .
)
by the reference channel (Reference
).
scp <- divideByReference(scp,
i = 1:3,
colvar = "SampleType",
samplePattern = ".",
refPattern = "Reference")
Now that the PSM assays are processed, we can aggregate them to
peptides. This is performed using the aggregateFeaturesOverAssays
function. For each assay, the function aggregates several PSMs into a
unique peptide. This is best illustrated by the figure below.