MSstatsLiP Workflow: An example workflow and analysis of the MSstatsLiP package

Devon Kohler (kohler.d@northeastern.edu)

2022-04-28

MSstatsLiP Workflow Vignette

This Vignette provides an example workflow for how to use the package MSstatsLiP.

Installation

To install this package, start R (version “4.1”) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstatsLiP")
library(MSstatsLiP)
library(tidyverse)
library(data.table)

Workflow

1. Preprocessing

1.1 Raw Data Format

The first step is to load in the raw dataset for both the PTM and Protein datasets. This data can then be converted into MSstatsLiP format with one of the built in converters, or manually converted. In this case we use the converter for Spectronaut.

## Read in raw data files
head(LiPRawData)
#>    R.Condition    R.FileName R.Replicate PG.ProteinAccessions PG.ProteinGroups
#> 1:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#> 2:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#> 3:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#> 4:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#> 5:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#> 6:        Osmo 180423_IP1742      Osmo_1               P14164           P14164
#>    PG.Quantity PEP.GroupingKey PEP.StrippedSequence PEP.Quantity
#> 1:    105910.6         ILQNDLK              ILQNDLK     65116.91
#> 2:    105910.6         ILQNDLK              ILQNDLK     65116.91
#> 3:    105910.6         ILQNDLK              ILQNDLK     65116.91
#> 4:    105910.6         ILQNDLK              ILQNDLK     65116.91
#> 5:    105910.6         ILQNDLK              ILQNDLK     65116.91
#> 6:    105910.6         ILQNDLK              ILQNDLK     65116.91
#>    EG.iRTPredicted           EG.Library EG.ModifiedSequence EG.PrecursorId
#> 1:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#> 2:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#> 3:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#> 4:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#> 5:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#> 6:       -9.285238 180517_Sc_Sent_O_all           _ILQNDLK_    _ILQNDLK_.2
#>       EG.Qvalue FG.Charge       FG.Id FG.PrecMz FG.Quantity F.Charge F.FrgIon
#> 1: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y5
#> 2: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y6
#> 3: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y5
#> 4: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y4
#> 5: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y5
#> 6: 0.0001794077         2 _ILQNDLK_.2  422.2504    65116.91        1       y3
#>    F.FrgLossType  F.FrgMz F.FrgNum F.FrgType F.ExcludedFromQuantification
#> 1:        noloss 617.3253        5         y                         TRUE
#> 2:        noloss 730.4094        6         y                        FALSE
#> 3:           NH3 600.2988        5         y                        FALSE
#> 4:        noloss 489.2667        4         y                        FALSE
#> 5:           H2O 599.3148        5         y                         TRUE
#> 6:        noloss 375.2238        3         y                         TRUE
#>    F.NormalizedPeakArea F.NormalizedPeakHeight F.PeakArea F.PeakHeight
#> 1:             40994.32              275561.99   39189.54    263430.31
#> 2:             32552.17              267988.39   31119.05    256190.14
#> 3:             17712.62              124252.06   16932.82    118781.84
#> 4:             14852.12              101414.80   14198.25     96949.99
#> 5:             30117.24              257451.20   28791.32    246116.86
#> 6:             28752.17               69229.75   27486.35     66181.90
head(TrPRawData)
#>    R.Condition    R.FileName R.Replicate PG.ProteinAccessions PG.ProteinGroups
#> 1:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#> 2:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#> 3:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#> 4:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#> 5:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#> 6:       OsmoT 180423_IP1739     OsmoT_1               P14164           P14164
#>    PG.Quantity PEP.GroupingKey PEP.StrippedSequence PEP.Quantity
#> 1:    159438.5         ILQNDLK              ILQNDLK     125562.7
#> 2:    159438.5         ILQNDLK              ILQNDLK     125562.7
#> 3:    159438.5         ILQNDLK              ILQNDLK     125562.7
#> 4:    159438.5         ILQNDLK              ILQNDLK     125562.7
#> 5:    159438.5         ILQNDLK              ILQNDLK     125562.7
#> 6:    159438.5 HQEISSAGTSSNTTK      HQEISSAGTSSNTTK      56869.4
#>    EG.iRTPredicted             EG.Library EG.ModifiedSequence
#> 1:       -8.957758 180516_Sc_Sent_O_mixed           _ILQNDLK_
#> 2:       -8.957758 180516_Sc_Sent_O_mixed           _ILQNDLK_
#> 3:       -8.957758 180516_Sc_Sent_O_mixed           _ILQNDLK_
#> 4:       -8.957758 180516_Sc_Sent_O_mixed           _ILQNDLK_
#> 5:       -8.957758 180516_Sc_Sent_O_mixed           _ILQNDLK_
#> 6:      -46.080009 180516_Sc_Sent_O_mixed   _HQEISSAGTSSNTTK_
#>         EG.PrecursorId            EG.Qvalue FG.Charge               FG.Id
#> 1:         _ILQNDLK_.2 1.87107676352627E-08         2         _ILQNDLK_.2
#> 2:         _ILQNDLK_.2 1.87107676352627E-08         2         _ILQNDLK_.2
#> 3:         _ILQNDLK_.2 1.87107676352627E-08         2         _ILQNDLK_.2
#> 4:         _ILQNDLK_.2 1.87107676352627E-08         2         _ILQNDLK_.2
#> 5:         _ILQNDLK_.2 1.87107676352627E-08         2         _ILQNDLK_.2
#> 6: _HQEISSAGTSSNTTK_.3 5.94616028780965E-24         3 _HQEISSAGTSSNTTK_.3
#>    FG.PrecMz FG.Quantity F.Charge F.FrgIon F.FrgLossType  F.FrgMz F.FrgNum
#> 1:  422.2504   125562.66        1       y5        noloss 617.3253        5
#> 2:  422.2504   125562.66        1       y6        noloss 730.4094        6
#> 3:  422.2504   125562.66        1       y5           NH3 600.2988        5
#> 4:  422.2504   125562.66        1       y4        noloss 489.2667        4
#> 5:  422.2504   125562.66        1       y3        noloss 375.2238        3
#> 6:  516.5814    38460.38        1       y6        noloss 637.3151        6
#>    F.FrgType F.ExcludedFromQuantification F.NormalizedPeakArea
#> 1:         y                        FALSE             43032.81
#> 2:         y                        FALSE             45005.01
#> 3:         y                        FALSE             37524.85
#> 4:         y                         TRUE             19428.34
#> 5:         y                         TRUE             50290.83
#> 6:         y                         TRUE             44622.36
#>    F.NormalizedPeakHeight F.PeakArea F.PeakHeight
#> 1:              196684.09   39642.51    181188.52
#> 2:              216104.63   41459.34    199079.03
#> 3:              153040.68   34568.49    140983.52
#> 4:               70077.31   17897.70     64556.34
#> 5:              220417.00   46328.71    203051.66
#> 6:              241914.85   41287.37    223834.58

1.2 Converter


## Run converter
MSstatsLiP_data <- SpectronauttoMSstatsLiPFormat(LiPRawData, 
                                      "../inst/extdata/ExampleFastaFile.fasta", 
                                      TrPRawData, use_log_file = FALSE, 
                                      append = FALSE)
#> INFO  [2022-04-28 17:05:14] ** Raw data from Spectronaut imported successfully.
#> INFO  [2022-04-28 17:05:14] ** Raw data from Spectronaut cleaned successfully.
#> INFO  [2022-04-28 17:05:14] ** Using annotation extracted from quantification data.
#> INFO  [2022-04-28 17:05:14] ** Run labels were standardized to remove symbols such as '.' or '%'.
#> INFO  [2022-04-28 17:05:14] ** The following options are used:
#>   - Features will be defined by the columns: PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge
#>   - Shared peptides will be removed.
#>   - Proteins with single feature will not be removed.
#>   - Features with less than 3 measurements across runs will be removed.
#> WARN  [2022-04-28 17:05:14] ** PGQvalue not found in input columns.
#> INFO  [2022-04-28 17:05:14] ** Intensities with values not smaller than 0.01 in EGQvalue are replaced with 0
#> INFO  [2022-04-28 17:05:14] ** Features with all missing measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Shared peptides are removed.
#> INFO  [2022-04-28 17:05:14] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
#> INFO  [2022-04-28 17:05:14] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Run annotation merged with quantification data.
#> INFO  [2022-04-28 17:05:14] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Fractionation handled.
#> INFO  [2022-04-28 17:05:14] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2022-04-28 17:05:14] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.
#> INFO  [2022-04-28 17:05:14] ** Raw data from Spectronaut imported successfully.
#> INFO  [2022-04-28 17:05:14] ** Raw data from Spectronaut cleaned successfully.
#> INFO  [2022-04-28 17:05:14] ** Using annotation extracted from quantification data.
#> INFO  [2022-04-28 17:05:14] ** Run labels were standardized to remove symbols such as '.' or '%'.
#> INFO  [2022-04-28 17:05:14] ** The following options are used:
#>   - Features will be defined by the columns: PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge
#>   - Shared peptides will be removed.
#>   - Proteins with single feature will not be removed.
#>   - Features with less than 3 measurements across runs will be removed.
#> WARN  [2022-04-28 17:05:14] ** PGQvalue not found in input columns.
#> INFO  [2022-04-28 17:05:14] ** Intensities with values not smaller than 0.01 in EGQvalue are replaced with 0
#> INFO  [2022-04-28 17:05:14] ** Features with all missing measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Shared peptides are removed.
#> INFO  [2022-04-28 17:05:14] ** Multiple measurements in a feature and a run are summarized by summaryforMultipleRows: max
#> INFO  [2022-04-28 17:05:14] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Run annotation merged with quantification data.
#> INFO  [2022-04-28 17:05:14] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:14] ** Fractionation handled.
#> INFO  [2022-04-28 17:05:14] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2022-04-28 17:05:14] ** Finished preprocessing. The dataset is ready to be processed by the dataProcess function.

head(MSstatsLiP_data[["LiP"]])
#>    ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge
#> 1:      P14164         ILQNDLK               2          y4             1
#> 2:      P14164         ILQNDLK               2          y4             1
#> 3:      P14164         ILQNDLK               2          y4             1
#> 4:      P14164         ILQNDLK               2          y4             1
#> 5:      P14164         ILQNDLK               2          y4             1
#> 6:      P14164         ILQNDLK               2          y4             1
#>    IsotopeLabelType Condition BioReplicate           Run Fraction    Intensity
#> 1:                L      Osmo       Osmo_1 180423_IP1742        1 1.419825e+04
#> 2:                L      Osmo       Osmo_2 180423_IP1743        1 1.053235e+04
#> 3:                L      Osmo       Osmo_3 180423_IP1744        1 1.232257e+04
#> 4:                L   Control    Control_1 180423_IP1748        1 1.034930e+04
#> 5:                L   Control    Control_2 180423_IP1749        1 2.396684e+04
#> 6:                L   Control    Control_3 180423_IP1750        1 9.265785e+03
#>      FULL_PEPTIDE
#> 1: P14164_ILQNDLK
#> 2: P14164_ILQNDLK
#> 3: P14164_ILQNDLK
#> 4: P14164_ILQNDLK
#> 5: P14164_ILQNDLK
#> 6: P14164_ILQNDLK
head(MSstatsLiP_data[["TrP"]])
#>    ProteinName    PeptideSequence PrecursorCharge FragmentIon ProductCharge
#> 1:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#> 2:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#> 3:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#> 4:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#> 5:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#> 6:      P14164 GLDDESGPTHGNDSGNHR               4         y12             2
#>    IsotopeLabelType Condition BioReplicate             Run Fraction
#> 1:                L     OsmoT      OsmoT_1   180423_IP1739        1
#> 2:                L     OsmoT      OsmoT_2 180423_IP1740_2        1
#> 3:                L     OsmoT      OsmoT_3 180423_IP1741_2        1
#> 4:                L     CtrlT      CtrlT_1   180423_IP1745        1
#> 5:                L     CtrlT      CtrlT_2   180423_IP1746        1
#> 6:                L     CtrlT      CtrlT_3   180423_IP1747        1
#>       Intensity
#> 1: 7.497252e+02
#> 2:         <NA>
#> 3: 3.105397e+03
#> 4: 3.712693e+03
#> 5: 4.353941e+03
#> 6: 4.617424e+03

## Make conditions match
MSstatsLiP_data[["LiP"]][MSstatsLiP_data[["LiP"]]$Condition == "Control", 
                         "Condition"] = "Ctrl"
MSstatsLiP_data[["TrP"]]$Condition = substr(MSstatsLiP_data[["TrP"]]$Condition, 
  1, nchar(MSstatsLiP_data[["TrP"]]$Condition)-1)

In the case above the SpectronauttoMSstatsLiPFormat was ran to convert the data into the format required for MSstatsLiP. Note that the condition names did not match between the LiP and TrP datasets. Here we edit the conditions in each dataset to match.

Additionally, it is important that the column “FULL_PEPTIDE” in the LiP dataset contains both the Protein and Peptide information, seperated by an underscore. This allows us to summarize up to the peptide level, while keeping important information about which protein the peptide corresponds to.

2. Summarization

The next step in the MSstatsLiP workflow is to summarize feature intensities per run into one value using the dataSummarizationLiP function. This function takes as input the formatted data from the converter.

2.1 Summarization Function


MSstatsLiP_Summarized <- dataSummarizationLiP(MSstatsLiP_data,
                                              MBimpute = FALSE, 
                                              use_log_file = FALSE, 
                                              append = FALSE)
#> Starting PTM summarization...
#> INFO  [2022-04-28 17:05:15] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:15] ** Fractionation handled.
#> INFO  [2022-04-28 17:05:15] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2022-04-28 17:05:15] ** Use all features that the dataset originally has.
#> INFO  [2022-04-28 17:05:15] 
#>  # proteins: 14
#>  # peptides per protein: 1-2
#>  # features per peptide: 2-4
#> INFO  [2022-04-28 17:05:15] 
#>                     Ctrl Osmo
#>              # runs    3    3
#>     # bioreplicates    3    3
#>  # tech. replicates    1    1
#> INFO  [2022-04-28 17:05:15]  == Start the summarization per subplot...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2022-04-28 17:05:15]  == Summarization is done.
#> Starting Protein summarization...
#> INFO  [2022-04-28 17:05:15] ** Features with one or two measurements across runs are removed.
#> INFO  [2022-04-28 17:05:15] ** Fractionation handled.
#> INFO  [2022-04-28 17:05:15] ** Updated quantification data to make balanced design. Missing values are marked by NA
#> INFO  [2022-04-28 17:05:15] ** Use all features that the dataset originally has.
#> INFO  [2022-04-28 17:05:15] 
#>  # proteins: 13
#>  # peptides per protein: 1-4
#>  # features per peptide: 2-4
#> INFO  [2022-04-28 17:05:15] 
#>                     Ctrl Osmo
#>              # runs    3    3
#>     # bioreplicates    3    3
#>  # tech. replicates    1    1
#> INFO  [2022-04-28 17:05:15] Some features are completely missing in at least one condition:  
#>  SNEVNGNNDDDADANNIFK_2_b3_1,
#>  SNEVNGNNDDDADANNIFK_2_y6_1,
#>  KDDDTDFLK_3_y3_1,
#>  KDDDTDFLK_3_y4_1,
#>  SNDLLSGLTGSSQTR_2_b3_1 ...
#> INFO  [2022-04-28 17:05:15]  == Start the summarization per subplot...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2022-04-28 17:05:15]  == Summarization is done.
names(MSstatsLiP_Summarized[["LiP"]])
#> [1] "FeatureLevelData"  "ProteinLevelData"  "SummaryMethod"    
#> [4] "ModelQC"           "PredictBySurvival"

lip_protein_data <- MSstatsLiP_Summarized[["LiP"]]$ProteinLevelData
trp_protein_data <- MSstatsLiP_Summarized[["TrP"]]$ProteinLevelData

head(lip_protein_data)
#>      FULL_PEPTIDE RUN LogIntensities   originalRUN GROUP   SUBJECT
#> 1: P14164_ILQNDLK   1       15.67592 180423_IP1748  Ctrl Control_1
#> 2: P14164_ILQNDLK   2       15.71415 180423_IP1749  Ctrl Control_2
#> 3: P14164_ILQNDLK   3       15.50261 180423_IP1750  Ctrl Control_3
#> 4: P14164_ILQNDLK   4       14.35382 180423_IP1742  Osmo    Osmo_1
#> 5: P14164_ILQNDLK   5       14.98971 180423_IP1743  Osmo    Osmo_2
#> 6: P14164_ILQNDLK   6       14.36873 180423_IP1744  Osmo    Osmo_3
#>    TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing
#> 1:                      6                  2                 0         FALSE
#> 2:                      6                  2                 0         FALSE
#> 3:                      6                  2                 0         FALSE
#> 4:                      6                  2                 0         FALSE
#> 5:                      6                  2                 0         FALSE
#> 6:                      6                  2                 0         FALSE
#>    NumImputedFeature Protein
#> 1:                 0  P14164
#> 2:                 0  P14164
#> 3:                 0  P14164
#> 4:                 0  P14164
#> 5:                 0  P14164
#> 6:                 0  P14164
head(trp_protein_data)
#>   RUN Protein LogIntensities     originalRUN GROUP SUBJECT
#> 1   1  P14164       12.20158   180423_IP1745  Ctrl CtrlT_1
#> 2   2  P14164       12.00412   180423_IP1746  Ctrl CtrlT_2
#> 3   3  P14164       12.34676   180423_IP1747  Ctrl CtrlT_3
#> 4   4  P14164       11.58219   180423_IP1739  Osmo OsmoT_1
#> 5   6  P14164       11.78048 180423_IP1741_2  Osmo OsmoT_3
#> 6   3  P16622       12.85950   180423_IP1747  Ctrl CtrlT_3
#>   TotalGroupMeasurements NumMeasuredFeature MissingPercentage more50missing
#> 1                     15                  5               0.0         FALSE
#> 2                     15                  5               0.0         FALSE
#> 3                     15                  5               0.0         FALSE
#> 4                     15                  3               0.4         FALSE
#> 5                     15                  3               0.4         FALSE
#> 6                      9                  3               0.0         FALSE
#>   NumImputedFeature
#> 1                 0
#> 2                 0
#> 3                 0
#> 4                 0
#> 5                 0
#> 6                 0

Again the summarization function returns a list of two dataframes one each for LiP and TrP. Each LiP and TrP is also a list with additional summary information. This summarized data can be used as input into some of the plotting functions included in the package.

2.2 Tryptic barplot

MSstatsLiP has a wide variety of plotting functionality to analysis and assess the results of experiments. Here we plot the number of half vs fully tryptic peptides per replicate.

2.3 Run Correlation Plot

MSstatsLiP also provides a function to plot run correlation.

2.7 PCA Plot

In addition, Priciple Component Analysis can also be done on the summarized dataset. Three different PCA plots can be created one each for: Percent of explained variance per component, PC1 vs PC2 for peptides, and PC1 vs PC2 for conditions.

2.8 Calculate Trypticity

Finally, the trypticity of a peptide can also be calculated and added to any dataframe with the ProteinName and PeptideSequence column.


feature_data <- data.table::copy(MSstatsLiP_Summarized$LiP$FeatureLevelData)
data.table::setnames(feature_data, c("PEPTIDE", "PROTEIN"), 
                     c("PeptideSequence", "ProteinName"))
feature_data$PeptideSequence <- substr(feature_data$PeptideSequence, 1, 
                                       nchar(as.character(
                                         feature_data$PeptideSequence)) - 2)

calculateTrypticity(feature_data, "../inst/extdata/ExampleFastaFile.fasta")
#>     ProteinName       PeptideSequence fully_TRI NSEMI_TRI CSEMI_TRI CTERMINUS
#>  1:      P14164               ILQNDLK      TRUE     FALSE     FALSE     FALSE
#>  2:      P16622 SHLQSNQLYSNQLPLDFALGK      TRUE     FALSE     FALSE     FALSE
#>  3:      P17891    ALQLINQDDADIIGGRDR      TRUE     FALSE     FALSE     FALSE
#>  4:      P17891              DDDTDFLK      TRUE     FALSE     FALSE     FALSE
#>  5:      P24004            FIGASEQNIR      TRUE     FALSE     FALSE     FALSE
#>  6:      P36112       SNDLLSGLTGSSQTR      TRUE     FALSE     FALSE     FALSE
#>  7:      P38805               LGQTVGR      TRUE     FALSE     FALSE     FALSE
#>  8:      P46959        DIIGKPYGSQIAIR      TRUE     FALSE     FALSE     FALSE
#>  9:      P52893           SSSQGVEGIRK     FALSE     FALSE      TRUE     FALSE
#> 10:      P52911          TWITEDDFEQIK      TRUE     FALSE     FALSE     FALSE
#> 11:      P53235      ERQAVGDKLEDTQVLK      TRUE     FALSE     FALSE     FALSE
#> 12:      P53858       FLDNHEVDSIVSLER      TRUE     FALSE     FALSE     FALSE
#> 13:      Q02908            ISVISGVGVR      TRUE     FALSE     FALSE     FALSE
#> 14:      Q12248            EFQSVSDLWK      TRUE     FALSE     FALSE     FALSE
#>     NTERMINUS MISSED StartPos EndPos
#>  1:     FALSE  FALSE      358    365
#>  2:     FALSE  FALSE      354    375
#>  3:     FALSE   TRUE      196    214
#>  4:     FALSE  FALSE       21     29
#>  5:     FALSE  FALSE      770    780
#>  6:     FALSE  FALSE      102    117
#>  7:     FALSE  FALSE      205    212
#>  8:     FALSE  FALSE       49     63
#>  9:     FALSE   TRUE      221    232
#> 10:     FALSE  FALSE      117    129
#> 11:     FALSE   TRUE      607    623
#> 12:     FALSE  FALSE      491    506
#> 13:     FALSE  FALSE      528    538
#> 14:     FALSE  FALSE       57     67


MSstatsLiP_Summarized$LiP$FeatureLevelData%>%
  rename(PeptideSequence=PEPTIDE, ProteinName=PROTEIN)%>%
  mutate(PeptideSequence=substr(PeptideSequence, 1, 
                                nchar(as.character(PeptideSequence))-2)
         ) %>% calculateTrypticity("../inst/extdata/ExampleFastaFile.fasta")
#>     ProteinName       PeptideSequence fully_TRI NSEMI_TRI CSEMI_TRI CTERMINUS
#>  1:      P14164               ILQNDLK      TRUE     FALSE     FALSE     FALSE
#>  2:      P16622 SHLQSNQLYSNQLPLDFALGK      TRUE     FALSE     FALSE     FALSE
#>  3:      P17891    ALQLINQDDADIIGGRDR      TRUE     FALSE     FALSE     FALSE
#>  4:      P17891              DDDTDFLK      TRUE     FALSE     FALSE     FALSE
#>  5:      P24004            FIGASEQNIR      TRUE     FALSE     FALSE     FALSE
#>  6:      P36112       SNDLLSGLTGSSQTR      TRUE     FALSE     FALSE     FALSE
#>  7:      P38805               LGQTVGR      TRUE     FALSE     FALSE     FALSE
#>  8:      P46959        DIIGKPYGSQIAIR      TRUE     FALSE     FALSE     FALSE
#>  9:      P52893           SSSQGVEGIRK     FALSE     FALSE      TRUE     FALSE
#> 10:      P52911          TWITEDDFEQIK      TRUE     FALSE     FALSE     FALSE
#> 11:      P53235      ERQAVGDKLEDTQVLK      TRUE     FALSE     FALSE     FALSE
#> 12:      P53858       FLDNHEVDSIVSLER      TRUE     FALSE     FALSE     FALSE
#> 13:      Q02908            ISVISGVGVR      TRUE     FALSE     FALSE     FALSE
#> 14:      Q12248            EFQSVSDLWK      TRUE     FALSE     FALSE     FALSE
#>     NTERMINUS MISSED StartPos EndPos
#>  1:     FALSE  FALSE      358    365
#>  2:     FALSE  FALSE      354    375
#>  3:     FALSE   TRUE      196    214
#>  4:     FALSE  FALSE       21     29
#>  5:     FALSE  FALSE      770    780
#>  6:     FALSE  FALSE      102    117
#>  7:     FALSE  FALSE      205    212
#>  8:     FALSE  FALSE       49     63
#>  9:     FALSE   TRUE      221    232
#> 10:     FALSE  FALSE      117    129
#> 11:     FALSE   TRUE      607    623
#> 12:     FALSE  FALSE      491    506
#> 13:     FALSE  FALSE      528    538
#> 14:     FALSE  FALSE       57     67

3. Modeling

The modeling function groupComparisonLiP takes as input the output of the summarization function dataSummarizationLiP.

3.1 Function


MSstatsLiP_model <- groupComparisonLiP(MSstatsLiP_Summarized,
                               fasta = "../inst/extdata/ExampleFastaFile.fasta",
                               use_log_file = FALSE, 
                               append = FALSE)
#> Starting PTM modeling...
#> INFO  [2022-04-28 17:05:21]  == Start to test and get inference in whole plot ...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |====================                                                  |  29%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |==============================                                        |  43%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |========================================                              |  57%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==================================================                    |  71%
  |                                                                            
  |=======================================================               |  79%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2022-04-28 17:05:21]  == Comparisons for all proteins are done.
#> Starting Protein modeling...
#> INFO  [2022-04-28 17:05:21]  == Start to test and get inference in whole plot ...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=====                                                                 |   8%
  |                                                                            
  |===========                                                           |  15%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |======================================================                |  77%
  |                                                                            
  |===========================================================           |  85%
  |                                                                            
  |=================================================================     |  92%
  |                                                                            
  |======================================================================| 100%
#> INFO  [2022-04-28 17:05:21]  == Comparisons for all proteins are done.
#> Starting adjustment...
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 1 has 0 rows but longest item has 1; filled with NA
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 2 has 0 rows but longest item has 1; filled with NA
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 3 has 0 rows but longest item has 1; filled with NA
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 1 has 0 rows but longest item has 1; filled with NA
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 2 has 0 rows but longest item has 1; filled with NA
#> Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
#> check.names, : Item 3 has 0 rows but longest item has 1; filled with NA

lip_model <- MSstatsLiP_model[["LiP.Model"]]
trp_model <- MSstatsLiP_model[["TrP.Model"]]
adj_lip_model <- MSstatsLiP_model[["Adjusted.LiP.Model"]]

head(lip_model)
#>    ProteinName       PeptideSequence                 FULL_PEPTIDE        Label
#> 1:      P14164               ILQNDLK               P14164_ILQNDLK Ctrl vs Osmo
#> 2:      P16622 SHLQSNQLYSNQLPLDFALGK P16622_SHLQSNQLYSNQLPLDFALGK Ctrl vs Osmo
#> 3:      P17891    ALQLINQDDADIIGGRDR    P17891_ALQLINQDDADIIGGRDR Ctrl vs Osmo
#> 4:      P17891              DDDTDFLK              P17891_DDDTDFLK Ctrl vs Osmo
#> 5:      P24004            FIGASEQNIR            P24004_FIGASEQNIR Ctrl vs Osmo
#> 6:      P36112       SNDLLSGLTGSSQTR       P36112_SNDLLSGLTGSSQTR Ctrl vs Osmo
#>        log2FC          SE      Tvalue DF      pvalue adj.pvalue issue
#> 1:  1.0601391 0.219397767   4.8320413  4 0.008448744 0.03942747    NA
#> 2:  0.1655540 0.277961791   0.5955998  3 0.593383886 0.71358286    NA
#> 3:  0.5436687 0.296272579   1.8350288  4 0.140405767 0.28081153    NA
#> 4: -0.3294106 0.173801944  -1.8953216  4 0.130943731 0.28081153    NA
#> 5:  1.7019345 0.006157026 276.4215332  1 0.002303066 0.03224292    NA
#> 6: -0.1584000 0.601777383  -0.2632203  1 0.836145500 0.85751951    NA
#>    MissingPercentage ImputationPercentage fully_TRI NSEMI_TRI CSEMI_TRI
#> 1:         0.0000000                    0      TRUE     FALSE     FALSE
#> 2:         0.1666667                    0      TRUE     FALSE     FALSE
#> 3:         0.0000000                    0      TRUE     FALSE     FALSE
#> 4:         0.0000000                    0      TRUE     FALSE     FALSE
#> 5:         0.5000000                    0      TRUE     FALSE     FALSE
#> 6:         0.5000000                    0      TRUE     FALSE     FALSE
#>    CTERMINUS NTERMINUS MISSED StartPos EndPos
#> 1:     FALSE     FALSE  FALSE      358    365
#> 2:     FALSE     FALSE  FALSE      354    375
#> 3:     FALSE     FALSE   TRUE      196    214
#> 4:     FALSE     FALSE  FALSE       21     29
#> 5:     FALSE     FALSE  FALSE      770    780
#> 6:     FALSE     FALSE  FALSE      102    117
head(trp_model)
#>    Protein        Label      log2FC         SE    Tvalue DF     pvalue
#> 1:  P14164 Ctrl vs Osmo  0.50281634 0.14796419  3.398230  3 0.04251661
#> 2:  P16622 Ctrl vs Osmo -0.48689817 0.38376886 -1.268728  1 0.42494298
#> 3:  P17891 Ctrl vs Osmo  1.15384384 0.52228878  2.209207  4 0.09170709
#> 4:  P24004 Ctrl vs Osmo -0.09463078 0.03735157 -2.533516  1 0.23932853
#> 5:  P36112 Ctrl vs Osmo -0.85301975 0.23381706 -3.648236  4 0.02180540
#> 6:  P38805 Ctrl vs Osmo  0.40570919 0.25707606  1.578168  4 0.18966680
#>    adj.pvalue issue MissingPercentage ImputationPercentage
#> 1: 0.09211932    NA         0.3000000                    0
#> 2: 0.48236810    NA         0.5000000                    0
#> 3: 0.17031317    NA         0.3194444                    0
#> 4: 0.34569677    NA         0.5000000                    0
#> 5: 0.06993334    NA         0.4393939                    0
#> 6: 0.30820855    NA         0.2500000                    0
head(adj_lip_model)
#>                    FULL_PEPTIDE        Label ProteinName       PeptideSequence
#> 1:               P14164_ILQNDLK Ctrl vs Osmo      P14164               ILQNDLK
#> 2: P16622_SHLQSNQLYSNQLPLDFALGK Ctrl vs Osmo      P16622 SHLQSNQLYSNQLPLDFALGK
#> 3:    P17891_ALQLINQDDADIIGGRDR Ctrl vs Osmo      P17891    ALQLINQDDADIIGGRDR
#> 4:              P17891_DDDTDFLK Ctrl vs Osmo      P17891              DDDTDFLK
#> 5:            P24004_FIGASEQNIR Ctrl vs Osmo      P24004            FIGASEQNIR
#> 6:       P36112_SNDLLSGLTGSSQTR Ctrl vs Osmo      P36112       SNDLLSGLTGSSQTR
#>        log2FC         SE    Tvalue       DF     pvalue adj.pvalue fully_TRI
#> 1:  0.5573227 0.26462952  2.106049 6.635790 0.07538247 0.15076494      TRUE
#> 2:  0.6524522 0.47385789  1.376894 2.129099 0.29543338 0.51700842      TRUE
#> 3: -0.6101751 0.60046899 -1.016164 6.332717 0.34679119 0.53945296      TRUE
#> 4: -1.4832544 0.55044772 -2.694633 4.875155 0.04419616 0.12374924      TRUE
#> 5:  1.7965653 0.03785563 47.458342 1.054304 0.01100135 0.04872612      TRUE
#> 6:  0.6946197 0.64560548  1.075920 1.317219 0.44032063 0.54772318      TRUE
#>    NSEMI_TRI CSEMI_TRI CTERMINUS NTERMINUS MISSED StartPos EndPos issue
#> 1:     FALSE     FALSE     FALSE     FALSE  FALSE      358    365    NA
#> 2:     FALSE     FALSE     FALSE     FALSE  FALSE      354    375    NA
#> 3:     FALSE     FALSE     FALSE     FALSE   TRUE      196    214    NA
#> 4:     FALSE     FALSE     FALSE     FALSE  FALSE       21     29    NA
#> 5:     FALSE     FALSE     FALSE     FALSE  FALSE      770    780    NA
#> 6:     FALSE     FALSE     FALSE     FALSE  FALSE      102    117    NA

## Number of significant adjusted lip peptides
adj_lip_model %>% filter(adj.pvalue < .05) %>% nrow()
#> [1] 4

The groupComparisonLiP function outputs a list with three separate models. These models are as follows: LiP model, TrP model, and adjusted LiP model.

3.4 Barcode

Here we show a barcode plot, showing the coverage of LiP and TrP peptides. This function requires the data in MSstatsLiP format and the path to a fasta file.