OmniPath Bioconductor workshop

Denes Turei*, Alberto Valdeolivas, Attila Gabor and Julio Saez-Rodriguez1

1Institute for Computational Biomedicine, Heidelberg University

16 October 2022

Abstract

OmniPath is a database of molecular signaling knowledge, combining data from more than 100 resources. It contains protein-protein and gene regulatory interactions, enzyme-PTM relationships, protein complexes, annotations about protein function, structure, localization and intercellular communication. OmniPath focuses on networks with directed interactions and effect signs (activation or inhibition) which are suitable inputs for many modeling techniques. OmniPath also features a large collection of proteins’ intercellular communication roles and interactions. OmniPath is distributed by a web service at https://omnipathdb.org/. The Bioconductor package OmnipathR is an R client with full support for all features of the OmniPath web server. Apart from OmniPath, it provides direct access to more than 15 further signaling databases (such as BioPlex, InBioMap, EVEX, Harmonizome, etc) and contains a number of convenience methods, such as igraph integration, and a close integration with the NicheNet pipeline for ligand activity prediction from transcriptomics data. In this demo we show the diverse data in OmniPath and the versatile and convenient ways to access this data by OmnipathR.

Package

OmnipathR 3.4.7

1 Introduction

Database knowledge is essential for omics data analysis and modeling. Despite being an important factor, contributing to the outcome of studies, often subject to little attention. With OmniPath our aim is to raise awarness of the diversity of available resources and facilitate access to these resources in a uniform and transparent way. OmniPath has been developed in a close contact to mechanistic modeling applications and functional omics analysis, hence it is especially suitable for these fields. OmniPath has been used for the analysis of various omics data. In the Saez-Rodriguez group we often use it in a pipeline with our footprint based methods DoRothEA and PROGENy and our causal reasoning method CARNIVAL to infer signaling mechanisms from transcriptomics data.

One recent novelty of OmniPath is a collection of intercellular communication interactions. Apart from simply merging data from existing resources, OmniPath defines a number of intercellular communication roles, such as ligand, receptor, adhesion, enzyme, matrix, etc, and generalizes the terms ligand and receptor by introducing the terms transmitter, receiver and mediator. This unique knowledge base is especially adequate for the emerging field of cell-cell communication analysis, typically from single cell transcriptomics, but also from other kinds of data.

2 Overview

2.1 Pre-requisites

No special pre-requisites apart from basic knowledge of R. OmniPath, the database resource in the focus of this workshop has been published in [1,2], however you don’t need to know anything about OmniPath to benefit from the workshop. In the workshop we will demonstrate the R/Bioconductor package OmnipathR. If you would like to try the examples yourself we recommend to install the latest version of the package before the workshop:

library(devtools)
install_github('saezlab/OmnipathR')

2.2 Participation

In the workshop we will present the design and some important features of the OmniPath database, so can be confident you get the most out of it. Then we will demonstrate further useful features of the OmnipathR package, such as accessing other resources, building graphs. Participants are encouraged to experiment with the examples and shape the contents of the workshop by asking questions. We are happy to recieve questions and topic suggestions by email also before the workshop. These could help us to adjust the contents to the interests of the participants.

2.3 R / Bioconductor packages used

OmnipathR
igraph
dplyr

2.4 Time outline

Total: 45 minutes

Activity	Time
OmniPath database overview	5m
Network datasets	10m
Other OmniPath databases	5m
Intercellular communication	10m
Igraph integration	5m
Further resources	10m

2.5 Workshop goals and objectives

In this workshop you will get familiar with the design and features of the OmniPath databases. For example, to know some important details about the datasets and parameters which help you to query the database the most suitable way according to your purposes. You will also learn about functionalities of the OmnipathR package which might make your work easier.

2.6 Learning goals

Learn about the OmniPath database, its contents and how it can be useful
Get a picture about the OmnipathR package capabilities
Learn about the datasets and parameters of various OmniPath query types

2.7 Learning objectives

Try examples of each OmniPath query type with various parameters
Build igraph networks, search for paths
Access some further interesting resources

3 Workshop

library(OmnipathR)

3.1 Data from OmniPath

OmniPath consists of five major databases, each combining many original resources. The five databases are:

Network (interactions)
Enzyme-substrate relationships (enzsub)
Protein complexes (complexes)
Annotations (annotations)
Intercellular communication roles (intercell)

The parameters for each database (query type) are available in the web service, for example: https://omnipathdb.org/queries/interactions. The R package supports all features of the web service and the parameter names and values usually correspond to the web service parameters which you would use in a HTTP query string.

3.1.1 Networks

The network database contains protein-protein, gene regulatory and miRNA-mRNA interactions. Soon more interaction types will be added. Some of these categories can be further divided into datasets which are defined by the type of evidences. A full list of network datasets:

Protein-protein interactions (post_translational) import_post_translational_interactions
- omnipath: literature curated, directed interactions with effect signs; corresponds to the first edition of OmniPath, hence the confusing name is due to historical reasons import_omnipath_interactions
- pathwayextra: directed and signed interactions, without literature references (might be literature curated, but references are not available) import_pathwayextra_interactions
- kinaseextra: enzyme-PTM interactions without literature references import_kinaseextra_interactions
- ligrecextra: ligand-receptor interactions without literature references import_ligrecextra_interactions
Gene regulatory interactions (transcriptional) import_transcriptional_interactions
- dorothea: a comprehensive collection built out of 18 resources, contains literature curated, ChIP-Seq, gene expression derived and TF binding site predicted data, with 5 confidence levels (A-E) import_dorothea_interactions
- tf_target: additional literature curated interactions import_tf_target_interactions
miRNA interactions (post_transcriptional and mirna_transcriptional)
- mirnatarget: literature curated miRNA-mRNA interactions import_mirnatarget_interactions
- tf_mirna: literature curated TF-miRNA interactions (transcriptional regulations of miRNA) import_tf_mirna_interactions
lncRNA interactions (lncrna_post_transcriptional)
- lncrna_mrna: literature curated lncRNA-mRNA interactions import_lncrna_mrna_interactions
Small molecule-protein interactions (small_molecule_protein)
- small_molecule: metabolites, intrinsic ligands or drug compounds targeting human proteins import_small_molecule_protein_interactions

Not individual interactions but resource are classified into the datasets above, so these can overlap. Each interaction type and dataset has its dedicated function in OmnipathR, above we provide links to their help pages. As an example, let’s see the gene regulatory interactions:

gri <- import_transcriptional_interactions()
gri

## # A tibble: 74,513 × 16
##    source target source_gene…¹ targe…² is_di…³ is_st…⁴ is_in…⁵ conse…⁶ conse…⁷ conse…⁸ sources refer…⁹ curat…˟
##    <chr>  <chr>  <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
##  1 P35869 Q9ULH1 AHR           ASAP1         1       0       0       0       0       0 ARACNe… <NA>          0
##  2 P35869 P04798 AHR           CYP1A1        1       1       0       1       1       0 DoRoth… HTRIdb…       6
##  3 P35869 P05177 AHR           CYP1A2        1       0       0       0       0       0 DoRoth… HTRIdb…       1
##  4 P35869 Q16678 AHR           CYP1B1        1       1       0       1       1       0 DoRoth… HTRIdb…       3
##  5 P35869 P11308 AHR           ERG           1       0       0       0       0       0 ARACNe… <NA>          0
##  6 P35869 P01100 AHR           FOS           1       0       0       0       0       0 DoRoth… DoRoth…       2
##  7 P35869 P01106 AHR           MYC           1       0       0       0       0       0 FANTOM… <NA>          0
##  8 P35869 Q07889 AHR           SOS1          1       0       0       0       0       0 DoRoth… DoRoth…       2
##  9 P35869 P19224 AHR           UGT1A6        1       0       0       0       0       0 DoRoth… DoRoth…       2
## 10 P35869 Q14135 AHR           VGLL4         1       0       0       0       0       0 ARACNe… <NA>          0
## # … with 74,503 more rows, 3 more variables: dorothea_level <chr>, n_references <chr>, n_resources <int>, and
## #   abbreviated variable names ¹source_genesymbol, ²target_genesymbol, ³is_directed, ⁴is_stimulation,
## #   ⁵is_inhibition, ⁶consensus_direction, ⁷consensus_stimulation, ⁸consensus_inhibition, ⁹references,
## #   ˟curation_effort

The interaction data frame contains the UniProt IDs and Gene Symbols of the interacting partners, the list of resources and references (PubMed IDs) for each interaction, and whether the interaction is directed, stimulatory or inhibitory.

3.1.1.1 Igraph integration

The network data frames can be converted to igraph graph objects, so you can make use of the graph and visualization methods of igraph:

gr_graph <- interaction_graph(gri)
gr_graph

## IGRAPH 465088b DN-- 15084 74513 -- 
## + attr: name (v/c), up_ids (v/c), is_directed (e/n), is_stimulation (e/n), is_inhibition (e/n),
## | consensus_direction (e/n), consensus_stimulation (e/n), consensus_inhibition (e/n), sources
## | (e/x), references (e/x), curation_effort (e/n), dorothea_level (e/c), n_references (e/c),
## | n_resources (e/n)
## + edges from 465088b (vertex names):
##  [1] AHR->ASAP1    AHR->CYP1A1   AHR->CYP1A2   AHR->CYP1B1   AHR->ERG      AHR->FOS      AHR->MYC     
##  [8] AHR->SOS1     AHR->UGT1A6   AHR->VGLL4    AR ->ABCC4    AR ->ABCE1    AR ->ABHD2    AR ->ABLIM1  
## [15] AR ->ACAD10   AR ->ACOXL    AR ->ACP3     AR ->ACSL1    AR ->ACTA1    AR ->ADAMTS4  AR ->ADAMTSL1
## [22] AR ->ADGRG6   AR ->ADGRV1   AR ->ADIPOR1  AR ->AFDN     AR ->AFF1     AR ->AFF3     AR ->AGAP1   
## [29] AR ->AHSG     AR ->AKAP13   AR ->AKAP6    AR ->AKAP7    AR ->AKAP7    AR ->AKR1B1   AR ->AKR1C3  
## + ... omitted several edges

On this network we can use OmnipathR’s find_all_paths function, which is able to look up all paths up to a certain length between two set of nodes:

paths <- find_all_paths(
    graph = gr_graph,
    start = c('EGFR', 'STAT3'),
    end = c('AKT1', 'ULK1'),
    attr = 'name'
)

As this is a gene regulatory network, the paths are TFs regulating the transcription of other TFs.

3.1.2 Enzyme-substrate relationships

Enzyme-substrate interactions are also available also in the interactions query, but the enzyme-substrate query type provides additional information about the PTM types and residues.

enz_sub <- import_omnipath_enzsub()
enz_sub

## # A tibble: 42,248 × 12
##    enzyme substrate enzyme_genesymbol substr…¹ resid…² resid…³ modif…⁴ sources refer…⁵ curat…⁶ n_ref…⁷ n_res…⁸
##    <chr>  <chr>     <chr>             <chr>    <chr>     <dbl> <chr>   <chr>   <chr>     <dbl> <chr>     <int>
##  1 P06239 O14543    LCK               SOCS3    Y           204 phosph… KEA;MI… KEA:12…       2 KEA:12…       6
##  2 P06239 O14543    LCK               SOCS3    Y           221 phosph… KEA;MI… KEA:12…       3 KEA:12…       6
##  3 P12931 O14746    SRC               TERT     Y           707 phosph… BEL-La… HPRD:1…       8 HPRD:1…       8
##  4 P06241 O15117    FYN               FYB1     Y           651 phosph… HPRD;K… HPRD:1…       7 HPRD:1…       6
##  5 P06241 O15117    FYN               FYB1     Y           595 phosph… HPRD;K… HPRD:1…       7 HPRD:1…       7
##  6 P06241 O15117    FYN               FYB1     Y           697 phosph… HPRD;K… HPRD:1…       3 HPRD:1…       3
##  7 P06241 O15117    FYN               FYB1     Y           625 phosph… Phosph… <NA>          0 0             2
##  8 P06241 O15117    FYN               FYB1     Y           571 phosph… Phosph… <NA>          0 0             2
##  9 P06241 O15117    FYN               FYB1     Y           771 phosph… Phosph… <NA>          0 0             2
## 10 P06241 O15117    FYN               FYB1     Y           559 phosph… Phosph… <NA>          0 0             2
## # … with 42,238 more rows, and abbreviated variable names ¹substrate_genesymbol, ²residue_type,
## #   ³residue_offset, ⁴modification, ⁵references, ⁶curation_effort, ⁷n_references, ⁸n_resources

This data frame also can be converted to an igraph object:

es_graph <- enzsub_graph(enz_sub)
es_graph

## IGRAPH 281df32 DN-- 4700 42248 -- 
## + attr: name (v/c), up_ids (v/c), residue_type (e/c), residue_offset (e/n), modification (e/c),
## | sources (e/x), references (e/x), curation_effort (e/n), n_references (e/c), n_resources (e/n)
## + edges from 281df32 (vertex names):
##  [1] LCK  ->SOCS3  LCK  ->SOCS3  SRC  ->TERT   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1  
##  [8] FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1  
## [15] FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1   FYN  ->FYB1  
## [22] ABL1 ->PLSCR1 ABL1 ->PLSCR1 SRC  ->PLSCR1 SRC  ->PLSCR1 ABL1 ->TP73   CDK2 ->TP73   CHEK1->TP73  
## [29] AURKB->BIRC5  AURKB->BIRC5  AURKB->BIRC5  CDK1 ->BIRC5  PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1 
## [36] PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1  PDPK1->PDPK1 
## [43] PDPK1->PDPK1  PDPK1->PDPK1  SRC  ->PDPK1  SRC  ->PDPK1  SRC  ->PDPK1  SRC  ->PDPK1  SRC  ->PDPK1 
## + ... omitted several edges

It is also possible to add effect signs (stimulatory or inhibitory) to enzyme-PTM relationships:

es_signed <- get_signed_ptms(enz_sub)

3.1.3 Protein complexes

cplx <- import_omnipath_complexes()
cplx

## # A tibble: 32,761 × 7
##    name         components                                components_genesym…¹ stoic…² sources refer…³ ident…⁴
##    <chr>        <chr>                                     <chr>                <chr>   <chr>   <chr>   <chr>  
##  1 NFY          P23511_P25208_Q13952                      NFYA_NFYB_NFYC       1:1:1   CORUM;… 147552… CORUM:…
##  2 mTORC2       P42345_P68104_P85299_Q6R327_Q8TB45_Q9BVC4 DEPTOR_EEF1A1_MLST8… 0:0:0:… SIGNOR  <NA>    SIGNOR…
##  3 mTORC1       P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4        AKT1S1_DEPTOR_MLST8… 0:0:0:… SIGNOR  <NA>    SIGNOR…
##  4 SCF-betaTRCP P63208_Q13616_Q9Y297                      BTRC_CUL1_SKP1       1:1:1   CORUM;… 9990852 CORUM:…
##  5 CBP/p300     Q09472_Q92793                             CREBBP_EP300         0:0     SIGNOR  <NA>    SIGNOR…
##  6 P300/PCAF    Q09472_Q92793_Q92831                      CREBBP_EP300_KAT2B   0:0:0   SIGNOR  <NA>    SIGNOR…
##  7 SMAD2/SMAD4  Q13485_Q15796                             SMAD2_SMAD4          1:2     Comple… 406541… PDB:1u…
##  8 SMAD3/SMAD4  P84022_Q13485                             SMAD3_SMAD4          2:1     Comple… 406541… PDB:1U…
##  9 SMAD4/JUN    P05412_Q13485                             JUN_SMAD4            0:0     SIGNOR  <NA>    SIGNOR…
## 10 SMAD2/SMURF2 Q15796_Q9HAU4                             SMAD2_SMURF2         1:1     Comple… 113894… Comple…
## # … with 32,751 more rows, and abbreviated variable names ¹components_genesymbols, ²stoichiometry,
## #   ³references, ⁴identifiers

The resulted data frame provides the constitution and stoichiometry of protein complexes, with references.

3.1.4 Annotations

The annotations query type includes a diverse set of resources (about 60 of them), about protein function, localization, structure and expression. For most use cases it is better to convert the data into wide data frames, as these correspond to the original format of the resources. If you load more than one resources into wide data frames, the result will be a list of data frames, otherwise one data frame. See a few examples with localization data from UniProt, tissue expression from Human Protein Atlas and pathway information from SignaLink:

uniprot_loc <- import_omnipath_annotations(
    resources = 'UniProt_location',
    wide = TRUE
)
uniprot_loc

## # A tibble: 65,264 × 5
##    uniprot genesymbol entity_type location                          features                           
##    <chr>   <chr>      <chr>       <chr>                             <chr>                              
##  1 P51451  BLK        protein     Cell membrane                     Lipid-anchor                       
##  2 A6H8Y1  BDP1       protein     Nucleus                           <NA>                               
##  3 O60885  BRD4       protein     Chromosome                        <NA>                               
##  4 O60885  BRD4       protein     Nucleus                           <NA>                               
##  5 P22223  CDH3       protein     Cell membrane                     Single-pass type I membrane protein
##  6 Q9BXJ4  C1QTNF3    protein     Secreted                          <NA>                               
##  7 Q9ULX7  CA14       protein     Membrane                          Single-pass type I membrane protein
##  8 Q01518  CAP1       protein     Cell membrane                     Peripheral membrane protein        
##  9 Q9BUK0  CHCHD7     protein     Mitochondrion intermembrane space <NA>                               
## 10 Q9P2X3  IMPACT     protein     Cytoplasm                         <NA>                               
## # … with 65,254 more rows

The entity_type field can be protein, mirna or complex. Protein complexes mostly annotated based on the consensus of their members, we should be aware that this is an in silico inference.

In case of spelling mistake either in parameter names or values OmnipathR either corrects the mistake or gives a warning or error:

uniprot_loc <- import_omnipath_annotations(
    resources = 'Uniprot_location',
    wide = TRUE
)

## Warning in omnipath_check_param(param): The following resources are not available: Uniprot_location. Check the
## resource names for spelling mistakes.

Above the name of the resource is wrong. If the parameter name is wrong, it throws an error:

uniprot_loc <- import_omnipath_annotations(
    resuorces = 'UniProt_location',
    wide = TRUE
)

## Error in import_omnipath_annotations(resuorces = "UniProt_location", wide = TRUE): Downloading the entire annotations database is not allowed by default because of its huge size (>1GB). If you really want to do that, you find static files at https://archive.omnipathdb.org/. However we recommend to query a set of proteins or a few resources, depending on your interest.

Singular vs. plural forms and a few synonyms are automatically corrected:

uniprot_loc <- import_omnipath_annotations(
    resource = 'UniProt_location',
    wide = TRUE
)

Another example with tissue expression from Human Protein Atlas:

hpa_tissue <- import_omnipath_annotations(
    resources = 'HPA_tissue',
    wide = TRUE,
    # Limiting to a handful of proteins for a faster vignette build:
    proteins = c('DLL1', 'MEIS2', 'PHOX2A', 'BACH1', 'KLF11', 'FOXO3', 'MEFV')
)
hpa_tissue

## # A tibble: 529 × 15
##    uniprot genesymbol entity…¹ organ tissue level n_not…² n_low n_med…³ n_high progn…⁴ favou…⁵   score patho…⁶
##    <chr>   <chr>      <chr>    <chr> <chr>  <chr>   <dbl> <dbl>   <dbl>  <dbl> <lgl>   <lgl>     <dbl> <lgl>  
##  1 O43524  FOXO3      protein  colo… color… Medi…       0     2       9      0 FALSE   TRUE     0.0428 TRUE   
##  2 O43524  FOXO3      protein  skin… skin … Medi…       0     2       9      1 NA      NA      NA      TRUE   
##  3 O43524  FOXO3      protein  cere… glial… Not …      NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  4 O43524  FOXO3      protein  endo… cells… Medi…      NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  5 O43524  FOXO3      protein  fall… gland… High       NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  6 O43524  FOXO3      protein  skin… hair … Medi…      NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  7 O43524  FOXO3      protein  skin… cells… Not …      NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  8 O43524  FOXO3      protein  bone… hemat… Low        NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
##  9 O43524  FOXO3      protein  carc… carci… Medi…       0     2       2      0 NA      NA      NA      TRUE   
## 10 O43524  FOXO3      protein  skin… cells… Not …      NA    NA      NA     NA FALSE   FALSE   NA      FALSE  
## # … with 519 more rows, 1 more variable: status <chr>, and abbreviated variable names ¹entity_type,
## #   ²n_not_detected, ³n_medium, ⁴prognostic, ⁵favourable, ⁶pathology

And pathway annotations from SignaLink:

slk_pathw <- import_omnipath_annotations(
    resources = 'SignaLink_pathway',
    wide = TRUE
)
slk_pathw

## # A tibble: 2,451 × 4
##    uniprot genesymbol entity_type pathway                 
##    <chr>   <chr>      <chr>       <chr>                   
##  1 P20963  CD247      protein     T-cell receptor         
##  2 P43403  ZAP70      protein     Receptor tyrosine kinase
##  3 P43403  ZAP70      protein     T-cell receptor         
##  4 Q9NYJ8  TAB2       protein     JAK/STAT                
##  5 Q9NYJ8  TAB2       protein     Receptor tyrosine kinase
##  6 Q9NYJ8  TAB2       protein     Toll-like receptor      
##  7 Q9NYJ8  TAB2       protein     Innate immune pathways  
##  8 O43318  MAP3K7     protein     Toll-like receptor      
##  9 O43318  MAP3K7     protein     TGF                     
## 10 O43318  MAP3K7     protein     JAK/STAT                
## # … with 2,441 more rows

3.1.4.1 Combining networks with annotations

Annotations can be easily added to network data frames, in this case both the source and target nodes will have their annotation data. This function accepts either the name of an annotation resource or an annotation data frame:

network <- import_omnipath_interactions()

network_slk_pw <- annotated_network(network, 'SignaLink_pathway')
network_slk_pw

## # A tibble: 82,713 × 17
##    source target source_gene…¹ targe…² is_di…³ is_st…⁴ is_in…⁵ conse…⁶ conse…⁷ conse…⁸ sources refer…⁹ curat…˟
##    <chr>  <chr>  <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
##  1 P0DP25 P48995 CALM3         TRPC1         1       0       1       1       0       1 TRIP    TRIP:1…       3
##  2 P0DP23 P48995 CALM1         TRPC1         1       0       1       1       0       1 TRIP    TRIP:1…       3
##  3 P0DP24 P48995 CALM2         TRPC1         1       0       1       1       0       1 TRIP    TRIP:1…       3
##  4 Q03135 P48995 CAV1          TRPC1         1       1       0       1       1       0 DIP;HP… DIP:19…      13
##  5 P14416 P48995 DRD2          TRPC1         1       1       0       1       1       0 TRIP    TRIP:1…       1
##  6 Q99750 P48995 MDFI          TRPC1         1       0       1       1       0       1 HPRD;T… HPRD:1…       3
##  7 Q14571 P48995 ITPR2         TRPC1         1       1       0       1       1       0 HPRD;T… HPRD:1…      13
##  8 P29966 P48995 MARCKS        TRPC1         1       0       1       1       0       1 TRIP    TRIP:2…       1
##  9 Q13255 P48995 GRM1          TRPC1         1       1       0       1       1       0 TRIP    TRIP:1…       1
## 10 Q13586 P48995 STIM1         TRPC1         1       1       0       1       1       0 IntAct… IntAct…      22
## # … with 82,703 more rows, 4 more variables: n_references <chr>, n_resources <int>, pathway_source <chr>,
## #   pathway_target <chr>, and abbreviated variable names ¹source_genesymbol, ²target_genesymbol,
## #   ³is_directed, ⁴is_stimulation, ⁵is_inhibition, ⁶consensus_direction, ⁷consensus_stimulation,
## #   ⁸consensus_inhibition, ⁹references, ˟curation_effort

3.1.5 Intercellular communication roles

The intercell database assigns roles to proteins such as ligand, receptor, adhesion, transporter, ECM, etc. The design of this database is far from being simple, best is to check the description in the recent OmniPath paper [1].

ic <- import_omnipath_intercell()
ic

## # A tibble: 323,572 × 15
##    category parent datab…¹ scope aspect source uniprot genes…² entit…³ conse…⁴ trans…⁵ recei…⁶ secre…⁷ plasm…⁸
##    <chr>    <chr>  <chr>   <chr> <chr>  <chr>  <chr>   <chr>   <chr>     <dbl> <lgl>   <lgl>   <lgl>   <lgl>  
##  1 transme… trans… UniPro… gene… locat… resou… Q9P0S3  ORMDL1  protein       5 FALSE   FALSE   FALSE   FALSE  
##  2 transme… trans… UniPro… gene… locat… resou… P26951  IL3RA   protein       7 FALSE   FALSE   FALSE   TRUE   
##  3 transme… trans… UniPro… gene… locat… resou… Q96CH1  GPR146  protein       6 FALSE   FALSE   FALSE   TRUE   
##  4 transme… trans… UniPro… gene… locat… resou… Q7Z442  PKD1L2  protein       6 FALSE   FALSE   FALSE   FALSE  
##  5 transme… trans… UniPro… gene… locat… resou… Q6UXU6  TMEM92  protein       6 FALSE   FALSE   FALSE   TRUE   
##  6 transme… trans… UniPro… gene… locat… resou… Q8IV01  SYT12   protein       5 FALSE   FALSE   FALSE   FALSE  
##  7 transme… trans… UniPro… gene… locat… resou… Q12913  PTPRJ   protein       7 FALSE   FALSE   FALSE   TRUE   
##  8 transme… trans… UniPro… gene… locat… resou… Q6UW88  EPGN    protein       7 FALSE   FALSE   TRUE    TRUE   
##  9 transme… trans… UniPro… gene… locat… resou… Q8TDU5  VN1R17P protein       4 FALSE   FALSE   FALSE   FALSE  
## 10 transme… trans… UniPro… gene… locat… resou… Q9P2W7  B3GAT1  protein       7 FALSE   FALSE   TRUE    FALSE  
## # … with 323,562 more rows, 1 more variable: plasma_membrane_peripheral <lgl>, and abbreviated variable names
## #   ¹database, ²genesymbol, ³entity_type, ⁴consensus_score, ⁵transmitter, ⁶receiver, ⁷secreted,
## #   ⁸plasma_membrane_transmembrane

This data frame is about individual proteins. To create a network of intercellular communication, we provide the import_intercell_network function:

icn <- import_intercell_network(high_confidence = TRUE)
icn

## # A tibble: 17,964 × 45
##    category_in…¹ paren…² source target categ…³ paren…⁴ targe…⁵ sourc…⁶ is_di…⁷ is_st…⁸ is_in…⁹ conse…˟ conse…˟
##    <chr>         <chr>   <chr>  <chr>  <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 activating_c… recept… O14786 P35968 adhesi… adhesi… KDR     NRP1          1       1       0       1       1
##  2 activating_c… recept… O14786 P35968 cell_a… cell_a… KDR     NRP1          1       1       0       1       1
##  3 activating_c… recept… O14786 P35968 matrix… matrix… KDR     NRP1          1       1       0       1       1
##  4 activating_c… recept… O14786 P35968 recept… recept… KDR     NRP1          1       1       0       1       1
##  5 activating_c… recept… P08138 P04629 adhesi… adhesi… NTRK1   NGFR          1       1       1       1       1
##  6 activating_c… recept… P08138 P04629 cell_a… cell_a… NTRK1   NGFR          1       1       1       1       1
##  7 activating_c… recept… P08138 P04629 recept… recept… NTRK1   NGFR          1       1       1       1       1
##  8 activating_c… recept… P08138 P05067 adhesi… adhesi… APP     NGFR          1       0       0       0       0
##  9 activating_c… recept… P08138 P05067 cell_a… cell_a… APP     NGFR          1       0       0       0       0
## 10 activating_c… recept… P08138 P05067 recept… recept… APP     NGFR          1       0       0       0       0
## # … with 17,954 more rows, 32 more variables: consensus_inhibition <dbl>, omnipath <lgl>, ligrecextra <lgl>,
## #   sources <chr>, references <chr>, curation_effort <dbl>, n_references <chr>, n_resources <int>,
## #   database_intercell_source <chr>, scope_intercell_source <chr>, aspect_intercell_source <chr>,
## #   category_source_intercell_source <chr>, genesymbol_intercell_source <chr>,
## #   entity_type_intercell_source <chr>, consensus_score_intercell_source <dbl>,
## #   transmitter_intercell_source <lgl>, receiver_intercell_source <lgl>, secreted_intercell_source <lgl>,
## #   plasma_membrane_transmembrane_intercell_source <lgl>, …

The result is similar to the annotated_network, each interacting partner has its intercell annotations. In the intercell database, OmniPath aims to ship all available information, which means it might contain quite some false positives. The high_confidence option is a shortcut to stringent filter settings based on the number and consensus of provenances. Using instead the filter_intercell_network function, you can have a fine control over the quality filters. It has many options which are described in the manual.

icn <- import_intercell_network()
icn_hc <- filter_intercell_network(
    icn,
    ligand_receptor = TRUE,
    consensus_percentile = 30,
    loc_consensus_percentile = 50,
    simplify = TRUE
)

The filter_intecell function does a similar procedure on an intercell annotation data frame.

3.1.6 Metadata

The list of available resources for each query type can be retrieved by the get_..._resources function. For example, the annotation resources are:

get_annotation_resources()

##  [1] "Adhesome"             "Almen2009"            "Baccin2019"           "CORUM_Funcat"        
##  [5] "CORUM_GO"             "CSPA"                 "CSPA_celltype"        "CancerDrugsDB"       
##  [9] "CancerGeneCensus"     "CancerSEA"            "CellCall"             "CellCellInteractions"
## [13] "CellChatDB"           "CellChatDB_complex"   "CellPhoneDB"          "CellPhoneDB_complex" 
## [17] "CellTalkDB"           "CellTypist"           "Cellinker"            "Cellinker_complex"   
## [21] "ComPPI"               "CytoSig"              "DGIdb"                "DisGeNet"            
## [25] "EMBRACE"              "Exocarta"             "GO_Intercell"         "GPCRdb"              
## [29] "Guide2Pharma"         "HGNC"                 "HPA_secretome"        "HPA_subcellular"     
## [33] "HPA_tissue"           "HPMR"                 "HumanCellMap"         "ICELLNET"            
## [37] "ICELLNET_complex"     "IntOGen"              "Integrins"            "KEGG-PC"             
## [41] "Kirouac2010"          "LOCATE"               "LRdb"                 "MCAM"                
## [45] "MSigDB"               "Matrisome"            "MatrixDB"             "Membranome"          
## [49] "NetPath"              "OPM"                  "PROGENy"              "PanglaoDB"           
## [53] "Phobius"              "Phosphatome"          "Ramilowski2015"       "Ramilowski_location" 
## [57] "SIGNOR"               "SignaLink_function"   "SignaLink_pathway"    "Surfaceome"          
## [61] "TCDB"                 "TFcensus"             "TopDB"                "UniProt_family"      
## [65] "UniProt_keyword"      "UniProt_location"     "UniProt_tissue"       "UniProt_topology"    
## [69] "Vesiclepedia"         "Zhong2015"            "connectomeDB2020"     "iTALK"               
## [73] "kinase.com"           "scConnect"            "scConnect_complex"    "talklr"

Categories in the intercell query also can be listed:

get_intercell_generic_categories()

##  [1] "transmembrane"                       "transmembrane_predicted"            
##  [3] "peripheral"                          "plasma_membrane"                    
##  [5] "plasma_membrane_transmembrane"       "plasma_membrane_regulator"          
##  [7] "plasma_membrane_peripheral"          "secreted"                           
##  [9] "cell_surface"                        "ecm"                                
## [11] "ligand"                              "receptor"                           
## [13] "secreted_enzyme"                     "secreted_peptidase"                 
## [15] "extracellular"                       "intracellular"                      
## [17] "receptor_regulator"                  "secreted_receptor"                  
## [19] "sparc_ecm_regulator"                 "ecm_regulator"                      
## [21] "ligand_regulator"                    "cell_surface_ligand"                
## [23] "cell_adhesion"                       "matrix_adhesion"                    
## [25] "adhesion"                            "matrix_adhesion_regulator"          
## [27] "cell_surface_enzyme"                 "cell_surface_peptidase"             
## [29] "secreted_enyzme"                     "extracellular_peptidase"            
## [31] "secreted_peptidase_inhibitor"        "transporter"                        
## [33] "ion_channel"                         "ion_channel_regulator"              
## [35] "gap_junction"                        "tight_junction"                     
## [37] "adherens_junction"                   "desmosome"                          
## [39] "intracellular_intercellular_related"

# get_intercell_categories() # this would show also the specific categories

3.2 Data from other resources

An increasing number of other resources (currently around 20) can be directly accessed by OmnipathR (not from the omnipathdb.org domain, but from their original providers). As an example,

3.3 General purpose functionalities

3.3.1 Identifier translation

OmnipathR uses UniProt data to translate identifiers. You may find a list of the available identifiers in the manual page of translate_ids function. The evaluation of the parameters is tidyverse style, and both UniProt’s notation and a simple internal notation can be used. Furthermore, it can handle vectors, data frames or list of vectors.

d <- data.frame(uniprot_id = c('P00533', 'Q9ULV1', 'P43897', 'Q9Y2P5'))
d <- translate_ids(
    d,
    uniprot_id = uniprot, # the source ID type and column name
    genesymbol # the target ID type using OmniPath's notation
)
d

##   uniprot_id genesymbol
## 1     P00533       EGFR
## 2     Q9ULV1       FZD4
## 3     P43897       TSFM
## 4     Q9Y2P5    SLC27A5

It is possible to have one source ID type and column in one call, but multiple target ID types and columns: to translate a network, two calls are necessary. Note: certain functionality fails recently due to changes in other packages, will be fixed in a few days.

network <- import_omnipath_interactions()
network <- translate_ids(
    network,
    source = uniprot_id,
    source_entrez = entrez
)
network <- translate_ids(
    network,
    target = uniprot_id,
    target_entrez = entrez
)

3.3.2 Gene Ontology

OmnipathR is able to look up ancestors and descendants in ontology trees, and also exposes the ontology tree in three different formats: as a data frame, as a list of lists or as an igraph graph object. All these can have two directions: child-to-parent (c2p) or parent-to-child (p2c).

go <- go_ontology_download()
go$rel_tbl_c2p

## # A tibble: 58,379 × 3
##    term       relation parents  
##    <fct>      <chr>    <list>   
##  1 GO:0000001 is_a     <chr [2]>
##  2 GO:0000002 is_a     <chr [1]>
##  3 GO:0000003 is_a     <chr [1]>
##  4 GO:0000006 is_a     <chr [1]>
##  5 GO:0000007 is_a     <chr [1]>
##  6 GO:0000009 is_a     <chr [1]>
##  7 GO:0000010 is_a     <chr [1]>
##  8 GO:0000011 is_a     <chr [2]>
##  9 GO:0000012 is_a     <chr [1]>
## 10 GO:0000014 is_a     <chr [1]>
## # … with 58,369 more rows

To convert the relations to list or graph format, use the relations_table_to_list or relations_table_to_graph functions. To swap between c2p and p2c use the swap_relations function.

go_graph <- relations_table_to_graph(go$rel_tbl_c2p)
go_graph

## IGRAPH e20afc8 DN-- 43328 85221 -- 
## + attr: name (v/c), relation (e/c)
## + edges from e20afc8 (vertex names):
##  [1] GO:0000001->GO:0048308 GO:0000001->GO:0048311 GO:0000002->GO:0007005 GO:0000003->GO:0008150
##  [5] GO:0000006->GO:0005385 GO:0000007->GO:0005385 GO:0000009->GO:0000030 GO:0000010->GO:0004659
##  [9] GO:0000011->GO:0007033 GO:0000011->GO:0048308 GO:0000012->GO:0006281 GO:0000014->GO:0004520
## [13] GO:0000015->GO:1902494 GO:0000015->GO:0005829 GO:0000016->GO:0004553 GO:0000017->GO:0042946
## [17] GO:0000018->GO:0051052 GO:0000018->GO:0006310 GO:0000019->GO:0000018 GO:0000019->GO:0006312
## [21] GO:0000022->GO:0051231 GO:0000022->GO:1903047 GO:0000022->GO:0000070 GO:0000022->GO:0007052
## [25] GO:0000023->GO:0005984 GO:0000024->GO:0000023 GO:0000024->GO:0046351 GO:0000025->GO:0000023
## [29] GO:0000025->GO:0046352 GO:0000026->GO:0000030 GO:0000027->GO:0022618 GO:0000027->GO:0042255
## + ... omitted several edges

It can also translate term IDs to term names:

ontology_ensure_name('GO:0000022')

## [1] "mitotic spindle elongation"

The first call takes a few seconds as it loads the database, subsequent calls are faster.

3.4 Useful tips

OmnipathR features a logging facility, a YML configuration file and a cache directory. By default the highest level messages are printed to the console, and you can browse the full log from R by calling omnipath_log(). The cache can be controlled by a number of functions, for example you can search for cache files by omnipath_cache_search(), and delete them by omnipath_cache_remove:

omnipath_cache_search('dorothea')

## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$key
## [1] "f4de1f63d60f35d18bb262dd6de84d1cccd9cc97"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$url
## [1] "https://omnipathdb.org/interactions?genesymbols=yes&datasets=dorothea,tf_target&organisms=9606&dorothea_levels=A,B&fields=sources,references,curation_effort,dorothea_level&license=academic"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$post
## list()
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$payload
## list()
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$ext
## [1] "rds"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`$number
## [1] "1"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`$path
## [1] "/home/biocbuild/.cache/OmnipathR/f4de1f63d60f35d18bb262dd6de84d1cccd9cc97-1.rds"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`$dl_started
## [1] "2022-10-16 17:15:48 EDT"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`$status
## [1] "ready"
## 
## $f4de1f63d60f35d18bb262dd6de84d1cccd9cc97$versions$`1`$dl_finished
## [1] "2022-10-16 17:15:48 EDT"

The configuration can be set by options, all options are prefixed with omnipath., and can be saved by omnipath_save_config. For example, to exclude all OmniPath resources which don’t allow for-profit use:

options(omnipath.license = 'commercial')

The internal state is contained by the omnipath.env environment.

3.5 Further information

Find more examples in the other vignettes and the manual. For example, the NicheNet vignette presents the integratation between OmnipathR and nichenetr, a method for prediction of ligand-target gene connections. Another Bioconductor package wppi is able to add context specific scores to networks, based on genes of interest, functional annotations and network proximity (random walks with restart). The new paths vignette presents some approaches to construct pathways from networks. The design of the OmniPath database is described in our recent paper [1], while an in depth analysis of the pathway resources is available in the first OmniPath paper [2].

Session info

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB             
##  [4] LC_COLLATE=C               LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] OmnipathR_3.4.7  BiocStyle_2.24.0
## 
## loaded via a namespace (and not attached):
##  [1] progress_1.2.2      tidyselect_1.2.0    xfun_0.33           bslib_0.4.0         purrr_0.3.5        
##  [6] vctrs_0.4.2         generics_0.1.3      htmltools_0.5.3     yaml_2.3.5          utf8_1.2.2         
## [11] rlang_1.0.6         jquerylib_0.1.4     pillar_1.8.1        later_1.3.0         glue_1.6.2         
## [16] withr_2.5.0         DBI_1.1.3           rappdirs_0.3.3      bit64_4.0.5         readxl_1.4.1       
## [21] lifecycle_1.0.3     stringr_1.4.1       cellranger_1.1.0    evaluate_0.17       knitr_1.40         
## [26] tzdb_0.3.0          fastmap_1.1.0       parallel_4.2.1      curl_4.3.3          fansi_1.0.3        
## [31] Rcpp_1.0.9          readr_2.1.3         backports_1.4.1     checkmate_2.1.0     BiocManager_1.30.18
## [36] cachem_1.0.6        vroom_1.6.0         jsonlite_1.8.2      bit_4.0.4           hms_1.1.2          
## [41] digest_0.6.29       stringi_1.7.8       bookdown_0.29       dplyr_1.0.10        cli_3.4.1          
## [46] tools_4.2.1         magrittr_2.0.3      logger_0.2.2        sass_0.4.2          tibble_3.1.8       
## [51] crayon_1.5.2        tidyr_1.2.1         pkgconfig_2.0.3     ellipsis_0.3.2      xml2_1.3.3         
## [56] prettyunits_1.1.1   assertthat_0.2.1    rmarkdown_2.17      httr_1.4.4          R6_2.5.1           
## [61] igraph_1.3.5        compiler_4.2.1

References

[1] D Turei, A Valdeolivas, L Gul, N Palacio-Escat, M Klein, O Ivanova, M Olbei, A Gabor, F Theis, D Modos, T Korcsmaros and J Saez-Rodriguez (2021) Integrated intra- and intercellular signaling knowledge for multicellular omics analysis. Molecular Systems Biology 17:e9923

[2] D Turei, T Korcsmaros and J Saez-Rodriguez (2016) OmniPath: guidelines and gateway for literature-curated signaling pathway resources. Nature Methods 13(12)