PPIntegrator - PPI Triplification Process
master @ 6d3008c

Workflow Type: Python

Stable

Summary

This pipeline has as major goal provide a tool for protein interactions (PPI) prediction data formalization and standardization using the OntoPPI ontology. This pipeline is splitted in two parts: (i) a part to prepare data from three main sources of PPI data (HINT, STRING and PredPrin) and create the standard files to be processed by the next part; (ii) the second part uses the data prepared before to semantically describe using ontologies related to the concepts of this domain. It describes the provenance information of PPI prediction experiments, datasets characteristics, functional annotations of proteins involved in the PPIs, description of the PPI detection methods (also named as evidence) used in the experiment, and the prediction score obtained by each PPI detection method for the PPIs. This pipeline also execute data fusion to map the same protein pairs from different data sources and, finally, it creates a database of all these information in the alegro graph triplestore.

Requirements:

Python packages needed:
- pip3 install numpy
- pip3 install rdflib
- pip3 install uuid
- pip3 install SPARQLWrapper
- alegro graph tools (pip3 install agraph-python)
  Go to this site for the installation tutorial

Usage Instructions

Preparation:

git clone https://github.com/YasCoMa/ppintegrator.git
cd ppintegrator
pip3 install -r requirements.txt Allegrograph is a triple store, which is a database to maintain semantic descriptions. This database's server provides a web application with a user interface to run, edit and manage queries, visualize results and manipulate the data without writing codes other than SPARQL query language. The use of the Allegregraph option is not mandatory, but if you want to export and use it, you have to install the server and the client.
if you want to use the Allegrograph server option (this triple store has free license up to 5,000,000 triples), install allegrograph server in your machine (configure a user and password): Server - https://franz.com/agraph/support/documentation/current/server-installation.html; Client - https://franz.com/agraph/support/documentation/current/python/install.html
Export the following environment variables to configure Allegrograph server

export AGRAPH_HOST=127.0.0.1
export AGRAPH_PORT=10035
export AGRAPH_USER=chosen_user
export AGRAPH_PASSWORD=chosen_password

Start allegrograph: path/to/allegrograph/bin/agraph-control --config path/to/allegrograph/lib/agraph.cfg start
Read the file data_requirements.txt to understand which files are needed for the process

Data preparation (first part) - File `prepare_data_triplification.py` :

Pipeline parameters:
- -rt or --running_type
  Use to indicate from which source you want to prepare PPI data, as follows:
  1 - Prepare data for PredPrin
  2 - Prepare data for String
  3 - Prepare data for HINT
- -fec or --file_experiment_config
  File with the experiment configuration in json format
  
  Examples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json
- -org or --organism
  Prepare data only for one organism of interest (example: homo_sapiens)
  
  This parameter is optional. If you do not specify, it will automatically use the organisms described in the experiment configuration file above
Running modes examples:
1. Running for PPI data generated by PredPrin:
  python3 prepare_data_triplification.py -rt 1 -fec params_predrep_5k.json
2. Running for HINT database:
  python3 prepare_data_triplification.py -rt 3 -fec params_hint.json
3. Running for STRING database:
  python3 prepare_data_triplification.py -rt 2 -fec params_string.json
In the file auxiliar_data_preparation.py you can run it for all the examples provided automatically, as follows:
python3 auxiliar_data_preparation.py

PPI data triplification (second part) - File `triplification_ppi_data.py`:

Pipeline parameters:
- -rt or --running_type
  Use to indicate which execution step you want to run (it is desirable following the order showed):
  0 - Generate the descriptions for all the protein interaction steps of an experiment (run steps 1, 2 and 3)
  1 - Generate triples just about data provenance
  2 - Generate triples just for protein functional annotations
  3 - Generate triples just for the score results of each evidence
  4 - Execute data fusion
  5 - Generate descriptions and execute data fusion (run steps 1, 2, 3 and 4)
  6 - Export to allegrograph server
- -fec or --file_experiment_config
  File with the experiment configuration in json format
  
  Examples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json
- -fev or --file_evidence_info
  File with the PPI detection methods information in json format
  
  Examples are in these files (all the metadata are required): evidences_information.json, evidences_information_hint.json e evidences_information_string.json
- -fcv or --file_config_evidence
  File with the experiment and evidence methods files addresses in tsv format
  
  Example of this file: config_evidence_file.tsv
Running modes examples:
1. Running to generate all semantic descriptions for PredPrin:
  python3 triplification_ppi_data.py -rt 0 -fec params_predrep_5k.json -fev evidences_information.json
2. Running to generate only triples of data provenance:
  python3 triplification_ppi_data.py -rt 1 -fec params_hint.json -fev evidences_information_hint.json
3. Running to generate only triples of PPI scores for each evidence:
  python3 triplification_ppi_data.py -rt 3 -fec params_hint.json -fev evidences_information_hint.json
4. Running to generate only triples of protein functional annotations (only PredPrin exports these annotations):
  python3 triplification_ppi_data.py -rt 2 -fec params_predrep_5k.json -fev evidences_information.json
5. Running to generate all semantic descrptions for STRING:
  python3 triplification_ppi_data.py -rt 0 -fec params_string.json -fev evidences_information_string.json
For the next options (4, 5 and 6), it is mandatory running at least mode 1 and 3 for HINT, STRING and PredPrin
1. Running to execute data fusion of different sources:
  python3 triplification_ppi_data.py -rt 4 -fcv config_evidence_file.tsv
2. Running to generate all semantic descriptions and execute data fusion of different sources (combines mode 0 and 4):
  python3 triplification_ppi_data.py -rt 5 -fcv config_evidence_file.tsv
3. Export semantic data to allegrograph server:
  python3 triplification_ppi_data.py -rt 6 -fcv config_evidence_file.tsv

Query Scenarios for analysis

Supposing you ran all the steps showed in the section above, you can run the following options to analyse the data stored alegro graph triple store.
File to use for this section: query_analysis_ppitriplificator.py

Parameter:
- -q or --query_option
  Use to indicate which query you want to perform:
  1 - Get all the different organisms whose interactions are stored in the database
  2 - Get the interactions that have scientific papers associated and the list of these papers
  3 - Get a list of the most frequent biological processes annotated for the interactions of Escherichia coli bacteria
  4 - Get only the interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
  5 - Get the scores of interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
  6 - Get a list of the most frequent biological processes annotated for the interactions of human organism
  7 - Get only the interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
  8 - Get the scores of interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
Running modes examples:
1. Running queries:
  python3 query_analysis_ppitriplificator.py -q 1
  Change number 1 to the respective number of the query you want to perform

Reference

Martins, Y. C., Ziviani, A., Cerqueira e Costa, M. D. O., Cavalcanti, M. C. R., Nicolás, M. F., & de Vasconcelos, A. T. R. (2023). PPIntegrator: semantic integrative system for protein–protein interaction and application for host–pathogen datasets. Bioinformatics Advances, 3(1), vbad067.