Summary
This pipeline has as major goal provide a tool for protein interactions (PPI) prediction data formalization and standardization using the OntoPPI ontology. This pipeline is splitted in two parts: (i) a part to prepare data from three main sources of PPI data (HINT, STRING and PredPrin) and create the standard files to be processed by the next part; (ii) the second part uses the data prepared before to semantically describe using ontologies related to the concepts of this domain. It describes the provenance information of PPI prediction experiments, datasets characteristics, functional annotations of proteins involved in the PPIs, description of the PPI detection methods (also named as evidence) used in the experiment, and the prediction score obtained by each PPI detection method for the PPIs. This pipeline also execute data fusion to map the same protein pairs from different data sources and, finally, it creates a database of all these information in the alegro graph triplestore.
Requirements:
- Python packages needed:
- pip3 install numpy
- pip3 install rdflib
- pip3 install uuid
- pip3 install SPARQLWrapper
- alegro graph tools (pip3 install agraph-python)
Go to this site for the installation tutorial
Usage Instructions
Preparation:
git clone https://github.com/YasCoMa/ppintegrator.git
cd ppintegrator
pip3 install -r requirements.txt
Allegrograph is a triple store, which is a database to maintain semantic descriptions. This database's server provides a web application with a user interface to run, edit and manage queries, visualize results and manipulate the data without writing codes other than SPARQL query language. The use of the Allegregraph option is not mandatory, but if you want to export and use it, you have to install the server and the client.- if you want to use the Allegrograph server option (this triple store has free license up to 5,000,000 triples), install allegrograph server in your machine (configure a user and password): Server - https://franz.com/agraph/support/documentation/current/server-installation.html; Client - https://franz.com/agraph/support/documentation/current/python/install.html
- Export the following environment variables to configure Allegrograph server
export AGRAPH_HOST=127.0.0.1
export AGRAPH_PORT=10035
export AGRAPH_USER=chosen_user
export AGRAPH_PASSWORD=chosen_password
- Start allegrograph:
path/to/allegrograph/bin/agraph-control --config path/to/allegrograph/lib/agraph.cfg start
- Read the file data_requirements.txt to understand which files are needed for the process
Data preparation (first part) - File prepare_data_triplification.py
:
-
Pipeline parameters:
-
-rt or --running_type
Use to indicate from which source you want to prepare PPI data, as follows:
1 - Prepare data for PredPrin
2 - Prepare data for String
3 - Prepare data for HINT -
-fec or --file_experiment_config
File with the experiment configuration in json formatExamples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json
-
-org or --organism
Prepare data only for one organism of interest (example: homo_sapiens)This parameter is optional. If you do not specify, it will automatically use the organisms described in the experiment configuration file above
-
-
Running modes examples:
-
Running for PPI data generated by PredPrin:
python3 prepare_data_triplification.py -rt 1 -fec params_predrep_5k.json
-
Running for HINT database:
python3 prepare_data_triplification.py -rt 3 -fec params_hint.json
-
Running for STRING database:
python3 prepare_data_triplification.py -rt 2 -fec params_string.json
In the file
auxiliar_data_preparation.py
you can run it for all the examples provided automatically, as follows:
python3 auxiliar_data_preparation.py
-
PPI data triplification (second part) - File triplification_ppi_data.py
:
-
Pipeline parameters:
-
-rt or --running_type
Use to indicate which execution step you want to run (it is desirable following the order showed):
0 - Generate the descriptions for all the protein interaction steps of an experiment (run steps 1, 2 and 3)
1 - Generate triples just about data provenance
2 - Generate triples just for protein functional annotations
3 - Generate triples just for the score results of each evidence
4 - Execute data fusion
5 - Generate descriptions and execute data fusion (run steps 1, 2, 3 and 4)
6 - Export to allegrograph server -
-fec or --file_experiment_config
File with the experiment configuration in json formatExamples are in these files (all the metadata are required): params_hint.json, params_predrep_5k.json e params_string.json
-
-fev or --file_evidence_info
File with the PPI detection methods information in json formatExamples are in these files (all the metadata are required): evidences_information.json, evidences_information_hint.json e evidences_information_string.json
-
-fcv or --file_config_evidence
File with the experiment and evidence methods files addresses in tsv formatExample of this file: config_evidence_file.tsv
-
-
Running modes examples:
-
Running to generate all semantic descriptions for PredPrin:
python3 triplification_ppi_data.py -rt 0 -fec params_predrep_5k.json -fev evidences_information.json
-
Running to generate only triples of data provenance:
python3 triplification_ppi_data.py -rt 1 -fec params_hint.json -fev evidences_information_hint.json
-
Running to generate only triples of PPI scores for each evidence:
python3 triplification_ppi_data.py -rt 3 -fec params_hint.json -fev evidences_information_hint.json
-
Running to generate only triples of protein functional annotations (only PredPrin exports these annotations):
python3 triplification_ppi_data.py -rt 2 -fec params_predrep_5k.json -fev evidences_information.json
-
Running to generate all semantic descrptions for STRING:
python3 triplification_ppi_data.py -rt 0 -fec params_string.json -fev evidences_information_string.json
For the next options (4, 5 and 6), it is mandatory running at least mode 1 and 3 for HINT, STRING and PredPrin
-
Running to execute data fusion of different sources:
python3 triplification_ppi_data.py -rt 4 -fcv config_evidence_file.tsv
-
Running to generate all semantic descriptions and execute data fusion of different sources (combines mode 0 and 4):
python3 triplification_ppi_data.py -rt 5 -fcv config_evidence_file.tsv
-
Export semantic data to allegrograph server:
python3 triplification_ppi_data.py -rt 6 -fcv config_evidence_file.tsv
-
Query Scenarios for analysis
Supposing you ran all the steps showed in the section above, you can run the following options to analyse the data stored alegro graph triple store.
File to use for this section: query_analysis_ppitriplificator.py
-
Parameter:
- -q or --query_option
Use to indicate which query you want to perform:
1 - Get all the different organisms whose interactions are stored in the database
2 - Get the interactions that have scientific papers associated and the list of these papers
3 - Get a list of the most frequent biological processes annotated for the interactions of Escherichia coli bacteria
4 - Get only the interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
5 - Get the scores of interactions belonging to a specific biological process (regulation of transcription, DNA-templated) in Escherichia coli bacteria
6 - Get a list of the most frequent biological processes annotated for the interactions of human organism
7 - Get only the interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
8 - Get the scores of interactions belonging to a specific biological process (positive regulation of transcription by RNA polymerase II) in human organism
- -q or --query_option
-
Running modes examples:
- Running queries:
python3 query_analysis_ppitriplificator.py -q 1
Change number 1 to the respective number of the query you want to perform
- Running queries:
Reference
Martins, Y. C., Ziviani, A., Cerqueira e Costa, M. D. O., Cavalcanti, M. C. R., Nicolás, M. F., & de Vasconcelos, A. T. R. (2023). PPIntegrator: semantic integrative system for protein–protein interaction and application for host–pathogen datasets. Bioinformatics Advances, 3(1), vbad067.
Bug Report
Please, use the Issues tab to report any bug.
Version History
master @ 6d3008c (earliest) Created 21st Oct 2023 at 00:56 by Yasmmin Martins
meta
Frozen
master
6d3008c
Creator
Submitter
Views: 1514 Downloads: 220
Created: 21st Oct 2023 at 00:56
None