Summary
The PPI information aggregation pipeline starts getting all the datasets in GEO database whose material was generated using expression profiling by high throughput sequencing. From each database identifiers, it extracts the supplementary files that had the counts table. Once finishing the download step, it identifies those that were normalized or had the raw counts to normalize. It also identify and map the gene ids to uniprot (the ids found usually were from HGNC and Ensembl). For each normalized counts table belonging to some experiment, il filters those which have the proteins (already mapped from HGNC to Uniprot identifiers) in the pairs in evaluation. Then, it calculates the correlation matrix based on Pearson method in the tables and saves the respective pairs correlation value for each table. Finally, a repor is made for each pair in descending order of correlation value with the experiment identifiers.
Requirements:
- Python packages needed:
- os
- scipy
- pandas
- sklearn
- Bio python
- numpy
Usage Instructions
- Preparation:
git clone https://github.com/YasCoMa/PipeAggregationInfo.git
cd PipeAggregationInfo
pip3 install -r requirements.txt
Preprocessing pipeline
- Go to the ncbi GDS database webpage, use the key words to filter your gds datasets of interest and save the results as file ("Send to" option), and choose "Summary (text)"
- Alternatively, we already saved the results concerning protein interactions, you may use them to run preprocessing in order to obtain the necessary files for the main pipeline
- Running preprocessing:
cd preprocessing
python3 data_preprocessing.py ./workdir_preprocessing filter_files
cd ../
- Copy the generated output folder "data_matrices_count" into the workflow folder:
cp -R preprocessing/workdir_preprocessing/data_matrices_count .
Main pipeline
-
Pipeline parameters:
-
-rt or --running_type
Use to indicate the step you want to execute (it is desirable following the order):
1 - Make the process of finding the experiments and ranking them by correlation
2 - Select pairs that were already processed and ranked making a separated folder of interest -
-fo or --folder
Folder to store the files (use the folder where the other required file can be found) -
-if or --interactome_file
File with the pairs (two columns with uniprot identifiers in tsv format)Example of this file: running_example/all_pairs.tsv
-
-spf or --selected_pairs_file
File with PPIs of interest (two columns with uniprot identifiers in tsv format)Example of this file: running_example/selected_pairs.tsv
-
-
Running modes examples:
-
Run step 1:
python3 pipeline_expression_pattern.py -rt 1 -fo running_example/ -if all_pairs.tsv
-
Run step 2:
python3 pipeline_expression_pattern.py -rt 2 -fo running_example/ -spf selected_pairs.tsv
-
Bug Report
Please, use the Issue tab to report any bug.
Version History
main @ beb490b (earliest) Created 22nd Oct 2023 at 01:02 by Yasmmin Martins
meta
Frozen
main
beb490b
Creator
Submitter
Views: 1541 Downloads: 213
Created: 22nd Oct 2023 at 01:02
None