The PPI information aggregation pipeline starts getting all the datasets in GEO database whose material was generated using expression profiling by high throughput sequencing. From each database identifiers, it extracts the supplementary files that had the counts table. Once finishing the download step, it identifies those that were normalized or had the raw counts to normalize. It also identify and map the gene ids to uniprot (the ids found usually were from HGNC and Ensembl). For each normalized counts table belonging to some experiment, il filters those which have the proteins (already mapped from HGNC to Uniprot identifiers) in the pairs in evaluation. Then, it calculates the correlation matrix based on Pearson method in the tables and saves the respective pairs correlation value for each table. Finally, a repor is made for each pair in descending order of correlation value with the experiment identifiers.
- Python packages needed:
- Bio python
git clone https://github.com/YasCoMa/PipeAggregationInfo.git
pip3 install -r requirements.txt
- Go to the ncbi GDS database webpage, use the key words to filter your gds datasets of interest and save the results as file ("Send to" option), and choose "Summary (text)"
- Alternatively, we already saved the results concerning protein interactions, you may use them to run preprocessing in order to obtain the necessary files for the main pipeline
- Running preprocessing:
python3 data_preprocessing.py ./workdir_preprocessing filter_files
- Copy the generated output folder "data_matrices_count" into the workflow folder:
cp -R preprocessing/workdir_preprocessing/data_matrices_count .
-rt or --running_type
Use to indicate the step you want to execute (it is desirable following the order):
1 - Make the process of finding the experiments and ranking them by correlation
2 - Select pairs that were already processed and ranked making a separated folder of interest
-fo or --folder
Folder to store the files (use the folder where the other required file can be found)
-if or --interactome_file
File with the pairs (two columns with uniprot identifiers in tsv format)
Example of this file: running_example/all_pairs.tsv
-spf or --selected_pairs_file
File with PPIs of interest (two columns with uniprot identifiers in tsv format)
Example of this file: running_example/selected_pairs.tsv
Running modes examples:
Run step 1:
python3 pipeline_expression_pattern.py -rt 1 -fo running_example/ -if all_pairs.tsv
Run step 2:
python3 pipeline_expression_pattern.py -rt 2 -fo running_example/ -spf selected_pairs.tsv
Please, use the Issue tab to report any bug.
Created: 22nd Oct 2023 at 01:02