PipePatExp - Pipeline to aggregate gene expression correlation information for PPI
main @ beb490b

Workflow Type: Python


The PPI information aggregation pipeline starts getting all the datasets in GEO database whose material was generated using expression profiling by high throughput sequencing. From each database identifiers, it extracts the supplementary files that had the counts table. Once finishing the download step, it identifies those that were normalized or had the raw counts to normalize. It also identify and map the gene ids to uniprot (the ids found usually were from HGNC and Ensembl). For each normalized counts table belonging to some experiment, il filters those which have the proteins (already mapped from HGNC to Uniprot identifiers) in the pairs in evaluation. Then, it calculates the correlation matrix based on Pearson method in the tables and saves the respective pairs correlation value for each table. Finally, a repor is made for each pair in descending order of correlation value with the experiment identifiers.


  • Python packages needed:
    • os
    • scipy
    • pandas
    • sklearn
    • Bio python
    • numpy

Usage Instructions

  • Preparation:
    1. git clone https://github.com/YasCoMa/PipeAggregationInfo.git
    2. cd PipeAggregationInfo
    3. pip3 install -r requirements.txt

Preprocessing pipeline

  • Go to the ncbi GDS database webpage, use the key words to filter your gds datasets of interest and save the results as file ("Send to" option), and choose "Summary (text)"
  • Alternatively, we already saved the results concerning protein interactions, you may use them to run preprocessing in order to obtain the necessary files for the main pipeline
  • Running preprocessing:
    • cd preprocessing
    • python3 data_preprocessing.py ./workdir_preprocessing filter_files
    • cd ../
    • Copy the generated output folder "data_matrices_count" into the workflow folder: cp -R preprocessing/workdir_preprocessing/data_matrices_count .

Main pipeline

  • Pipeline parameters:

    • -rt or --running_type
      Use to indicate the step you want to execute (it is desirable following the order):
      1 - Make the process of finding the experiments and ranking them by correlation
      2 - Select pairs that were already processed and ranked making a separated folder of interest

    • -fo or --folder
      Folder to store the files (use the folder where the other required file can be found)

    • -if or --interactome_file
      File with the pairs (two columns with uniprot identifiers in tsv format)

      Example of this file: running_example/all_pairs.tsv

    • -spf or --selected_pairs_file
      File with PPIs of interest (two columns with uniprot identifiers in tsv format)

      Example of this file: running_example/selected_pairs.tsv

  • Running modes examples:

    1. Run step 1:
      python3 pipeline_expression_pattern.py -rt 1 -fo running_example/ -if all_pairs.tsv

    2. Run step 2:
      python3 pipeline_expression_pattern.py -rt 2 -fo running_example/ -spf selected_pairs.tsv

Bug Report

Please, use the Issue tab to report any bug.

Version History

main @ beb490b (earliest) Created 22nd Oct 2023 at 01:02 by Yasmmin Martins


Frozen main beb490b
help Creators and Submitter
  • Yasmmin Martins
Martins, Y. (2023). {PipePatExp - Pipeline to aggregate gene expression correlation information for PPI}. https://github.com/YasCoMa/PipeAggregationInfo

Views: 787   Downloads: 110

Created: 22nd Oct 2023 at 01:02

help Attributions


Total size: 17.2 MB
Powered by
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH