ScreenDOP - Screening of strategies for disease outcome prediction
master @ b8cc280

Workflow Type: Python


The data preparation pipeline contains tasks for two distinct scenarios: leukaemia that contains microarray data for 119 patients and ovarian cancer that contains next generation sequencing data for 380 patients.

The disease outcome prediction pipeline offers two strategies for this task:

Graph kernel method: It starts generating personalized networks for each patient using the interactome file provided and generate the patient network checking if each PPI of the interactome has both proteins up regulated or down regulated according to the gene expression table provided. The first step generate a set of graphs for the patients that are evaluated with 4 distinct kernels for graph classification, which are: Linear kernel between edge histograms, Linear kernel between vertex histograms and the Weisfeiler lehman. These kernels functions calculate a similarity matrix for the graphs and then this matrix is used by the support vector machine classifier. Then the predictions are delivered to the last task that exports a report with the accuracy reached by each kernel. It allows some customizations about the network parameters to be used, such as the DEG cutoff to determine up and down regulated based on the log2 fold change, which will determine the topology and the labels distribution in the specific sample graphs. It is also possible customize the type of node/edge attributes passed to the kernel function, which may be only label, only weight or both.

GSEA based pathway scores method: This method is faster and do not rely on tensor inputs such as the previous method. It uses geneset enrichment analysis on the pathways from KEGG 2021 of Human, and uses the scores of the pathways found enriched for the samples to build the numerical features matrix, that is then delivered to the AdaBoost classifier. The user may choose balance the dataset using oversampling strategy provided by SMOTE.

Usage Instructions


  1. git clone
  2. cd screendop
  3. Decompress screening_ovarian/raw_expression_table.tsv.tar.xz
  4. Create conda environment to handle dependencies: conda env create -f drugresponse_env.yml
  5. conda activate drugresponse_env
  6. Setup an environment variable named "path_workflow_screendop" with the full path to this workflow folder

Data preparation - File :

Files decompression

  • Decompress data_preparation/lekaemia.tar.xz
  • Decompress data_preparation/ovarian/GSE140082_data.tar.xz
    • Put the decompressed file GSE140082_series_matrix.txt in data_preparation/ovarian/

Pipeline parameters

  • -rt or --running_type
    Use to prepare data for the desired scenario:
    1 - Run with Leukaemia data
    2 - Run with Ovarian cancer data

Running modes examples

  1. Run for Leukaemia data:
    python3 -rt 1

In this case, you must have R installed and also the library limma, it is used to determine DEGs from microarray data. For this dataset, the files are already prepared in the folder.

  1. Run for Ovarian cancer data:
    python3 -rt 2

In this case, you must have R installed and also the library DESeq, because this scenario treats next generation sequencing data

Disease outcome prediction execution - File

Pipeline parameters

  • -rt or --running_step
    Use to prepare data for the desired scenario:
    1 - Run graph kernel method
    2 - Run gsea based pathway scores method

  • -cf or --configuration_file
    File with the expression values for the genes by sample/patient in tsv format

    Example of this file: config.json

Input configuration file

  • Configuration file keys (see also the example in config.json):
    • folder (mandatory for both methods): working directory
    • identifier: project identifier to be used in the result files
    • mask_expression_table (mandatory for both methods): Gene expression values file with the result of the fold change normalized value of a certain gene for each sample, already pruned by the significance (p-value).
    • raw_expression_table (mandatory for both methods): Raw gene expression values already normalized following the method pf preference of the user.
    • labels_file (mandatory for both methods): File with the prognosis label for each sample
    • deg_cutoff_up: Cutoff value to determine up regulated gene. Default value is 1.
    • deg_cutoff_down: Cutoff value to determine down regulated gene. Default value is -1.
    • nodes_enrichment: Node attributes to be used in the screening evaluation. It may be a list combining the options "label", "weight" or "all". Examples: ["all", "weight"], ["label"], ["label", "weight"]. Default value is ["all"].
    • edges_enrichment: Edge attributes to be used in the screening evaluation. It may be a list combining the options "label", "weight" or "all". Examples: ["all", "weight"], ["label"], ["label", "weight"]. Default value is ["all"].
    • flag_balance: Flag to indicate whether the user wants to balance the samples in each outcome class, by SMOTE oversampling. Values may be false or true. Default value is false.

Running modes examples

  1. Running disease outcome prediction by graph kernel method:
    python3 -rt 1 -cf config.json

  2. Running disease outcome prediction by gsea enriched network method:
    python3 -rt 2 -cf config.json


Martins, Y. C. (2023). Multi-task analysis of gene expression data on cancer public datasets. medRxiv, 2023-09.

Bug Report

Please, use the Issue tab to report any bug.

Version History

master @ b8cc280 (earliest) Created 22nd Oct 2023 at 01:18 by Yasmmin Martins


Frozen master b8cc280
help Creators and Submitter
  • Yasmmin Martins
Martins, Y. (2023). {ScreenDOP - Screening of strategies for disease outcome prediction}.

Views: 848   Downloads: 142

Created: 22nd Oct 2023 at 01:18

Last updated: 22nd Oct 2023 at 01:19

help Attributions


Total size: 214 MB
Powered by
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH