EGG: Epitope Generation Gateway A modular Snakemake pipeline for personalized neoantigen discovery and prioritization using patient-specific gene co-expression networks.
main @ cc46a7a (latest)

main @ cc46a7a (latest)

main @ cc46a7a (latest)

main @ 1f78621 (earliest)

Workflow Type: Snakemake

EGG: Epitope Generation Gateway

A modular Snakemake pipeline for personalized neoantigen discovery and prioritization using patient-specific gene co-expression networks.

Overview

EGG is a comprehensive computational workflow for identifying and prioritizing tumor neoantigens from DNA and RNA sequencing data. Unlike traditional neoantigen prediction pipelines that focus solely on sequence-level features, EGG integrates patient-specific gene co-expression networks to identify neoantigens in biologically central and functionally relevant contexts.

Key Features

Modular Architecture: Seven independent Snakemake modules can be run separately or as a complete workflow
Multi-source Neoantigen Detection: Identifies neoantigens from:
- Somatic mutations (SNVs and indels)
- Gene fusions
- Alternative splicing events
Network-based Prioritization: Uses LIONESS-derived personalized co-expression networks to assess functional importance
Automated Setup: Automatically downloads references, installs dependencies, and configures environments
Reproducible: Containerized tools (Docker) and versioned environments (Conda) ensure consistent results

What Makes EGG Different?

EGG goes beyond traditional binding affinity predictions by incorporating:

Gene co-expression network topology to identify functionally critical genes
Protein-protein interaction data (HumanNet-XN) to prioritize biologically relevant candidates
Network centrality metrics that capture a gene's importance to tumor cell function in particular related to resistance to cancer evolution and immune evasion
Consensus scoring that integrates network features with orthogonal evidence (DepMap essentiality, binding affinity, expression)

Pipeline Implementation

EGG is implemented as a Snakemake workflow cosisting of seven separate modules. where each module corresponds to a dedicated .smk file that can be executed independently or as part of an end-to-end run.

Each module is encapsulated in its own Snakefile under scripts/final_scripts/ and follows a consistent interface (csvfile=..., results_dir=...) so outputs are organized under a single root directory and can be reused reliably by downstream modules. Sample inputs are provided through a standardized CSV metadata file (sample identity, FASTQ paths, datatype, lane), enabling uniform batch processing without per-sample manual configuration.

To minimize setup burden while preserving reproducibility, EGG provisions dependencies automatically at runtime: modules create and use local Conda environments via --use-conda, pull and run Docker images when required by specific tools, and download reference resources as needed as shown in more detail in the resource requirements section. Beyond installing Snakemake, Conda, and Docker, no additional manual installation should be necessary.

In practice, the workflow is commonly initialized with QC filtering (Module 0), after which DNA (Module 1A) and RNA (Module 1B) analyses can run in parallel. Downstream epitope generation modules consume these outputs: 2A uses somatic variants from 1A plus HLA alleles (from 1B or provided externally), 2B uses fusion calls from 1B, and 2C integrates RNA-derived splice events with matched DNA calls and HLA context. RNA-seq expression outputs feed the co-expression network module, which generates patient-specific network features used alongside orthogonal evidence (binding features, expression, essentiality, and biological annotations) in the prioritization module to produce a final per-sample ranked neoepitope table. This is intended to complement binding and expression-based ranking with tumor functional context, prioritizing epitopes arising from functionally indispensable / network-central genes that may be less likely to be lost under tumor evolution and selective pressure, thereby supporting identification of potentially more durable vaccine targets. The recommended execution order is explained in the Execution Guide.

Installation

Prerequisites

Conda/Mamba (for environment management)
Snakemake ≥6.0
Docker (for containerized tools)

Setup

Clone the repository

git clone https://github.com/fhaive/epitope_generation_gateway.git
cd epitope_generation_gateway

Install Snakemake (if not already installed)

conda create -n snakemake -c conda-forge -c bioconda snakemake
conda activate snakemake

That's it! All other dependencies are managed automatically by the pipeline.

Pipeline Modules

EGG is organized into seven modular components, each handling a specific analysis step:

Module 0: QC Filtering

Purpose: Quality control and preprocessing of raw sequencing data

Description: This module employs FastQC for sequence quality assessment and fastp for automated adapter trimming and low-quality read filtering. All trimming thresholds and filters are fully configurable via a central YAML configuration file, ensuring downstream analysis proceeds with high-quality data.

Tools: FastQC, fastp

Output: Trimmed FASTQ files, QC reports

snakemake --use-conda --snakefile scripts/final_scripts/QC_filtering_module_0.smk \
    --cores {threads} --config csvfile="samples.csv" results_dir="output/"

Module 1B: RNA Analysis

Purpose: Multi-faceted RNA-seq analysis

Description: Using trimmed Cancer RNA-seq data, this module performs three parallel analyses: STAR-Fusion detects gene fusion events, ArcasHLA conducts HLA typing, and FeatureCounts quantifies gene expression for downstream co-expression network analysis.

Tools:

STAR-Fusion (gene fusions)
ArcasHLA (HLA typing)
FeatureCounts (gene expression quantification)

Output: Fusion calls, HLA types, gene expression matrices

snakemake --use-conda --snakefile scripts/final_scripts/RNA_analysis_module_1B.smk \
    --cores {threads} --config csvfile="samples.csv" results_dir="output/"

Module 2B: Fusion Epitopes

Purpose: Predict neoepitopes from gene fusion breakpoints

Description: RNA-derived fusion transcripts from STAR-Fusion are processed for neoantigen prediction using pVACfuse. The module automatically handles fusion breakpoint processing and peptide candidate generation across junction sites.

Tools: pVACfuse

Input: Fusion transcripts from Module 1B

Output: Predicted neoepitopes spanning fusion junctions

snakemake --use-conda --snakefile scripts/final_scripts/fusion_epitopes_module_2B.smk \
    --cores {threads} --config csvfile="samples.csv" results_dir="output/"

Network Generation Module

Purpose: Construct patient-specific gene co-expression networks, produces Patient Specific Functional Interaction Networks (pFINs) by integreting co-expression networks with HumanNetV2 functional PPI networks and compute network centrality

Description: This module constructs personalized co-expression networks using LIONESS methodology on Pearson gene co-expression network. Gene level centrality metrics are computed and overlaid with HumanNet-XN protein-protein interaction data to identify functionally important neoantigens.

Tools: LIONESS, HumanNet-XN PPI database

Method:

Generates personalized Pearson correlation networks
Filters networks using PPI data and membrane protein enrichment
Computes centrality metrics (degree, betweenness, strength, Weighted Connectivity Impact (WCI))
Calculates Largest Component Impact (LCI) to identify structurally critical genes

Output: Network features for each gene per patient

snakemake --use-conda --snakefile scripts/final_scripts/sample_networks_generation.smk \
    --cores {threads} --config csvfile="samples.csv" results_dir="output/"

Input Requirements

Sample Metadata CSV

The pipeline requires a CSV file with the following columns:

Column	Description	Example
`sample_name`	Unique sample identifier	SAMPLE_001
`fastq_R1`	Path to forward reads	/path/to/sample_R1.fastq.gz
`fastq_R2`	Path to reverse reads	/path/to/sample_R2.fastq.gz
`datatype`	Sample type	CancerRNA / CancerDNA / NormalDNA
`lane`	Sequencing lane identifier	L001

Required Data Types

CancerRNA: Tumor RNA-seq (for expression, fusions, splicing, HLA typing)
CancerDNA: Tumor DNA-seq (for somatic mutations)
NormalDNA: Matched normal DNA-seq (for filtering germline variants)

Output

Output Organization

The pipeline generates outputs organized by module:

results/
├── 0_Filtering_and_QC/
│   ├── trimmed_fastq/                      # Quality-filtered FASTQ files
│   ├── qc_pre_filtering/                   # Initial FastQC reports
│   └── qc_post_filtering/                  # Post-trimming QC reports
│
├── 1A_mutation_analysis/
│   ├── VCF/                                # Raw variant calls
│   ├── VCF_filtered/                       # Filtered somatic variants
│   ├── VCF_germline/                       # Germline variant calls
│   ├── VCF_germline_filtered/              # Filtered germline variants
│   ├── bwa/                                # Alignment outputs
│   ├── dedup/                              # Duplicate-marked BAMs/metrics
│   ├── bqsr/                               # BQSR-processed BAMs
│   ├── merge/                              # Merge fastqs from same sample run across different lanes 
│   ├── metrics/                            # Alignment/QC metrics
│   └── filtering_tables/                   # Variant filtering tables/logs
│
├── 1B_RNA_fusion_HLA/
│   ├── ArcasHLA/                           # HLA typing results
│   ├── RNA_Counts/                         # Gene expression quantification
│   ├── StarFusionOut/                      # Fusion predictions
│   └── sorted_bam/                         # Coordinate-sorted RNA BAMs
│
├── 2A_somatic_mutation_epitopes/
│   ├── annotated_germline_VCF/             # Annotated germline variants
│   ├── annotated_germline_VCF_name_updated/# Germline VCFs with updated naming
│   ├── annotated_phased_VCF/               # Annotated phased variants
│   ├── annotated_somatic_VCF/              # Annotated somatic variants
│   ├── combined_VCF/                       # Combined VCFs (somatic/germline/etc.)
│   ├── kallisto_quantification/            # Expression quantification (kallisto)
│   ├── kallisto_somatic_VCF/               # Somatic VCFs used alongside expression
│   ├── phased_VCF/                         # Phased VCFs
│   ├── pvacSeq/                            # Neoepitope predictions
│   ├── regtools/                           # regtools outputs for splicing evidence
│   ├── sorted_VCF/                         # Sorted VCFs
│   └── tumor_only_VCF/                     # Tumor-only variant calls
│
├── 2B_fusion_epitopes/
│   ├── AGfusion/                           # Fusion annotations
│   └── pvacFuse/                           # Fusion neoepitope predictions
│
├── 2C_splicing_epitopes/
│   ├── annotated_somatic_VCF/              # Annotated somatic VCFs (splicing context)
│   ├── pvacSplice/                         # Splicing neoepitope predictions
│   └── regtools_genomic_VCF_genecode/      # Splice junction calls
│
├── sample_specific_networks/
│   ├── Network_Metrics_Betweenness/        # Betweenness centrality per gene
│   ├── Network_Metrics_Degree/             # Degree centrality per gene
│   ├── Network_Metrics_Full/               # Full metrics outputs
│   │   ├── Strength/                       # Strength metrics
│   │   └── WCI/                            # WCI metrics
│   ├── Network_Metrics_LargestComponentImpact/ # LCI scores per gene
│   ├── Network_Metrics_Strength/           # Strength centrality per gene
│   ├── Sample_Specific_Networks/           # Patient-specific networks
│   ├── Sample_Specific_Networks_Distances/ # Network distance computations
│   ├── Sample_Specific_Networks_PPI_filtered/
│   │   ├── filtered_networks_matrix/
│   │   ├── filtered_networks_rds/
│   │   ├── filtered_networks_rds_old/
│   │   └── qc_plots/
│   ├── Sample_Specific_Networks_PPI_filtered_copy/
│   │   └── filtered_networks_matrix/
│   ├── gene_lists/                         # Gene sets used for networks
│   └── normalised_counts/                  # Normalized expression for network construction
│
└── epitopes_prioritisation/
    ├── Borda_Epiotpes_Prioritisation/      # Consensus scoring results
    ├── annotated_epitopes/                 # Annotated epitopes
    ├── annotated_epitopes_with_network/    # Epitopes merged with network features
    ├── annotationhub_cache/                # AnnotationHub cache
    ├── combined_epitopes/                  # Merged epitopes from all sources
    └── final_epitopes/                     # Final ranked neoepitope outputs
        ├── HLA/
        ├── HTML/
        └── html_out/

Final Prioritized Neoepitope Table

The prioritization module produces a comprehensive ranked table containing:

Per-epitope information:

sample_id: Patient identifier
gene: Gene harboring the alteration
variant_class: SNV / indel / fusion / splice
peptide: Amino acid sequence of the epitope
peptide_length: Length of the epitope peptide
HLA_allele: Predicted binding HLA allele

Immunogenicity features:

binding_affinity: Predicted IC50 (nM)
binding_rank: Percentile rank of binding affinity
expression: Gene expression level (TPM/FPKM)

Network features:

degree: Number of co-expressed genes
betweennes_centrality: Network information flow score
LCI: Largest Component Impact (structural criticality)

Functional annotations:

DepMap_dependency: Cancer cell line essentiality score
Gene_Ontology: GO cellular component
driver_status: IntOGen driver classification
cancer_hallmarks: Associated cancer pathways

Final scores:

Borda_consensus_score: Integrated multi-feature score
final_rank: Overall epitope ranking

Interactive Outputs

The pipeline also generates:

Interactive HTML tables for exploring results
Gene Ontology enrichment analysis
Cancer hallmark enrichment per sample

Configuration

All editable YAML configs are found under: Epitope_Generation_Gateway/scripts/final_scripts/config/

Adjusting QC Filtering

Configure read-quality filtering (via fastp) by editing:

# Relative path: Epitope_Generation_Gateway/scripts/final_scripts/config/QC_filtering_config_0.yaml

fastp:
  detect_adapter: true          # Auto-detect adapters; set to false to use custom sequences
  custom_adapter_R1: ""         # Custom adapter for R1 (leave blank if auto-detecting)
  custom_adapter_R2: ""         # Custom adapter for R2 (leave blank if auto-detecting)
  min_length: 50                # Drop reads shorter than this after trimming
  quality: 20                   # Minimum Phred quality for a base to be considered "qualified"
  unqualified_base_limit: 30    # Max % of unqualified bases allowed per read
  cut_front: false              # Trim low-quality bases from the 5' end if true
  cut_tail: false               # Trim low-quality bases from the 3' end if true
  cut_window_size: 4            # Sliding window size for quality-based trimming
  cut_mean_quality: 20          # Mean quality threshold within the window to trigger trimming

Adjusting PPI filtrering and membrane edges addition

“
# Configuration for sample-specific network filtering.
#
# The final filtered network is built in two explicit steps:
#   1. PPI filtering using HumanNet edges.
#   2. Optional enrichment with high-weight edges touching membrane genes.

ppi_filter:
  # For each gene/node, keep this fraction of its strongest incident HumanNet/PPI edges.
  # The strength is based on absolute sample-specific LIONESS edge weight.
  # 0.20 keeps the top 20% of PPI edges per gene.
  top_prop: 0.20

membrane_enrichment:
  # If true, add additional high-weight raw-network edges that touch membrane genes.
  # If false, the final filtered network will be the PPI-only filtered network.
  enabled: true

  # Fraction of additional membrane-touched edges to add.
  # For example, add_prop: 0.05 adds 5% additional membrane-touched edges.
  add_prop: 0.05

  # Denominator used to calculate how many membrane-touched edges are added.
  # Options:
  #   "unfiltered_ppi_edges":
  #       n_added = ceiling(add_prop * number_of_weighted_unfiltered_HumanNet_edges)
  #   "ppi_filtered_edges":
  #       n_added = ceiling(add_prop * number_of_edges_in_the_PPI_only_filtered_network)
  denominator: "unfiltered_ppi_edges"

  # Candidate membrane-enrichment edges must touch at least one membrane gene.
  # This should remain true. Setting it to false is not currently supported because
  # it would change the biological meaning from membrane enrichment to general edge enrichment.
  require_membrane_endpoint: true

  # Candidate membrane-touched edges are ranked by absolute LIONESS edge weight.
  # Currently, only "absolute_weight" is supported.
  edge_selection_metric: "absolute_weight"
 “

Adjusting Prioritization Weights

EGG uses a Borda consensus to combine feature-ranked scores. Edit the YAML to change which columns are used, their weights, and whether higher or lower values rank better:

# config/borda_config.yaml
borda:
  # Choose columns, weight (any positive numbers; auto-renormalized),
  # and direction: "lower_better" or "higher_better".
  columns:
    Median.MT.IC50.Score:
      weight: 0.25
      direction: lower_better
    Depmap_survivability_score:
      weight: 0.25
      direction: lower_better
    Network_Betweenness:
      weight: 0.10
      direction: higher_better
    Network_Degree:
      weight: 0.10
      direction: higher_better
    Network_Impact:
      weight: 0.10
      direction: higher_better
    Network_Strength:
      weight: 0.10
      direction: higher_better
    Network_WCI:
      weight: 0.10
      direction: higher_better

  # How to break ties between equal ranks: average|min|max|first|random
  ties_method: average

  # NA handling policy:
  # - ignore: exclude NA for that feature and renormalize weights per row (no penalty)
  # - worst:  treat NA as worst rank (penalize missing values)
  # - drop:   drop rows with any NA across selected features (strict)
  na_policy: ignore

## Citation

If you use EGG in your research, please cite:

[Citation]




## Acknowledgments

EGG integrates and builds upon numerous open-source tools:
- **GATK** (Broad Institute)
- **pVACtools** (Griffith Lab)
- **STAR-Fusion** (Broad Institute)
- **HumanNet-XN** (Seoul National University)
- **LIONESS** (Glass Lab)
- **DepMap** (Broad Institute)


We thank the developers of these tools for making their software freely available.

SEEK ID: https://workflowhub.eu/workflows/2165?version=2

Version History

main @ cc46a7a (latest) Created 1st Jun 2026 at 12:49 by Luca Mannino

Update epitopes_prioritisation.smk

Frozen main cc46a7a

main @ cc46a7a (latest) Created 1st Jun 2026 at 12:48 by Luca Mannino