EGG: Epitope Generation Gateway
A modular Snakemake pipeline for personalized neoantigen discovery and prioritization using patient-specific gene co-expression networks.
Overview
EGG is a comprehensive computational workflow for identifying and prioritizing tumor neoantigens from DNA and RNA sequencing data. Unlike traditional neoantigen prediction pipelines that focus solely on sequence-level features, EGG integrates patient-specific gene co-expression networks to identify neoantigens in biologically central and functionally relevant contexts.
Key Features
- Modular Architecture: Seven independent Snakemake modules can be run separately or as a complete workflow
- Multi-source Neoantigen Detection: Identifies neoantigens from:
- Somatic mutations (SNVs and indels)
- Gene fusions
- Alternative splicing events
- Network-based Prioritization: Uses LIONESS-derived personalized co-expression networks to assess functional importance
- Automated Setup: Automatically downloads references, installs dependencies, and configures environments
- Reproducible: Containerized tools (Docker) and versioned environments (Conda) ensure consistent results
What Makes EGG Different?
EGG goes beyond traditional binding affinity predictions by incorporating:
- Gene co-expression network topology to identify functionally critical genes
- Protein-protein interaction data (HumanNet-XN) to prioritize biologically relevant candidates
- Network centrality metrics that capture a gene's importance to tumor cell function in particular related to resistance to cancer evolution and immune evasion
- Consensus scoring that integrates network features with orthogonal evidence (DepMap essentiality, binding affinity, expression)
Pipeline Implementation
EGG is implemented as a Snakemake workflow cosisting of seven separate modules. where each module corresponds to a dedicated .smk file that can be executed independently or as part of an end-to-end run.
Each module is encapsulated in its own Snakefile under scripts/final_scripts/ and follows a consistent interface (csvfile=..., results_dir=...) so outputs are organized under a single root directory and can be reused reliably by downstream modules. Sample inputs are provided through a standardized CSV metadata file (sample identity, FASTQ paths, datatype, lane), enabling uniform batch processing without per-sample manual configuration.
To minimize setup burden while preserving reproducibility, EGG provisions dependencies automatically at runtime: modules create and use local Conda environments via --use-conda, pull and run Docker images when required by specific tools, and download reference resources as needed as shown in more detail in the resource requirements section. Beyond installing Snakemake, Conda, and Docker, no additional manual installation should be necessary.
In practice, the workflow is commonly initialized with QC filtering (Module 0), after which DNA (Module 1A) and RNA (Module 1B) analyses can run in parallel. Downstream epitope generation modules consume these outputs: 2A uses somatic variants from 1A plus HLA alleles (from 1B or provided externally), 2B uses fusion calls from 1B, and 2C integrates RNA-derived splice events with matched DNA calls and HLA context. RNA-seq expression outputs feed the co-expression network module, which generates patient-specific network features used alongside orthogonal evidence (binding features, expression, essentiality, and biological annotations) in the prioritization module to produce a final per-sample ranked neoepitope table. This is intended to complement binding and expression-based ranking with tumor functional context, prioritizing epitopes arising from functionally indispensable / network-central genes that may be less likely to be lost under tumor evolution and selective pressure, thereby supporting identification of potentially more durable vaccine targets. The recommended execution order is explained in the Execution Guide.
Installation
Prerequisites
- Conda/Mamba (for environment management)
- Snakemake ≥6.0
- Docker (for containerized tools)
Setup
- Clone the repository
git clone https://github.com/fhaive/epitope_generation_gateway.git
cd epitope_generation_gateway
- Install Snakemake (if not already installed)
conda create -n snakemake -c conda-forge -c bioconda snakemake
conda activate snakemake
- That's it! All other dependencies are managed automatically by the pipeline.
Pipeline Modules
EGG is organized into seven modular components, each handling a specific analysis step:
Module 0: QC Filtering
Purpose: Quality control and preprocessing of raw sequencing data
Description: This module employs FastQC for sequence quality assessment and fastp for automated adapter trimming and low-quality read filtering. All trimming thresholds and filters are fully configurable via a central YAML configuration file, ensuring downstream analysis proceeds with high-quality data.
Tools: FastQC, fastp
Output: Trimmed FASTQ files, QC reports
snakemake --use-conda --snakefile scripts/final_scripts/QC_filtering_module_0.smk \
--cores {threads} --config csvfile="samples.csv" results_dir="output/"
Module 1B: RNA Analysis
Purpose: Multi-faceted RNA-seq analysis
Description: Using trimmed Cancer RNA-seq data, this module performs three parallel analyses: STAR-Fusion detects gene fusion events, ArcasHLA conducts HLA typing, and FeatureCounts quantifies gene expression for downstream co-expression network analysis.
Tools:
- STAR-Fusion (gene fusions)
- ArcasHLA (HLA typing)
- FeatureCounts (gene expression quantification)
Output: Fusion calls, HLA types, gene expression matrices
snakemake --use-conda --snakefile scripts/final_scripts/RNA_analysis_module_1B.smk \
--cores {threads} --config csvfile="samples.csv" results_dir="output/"
Module 2B: Fusion Epitopes
Purpose: Predict neoepitopes from gene fusion breakpoints
Description: RNA-derived fusion transcripts from STAR-Fusion are processed for neoantigen prediction using pVACfuse. The module automatically handles fusion breakpoint processing and peptide candidate generation across junction sites.
Tools: pVACfuse
Input: Fusion transcripts from Module 1B
Output: Predicted neoepitopes spanning fusion junctions
snakemake --use-conda --snakefile scripts/final_scripts/fusion_epitopes_module_2B.smk \
--cores {threads} --config csvfile="samples.csv" results_dir="output/"
Network Generation Module
Purpose: Construct patient-specific gene co-expression networks, produces Patient Specific Functional Interaction Networks (pFINs) by integreting co-expression networks with HumanNetV2 functional PPI networks and compute network centrality
Description: This module constructs personalized co-expression networks using LIONESS methodology on Pearson gene co-expression network. Gene level centrality metrics are computed and overlaid with HumanNet-XN protein-protein interaction data to identify functionally important neoantigens.
Tools: LIONESS, HumanNet-XN PPI database
Method:
- Generates personalized Pearson correlation networks
- Filters networks using PPI data and membrane protein enrichment
- Computes centrality metrics (degree, betweenness, strength, Weighted Connectivity Impact (WCI))
- Calculates Largest Component Impact (LCI) to identify structurally critical genes
Output: Network features for each gene per patient
snakemake --use-conda --snakefile scripts/final_scripts/sample_networks_generation.smk \
--cores {threads} --config csvfile="samples.csv" results_dir="output/"
Input Requirements
Sample Metadata CSV
The pipeline requires a CSV file with the following columns:
| Column | Description | Example |
|---|---|---|
sample_name |
Unique sample identifier | SAMPLE_001 |
fastq_R1 |
Path to forward reads | /path/to/sample_R1.fastq.gz |
fastq_R2 |
Path to reverse reads | /path/to/sample_R2.fastq.gz |
datatype |
Sample type | CancerRNA / CancerDNA / NormalDNA |
lane |
Sequencing lane identifier | L001 |
Required Data Types
- CancerRNA: Tumor RNA-seq (for expression, fusions, splicing, HLA typing)
- CancerDNA: Tumor DNA-seq (for somatic mutations)
- NormalDNA: Matched normal DNA-seq (for filtering germline variants)
Output
Output Organization
The pipeline generates outputs organized by module:
results/
├── 0_Filtering_and_QC/
│ ├── trimmed_fastq/ # Quality-filtered FASTQ files
│ ├── qc_pre_filtering/ # Initial FastQC reports
│ └── qc_post_filtering/ # Post-trimming QC reports
│
├── 1A_mutation_analysis/
│ ├── VCF/ # Raw variant calls
│ ├── VCF_filtered/ # Filtered somatic variants
│ ├── VCF_germline/ # Germline variant calls
│ ├── VCF_germline_filtered/ # Filtered germline variants
│ ├── bwa/ # Alignment outputs
│ ├── dedup/ # Duplicate-marked BAMs/metrics
│ ├── bqsr/ # BQSR-processed BAMs
│ ├── merge/ # Merge fastqs from same sample run across different lanes
│ ├── metrics/ # Alignment/QC metrics
│ └── filtering_tables/ # Variant filtering tables/logs
│
├── 1B_RNA_fusion_HLA/
│ ├── ArcasHLA/ # HLA typing results
│ ├── RNA_Counts/ # Gene expression quantification
│ ├── StarFusionOut/ # Fusion predictions
│ └── sorted_bam/ # Coordinate-sorted RNA BAMs
│
├── 2A_somatic_mutation_epitopes/
│ ├── annotated_germline_VCF/ # Annotated germline variants
│ ├── annotated_germline_VCF_name_updated/# Germline VCFs with updated naming
│ ├── annotated_phased_VCF/ # Annotated phased variants
│ ├── annotated_somatic_VCF/ # Annotated somatic variants
│ ├── combined_VCF/ # Combined VCFs (somatic/germline/etc.)
│ ├── kallisto_quantification/ # Expression quantification (kallisto)
│ ├── kallisto_somatic_VCF/ # Somatic VCFs used alongside expression
│ ├── phased_VCF/ # Phased VCFs
│ ├── pvacSeq/ # Neoepitope predictions
│ ├── regtools/ # regtools outputs for splicing evidence
│ ├── sorted_VCF/ # Sorted VCFs
│ └── tumor_only_VCF/ # Tumor-only variant calls
│
├── 2B_fusion_epitopes/
│ ├── AGfusion/ # Fusion annotations
│ └── pvacFuse/ # Fusion neoepitope predictions
│
├── 2C_splicing_epitopes/
│ ├── annotated_somatic_VCF/ # Annotated somatic VCFs (splicing context)
│ ├── pvacSplice/ # Splicing neoepitope predictions
│ └── regtools_genomic_VCF_genecode/ # Splice junction calls
│
├── sample_specific_networks/
│ ├── Network_Metrics_Betweenness/ # Betweenness centrality per gene
│ ├── Network_Metrics_Degree/ # Degree centrality per gene
│ ├── Network_Metrics_Full/ # Full metrics outputs
│ │ ├── Strength/ # Strength metrics
│ │ └── WCI/ # WCI metrics
│ ├── Network_Metrics_LargestComponentImpact/ # LCI scores per gene
│ ├── Network_Metrics_Strength/ # Strength centrality per gene
│ ├── Sample_Specific_Networks/ # Patient-specific networks
│ ├── Sample_Specific_Networks_Distances/ # Network distance computations
│ ├── Sample_Specific_Networks_PPI_filtered/
│ │ ├── filtered_networks_matrix/
│ │ ├── filtered_networks_rds/
│ │ ├── filtered_networks_rds_old/
│ │ └── qc_plots/
│ ├── Sample_Specific_Networks_PPI_filtered_copy/
│ │ └── filtered_networks_matrix/
│ ├── gene_lists/ # Gene sets used for networks
│ └── normalised_counts/ # Normalized expression for network construction
│
└── epitopes_prioritisation/
├── Borda_Epiotpes_Prioritisation/ # Consensus scoring results
├── annotated_epitopes/ # Annotated epitopes
├── annotated_epitopes_with_network/ # Epitopes merged with network features
├── annotationhub_cache/ # AnnotationHub cache
├── combined_epitopes/ # Merged epitopes from all sources
└── final_epitopes/ # Final ranked neoepitope outputs
├── HLA/
├── HTML/
└── html_out/
Final Prioritized Neoepitope Table
The prioritization module produces a comprehensive ranked table containing:
Per-epitope information:
sample_id: Patient identifiergene: Gene harboring the alterationvariant_class: SNV / indel / fusion / splicepeptide: Amino acid sequence of the epitopepeptide_length: Length of the epitope peptideHLA_allele: Predicted binding HLA allele
Immunogenicity features:
binding_affinity: Predicted IC50 (nM)binding_rank: Percentile rank of binding affinityexpression: Gene expression level (TPM/FPKM)
Network features:
degree: Number of co-expressed genesbetweennes_centrality: Network information flow scoreLCI: Largest Component Impact (structural criticality)
Functional annotations:
DepMap_dependency: Cancer cell line essentiality scoreGene_Ontology: GO cellular componentdriver_status: IntOGen driver classificationcancer_hallmarks: Associated cancer pathways
Final scores:
Borda_consensus_score: Integrated multi-feature scorefinal_rank: Overall epitope ranking
Interactive Outputs
The pipeline also generates:
- Interactive HTML tables for exploring results
- Gene Ontology enrichment analysis
- Cancer hallmark enrichment per sample
Configuration
All editable YAML configs are found under: Epitope_Generation_Gateway/scripts/final_scripts/config/
Adjusting QC Filtering
Configure read-quality filtering (via fastp) by editing:
# Relative path: Epitope_Generation_Gateway/scripts/final_scripts/config/QC_filtering_config_0.yaml
fastp:
detect_adapter: true # Auto-detect adapters; set to false to use custom sequences
custom_adapter_R1: "" # Custom adapter for R1 (leave blank if auto-detecting)
custom_adapter_R2: "" # Custom adapter for R2 (leave blank if auto-detecting)
min_length: 50 # Drop reads shorter than this after trimming
quality: 20 # Minimum Phred quality for a base to be considered "qualified"
unqualified_base_limit: 30 # Max % of unqualified bases allowed per read
cut_front: false # Trim low-quality bases from the 5' end if true
cut_tail: false # Trim low-quality bases from the 3' end if true
cut_window_size: 4 # Sliding window size for quality-based trimming
cut_mean_quality: 20 # Mean quality threshold within the window to trigger trimming
Adjusting PPI filtrering and membrane edges addition
“
# Configuration for sample-specific network filtering.
#
# The final filtered network is built in two explicit steps:
# 1. PPI filtering using HumanNet edges.
# 2. Optional enrichment with high-weight edges touching membrane genes.
ppi_filter:
# For each gene/node, keep this fraction of its strongest incident HumanNet/PPI edges.
# The strength is based on absolute sample-specific LIONESS edge weight.
# 0.20 keeps the top 20% of PPI edges per gene.
top_prop: 0.20
membrane_enrichment:
# If true, add additional high-weight raw-network edges that touch membrane genes.
# If false, the final filtered network will be the PPI-only filtered network.
enabled: true
# Fraction of additional membrane-touched edges to add.
# For example, add_prop: 0.05 adds 5% additional membrane-touched edges.
add_prop: 0.05
# Denominator used to calculate how many membrane-touched edges are added.
# Options:
# "unfiltered_ppi_edges":
# n_added = ceiling(add_prop * number_of_weighted_unfiltered_HumanNet_edges)
# "ppi_filtered_edges":
# n_added = ceiling(add_prop * number_of_edges_in_the_PPI_only_filtered_network)
denominator: "unfiltered_ppi_edges"
# Candidate membrane-enrichment edges must touch at least one membrane gene.
# This should remain true. Setting it to false is not currently supported because
# it would change the biological meaning from membrane enrichment to general edge enrichment.
require_membrane_endpoint: true
# Candidate membrane-touched edges are ranked by absolute LIONESS edge weight.
# Currently, only "absolute_weight" is supported.
edge_selection_metric: "absolute_weight"
“
Adjusting Prioritization Weights
EGG uses a Borda consensus to combine feature-ranked scores. Edit the YAML to change which columns are used, their weights, and whether higher or lower values rank better:
# config/borda_config.yaml
borda:
# Choose columns, weight (any positive numbers; auto-renormalized),
# and direction: "lower_better" or "higher_better".
columns:
Median.MT.IC50.Score:
weight: 0.25
direction: lower_better
Depmap_survivability_score:
weight: 0.25
direction: lower_better
Network_Betweenness:
weight: 0.10
direction: higher_better
Network_Degree:
weight: 0.10
direction: higher_better
Network_Impact:
weight: 0.10
direction: higher_better
Network_Strength:
weight: 0.10
direction: higher_better
Network_WCI:
weight: 0.10
direction: higher_better
# How to break ties between equal ranks: average|min|max|first|random
ties_method: average
# NA handling policy:
# - ignore: exclude NA for that feature and renormalize weights per row (no penalty)
# - worst: treat NA as worst rank (penalize missing values)
# - drop: drop rows with any NA across selected features (strict)
na_policy: ignore
## Citation
If you use EGG in your research, please cite:
[Citation]
## Acknowledgments
EGG integrates and builds upon numerous open-source tools:
- **GATK** (Broad Institute)
- **pVACtools** (Griffith Lab)
- **STAR-Fusion** (Broad Institute)
- **HumanNet-XN** (Seoul National University)
- **LIONESS** (Glass Lab)
- **DepMap** (Broad Institute)
We thank the developers of these tools for making their software freely available.
Version History
main @ cc46a7a (latest) Created 1st Jun 2026 at 12:49 by Luca Mannino
Update epitopes_prioritisation.smk
Frozen
main
cc46a7a
main @ cc46a7a (latest) Created 1st Jun 2026 at 12:48 by Luca Mannino
Update epitopes_prioritisation.smk
Frozen
main
cc46a7a
main @ 1f78621 (earliest) Created 28th Apr 2026 at 16:38 by Luca Mannino
Update README.md
Frozen
main
1f78621
Creators and SubmitterCreators
Submitter
Views: 31 Downloads: 6
Created: 28th Apr 2026 at 16:38
Last updated: 1st Jun 2026 at 12:49
AttributionsNone
View on GitHub
https://orcid.org/0009-0006-2549-2680