EXCON (v2.1.1)
A Nextflow pipeline for gene family EXpansion and CONtraction analysis across multiple species using CAFE5.
Given a set of genome assemblies and annotations, EXCON builds orthogroups with OrthoFinder, fits and compares multiple CAFE models to identify gene families evolving at significantly different rates, and automatically selects the best-fitting model for downstream analysis. Optionally, GO enrichment analysis can be run on expanded and contracted gene families, and on genes grouped by chromosome.
It works with any set of species that have a genome (fasta) and annotation (gff) file. (minimum of 5 species ideally up to around 30). Maximum 100 species (normally).
To run the GO annotation. You must provide a database yourself with --eggnog_data_dir,
else everytime you run the pipeline, it will download the DB for you.
So be careful, it is ~45GB. Please run it once, and save the DB somewhere handy to point to.
This is then used to check what GO terms are associated with expanded or contracted gene sets.
Overview
The general pipeline logic is as follows:
- Downloads genome and annotation files from NCBI
[NCBIGENOMEDOWNLOAD], or you provide your own. - Unzips the files, if necessary
[GUNZIP] - Standardises and filters GFF annotations
[AGAT_CONVERTSPGXF2GXF]. - Extracts longest protein
[AGAT_SPKEEPLONGESTISOFORM]. - Gets the protein sequences
[GFFREAD]. - Renames the genes to gene name (as some will be isoform name)
RENAME_FASTA. - Finds orthologous genes across species [ORTHOFINDER_CAFE].
- Rescales species tree branch lengths for CAFE
[RESCALE_TREE]. - Prepares gene count input and runs the base CAFE model
[CAFE_PREP]. - Runs two additional CAFE models in parallel for model comparison
[CAFE_RUN]:- Gamma model with k=3 rate categories (
-k 3) - Gamma model with per-family rates (
-p -k 3)
- Gamma model with k=3 rate categories (
- Compares all three CAFE models using AIC and likelihood ratio tests,
selects the best fitting model
[CAFE_MODEL_COMPARE]. - Plots gene family expansions and contractions for the best model
[CAFE_PLOT].
Optional — GO enrichment (--run_eggnog)
- Optionally downloads the eggnogmapper database
[EGGNOG_DOWNLOAD]. - Optionally assigns GO terms to genes using
[EGGNOGMAPPER]. - Optionally prepares GO gene lists from the best CAFE model results
[CAFE_GO_PREP]. - Optionally runs GO enrichment in parallel, one job per species/node
and direction (expansion/contraction)
[CAFE_GO_RUN].
Optional — chromosome GO enrichment (--chromo_go --run_eggnog)
- Optionally plots GO enrichment of genes by chromosome
[CHROMO_GO]. - Optionally summarizes GO enrichment by chromosome
[SUMMARIZE_CHROMO_GO].
Optional — genome quality statistics (--stats)
- Optionally describes genome assembly and annotation:
[BUSCO_BUSCO]: Completeness of the genome compared to expected gene set.[QUAST]: Assembly contiguity statistics (N50 etc).[AGAT_SPSTATISTICS]: Gene, exon, and intron statistics.
Installation
Nextflow pipelines require a few prerequisites. There is further documentation on the nf-core webpage here, about how to install Nextflow.
Prerequisites
- Docker or Singularity.
- Java and openJDK >= 8 (Please Note: When installing Java versions are
1.VERSIONsoJava 8isJava 1.8). - Nextflow >=
v25.10.0. - When running nextflow with this pipeline, ideally run
NXF_VER=25.10.0beforehand, to ensure functionality on this version.
Install
To install the pipeline please use the following commands but replace VERSION with a release.
wget https://github.com/Eco-Flow/excon/archive/refs/tags/VERSION.tar.gz -O - | tar -xvf -
or
curl -L https://github.com/Eco-Flow/excon/archive/refs/tags/VERSION.tar.gz --output - | tar -xvf -
This will produce a directory in the current directory called excon-VERSION which contains the pipeline.
Inputs
Required
--input /path/to/csv/file- A singular csv file as input in one of the two formats stated below.
This csv can take 2 forms:
- A 2 field csv where each row is a unique species name followed by a Refseq genome reference ID (NOT a Genbank reference ID) i.e.
data/input_small-s3.csv. The pipeline will download the relevant genome fasta file and annotation gff3 (or gff augustus) file. - A 3 field csv where each row is a unique species name, followed by an absolute path to a genome fasta file, followed by an absolute path to an annotation gff3 (or gff augustus) file. Input can be gzipped (.gz) or not.
Please Note: The genome has to be chromosome level not contig level.
2 fields (Name,Refseq_ID):
Drosophila_yakuba,GCF_016746365.2
Drosophila_simulans,GCF_016746395.2
Drosophila_santomea,GCF_016746245.2
3 fields (Name,genome.fna,annotation.gff):
Drosophila_yakuba,data/Drosophila_yakuba/genome.fna.gz,data/Drosophila_yakuba/genomic.gff.gz
Drosophila_simulans,data/Drosophila_simulans/genome.fna.gz,data/Drosophila_simulans/genomic.gff.gz
Drosophila_santomea,data/Drosophila_santomea/genome.fna.gz,data/Drosophila_santomea/genomic.gff.gz
Note: Genomes should be chromosome-level, not contig-level. RefSeq IDs must be used (not GenBank IDs).
Parameters
Core options
| Parameter | Description | Default |
|---|---|---|
--input |
Path to input CSV file | Required |
--outdir |
Output directory | results |
--groups |
NCBI taxonomy group for genome download (e.g. insects, bacteria) |
insects |
--help |
Display help message | false |
--custom_config |
Path to a custom Nextflow config file | null |
Quality statistics (optional)
| Parameter | Description | Default |
|---|---|---|
--stats |
Run BUSCO, QUAST and AGAT statistics on genomes | null |
--busco_lineage |
BUSCO lineage database (e.g. insecta_odb10) |
null |
--busco_mode |
BUSCO mode (genome, proteins, transcriptome) |
null |
--busco_lineages_path |
Path to local BUSCO lineage databases | null |
--busco_config |
Path to BUSCO config file | null |
CAFE gene family evolution
| Parameter | Description | Default |
|---|---|---|
--skip_cafe |
Skip CAFE analysis | null |
--cafe_max_differential |
Maximum gene count differential for CAFE filtering on retry | 50 |
--tree_scale_factor |
Scale factor for rescaling species tree branch lengths | 1000 |
Note on CAFE model selection: The pipeline runs three CAFE models — a base single-λ model, a Gamma model with k=3 rate categories, and a Gamma model with per-family rate estimation. These run in parallel and are compared using AIC. The best-fitting model is automatically selected and used for all downstream GO enrichment and plotting. Model comparison results are written to
results/cafe/model_comparison/cafe_model_comparison.tsv. If model scores cannot be parsed (e.g. on very small datasets), the pipeline defaults to the base model.
GO annotation with EggNOG-mapper (optional)
| Parameter | Description | Default |
|---|---|---|
--run_eggnog |
Run EggNOG-mapper GO annotation | false |
--eggnog_data_dir |
Path to pre-downloaded EggNOG database directory | null |
--eggnog_target_taxa |
Restrict annotations to orthologs from this taxon and its descendants (NCBI taxon ID) | null |
--eggnog_tax_scope |
Taxonomic scope for orthologous group assignment (e.g. 50557 for Insecta) |
null |
--eggnog_evalue |
Maximum e-value threshold for sequence matches | null |
--eggnog_score |
Minimum bitscore threshold for matches | null |
--eggnog_pident |
Minimum percent identity (%) | null |
--eggnog_query_cover |
Minimum query coverage (%) | null |
--eggnog_subject_cover |
Minimum subject coverage (%) | null |
Note: The EggNOG database is ~45GB. If --eggnog_data_dir is not provided, the database will be downloaded automatically on each run. We strongly recommend downloading it once and reusing it:
Then pass --eggnog_data_dir /path/to/eggnog_data to the pipeline.
GO enrichment analysis (optional, requires --run_eggnog)
| Parameter | Description | Default |
|---|---|---|
--chromo_go |
Run GO enrichment analysis by chromosome | null |
--go_cutoff |
P-value cutoff for GO enrichment | 0.05 |
--go_type |
GO test type (e.g. none) |
none |
--go_max_plot |
Maximum number of GO terms to plot | 10 |
Resource limits
| Parameter | Description | Default |
|---|---|---|
--max_memory |
Maximum memory per job | 128.GB |
--max_cpus |
Maximum CPUs per job | 16 |
--max_time |
Maximum runtime per job | 240.h |
Profiles
| Profile | Description |
|---|---|
docker |
Run with Docker containers |
singularity |
Run with Singularity containers |
conda |
Run with Conda environments |
test_bacteria |
Test run with small bacterial genomes |
test_small |
Test run with small insect genomes |
Profiles
This pipeline is designed to run in various modes that can be supplied as a comma separated list i.e. -profile profile1,profile2.
Container Profiles
Please select one of the following profiles when running the pipeline.
docker- This profile uses the container software Docker when running the pipeline. This container software requires root permissions so is used when running on cloud infrastructure or your local machine (depending on permissions). Please Note: You must have Docker installed to use this profile.singularity- This profile uses the container software Singularity when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Singularity installed to use this profile.apptainer- This profile uses the container software Apptainer when running the pipeline. This container software does not require root permissions so is used when running on on-premise HPCs or you local machine (depending on permissions). Please Note: You must have Apptainer installed to use this profile.
Optional Profiles
local- This profile is used if you are running the pipeline on your local machine.aws_batch- This profile is used if you are running the pipeline on AWS utilising the AWS Batch functionality. Please Note: You must use theDockerprofile with with AWS Batch.test_small- This profile is used if you want to test running the pipeline on your infrastructure, running from predownloaded go files. Please Note: You do not provide any input parameters if this profile is selected but you still provide a container profile.test_biomart- This profile is used if you want to test running the pipeline on your infrastructure, running from the biomart input. Please Note: You do not provide any input parameters if this profile is selected but you still provide a container profile.
Custom Configuration
If you want to run this pipeline on your institute's on-premise HPC or specific cloud infrastructure then please contact us and we will help you build and test a custom config file. This config file will be published to our configs repository.
Running the Pipeline
Please note: The -resume flag uses previously cached successful runs of the pipeline.
- Example run the full test example data:
NXF_VER=25.04.8 # Is latest it is tested on
nextflow run main.nf -resume -profile docker,test_small
Settings in test_small: input = "input_small-s3.csv"
For the fastest run use: nextflow run main.nf -resume -profile docker,test_bacteria
- To run on your own data (minimal run), cafe only.
# NXF_VER=25.04.8
nextflow run main.nf -resume -profile docker --input data/input_small-s3.csv
- To run with cafe and GO analysis
# NXF_VER=25.04.8
nextflow run main.nf -resume -profile docker --input data/input_small-s3.csv --chromo_go --go_type bonferoni --stats --run_eggnog --eggnog_data_dir /path/to/eggnogdb
Output Structure
results/
├── cafe/
│ ├── base/ # Base single-λ CAFE model outputs
│ │ ├── Out_cafe/ # CAFE5 output files (trees, counts, probabilities)
│ │ ├── hog_gene_counts.tsv # Filtered gene count input to CAFE
│ │ └── hog_filtering_report.tsv # Filtering report (only present if retry triggered)
│ ├── gamma/ # Gamma k=3 model outputs
│ │ └── Out_gamma/ # CAFE5 output files
│ ├── gamma_per_family/ # Gamma per-family rate model outputs
│ │ └── Out_gamma_per_family/ # CAFE5 output files
│ └── model_comparison/
│ ├── cafe_model_comparison.tsv # AIC, delta-AIC, AIC weights, LRT results
│ ├── best_model.txt # Name of the winning model
│ └── Significant_trees.tre # Nexus trees with significant branches (from best model)
├── cafe_plot/
│ └── cafe_plotter/ # Expansion/contraction plots for best model
├── cafe_go/ # GO enrichment (one job per species/node x direction)
│ ├── CAFE_summary.txt # Summary of expansions/contractions per branch
│ ├── *_TopGo_results_ALL.tab # TopGO results per target
│ ├── TopGO_Pval_barplot_*.pdf # Barplots per target
│ ├── Go_summary_pos.pdf # Summary plot across all expansions
│ ├── Go_summary_neg.pdf # Summary plot across all contractions
│ ├── Go_summary_pos_noNode.pdf # As above, terminal branches only
│ └── Go_summary_neg_noNode.pdf
├── chromo_go/ # [optional] GO enrichment by chromosome
│ ├── *.pdf # Per-chromosome GO plots
│ └── summary/ # Summarized results across chromosomes
├── eggnogmapper/
│ └── go_files/ # Per-species GO annotation files
├── gffread/
│ └── *.fasta # Protein sequences per species
├── ncbigenomedownload/
│ ├── *.fna.gz # Downloaded genome assemblies
│ └── *.gff.gz # Downloaded annotations
├── orthofinder_cafe/
│ └── ortho_cafe/ # OrthoFinder results including species tree
├── busco/ # [optional, --stats] BUSCO completeness results
├── agat/ # [optional, --stats] AGAT annotation statistics
├── quast/ # [optional, --stats] Assembly contiguity statistics
└── pipeline_info/
├── execution_report_*.html # Nextflow execution report
├── execution_timeline_*.html # Per-process timeline
├── execution_trace_*.txt # Per-task resource usage
├── pipeline_dag_*.html # Pipeline DAG diagram
└── software_versions.yml # Versions of all tools used
Citation
This pipeline is published on Workflowhub using the nf-core template. If you use this pipeline in you work, the following citations are essential:
excon: Wyatt, C. (2026). Gene EXpansion and CONtraction analysis pipeline. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.2141.5
nf-core: Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
If you used any of these tools within the pipeline, you must also cite:
CAFE: Fábio K Mendes, Dan Vanderpool, Ben Fulton, Matthew W Hahn, CAFE 5 models variation in evolutionary rates among gene families, Bioinformatics, 2020; btaa1022, https://doi.org/10.1093/bioinformatics/btaa1022
Orthofinder: Emms, D.M. and Kelly, S. (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology 20:238
AGAT: Dainat J. 2022. Another Gtf/Gff Analysis Toolkit (AGAT): Resolve interoperability issues and accomplish more with your annotations. Plant and Animal Genome XXIX Conference. https://github.com/NBISweden/AGAT.
eggNOG-mapper (if used): Carlos P Cantalapiedra, Ana Hernández-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas, eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale, Molecular Biology and Evolution, Volume 38, Issue 12, December 2021, Pages 5825–5829, https://doi.org/10.1093/molbev/msab293
A full list of tools and their versions are found in the software_versions.yml in the results/pipeline_info. So ensure to look here for any additional tools you need to cite.
For the --stats module, you would need to cite BUSCO, Quast, AGAT
Contact Us
If you need any support do not hesitate to contact us at any of:
c.wyatt [at] ucl.ac.uk
ecoflow.ucl [at] gmail.com
Version History
v2.3.1 (latest) Created 12th Apr 2026 at 11:33 by Chris Wyatt
Merge pull request #132 from Eco-Flow/dev
Dev
Frozen
v2.3.1
5392c10
v2.3.0 Created 8th Apr 2026 at 21:31 by Chris Wyatt
Merge pull request #128 from Eco-Flow/dev
Better GO handling
Frozen
v2.3.0
726fa18
v2.2.0 Created 6th Apr 2026 at 15:28 by Chris Wyatt
Merge pull request #90 from Eco-Flow/dev
[v2.2.0] - 2026-04-06
Added
- New OrthoFinder algorithm parameters:
--orthofinder_method(-M),--orthofinder_search(-S),--orthofinder_msa_prog(-A), and--orthofinder_tree(-T). These map directly to OrthoFinder command-line flags and are all optional — OrthoFinder defaults are used when unset. - New
--orthofinder_v2flag (defaultfalse) to run OrthoFinder v2.5.5 instead of v3.1.3. v2 uses Hierarchical Orthogroups (N0.tsv) which are more appropriate for CAFE5 as they represent gene families traceable to the common ancestor. v3 uses flat orthogroups (Orthogroups.tsv) which can have inflated copy-number variance. For large datasets (>30 species), v2 is recommended. ORTHOFINDER_V2_CAFEresults are now published toresults/orthofinder_cafe/(was only published for v3).CAFE_PREPnow emitspruned_tree(the rescaled, species-name-stripped tree) for use by all downstream CAFE runs.CAFE_RUN_LARGEnow retries with progressively smaller lambda values (estimated → 1e-4 → 1e-5 → 1e-6 → 1e-7) when the initial fixed-lambda run fails to converge, as recommended in hahnlab/CAFE5#132.- CAFE GO enrichment plots now display full GO term text labels and GO IDs.
- New output documentation page
docs/outputs.mdwith example figures and detailed descriptions of all output files.
Changed
- Reverted tree scaling back to the original
RESCALE_TREEapproach (rescale_tree.pymultiplies branch lengths by--tree_scale_factor). TheMAKE_ULTRAMETRICmodule introduced in v2.1.4 is removed. --tree_scale_factordefault changed from1back to1000.CAFE_PREPbase run and error model estimation now usepruned_tree(the rescaled non-ultrametric tree) directly, matching the approach that was validated on large datasets.SpeciesTree_rooted_ultra.txt(produced bychronoMPL()) is retained for reference only.CAFE_RUN_KandCAFE_RUN_BESTnow usepruned_treeinstead ofSpeciesTree_rooted_ultra.txt, avoiding convergence failures caused by the ultrametric tree's maximum-possible-lambda constraint.CAFE_RUN_LARGEfailure is now non-fatal — the pipeline continues even if high-differential families cannot be modelled.CAFE_PLOT(andCAFE_PLOT_LARGE) now skip gracefully when CAFE5 did not produce an*_asr.trefile, instead of crashing the pipeline.ORTHOFINDER_V2module now emitsN0.tsvasorthologues(previously emittedOrthogroups.tsv). This ensures CAFE_PREP receives hierarchical orthogroups, which have lower copy-number variance and are required for correct CAFE5 analysis.CHROMO_GOnow symlinksN0.tsvtoOrthogroups.tsvwhen the v2 path is used, so the downstream perl script works regardless of OrthoFinder version.
Fixed
- Fixed species name mismatch between tree and gene counts in
CAFE_RUN_K: theSpeciesTree_rescaled.nwkretains.cleansuffixes buthog_gene_counts.tsvuses plain names. All CAFE runs now usepruned_treewhich has suffixes stripped bysedinCAFE_PREP. - Fixed
chronos()convergence failure on large datasets: passing the ×1000 pre-scaled tree tochronos()caused a degenerate starting point (~-74 billion log-likelihood).CAFE_PREPnow receives the original unscaled tree and scaling is applied after any ultrametric correction.
Frozen
v2.2.0
1237320
v2.1.1 Created 1st Apr 2026 at 08:41 by Chris Wyatt
Merge pull request #82 from Eco-Flow/transcript_augustus
Augustus gffs and eggnogmapper memory
Frozen
v2.1.1
481d735
v2.1.0 Created 30th Mar 2026 at 17:34 by Chris Wyatt
Merge pull request #77 from Eco-Flow/cafeadvanced
Cafe advanced
Frozen
v2.1.0
ebe7fed
v2.0.4 Created 28th Mar 2026 at 08:12 by Chris Wyatt
Merge pull request #72 from Eco-Flow/patch_bug
minor_patch
Frozen
v2.0.4
c8a1070
v2.0.3 Created 28th Mar 2026 at 07:06 by Chris Wyatt
Merge pull request #71 from Eco-Flow/transcriptGOs
Added
- Software versions now collected via
topic: versionsand written topipeline_info/software_versions.yml EGGNOG_TO_GOnow outputs isoform-level GO annotations (*.isoform_go.txt) in addition to gene-levelversionstopic emit added toRENAME_FASTA,EGGNOG_TO_GO,EGGNOG_TO_OG_GO,RESCALE_TREE,SUMMARIZE_CHROMO_GO,CAFE, andCAFE_PLOTmodules
Fixed
- QUAST now receives genome assembly FASTA instead of protein FASTA
- Removed orphaned heredoc version blocks from
CAFEscript
Changed
EGGNOG_TO_GOnow receives pre-filtered GFF (AGAT_CONVERTSPGXF2GXF.out.output_gff) to capture all isoforms- Replaced deprecated
CUSTOM_DUMPSOFTWAREVERSIONSmodule with nativetopic: versionsapproach - Legacy unused modules moved to
modules/local/legacy_modules/
Removed
- Legacy modules and removed old input config input types
Frozen
v2.0.3
47ff519
v2.0.2 (earliest) Created 26th Mar 2026 at 22:22 by Chris Wyatt
Merge pull request #69 from Eco-Flow/Chromo_go_parellelise
Add option flags for eggnogmapper, main ones
Frozen
v2.0.2
a60621a
Creators and SubmitterCreator
Submitter
Views: 904 Downloads: 172
Created: 26th Mar 2026 at 22:22
Last updated: 12th Apr 2026 at 11:33
TagsThis item has not yet been tagged.
AttributionsNone
View on GitHub
https://orcid.org/0000-0001-8033-2213