Module for integrative Mobilome prediction
Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrons. Autonomous elements are those capable of excising themselves from the chromosome, reintegrating elsewhere, and potentially modifying the host's physiology. Small integrative elements like insertion sequences usually contain one or two genes and are frequently present in multiple copies in the genome, whereas large elements like integrative conjugative elements, often carry multiple cargo genes. The acquisition of large mobile genetic elements may provide genes for defence against other mobile genetic elements or impart new metabolic capabilities to the host.
MoMofy is a wraper that integrates the ouptput of different tools designed for the prediction of autonomous integrative mobile genetic elements in prokaryotic genomes and metagenomes.
This workflow is built using Nextflow. It uses Singularity containers making installation trivial and results highly reproducible. Explained in this section section, there is one manual step required to build the singularity image for ICEfinder, as we can't distribute that software due to license issues.
- Install Nextflow version >=21.10
- Install Singularity
MoMofy install and dependencies
To install MoMofy, clone this repo by:
$ git clone https://github.com/EBI-Metagenomics/momofy.git
The mobileOG-database is required to run an extra step of annotation on the mobilome coding sequences. The first time you run MoMofy, you will need to download the Beatrix 1.6 v1 database, move the tarball to
/PATH/momofy/databases, decompress it, and run the script to format the db for diamond:
$ mv beatrix-1-6_v1_all.zip /PATH/momofy/databases $ cd /PATH/momofy/databases $ unzip beatrix-1-6_v1_all.zip $ nextflow run /PATH/momofy/format_mobileOG.nf
Most of the tools are available on quay.io and no install is needed.
In the case of ICEfinder, you will need to contact the author to get a copy of the software, visit the ICEfinder website for more information. Once you have the
ICEfinder_linux.tar.gz tarball, move it to
momofy/templates and build the singularity image using the following command:
$ mv ICEfinder_linux.tar.gz /PATH/momofy/templates/ $ cd /PATH/momofy/templates/ $ sudo singularity build ../../singularity/icefinder-v1.0-local.sif icefinder-v1.0-local.def
PaliDIS is an optional step on the workflow and the install is optional as well. Visit PaliDIS repo for installing instructions.
If you are aim to run the pipeline in a system with jobs scheduler as LSF or SGE, set up a config file and provide it as part of the arguments as follows:
$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta -c /PATH/configs/some_cluster.config
You can find an example in the
configs directory of this repo.
Running the tool with
--help option will display the following message:
$ nextflow run /PATH/momofy/momofy.nf --help N E X T F L O W ~ version 21.10.0 Launching `momofy.nf` [gigantic_pare] - revision: XXXXX MoMofy is a wraper that integrates the ouptput of different tools designed for the prediction of autonomous integrative mobile genetic elements in prokaryotic genomes and metagenomes. Usage: The basic command for running the pipeline is as follows: nextflow run momofy.nf --assembly contigs.fasta Mandatory arguments: --assembly (Meta)genomic assembly in fasta format (uncompress) Optional arguments: --user_genes User annotation files. See --prot_fasta and --prot_gff [ default = false ] --prot_gff Annotation file in GFF3 format. Mandatory with --user_genes true --prot_fasta Fasta file of aminoacids. Mandatory with --user_genes true --palidis Incorporate PaliDIS predictions to final output [ default = false ] --palidis_fasta Fasta file of PaliDIS insertion sequences. Mandatory with --palidis true --palidis_info Information file of PaliDIS insertion sequences. Mandatory with --palidis true --gff_validation Run a step of format validation on the GFF3 file output [ default = true ] --outdir Output directory to place final MoMofy results [ default = MoMofy_results ] --help This usage statement [ default = false ]
To run MoMofy in multiple samples, create a directory per sample and launch the tool from the sample directory. The only mandatory input is the (meta)genomic assembly file in fasta format (uncompress).
$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta
Note that the final output in gff format is created by adding information to PROKKA output. If you have your own protein prediction files, provide the gff and the fasta file of amino acid sequences (both uncompressed files are mandatory with this option). These files will be used for Diamond annotation and CDS coordinates mapping to the MGEs boundaries. If any original annotation is present in the gff file, it will remained untouched.
Running MoMofy with user's genes prediction:
$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \ --user_genes true \ --prot_fasta proteins.faa \ --prot_gff annotation.gff \
If you want to incorporate PaliDIS predictions to the final output, provide the path of the two outputs of PaliDIS (fasta file of insertion sequences and the information for each insertion sequence file).
To run MoMofy incorporating PaliDIS results:
$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \ --palidis true \ --palidis_fasta insertion_sequences.fasta \ --palidis_info insertion_sequences_info.txt \
Then, if you have protein files and PaliDIS outputs, you can run:
$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \ --user_genes true \ --prot_fasta proteins.faa \ --prot_gff annotation.gff \ --palidis true \ --palidis_fasta insertion_sequences.fasta \ --palidis_info insertion_sequences_info.txt \
A GFF validation process is used to detect formatting errors in the final GFF3 output. This process can be skipped adding
--gff_validation false to the command.
Results will be written by default in the
MoMofy_results directory inside the sample dir unless the user define
--outdir option. There you will find the following output files:
MoMofy_results/ ├── discarded_mge.txt ├── momofy_predictions.fna ├── momofy_predictions.gff └── nested_integrons.txt
The main MoMofy output files are the
momofy_predictions.fna containing the nucleotide sequences of every prediction, and the
momofy_predictions.gff containing the mobilome annotation plus any other feature annotated by PROKKA or in the gff file provided by the user with the option
--user_genes. The labels used in the Type column of the gff file corresponds to the following nomenclature according to the Sequence Ontology resource:
|Type in gff file||Sequence ontology ID||Element description||Reporting tool|
|insertion_sequence||SO:0000973||Insertion sequence||ISEScan, PaliDIS|
|terminal_inverted_repeat_element||SO:0000481||Terminal Inverted Repeat (TIR) flanking insertion sequences||ISEScan, PaliDIS|
|integron||SO:0000365||Integrative mobilizable element||IntegronFinder, ICEfinder|
|attC_site||SO:0000950||Integration site of DNA integron||IntegronFinder|
|conjugative_transposon||SO:0000371||Integrative Conjugative Element||ICEfinder|
|direct_repeat||SO:0000314||Flanking regions on mobilizable elements||ICEfinder|
discarded_mge.txt contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include:
- overlapping For insertion sequences only, ISEScan prediction is discarded if an overlap with PaliDIS is found.
- mge<500bp Discarded by length.
- no_cds If there are no genes encoded in the prediction.
nested_integrons.txt is a report of overlapping predictions reported by IntegronFinder and ICEfinder. No predictions are discarded in this case.
Additionally, you will see the directories containing the main outputs of each tool.
Nextflow tests are executed with nf-test. It takes around 3 min in executing.
$ cd /PATH/momofy $ nf-test test *.nf.test
MoMofy performance was profiled using 460 public metagenomic assemblies and co-assemblies of chicken gut (ERP122587, ERP125074, and ERP131894) with sizes ranging from ~62 K to ~893 M assembled bases. We used the metagenomic assemblies, CDS prediction and annotation files generated by MGnify v5 pipeline, and PaliDIS outputs generated after downsampling the number of reads to 10 M. MoMofy was run adding the following options:
-with-report -with-trace -with-timeline timeline.out.
If you use MoMofy on your data analysis, please cite:
MoMofy is a wrapper that integrates the output of the following tools and DBs:
- ISEScan v22.214.171.124 Xie et al., Bioinformatics, 2017
- IntegronFinder2 v2.0.2 Néron et al., Microorganisms, 2022
- ICEfinder v1.0 Liu et al., Nucleic Acids Research, 2019
- PaliDIS Carr et al., biorxiv, 2022
- MobileOG-DB Beatrix 1.6 v1 Brown et al., Appl Environ Microbiol, 2022
main @ 9ed4ca9 (earliest) Created 6th Apr 2023 at 10:40 by Alejandra Escobar
Merge pull request #5 from EBI-Metagenomics/dev
MoMofy to public
Created: 6th Apr 2023 at 10:40
Last updated: 6th Apr 2023 at 10:50