MoMofy: Module for integrative Mobilome prediction
Version 1

Workflow Type: Nextflow
Work-in-progress

MoMofy

Module for integrative Mobilome prediction

Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrons. Autonomous elements are those capable of excising themselves from the chromosome, reintegrating elsewhere, and potentially modifying the host's physiology. Small integrative elements like insertion sequences usually contain one or two genes and are frequently present in multiple copies in the genome, whereas large elements like integrative conjugative elements, often carry multiple cargo genes. The acquisition of large mobile genetic elements may provide genes for defence against other mobile genetic elements or impart new metabolic capabilities to the host.

MoMofy is a wraper that integrates the ouptput of different tools designed for the prediction of autonomous integrative mobile genetic elements in prokaryotic genomes and metagenomes.

Contents

Workflow

Setup

This workflow is built using Nextflow. It uses Singularity containers making installation trivial and results highly reproducible. Explained in this section section, there is one manual step required to build the singularity image for ICEfinder, as we can't distribute that software due to license issues.

MoMofy install and dependencies

To install MoMofy, clone this repo by:

$ git clone https://github.com/EBI-Metagenomics/momofy.git

The mobileOG-database is required to run an extra step of annotation on the mobilome coding sequences. The first time you run MoMofy, you will need to download the Beatrix 1.6 v1 database, move the tarball to /PATH/momofy/databases, decompress it, and run the script to format the db for diamond:

$ mv beatrix-1-6_v1_all.zip /PATH/momofy/databases
$ cd /PATH/momofy/databases
$ unzip beatrix-1-6_v1_all.zip
$ nextflow run /PATH/momofy/format_mobileOG.nf

Most of the tools are available on quay.io and no install is needed.

In the case of ICEfinder, you will need to contact the author to get a copy of the software, visit the ICEfinder website for more information. Once you have the ICEfinder_linux.tar.gz tarball, move it to momofy/templates and build the singularity image using the following command:

$ mv ICEfinder_linux.tar.gz /PATH/momofy/templates/
$ cd /PATH/momofy/templates/
$ sudo singularity build ../../singularity/icefinder-v1.0-local.sif icefinder-v1.0-local.def

PaliDIS is an optional step on the workflow and the install is optional as well. Visit PaliDIS repo for installing instructions.

If you are aim to run the pipeline in a system with jobs scheduler as LSF or SGE, set up a config file and provide it as part of the arguments as follows:

$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta -c /PATH/configs/some_cluster.config

You can find an example in the configs directory of this repo.

Usage

Running the tool with --help option will display the following message:

$ nextflow run /PATH/momofy/momofy.nf --help
N E X T F L O W  ~  version 21.10.0
Launching `momofy.nf` [gigantic_pare] - revision: XXXXX

	MoMofy is a wraper that integrates the ouptput of different tools designed for the prediction of autonomous integrative mobile genetic elements in prokaryotic genomes and metagenomes.

        Usage:
         The basic command for running the pipeline is as follows:

         nextflow run momofy.nf --assembly contigs.fasta

         Mandatory arguments:
          --assembly                     (Meta)genomic assembly in fasta format (uncompress)

         Optional arguments:
          --user_genes                    User annotation files. See --prot_fasta and --prot_gff [ default = false ]
          --prot_gff                      Annotation file in GFF3 format. Mandatory with --user_genes true
          --prot_fasta                    Fasta file of aminoacids. Mandatory with --user_genes true
          --palidis                       Incorporate PaliDIS predictions to final output [ default = false ]
          --palidis_fasta                 Fasta file of PaliDIS insertion sequences. Mandatory with --palidis true
          --palidis_info                  Information file of PaliDIS insertion sequences. Mandatory with --palidis true
          --gff_validation                Run a step of format validation on the GFF3 file output [ default = true ]
          --outdir                        Output directory to place final MoMofy results [ default = MoMofy_results ]
          --help                          This usage statement [ default = false ]

Inputs

To run MoMofy in multiple samples, create a directory per sample and launch the tool from the sample directory. The only mandatory input is the (meta)genomic assembly file in fasta format (uncompress).

Basic run:

$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta

Note that the final output in gff format is created by adding information to PROKKA output. If you have your own protein prediction files, provide the gff and the fasta file of amino acid sequences (both uncompressed files are mandatory with this option). These files will be used for Diamond annotation and CDS coordinates mapping to the MGEs boundaries. If any original annotation is present in the gff file, it will remained untouched.

Running MoMofy with user's genes prediction:

$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \
    --user_genes true \
    --prot_fasta proteins.faa \
    --prot_gff annotation.gff \

If you want to incorporate PaliDIS predictions to the final output, provide the path of the two outputs of PaliDIS (fasta file of insertion sequences and the information for each insertion sequence file).

To run MoMofy incorporating PaliDIS results:

$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \
    --palidis true \
    --palidis_fasta insertion_sequences.fasta \
    --palidis_info insertion_sequences_info.txt \

Then, if you have protein files and PaliDIS outputs, you can run:

$ nextflow run /PATH/momofy/momofy.nf --assembly contigs.fasta \
    --user_genes true \
    --prot_fasta proteins.faa \
    --prot_gff annotation.gff \
    --palidis true \
    --palidis_fasta insertion_sequences.fasta \
    --palidis_info insertion_sequences_info.txt \

A GFF validation process is used to detect formatting errors in the final GFF3 output. This process can be skipped adding --gff_validation false to the command.

Outputs

Results will be written by default in the MoMofy_results directory inside the sample dir unless the user define --outdir option. There you will find the following output files:

MoMofy_results/
├── discarded_mge.txt
├── momofy_predictions.fna
├── momofy_predictions.gff
└── nested_integrons.txt

The main MoMofy output files are the momofy_predictions.fna containing the nucleotide sequences of every prediction, and the momofy_predictions.gff containing the mobilome annotation plus any other feature annotated by PROKKA or in the gff file provided by the user with the option --user_genes. The labels used in the Type column of the gff file corresponds to the following nomenclature according to the Sequence Ontology resource:

Type in gff file Sequence ontology ID Element description Reporting tool
insertion_sequence SO:0000973 Insertion sequence ISEScan, PaliDIS
terminal_inverted_repeat_element SO:0000481 Terminal Inverted Repeat (TIR) flanking insertion sequences ISEScan, PaliDIS
integron SO:0000365 Integrative mobilizable element IntegronFinder, ICEfinder
attC_site SO:0000950 Integration site of DNA integron IntegronFinder
conjugative_transposon SO:0000371 Integrative Conjugative Element ICEfinder
direct_repeat SO:0000314 Flanking regions on mobilizable elements ICEfinder
CDS SO:0000316 Coding sequence Prodigal

The file discarded_mge.txt contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include:

  1. overlapping For insertion sequences only, ISEScan prediction is discarded if an overlap with PaliDIS is found.
  2. mge<500bp Discarded by length.
  3. no_cds If there are no genes encoded in the prediction.

The file nested_integrons.txt is a report of overlapping predictions reported by IntegronFinder and ICEfinder. No predictions are discarded in this case.

Additionally, you will see the directories containing the main outputs of each tool.

Tests

Nextflow tests are executed with nf-test. It takes around 3 min in executing.

Run:

$ cd /PATH/momofy
$ nf-test test *.nf.test

Performance

MoMofy performance was profiled using 460 public metagenomic assemblies and co-assemblies of chicken gut (ERP122587, ERP125074, and ERP131894) with sizes ranging from ~62 K to ~893 M assembled bases. We used the metagenomic assemblies, CDS prediction and annotation files generated by MGnify v5 pipeline, and PaliDIS outputs generated after downsampling the number of reads to 10 M. MoMofy was run adding the following options: -with-report -with-trace -with-timeline timeline.out.

Citation

If you use MoMofy on your data analysis, please cite:

XXXXX

MoMofy is a wrapper that integrates the output of the following tools and DBs:

  1. ISEScan v1.7.2.3 Xie et al., Bioinformatics, 2017
  2. IntegronFinder2 v2.0.2 Néron et al., Microorganisms, 2022
  3. ICEfinder v1.0 Liu et al., Nucleic Acids Research, 2019
  4. PaliDIS Carr et al., biorxiv, 2022

Databases:

Version History

main @ 9ed4ca9 (earliest) Created 6th Apr 2023 at 10:40 by Alejandra Escobar

Merge pull request #5 from EBI-Metagenomics/dev

MoMofy to public


Frozen main 9ed4ca9
help Creators and Submitter
Activity

Views: 1018

Created: 6th Apr 2023 at 10:40

Last updated: 6th Apr 2023 at 10:50

help Attributions

None

Total size: 19.7 MB
Powered by
(v.1.14.1)
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH