# Mobilome Annotation Pipeline (former MoMofy) Bacteria can acquire genetic material through horizontal gene transfer, allowing them to rapidly adapt to changing environmental conditions. These mobile genetic elements can be classified into three main categories: plasmids, phages, and integrative elements. Plasmids are mostly extrachromosmal; phages can be found extrachromosmal or as temperate phages (prophages); whereas integrons are stable inserted in the chromosome. Autonomous elements are those integrative elements capable of excising themselves from the chromosome and reintegrate elsewhere. They can use a transposase (like insertion sequences and transposons) or an integrase/excisionase (like ICEs and IMEs). The Mobilome Annotation Pipeline is a wrapper that integrates the output of different tools designed for the prediction of plasmids, phages, insertion sequences, and other autonomous integrative mobile genetic elements such as ICEs, IMEs and integrons in prokaryotic genomes and metagenomes. The output is a PROKKA gff file with extra entries for the mobilome. ## Contents - [ Workflow ](#wf) - [ Setup ](#sp) - [ Install and dependencies ](#install) - [ Usage ](#usage) - [ Inputs ](#in) - [ Outputs ](#out) - [ Tests ](#test) - [ Citation ](#cite) ## Workflow This workflow has the following main subworkflows: - Preprocessing: Rename and filter contigs, and run PROKKA annotation - Prediction: Run geNomad, ICEfinder, IntegronFinder, and ISEScan - Annotation: Generate extra-annotation for antimicrobial resistance genes (AMRFinderPlus) and other mobilome-related proteins (MobileOG). - Integration: Parse and integrate the outputs generated on `Prediction` and `Annotation` subworkflows. In this step optional results of VIRify v3.0.0 can be incorporated. MGEs <500 bp lengh and predictions with no genes are discarded. - Postprocessing: Write the mobilome fasta file, write a report of the location of AMR genes (either mobilome or chromosome), and generate three new GFF files: 1. `mobilome_clean.gff`: mobilome + associated CDSs 2. `mobilome_extra.gff`: mobilome + ViPhOGs/mobileOG annotated genes (note that ViPhOG annotation is generated by VIRify) 3. `mobilome_nogenes.gff`: mobilome only The output `mobilome_nogenes.gff` is validated in this subworkflow. ## Setup This workflow is built using [Nextflow](https://www.nextflow.io/). It uses Singularity containers making installation trivial and results highly reproducible. Explained in this section, there is one manual step required to build the singularity image for [ICEfinder](https://bioinfo-mml.sjtu.edu.cn/ICEfinder/index.php), as we can't distribute that software due to license issues. - Install [Nextflow version >=21.10](https://www.nextflow.io/docs/latest/getstarted.html#installation) - Install [Singularity](https://github.com/apptainer/singularity/blob/master/INSTALL.md) ## Install and dependencies To get a copy of the Mobilome Annotation Pipeline, clone this repo by: ```bash $ git clone https://github.com/EBI-Metagenomics/mobilome-annotation-pipeline.git ``` The mobileOG-database is required to run an extra step of annotation on the mobilome coding sequences. The first time you run the Mobilome Annotation Pipeline, you will need to download the [Beatrix 1.6 v1](https://mobileogdb.flsi.cloud.vt.edu/entries/database_download) database, move the tarball to `mobilome-annotation-pipeline/databases`, decompress it, and run the script to format the db for diamond: ```bash $ mv beatrix-1-6_v1_all.zip /PATH/mobilome-annotation-pipeline/databases $ cd /PATH/mobilome-annotation-pipeline/databases $ unzip beatrix-1-6_v1_all.zip $ nextflow run /PATH/mobilome-annotation-pipeline/format_mobileOG.nf ``` Two additional databases need to be manually downloaded and extracted: [AMRFinder plus db](https://ftp.ncbi.nlm.nih.gov/pathogen/Antimicrobial_resistance/AMRFinderPlus/database/latest) and the [geNomad database](https://zenodo.org/records/8339387) databases. Then you can provide the paths to your databases using the `mobileog_db`, the `amrfinder_plus_db` and the `genomad_db` respectively when you run the pipeline. Most of the tools are available on [quay.io](https://quay.io) and no install is needed. However, in the case of ICEfinder, you will need to contact the author to get a copy of the software, visit the [ICEfinder website](https://bioinfo-mml.sjtu.edu.cn/ICEfinder/download.html) for more information. Once you have the `ICEfinder_linux.tar.gz` tarball, move it to `mobilome-annotation-pipeline/templates` and build the singularity image using the following command: ```bash $ mv ICEfinder_linux.tar.gz /PATH/mobilome-annotation-pipeline/templates/ $ cd /PATH/mobilome-annotation-pipeline/templates/ $ sudo singularity build ../../singularity/icefinder-v1.0-local.sif icefinder-v1.0-local.def ``` The path to the ICEfinder image needs to be provided when running the pipeline, unless a custom config file is created. ## Inputs To run the Mobilome Annotation Pipeline on multiple samples, prepare a samplesheet with your input data that looks as in the following example. Note that `virify_gff` is an optional input for this pipeline generated with [VIRify](https://github.com/EBI-Metagenomics/emg-viral-pipeline) v3.0.0 tool. `samplesheet.csv`: ```csv sample,assembly,user_proteins_gff,virify_gff minimal,/PATH/assembly.fasta,, assembly_proteins,/PATH/assembly.fasta,/PATH/proteins.gff, assembly_proteins_virify,/PATH/assembly.fasta,/PATH/proteins.gff,/PATH/virify_out.gff ``` Each row represents a sample. The minimal input is the (meta)genome assembly in fasta format. Basic run: ```bash $ nextflow run /PATH/mobilome-annotation-pipeline/main.nf --input samplesheet.csv [--icefinder_sif icefinder-v1.0-local.sif] ``` Note that the final output in gff format is created by adding information to PROKKA output. If you have your own protein prediction files, provide the path the the uncompressed gff file in the samplesheet.csv. This file will be used to generate a `user_mobilome_extra.gff` file containing the mobilome plus any extra annotation generated on the annotation subworkflow. If you want to integrate VIRify results to the final output provide the path to the GFF file generated by VIRify v3.0.0 in your samplesheet.csv. ## Outputs Results will be written by default in the `mobilome_results` directory unless the `--outdir` option is used. There, you will find the following outputs: ```bash mobilome_results/ ├── mobilome.fasta ├── mobilome_prokka.gff ├── overlapping_integrons.txt ├── discarded_mge.txt ├── func_annot/ ├── gff_output_files/ ├── prediction/ └── preprocessing ``` The AMRFinderPlus results are generated by default. The `func_annot/amr_location.txt` file contains a summary of the AMR genes annotated and their location (either mobilome or chromosome). The file `discarded_mge.txt` contains a list of predictions that were discarded, along with the reason for their exclusion. Possible reasons include: 1. 'mge < 500bp' Discarded by length. 2. 'no_cds' If there are no genes encoded in the prediction. The file `overlapping_integrons.txt` is a report of long-MGEs with overlapping coordinates. No predictions are discarded in this case. The main output files containing the mobilome predictions are `mobilome.fasta` containing the nucleotide sequences of every prediction, and `mobilome_prokka.gff` containing the mobilome annotation plus any other feature annotated by PROKKA, mobileOG, or ViPhOG (only when VIRify results are provided). The mobilome prediction IDs are build as follows: 1. Contig ID 2. MGE type: flanking_site recombination_site prophage viral_sequence plasmid phage_plasmid integron conjugative_integron insertion_sequence 3. Start and end coordinates separated by ':' Example: ```bash >contig_id|mge_type-start:end ``` Any CDS with a coverage >= 0.9 in the boundaries of a predicted MGE is considered as part of the mobilome and labelled acordingly in the attributes field under the key `location`. The labels used in the Type column of the gff file corresponds to the following nomenclature according to the [Sequence Ontology resource](http://www.sequenceontology.org/browser/current_svn/term/SO:0000001) when possible: | Type in gff file | Sequence ontology ID | Element description | Reporting tool | | -------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------- | ------------------------- | | insertion_sequence | [SO:0000973](http://www.sequenceontology.org/browser/current_svn/term/SO:0000973) | Insertion sequence | ISEScan, PaliDIS | | terminal_inverted_repeat_element | [SO:0000481](http://www.sequenceontology.org/browser/current_svn/term/SO:0000481) | Terminal Inverted Repeat (TIR) flanking insertion sequences | ISEScan, PaliDIS | | integron | [SO:0000365](http://www.sequenceontology.org/browser/current_svn/term/SO:0000365) | Integrative mobilizable element | IntegronFinder, ICEfinder | | attC_site | [SO:0000950](http://www.sequenceontology.org/browser/current_svn/term/SO:0000950) | Integration site of DNA integron | IntegronFinder | | conjugative_integron | [SO:0000371](http://www.sequenceontology.org/browser/current_svn/term/SO:0000371) | Integrative Conjugative Element | ICEfinder | | direct_repeat | [SO:0000314](http://www.sequenceontology.org/browser/current_svn/term/SO:0000314) | Flanking regions on mobilizable elements | ICEfinder | | prophage | [SO:0001006](http://www.sequenceontology.org/browser/current_svn/term/SO:0001006) | Temperate phage | geNomad, VIRify | | viral_sequence | [SO:0001041](http://www.sequenceontology.org/browser/current_svn/term/SO:0001041) | Viral genome fragment | geNomad, VIRify | | plasmid | [SO:0000155](http://www.sequenceontology.org/browser/current_svn/term/SO:0000155) | Plasmid | geNomad | ## Tests Nextflow tests are executed with [nf-test](https://github.com/askimed/nf-test). It takes around 3 min in executing. Run: ```bash $ cd mobilome-annotation-pipeline/ $ nf-test test ``` ## Citation The Mobilome Annotation Pipeline parses and integrates the output of the following tools and DBs sorted alphabetically: - AMRFinderPlus v3.11.4 with database v2023-02-23.1 [Feldgarden et al., Sci Rep, 2021](https://doi.org/10.1038/s41598-021-91456-0) - Diamond v2.0.12 [Buchfink et al., Nature Methods, 2021](https://doi.org/10.1038/s41592-021-01101-x) - geNomad v1.6.1 [Camargo et al., Nature Biotechnology, 2023](https://doi.org/10.1038/s41587-023-01953-y) - ICEfinder v1.0 [Liu et al., Nucleic Acids Res, 2019](https://doi.org/10.1093/nar/gky1123) - IntegronFinder2 v2.0.2 [Néron et al., Microorganisms, 2022](https://doi.org/10.3390/microorganisms10040700) - ISEScan v1.7.2.3 [Xie et al., Bioinformatics, 2017](https://doi.org/10.1093/bioinformatics/btx433) - MobileOG-DB Beatrix 1.6 v1 [Brown et al., Appl Environ Microbiol, 2022](https://doi.org/10.1128/aem.00991-22) - PROKKA v1.14.6 [Seemann, Bioinformatics, 2014](https://doi.org/10.1093/bioinformatics/btu153) - VIRify v3.0.0 [Rangel-Pineros et al., PLoS Comput Biol, 2023](https://doi.org/10.1371/journal.pcbi.1011422)