ANNOTATO - Annotation workflow To Annotate Them Oll
- ANNOTATO - Annotation workflow To Annotate Them Oll
Overview of the workflow
The pipeline is based on Funannotate
or BRAKER
and was initially developed and tested on the two datasets:
- Drosophila melanogaster: https://doi.org/10.5281/zenodo.8013373
- Pocillopora cf. effusa: https://www.ncbi.nlm.nih.gov/biosample/26809107
Then, it was further tested on these species during the BioHackathon 2023 - project 20
- Helleia helle
- Homo sapiens chrom 19
- Melampus jaumei
- Phakellia ventilabrum
- Trifolium dubium
Input data
- Reference genome
genome.[.fna, .fa, .fasta][.gz]
- RNAseq data listed in a metadata csv file. Input type can be mixed between long and short reads, with the option of single-end read. The input file should follow the format below:
sample_id,R1_path,R2_path,read_type
SAM1,/path/to/R1,,long # For long reads
SAM2,/path/to/R1,/path/to/R2,short # For PE reads
SAM3,/path/to/R1,,short # For SE reads
- Protein sequence data in fasta format, could be gzip or not
Pipeline steps
The main pipeline is divided into five different subworkflows.
Preprocess RNA
is where the input RNASeq data are QC and trimmed.Process RNA Minimap
is triggered when long reads FastQ are in the input CSV file.Process RNA STAR
will run when short reads FastQ are in the input CSV.Genome Masking
runs by default if not skipped. It assumes the input genome fasta is not masked and will run Denovo repeat masking with RepeatModeler and RepeatMasker.Filter Repeat
whenever there is a Denovo masking step, this sub-workflow will be triggered to remove the repeat sequences that appeared in the Uniprot Swissprot protein data.
Output data
- MultiQC report for the RNASeq data, before and after trimming, mapping rate of short reads, and the BUSCO results of predicted genes.
- RepeatMasker report containing quantity of masked sequence and distribution among TE families
- Protein-coding gene annotation file in gff3 format
- BUSCO summary of annotated sequences
Prerequisites
The following programs are required to run the workflow and the listed version were tested.
nextflow v23.04.0 or higher
singularity
conda
and mamba
(currently, having problem with Funannotate and BRAKER installation)
docker
(have not been tested but in theory should work fine)
Installation
Simply get the code from github or workflowhub and directly use it for the analysis with nextflow
.
git clone https://github.com/ERGA-consortium/pipelines/tree/main/annotation/nextflow
Running ANNOTATO
Before running the pipeline (IMPORTANT)
One thing with Nextflow is that it is running off a Java Virtual Machine (JVM), and it will try to use all available memory for Nextflow even though it is unnecessary (for workflow management and job control). This will cause much trouble if you run a job on an HPC cluster. Thus, to minimize the effect of it, we need to limit the maximum memory the JVM can use.
export NFX_OPTS="-Xms=512m -Xmx=3g"
-Xms
is the lower limit, which is set as 512 MB.
-Xmx
is the upper limit, which in this case is set as 3 GB.
Please modify this according to your situation.
Without RNASeq and protein data
Perform the analysis with only the draft genome and busco database.
nextflow run main.nf --genome /path/to/genome.fasta --species "Abc def" --buscodb 'metazoa'
The workflow will run Denovo repeat masking on the draft genome, then softmask the repeat region and use the genome to run funannotate
. Add --run_braker
to run the genome prediction using BRAKER
instead.
Running ANNOTATO with RNASeq data
When you want to let the workflow run the mapping by itself, uses input.csv
as input with the link to all FASTQ
file.
nextflow run main.nf --genome /path/to/genome.fasta[.gz] --rnaseq /path/to/input.csv --species "Abc def" --buscodb 'metazoa'
Based on the content of the input.csv
file to trigger different RNASeq processing workflows. The output bam
file will then be used for genome prediction.
When reads are mapped to the reference genome, the aligned bam
file can be used as input to the pipeline instead of the raw FASTQ
nextflow run main.nf --genome /path/to/genome.fasta[.gz] --short_rna_bam /path/to/shortreads.bam [--long_rna_bam /path/to/longreads.bam] --species "Abc def" --buscodb 'metazoa'
ATTENTION: One major drawback of the current workflow is that the input genome will be sorted and renamed by the funannotate sort
function. This is because AUGUSTUS
and Funannotate
won't work normally when the header of the input genome is too long and contains weird characters. Therefore, if you want to provide a bam
file as input instead of the raw FASTQ
, please run funannotate sort
on the genome fasta first and then use it as the reference for running alignment. Or in case your genome headers are already shorter than 16 character, please add --skip_rename
when running the pipeline.
Running ANNOTATO with protein data
nextflow run main.nf --genome /path/to/genome.fasta[.gz] --protein /path/to/protein.fasta[.gz] --species "Abc def" --buscodb 'metazoa'
When only protein data is provided, the workflow will run denovo masking then repeat filter with the additional protein data. The masked genome and protein fasta will then be used for gene prediction.
Running ANNOTATO with both protein and RNASeq data
The full pipeline is triggered when both RNASeq data and protein fasta is provided.
nextflow run main.nf --genome /path/to/genome.fasta[.gz] --protein /path/to/protein.fasta[.gz] --rnaseq /path/to/input.csv --species "Abc def" --buscodb 'metazoa'
Running ANNOTATO with params.json
One plus side with Nextflow is that it can use a parameter JSON file called params.json
to start the analysis pipeline with all required parameters. Please modify the content of the params.json
according to your need then run the following command.
nextflow run main.nf -params-file params.json
Other parameters for running the analysis
Compulsory input:
--genome Draft genome fasta file contain the assembled contigs/scaffolds
--species Species name for the annotation pipeline, e.g. "Drosophila melanogaster"
Optional input:
--protein Fasta file containing known protein sequences used as an additional information for gene prediction pipeline.
Ideally this should come from the same species and/or closely related species. [default: null]
--rnaseq A CSV file following the pattern: sample_id,R1_path,R2_path,read_type.
This could be generated using gen_input.py. Run `python gen_input.py --help` for more information.
[default: null]
--long_rna_bam A BAM file for the alignment of long reads (if any) to the draft genome. Noted that the header of the draft
genome need to be renamed first before alignment otherwise it will causes trouble for AUGUSTUS and funannotate.
[default: null]
--short_rna_bam A BAM file for the alignment of short reads (if any) to the draft genome. Noted that the header of the draft
genome need to be renamed first before alignment otherwise it will causes trouble for AUGUSTUS and funannotate.
[default: null]
--knownrepeat Fasta file containing known repeat sequences of the species, this will be used directly for masking
(if --skip_denovo_masking) or in combination with the denovo masking. [default: null]
Output option:
--outdir Output directory.
--tracedir Pipeline information.
--publish_dir_mode Option for nextflow to move data to the output directory. [default: copy]
--tmpdir Database directory.
Funannotate params:
--run_funannotate Whether to use funannotate for gene prediction. [default: true]
--organism Fungal-specific option. Should be change to "fungus" if the annotated organism is fungal. [default: other]
--ploidy Set the ploidy for gene prediction, in case of haploid, a cleaning step will be performed by funannotate to remove
duplicated contigs/scaffold. [default: 2]
--buscodb BUSCO database used for AUGUSTUS training and evaluation. [default: eukaryota]
--buscoseed AUGUSTUS pre-trained species to start BUSCO. Will be override if rnaseq data is provided. [default: null]
Braker params:
--run_braker Whether to use BRAKER for gene prediction. [default: false]
Skipping options:
--skip_rename Skip renaming genome fasta file by funannotate sort.
--skip_all_masking Skip all masking processes, please be sure that your --genome input is soft-masked before triggering this
parameter. [default: false]
--skip_denovo_masking Skip denovo masking using RepeatModeler, this option can only be run when --knownrepeat fasta is provided.
[default: false]
--skip_functional_annotation Skip functional annotation step. [default: false]
--skip_read_preprocessing Skip RNASeq preprocessing step. [default: false]
Execution/Engine profiles:
The pipeline supports profiles to run via different Executers and Engines e.g.: -profile local,conda
Executer (choose one):
local
slurm
Engines (choose one):
conda
mamba
docker
singularity
Per default: -profile slurm,singularity is executed.
Evaluating output GFF to the exon level
We provided a script to analyze the output GFF of ANNOTATO (which also could be applied to the GFF file output of other pipelines) to report the number of exons per mRNA/tRNA. To run this, simply use:
python bin/analyze_exons.py -f ${GFF}
Below is the sample output of this script
INFORMATION REGARDING mRNA
Number of transcripts: 33086
Largest number of exons in all transcripts: 128
Monoexonic transcripts: 4085
Multiexonic transcripts: 29001
Mono:Mult Ratio: 0.14
Boxplot of number of exons per transcript:
Min: 1
25%: 2
50%: 4
75%: 8
Max: 128
Mean: 6.978812790908542
==================================================
INFORMATION REGARDING tRNA
Number of transcripts: 2017
Largest number of exons in all transcripts: 1
Monoexonic transcripts: 2017
Multiexonic transcripts: 0
No multiexonic transcripts, unable to calculate Mono:Mult Ratio
Boxplot of number of exons per transcript:
Min: 1
25%: 1
50%: 1
75%: 1
Max: 1
Mean: 1.0
==================================================
This script was originally written by Katharina Hoff and was modified accordingly to suit the analysis of GFF file.
Performance of the workflow on annotating difference eukaryote genomes
The following table is the result predicted by ANNOTATO on difference species during the Europe BioHackathon 2023.
Species | Genome size | N.Genes | N.Exons | N.mRNA | BUSCO lineage | BUSCO score | OMArk Completeness | OMArk Consistency |
---|---|---|---|---|---|---|---|---|
Drosophila melanogaster | 143M | 14,753 | 57,343 | 14,499 | diptera | C:96.1%[S:95.6%,D:0.5%],F:1.2%,M:2.7% | melanogaster subgroup, C:90.38%[S:84.32%,D:6.06%],M:9.62%,,n:12442 | A:94.21%[P:4.05%,F:7.28%],I:1.61%[P:0.5%,F:0.42%],C:0.00%[P:0.00%,F:0.00%],U:4.19% |
Helleia helle | 547M | 37,367 | 139,302 | 28,445 | lepidoptera | C:74.6%[S:73.4%,D:1.2%],F:5.4%,M:20.0% | Papilionidea, C:82.04%[S:66.12%,D:15.92%],M:17.96%, n:7939 | A:44.78%[P:14.41%,F:6.02%],I:3.53%[P:2.1%,F:0.7%],C:0.00%[P:0.00%,F:0.00%],U:51.69% |
Homo sapiens chrom 19 | 58M | 1,872 | 11,937 | 1,862 | primates | C:5.0%[S:4.8%,D:0.2%],F:0.5%,M:94.5% | Hominidae, C:8.57%[S:7.74%,D:0.83%],M:91.43%, n=17843 | A:87.54%[P:12.73%,F:13.1%],I:4.78%[P:1.5%,F:2.04%],C:0.00%[P:0.00%,F:0.00%],U:7.68% |
Melampus jaumei | 958M | 61,128 | 335,483 | 60,720 | mollusca | C:80.4%[S:67.2%,D:13.2%],F:3.8%,M:15.8% | Lophotrochozoa, C: 92.5%[S: 66.29%, D: 26.21%], M:7.5%, n:2373 | A:41.45%[P:15.72%,F:9.97%],I:15.97%[P:10.68%,F:3.07%],C:0.00%[P:0.00%,F:0.00%],U:42.57% |
Phakellia ventilabrum | 186M | 19,073 | 157,441 | 18,855 | metazoa | C:80.9%[S:79.2%,D:1.7%],F:6.5%,M:12.6% | Metazoa, C:86.79%[S:76.9%,D:9.9%],M:13.21% , n:3021 | A:53.81%[P:18.92%,F:5.06%],I:5.0%[P:2.7%,F:0.68%],C:0.00%[P:0.00%,F:0.00%],U:41.19% |
Pocillopora cf. effusa | 347M | 35,103 | 230,901 | 33,086 | metazoa | C:95.1%[S:92.2%,D:2.9%],F:1.7%,M:3.2% | Eumetazoa, C:94.16%[S:84.3%,D:9.86%],M:5.84%,n:3255 | A:52.94%[P:22.30%,F:3.69%],I:3.44%[P:2.08%,F:0.28%],C:0.00%[P:0.00%,F:0.00%],U:43.62% |
Trifolium dubium | 679M | 78,810 | 354,662 | 77,763 | fabales | C:95.1%[S:19.5%,D:75.6%],F:1.5%,M:3.4% | NPAAA clade, C:94.58%[S:19.21%,D:75.38%],M:5.42%,n:15412 | A:71.99%[P:11.03%,F:6.63%],I:2.77%[P:1.66%,F:0.52%],C:0.00%[P:0.00%,F:0.00%],U:25.23% |
Future work
- Python wrapper function to remove intermediate files
- Adding functional annotation with
Interproscan
andeggnog
- Adding PASA results to further improve the accuracy of the training
- Adding custom parameter for both
BRAKER
andfunannotate
Version History
Version 2 (latest) Created 23rd Nov 2023 at 13:28 by Phuong Doan
- Python script to analyze the number of exons per transcript (let it mRNA or tRNA), generate the number of mono exons, multi exons, mono/multi ratio, etc.
- Adding a step in the workflow to rename the output GFF file's contigs back to the original name, now saved as rename_output.gff, which can be compared with other GFF.
- Modify README.md, adding TOC and more detailed information about the workflow.
Frozen
Version-2
0d15734
Version 1 (earliest) Created 9th Nov 2023 at 09:43 by Tom Brown
Public release v1
Frozen
Version-1
acb1402
Creator
Submitter
Views: 3233 Downloads: 893
Created: 9th Nov 2023 at 09:43
Last updated: 24th Nov 2023 at 15:24
None