ERGA Protein-coding gene annotation workflow
Version 1

Workflow Type: Snakemake

ERGA Protein-coding gene annotation workflow.

Adapted from the work of Sagane Joye:

https://github.com/sdind/genome_annotation_workflow

Prerequisites

The following programs are required to run the workflow and the listed version were tested. It should be noted that older versions of snakemake are not compatible with newer versions of singularity as is noted here: https://github.com/nextflow-io/nextflow/issues/1659.

conda v 23.7.3

singularity v 3.7.3

snakemake v 7.32.3

You will also need to acquire a licence key for Genemark and place this in your home directory with name ~/.gm_key The key file can be obtained from the following location, where the licence should be read and agreed to: http://topaz.gatech.edu/GeneMark/license_download.cgi

Workflow

The pipeline is based on braker3 and was tested on the following dataset from Drosophila melanogaster: https://doi.org/10.5281/zenodo.8013373

Input data

  • Reference genome in fasta format

  • RNAseq data in paired-end zipped fastq format

  • uniprot fasta sequences in zipped fasta format

Pipeline steps

  • Repeat Model and Mask Run RepeatModeler using the genome as input, filter any repeats also annotated as protein sequences in the uniprot database and use this filtered libray to mask the genome with RepeatMasker

  • Map RNAseq data Trim any remaining adapter sequences and map the trimmed reads to the input genome

  • Run gene prediction software Use the mapped RNAseq reads and the uniprot sequences to create hints for gene prediction using Braker3 on the masked genome

  • Evaluate annotation Run BUSCO to evaluate the completeness of the annotation produced

Output data

  • FastQC reports for input RNAseq data before and after adapter trimming

  • RepeatMasker report containing quantity of masked sequence and distribution among TE families

  • Protein-coding gene annotation file in gff3 format

  • BUSCO summary of annotated sequences

Setup

Your data should be placed in the data folder, with the reference genome in the folder data/ref and the transcript data in the foler data/rnaseq.

The config file requires the following to be given:

asm: 'absolute path to reference fasta'
snakemake_dir_path: 'path to snakemake working directory'
name: 'name for project, e.g. mHomSap1'
RNA_dir: 'absolute path to rnaseq directory'
busco_phylum: 'busco database to use for evaluation e.g. mammalia_odb10'

Version History

Version 1 (earliest) Created 12th Sep 2023 at 20:29 by Tom Brown

Initial commit


Frozen Version-1 b8e3693
help Creators and Submitter
Creator
Submitter
Citation
Joye-Dind, S. (2023). ERGA Protein-coding gene annotation workflow. WorkflowHub. https://doi.org/10.48546/WORKFLOWHUB.WORKFLOW.569.1
Activity

Views: 2372   Downloads: 278

Created: 12th Sep 2023 at 20:29

Last updated: 13th Sep 2023 at 14:40

Annotated Properties
help Attributions

None

Total size: 87.4 MB
Powered by
(v.1.16.0-main)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH