Metagenome and metatranscriptome assembly in CWL
Version 1

Workflow Type: Common Workflow Language


Codacy Badge Build Status


This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired end data, and MEGAHIT for single end data. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:

QC - removal of short reads, low quality regions, adapters and host decontamination

Assembly - with metaSPADES or MEGAHIT

Post-assembly - Host and PhiX decontamination, contig length filter (500bp), stats generation.

Multiple input read files can also be specified for co-assembly.


This pipeline requires and environment with cwltool, blastn, metaspades and megahit.


Predownload fasta files for host decontamination and generate: - bwa index folder - blast index folder

Specify the locations in the yaml file when running the pipeline.

Main pipeline executables

src/workflows/metagenome_pipeline.cwl src/workflows/metatranscriptome_pipeline.cwl

Example output directory structure

    └── SRP074153               Project directory containing all assemblies under that project
        ├── downloads.yml       Raw data download caching logfile, to avoid duplicate downloads of raw data
        ├── SRR6257
        │   └── SRR6257420      Run directory
        │       └── megahit
        │           ├── 001     Assembly directory
        │           │   ├── SRR6257420.fasta               Trimmed assembly
        │           │   ├── SRR6257420.fasta.gz            Archive trimmed assembly
        │           │   ├── SRR6257420.fasta.gz.md5        MD5 hash of above archive
        │           │   ├──                   Coverage file
        │           │   ├── final.contigs.fa               Raw assembly
        │           │   ├── job_config.yml                 CWL job configuration
        │           │   ├── megahit.log                    Assembler output log
        │           │   ├── output.json                    Human-readable Assembly stats file
        │           │   ├── sorted.bam                     BAM file of assembly
        │           │   ├── sorted.bam.bai                 Secondary BAM file
        │           │   └── toil.log                       cwlToil output log
        │           └── metaspades Assembly of equivalent data using another assembler (eg metaspades, spades...)
        │               └── ... 
        ├── raw                 Raw data directory
        │   └── SRR6257420.fastq.gz                        Raw data files
        └── tmp                 Temporary directory for assemblies
            └── SRR6257
                └── SRR6257420
                    └── megahit
                        └── 001

Version History

master @ b269a55 (earliest) Created 19th May 2023 at 14:59 by Varsha Kale


Frozen master b269a55
help Creators and Submitter
Not specified

Views: 55

Created: 19th May 2023 at 14:59

help Tags

This item has not yet been tagged.

help Attributions


Total size: 15.8 MB

Brought to you by:

Powered by
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH