Workflow Type: Common Workflow Language


Codacy Badge Build Status


This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired-end data, and MEGAHIT for single-end data and co-assemblies. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:

  • QC: removal of short reads, low quality regions, adapters and host decontamination
  • Assembly: with metaSPADES or MEGAHIT
  • Post-assembly: Host and PhiX decontamination, contig length filter (500bp), stats generation

Requirements - How to install

This pipeline requires a conda environment with cwltool, blastn, and metaspades. If created with requirements.yml, the environment will be called cwl_assembly.

conda env create -f requirements.yml
conda activate cwl_assembly
pip install cwltool==3.1.20230601100705


You will need to pre-download fasta files for host decontamination and generate the following databases accordingly:

  • bwa index
  • blast index

Specify the locations in the yaml file when running the pipeline.

Main pipeline executables

  • src/workflows/metagenome_pipeline.cwl
  • src/workflows/metatranscriptome_pipeline.cwl

Example command

cwltool --singularity --outdir ${OUTDIR} ${CWL} ${YML}

$CWL is going to be one of the executables mentioned above $YML should be a config yaml file including entries among what follows. You can find a yml template in the examples folder.

Example output directory structure

Root directory
    ├── megahit
    │   └── 001 -------------------------------- Assembly root directory
    │       ├── assembly_stats.json ------------ Human-readable assembly stats file
    │       ├── ------------------- Coverage file
    │       ├── log ---------------------------- CwlToil+megahit output log
    |       ├── options.json ------------------- Megahit input options
    │       ├── SRR6257420.fasta.gz ------------ Archived and trimmed assembly
    │       └── SRR6257420.fasta.gz.md5 -------- MD5 hash of above archive
    ├── metaspades
    │   └── 001 -------------------------------- Assembly root directory
    │       ├── assembly_graph.fastg ----------- Assembly graph
    │       ├── assembly_stats.json ------------ Human-readable assembly stats file
    │       ├── ------------------- Coverage file
    |       ├── params.txt --------------------- Metaspades input options
    │       ├── spades.log --------------------- Metaspades output log
    │       ├── SRR6257420.fasta.gz ------------ Archived and trimmed assembly
    │       └── SRR6257420.fasta.gz.md5 -------- MD5 hash of above archive
    └── raw ------------------------------------ Raw data directory
        ├── SRR6257420.fastq.qc_stats.tsv ------ Stats for cleaned fastq
        ├── SRR6257420_fastp_clean_1.fastq.gz -- Cleaned paired-end file_1
        └── SRR6257420_fastp_clean_2.fastq.gz -- Cleaned paired-end file_2

Version History

master @ 39efebc (latest) Created 21st Jun 2023 at 11:41 by Germana Baldi

Merge pull request #8 from EBI-Metagenomics/readme_requirements

Update of README, examples, and installation requirements

Frozen master 39efebc

master @ b269a55 (earliest) Created 19th May 2023 at 14:59 by Varsha Kale


Frozen master b269a55
help Creators and Submitter
Not specified

Views: 599

Created: 19th May 2023 at 14:59

Last updated: 21st Jun 2023 at 11:41

help Tags

This item has not yet been tagged.

help Attributions


Total size: 2.56 MB

Brought to you by:

Powered by
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH