Workflow Type: Common Workflow Language
Stable

CWL-assembly

Codacy Badge Build Status

Description

This repository contains two workflows for metagenome and metatranscriptome assembly of short read data. MetaSPAdes is used as default for paired-end data, and MEGAHIT for single-end data and co-assemblies. MEGAHIT can be specified as the default assembler in the yaml file if preferred. Steps include:

  • QC: removal of short reads, low quality regions, adapters and host decontamination
  • Assembly: with metaSPADES or MEGAHIT
  • Post-assembly: Host and PhiX decontamination, contig length filter (500bp), stats generation

Requirements - How to install

This pipeline requires a conda environment with cwltool, blastn, and metaspades. If created with requirements.yml, the environment will be called cwl_assembly.

conda env create -f requirements.yml
conda activate cwl_assembly
pip install cwltool==3.1.20230601100705

Databases

You will need to pre-download fasta files for host decontamination and generate the following databases accordingly:

  • bwa index
  • blast index

Specify the locations in the yaml file when running the pipeline.

Main pipeline executables

  • src/workflows/metagenome_pipeline.cwl
  • src/workflows/metatranscriptome_pipeline.cwl

Example command

cwltool --singularity --outdir ${OUTDIR} ${CWL} ${YML}

$CWL is going to be one of the executables mentioned above $YML should be a config yaml file including entries among what follows. You can find a yml template in the examples folder.

Example output directory structure

Root directory
    ├── megahit
    │   └── 001 -------------------------------- Assembly root directory
    │       ├── assembly_stats.json ------------ Human-readable assembly stats file
    │       ├── coverage.tab ------------------- Coverage file
    │       ├── log ---------------------------- CwlToil+megahit output log
    |       ├── options.json ------------------- Megahit input options
    │       ├── SRR6257420.fasta.gz ------------ Archived and trimmed assembly
    │       └── SRR6257420.fasta.gz.md5 -------- MD5 hash of above archive
    ├── metaspades
    │   └── 001 -------------------------------- Assembly root directory
    │       ├── assembly_graph.fastg ----------- Assembly graph
    │       ├── assembly_stats.json ------------ Human-readable assembly stats file
    │       ├── coverage.tab ------------------- Coverage file
    |       ├── params.txt --------------------- Metaspades input options
    │       ├── spades.log --------------------- Metaspades output log
    │       ├── SRR6257420.fasta.gz ------------ Archived and trimmed assembly
    │       └── SRR6257420.fasta.gz.md5 -------- MD5 hash of above archive
    │ 
    └── raw ------------------------------------ Raw data directory
        ├── SRR6257420.fastq.qc_stats.tsv ------ Stats for cleaned fastq
        ├── SRR6257420_fastp_clean_1.fastq.gz -- Cleaned paired-end file_1
        └── SRR6257420_fastp_clean_2.fastq.gz -- Cleaned paired-end file_2
Could not render the workflow diagram.

Version History

master @ 39efebc (latest) Created 21st Jun 2023 at 11:41 by Germana Baldi

Merge pull request #8 from EBI-Metagenomics/readme_requirements

Update of README, examples, and installation requirements


Frozen master 39efebc

master @ b269a55 (earliest) Created 19th May 2023 at 14:59 by Varsha Kale

Update README.md


Frozen master b269a55
help Creators and Submitter
Creators
Not specified
Submitter
Activity

Views: 2825   Downloads: 343

Created: 19th May 2023 at 14:59

Last updated: 21st Jun 2023 at 11:41

help Tags

This item has not yet been tagged.

help Attributions

None

Total size: 2.56 MB