(Hybrid) Metagenomics workflow
Version 2 (latest)

Version 2 (latest)

Version 1 (earliest)

Visit source

Download RO-Crate

Workflow Type: Common Workflow Language

Work-in-progress

Workflow (hybrid) metagenomic assembly and binning

Workflow Illumina Quality: https://workflowhub.eu/workflows/336?version=1
- FastQC (control)
- fastp (quality trimming)
- kraken2 (taxonomy)
- bbmap contamination filter
Workflow Longread Quality:
- NanoPlot (control)
- filtlong (quality trimming)
- kraken2 (taxonomy)
- minimap2 contamination filter
Kraken2 taxonomic classification of FASTQ reads
SPAdes/Flye (Assembly)
Pilon/Medaka/PyPolCA (Assembly polishing)
QUAST (Assembly quality report)

(optional)

Workflow binnning https://workflowhub.eu/workflows/64?version=11
- Metabat2/MaxBin2/SemiBin
- DAS Tool
- CheckM
- BUSCO
- GTDB-Tk
- (optional)
Workflow bin annotation (https://workflowhub.eu/workflows/1170)
- bakta
- KoFam scan (optional)
- Interpro Scan(optional)
- eggNOG mapper (optional)
- Workflow SAPP conversion (optional, default on) (https://workflowhub.eu/workflows/1174/)

(optional)

Workflow Genome-scale metabolic models from bins https://workflowhub.eu/workflows/372
- CarveMe (GEM generation)
- MEMOTE (GEM test suite)
- SMETANA (Species METabolic interaction ANAlysis)

Other UNLOCK workflows on WorkflowHub: https://workflowhub.eu/projects/16/workflows?view=default

All tool CWL files and other workflows can be found here:
Tools: https://gitlab.com/m-unlock/cwl/-/tree/master/cwl
Workflows: https://gitlab.com/m-unlock/cwl/-/tree/master/cwl/workflows

How to setup and use an UNLOCK workflow:
https://m-unlock.gitlab.io/docs/setup/setup.html

Click and drag the diagram to pan, double click or use the controls to zoom.

SEEK ID: https://workflowhub.eu/workflows/367?version=2

Inputs

ID	Name	Description	Type
identifier	Identifier	Identifier for this dataset used in this workflow (required)	string
illumina_forward_reads	Forward reads	Illumina Forward sequence file(s)	File[]?
illumina_reverse_reads	Reverse reads	Illumina Reverse sequence file(s)	File[]?
pacbio_reads	PacBio reads	File(s) with PacBio reads in FASTQ format	File[]?
nanopore_reads	Oxford Nanopore reads	File(s) with Oxford Nanopore reads in FASTQ format	File[]?
fastq_rich	Fastq rich (ONT)	Input fastq is generated by albacore, MinKNOW or guppy with additional information concerning channel and time. Used to creating more informative quality plots (default false)	boolean
longread_minimum_length	Minimum read length	Minimum read length threshold (default 1000)	int
longread_keep_percent	Keep percentage	Keep only this percentage of the best reads (measured by bases) (default 90)	float
longread_length_weight	Length weigth	Weight given to the length score (default 10)	float
filter_references	Reference file(s)	Reference fasta file(s) used for pre-filtering. Can be gzipped (not mixed)	File[]?
use_reference_mapped_reads	Keep mapped reads	Continue with reads mapped to the given reference (default false)	boolean
keep_filtered_reads	Keep filtered reads	Keep filtered reads in the final output (default false)	boolean
deduplicate	Deduplicate reads	Remove exact duplicate reads Illumina reads with fastp (default false)	boolean
kraken2_confidence	Kraken2 confidence threshold	Confidence score threshold must be in [0, 1] (default 0.0)	float?
kraken2_database	Kraken2 database	Database location of kraken2	Directory[]?
skip_bracken	Run Bracken	Skip Bracken analysis. Default false.	boolean
gtdbtk_data	gtdbtk data directory	Directory containing the GTDBTK repository	Directory?
busco_data	BUSCO dataset	Path to the BUSCO dataset downloaded location	Directory?
ont_basecall_model	ONT Basecalling model for MEDAKA	Used in MEDAKA Basecalling model used with guppy default r941_min_high. Available: r103_fast_g507, r103_fast_snp_g507, r103_fast_variant_g507, r103_hac_g507, r103_hac_snp_g507, r103_hac_variant_g507, r103_min_high_g345, r103_min_high_g360, r103_prom_high_g360, r103_prom_snp_g3210, r103_prom_variant_g3210, r103_sup_g507, r103_sup_snp_g507, r103_sup_variant_g507, r1041_e82_400bps_fast_g615, r1041_e82_400bps_fast_variant_g615, r1041_e82_400bps_hac_g615, r1041_e82_400bps_hac_variant_g615, r1041_e82_400bps_sup_g615, r1041_e82_400bps_sup_variant_g615, r104_e81_fast_g5015, r104_e81_fast_variant_g5015, r104_e81_hac_g5015, r104_e81_hac_variant_g5015, r104_e81_sup_g5015, r104_e81_sup_g610, r104_e81_sup_variant_g610, r10_min_high_g303, r10_min_high_g340, r941_e81_fast_g514, r941_e81_fast_variant_g514, r941_e81_hac_g514, r941_e81_hac_variant_g514, r941_e81_sup_g514, r941_e81_sup_variant_g514, r941_min_fast_g303, r941_min_fast_g507, r941_min_fast_snp_g507, r941_min_fast_variant_g507, r941_min_hac_g507, r941_min_hac_snp_g507, r941_min_hac_variant_g507, r941_min_high_g303, r941_min_high_g330, r941_min_high_g340_rle, r941_min_high_g344, r941_min_high_g351, r941_min_high_g360, r941_min_sup_g507, r941_min_sup_snp_g507, r941_min_sup_variant_g507, r941_prom_fast_g303, r941_prom_fast_g507, r941_prom_fast_snp_g507, r941_prom_fast_variant_g507, r941_prom_hac_g507, r941_prom_hac_snp_g507, r941_prom_hac_variant_g507, r941_prom_high_g303, r941_prom_high_g330, r941_prom_high_g344, r941_prom_high_g360, r941_prom_high_g4011, r941_prom_snp_g303, r941_prom_snp_g322, r941_prom_snp_g360, r941_prom_sup_g507, r941_prom_sup_snp_g507, r941_prom_sup_variant_g507, r941_prom_variant_g303, r941_prom_variant_g322, r941_prom_variant_g360, r941_sup_plant_g610, r941_sup_plant_variant_g610 (required for Medaka)	string?
pilon_fixlist	Pilon fix list	A comma-separated list of categories of issues to try to fix: "snps": try to fix individual base errors; "indels": try to fix small indels; "gaps": try to fill gaps; "local": try to detect and fix local misassemblies; "all": all of the above (default); "bases": shorthand for "snps" and "indels" (for back compatibility); default; snps,gaps,local (conservative)	string
genome_size	Genome Size	Estimated genome size (for example, 5m or 2.6g)	string?
metagenome	When working with metagenomes	Metagenome option for assemblers (default true)	boolean
semibin_environment	SemiBin Environment	SemiBin built-in models; human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/chicken_caecum/global (default global)	string
run_binspreader	n/a	Whether to use BinSPreader for bin refinement	boolean?
annotate_bins	Annotate bins	Annotate bins. Default false	boolean
annotate_unbinned	Annotate unbinned	Annotate unbinned contigs. Will be treated as metagenome. Default false	boolean
bakta_db	Bakta DB	Bakta Database directory (required when annotating bins)	Directory?
skip_bakta_crispr	Skip bakta CRISPR	Skip bakta CRISPR array prediction using PILER-CR. Default false	boolean
interproscan_directory	InterProScan 5 directory	Directory of the (full) InterProScan 5 program. Used for annotating bins. (optional)	Directory?
eggnog_dbs	n/a	n/a	record containing Directory? File? File?
run_kofamscan	Run kofamscan	Run with KEGG KO KoFamKOALA annotation. Default false	boolean
kofamscan_limit_sapp	SAPP kofamscan limit	Limit max number of entries of kofamscan hits per locus in SAPP. Default 5	int?
run_eggnog	Run eggNOG-mapper	Run with eggNOG-mapper annotation. Requires eggnog database files. Default false	boolean
run_interproscan	Run InterProScan	Run with eggNOG-mapper annotation. Requires InterProScan v5 program files. Default false	boolean
interproscan_applications	InterProScan applications	Comma separated list of analyses: FunFam,SFLD,PANTHER,Gene3D,Hamap,PRINTS,ProSiteProfiles,Coils,SUPERFAMILY,SMART,CDD,PIRSR,ProSitePatterns,AntiFam,Pfam,MobiDBLite,PIRSF,NCBIfam default Pfam,SFLD,SMART,AntiFam,NCBIfam	string
run_spades	Use SPAdes	Run with SPAdes assembler (default true)	boolean
only_assembler_mode_spades	Only spades assembler	Run spades in only assembler mode (without read error correction) (default false)	boolean
run_flye	Use Flye	Run with Flye assembler (default false)	boolean
run_pilon	Use Pilon	Run with Pilon illumina assembly polishing (default false)	boolean
run_medaka	Use Medaka	Run with Mekada assembly polishing with nanopore reads (default false)	boolean
run_pypolca	Use PyPolCA	Run with PyPolCA assembly polishing for Long-reads with illumina data (default false)	boolean
assembly_choice	Assembly choice	User's choice of assembly for post-assembly (binning) processes ('spades', 'medaka', 'flye', 'pilon', 'pypolca'). Optional. Only one choice allowed.	<strong>enum</strong> of: spades, medaka, flye, pilon, pypolca
binning	Run binning workflow	Run with contig binning workflow (default false)	boolean
run_GEM	Run GEM workflow	Run the community GEnomescale Metabolic models workflow on bins. (default false) NOTE: Uses by default private docker containers	boolean
run_smetana	Run SMETANA	Run SMETANA (Species METabolic interaction ANAlysis) (default false)	boolean
smetana_solver	n/a	Solver to be used in SMETANA (now only run with cplex)	string?
memote_solver	MEMOTE solver	MEMOTE solver Choice (cplex, glpk, gurobi, glpk_exact); by default glpk	string?
gapfill	Gap fill	Gap fill model for given media	string?
mediadb	Media database	Media database file	File?
carveme_solver	CarveMe solver	CarveMe solver (default scip), possible to use cplex in private container (not provided in public container)	string?
skip_qc_unfiltered	Skip QC unfiltered	Skip quality analyses of unfiltered input reads (default false)	boolean
threads	Number of threads	Number of threads to use for each computational processe (default 2)	int
memory	Memory usage (MB)	Maximum memory usage in megabytes (default 8GB)	int
destination	Output Destination (prov only)	Not used in this workflow. Output destination used for cwl-prov reporting only.	string?

Steps

ID	Name	Description
prepare_fasta_db	Prepare references	Prepare references to a single fasta file and unique headers
workflow_quality_illumina	Quality and filtering workflow	Quality, filtering and taxonomic classification of Illumina reads
workflow_quality_nanopore	Oxford Nanopore quality workflow	Quality, filtering and taxonomic classification workflow for Oxford Nanopore reads
workflow_quality_pacbio	PacBio quality and filtering workflow	Quality, filtering and taxonomic classification for PacBio reads
spades	SPAdes assembly	Genome assembly using SPAdes with illumina and or long reads
compress_spades	SPAdes compressed	Compress the large Spades assembly output files
flye	Flye assembly	De novo assembly of single-molecule reads with Flye
medaka	Medaka polishing of assembly	Medaka for (ont reads) polishing of a assembled genome
metaquast_medaka	assembly evaluation	evaluation of polished assembly with metaQUAST
workflow_pilon	Pilon worklow	Illumina reads assembly polishing with Pilon
metaquast_pilon	Illumina assembly evaluation	Illumina evaluation of pilon polished assembly with metaQUAST
workflow_pypolca	Run PyPolCA assemlby polishing	PyPolCA polishing of longreads assembly with illumina reads
metaquast_pypolca	Pypolca polished assembly evaluation with QUAST	Run Evaluation of PyPolCA polished assembly with metaQUAST
assembly_read_mapping_illumina	Minimap2	Illumina read mapping using Minimap2 on assembled scaffolds
contig_read_counts	Samtools idxstats	Reports alignment summary statistics
workflow_binning	Binning workflow	Binning workflow to create bins
workflow_GEM	GEM workflow	CarveMe community genomescale metabolic models workflow from bins
keep_readfilter_files_to_folder	Read filtering output folder	Preparation of read filtering output files to a specific output folder
readfilter_files_to_folder	Read filtering output folder	Preparation of read filtering output files to a specific output folder
spades_files_to_folder	SPADES output to folder	Preparation of SPAdes output files to a specific output folder
flye_files_to_folder	Flye output folder	Preparation of Flye output files to a specific output folder
metaquast_medaka_files_to_folder	Medaka metaQUAST output folder	Preparation of metaQUAST output files to a specific output folder
medaka_files_to_folder	Medaka output folder	Preparation of Medaka output files to a specific output folder
metaquast_pilon_files_to_folder	Illumina metaQUAST output folder	Preparation of QUAST output files to a specific output folder
pilon_files_to_folder	Pilon output folder	Preparation of pilon output files to a specific output folder
metaquast_pypolca_files_to_folder	PyPolca metaQUAST output folder	Preparation of PyPolCA metaQUAST output files to a specific output folder
pypolca_files_to_folder	PyPolca output folder	Preparation of PyPolCA output files to a specific output folder
assembly_files_to_folder	Flye output folder	Preparation of Flye output files to a specific output folder
binning_files_to_folder	Binning output to folder	Preparation of binning output files and folders to a specific output folder
GEM_files_to_folder	GEM workflow output to folder	Preparation of GEM workflow output files and folders to a specific output folder

Outputs

ID	Name	Description	Type
read_filtering_output_keep	Read filtering output	Read filtering stats + filtered reads	Directory?
read_filtering_output	Read filtering output	Read filtering stats + filtered reads	Directory?
assembly_output	Assembly output	Output from different assembly steps	Directory
binning_output	Binning output	Binning outputfolders	Directory?
gem_output	Community GEM output	Community GEM output folder	Directory?