Escalibur
v0.3-beta

Workflow Type: Workflow Description Language

ESCALIBUR

Escalibur Population Genomic Analysis Pipeline is able to explore key aspects centering the population genetics of organisms, and automates three key bioinformatic components in population genomic analysis using Workflow Definition Language (WDL: https://openwdl.org/), and customised R, Perl, Python and Unix shell scripts. Associated programs are packaged into a platform independent singularity image, for which the definition file is provided.

The workflow for analysis using Escalibur consists of three steps - each step can be run in a separate workflow in a sequential manner; step 2 is optional.

1. Trimming and mapping the raw data - selection of the best reference genome;
2. Removing the contamination from mapped data;
3. Recalibration, variant calling and filtering;

This implementation runs both locally and in a distributed environment that uses SLURM job scheduler.

Dependencies

Following software dependencies are required:

Step 1: Installation

Typically, the installation of Singularity requires root rights. You should therefore contact your administrator to get it correctly installed. Minimum Linux kernel version requirement is 3.8, thought >= 3.18 would be preferred (https://sylabs.io/guides/3.5/admin-guide/installation.html).

Clone the git repository to a directory on your cluster or stand-alone server.

> git clone --depth 1 -b v0.3-beta https://gitlab.unimelb.edu.au/bioscience/escalibur.git
> cd escalibur

Description of Files

  • workflow-main.local.config: main configuration file for stand alone server runtime environment
  • workflow-main.slurm.config: main configuration file for HPC runtime environment that support Slurm job scheduler
  • workflow-mapping.json: defines location of input files, has behavioral settings and sets resource allocations
  • workflow-cleaning.json: defines location of input files and sets resource allocations
  • workflow-variants.json: defines location of input files, has behavioral settings and sets resource allocations
  • workflow-mapping.wdl: main workflow file to trim and map PE reads into the genome
  • workflow-cleaning.wdl: main workflow file to clean contamination from mapped PE reads against genomes representing putative contamination
  • workflow-variants.wdl: main workflow file to call variants using mapped and cleaned reads
  • workflow-mapping.outputs.json: defines location for resultant outputs and logs from mapping workflow
  • workflow-cleaning.outputs.json: defines location for resultant outputs and logs from cleaning workflow
  • workflow-variants.outputs.json: defines location for resultant outputs and logs from variants workflow
  • inputReads.txt: example input file for fastq read files to mapping step
  • cleanup.conf: example configuration file for putative host contamination to cleaning step
  • inputBams.txt: example input file for resultant BAM files to variant calling step
  • references.txt: contains list of example references genomes
  • perl_scripts: contains Perl scripts used by the pipeline
  • scripts: contains Python scripts used by the pipeline
  • R_scripts: contains R scripts used by the pipeline
  • sub_workflows: sub-workflows, one for each of the workflow steps
  • tasks: workflow tasks
  • cromwell-50.jar: java archive file required to run the workflow.

Two config files have been created. One for stand alone server (workflow-runtime.local.config) and another one for HPC environment that supports Slurm scheduler (workflow-runtime.slurm.config). These files have already been optimised. For slurm configuration you only need to define the HPC partition in line 35: "String rt_queue" Change this to the partition you have access to on HPC environment.

Files workflow-mapping.outputs.json, workflow-cleaning.outputs.json and workflow-variants.outputs.json define the directories to copy the result files to. Modify if you want to change default output directories outputMapping, outputCleaning and outputVariants. These output directories are generated to the directory escalibur.

NOTE: delete output directories from previous runs. If you have files there already and a name matches during the copy, the workflow may fail.

Singularity directory contains the definition file for the software used in Escalibur. Pre-built singularity image can be downloaded from library://pakorhon/workflows/escalibur:0.0.1-beta.

> singularity pull escalibur.sif library://pakorhon/workflows/escalibur:0.0.1-beta

Step 2: Test run

To confirm correct function of the workflows (mapping, cleaning and variant calling), fix the required absolute paths, marked by three dots ... in workflow-mapping.json, workflow-cleaning.json and workflow-variants.json and configuration files cleanup.conf and inputBams.txt, and run the workflow with the provided test and configuration files, and parameter settings.

> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-mapping.wdl -i workflow-mapping.json -o workflow-mapping.outputs.json > out.mapping 2> err.mapping
> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-cleaning.wdl -i workflow-cleaning.json -o workflow-cleaning.outputs.json > out.cleaning 2> err.cleaning
> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-variants.wdl -i workflow-variants.json -o workflow-variants.outputs.json > out.variants 2> err.variants

Slurm file templates runMapping.slurm, runCleaning.slurm and runVariants.slurm are available for each workflow.

NOTE: default parameter settings for run-times, memory usage and module loading may require adjustment in these files if run in HPC environment using slurm. Current settings should account for the test run.

After the runs are complete, the results will be at the output directories: outputMapping, outputCleaning and outputVariants. You can compare the result of outputVariants/full_genotype_output.vcf to that or pre-run TestResults/full_genotype_output.vcf.

Step 3: Mapping

Make a directory for your fastq files e.g. Reads and copy your paired end raw data in there.

> mkdir Reads

It should look something like below

> ls TestReads/
1-1_r1.fastq.gz  32-1_r1.fastq.gz  44-1_r1.fastq.gz
1-1_r2.fastq.gz  32-1_r2.fastq.gz  44-1_r2.fastq.gz

Run the python script to create a file of your input samples and edit the resulting file to match your sample identifiers and libraries.

> python3 scripts/inputArgMaker.py -d Reads/ -p -ps 33 -pq 20 -pl ILLUMINA -ml 50 -o inputReads.txt 

The edited output file is shown below. The script will automatically sort the files by size.

> cat inputReads.txt
# Prefix PE/SE	MinLen	PhredS	Sequencer	PhredQ	Library	Read Group ID	Sample	Platform Unit	First pair of PE reads		Second pair of PE reads
test1	 PE	50	33	ILLUMINA	28	LIB1	CL100082180L1	SM1	CL100082180L1	./TestReads/1-1_r1.fastq.gz	./TestReads/1-1_r2.fastq.gz
test2	 PE	50	33	ILLUMINA	20	LIB2	CL100082180L1	SM2	CL100082180L1	./TestReads/44-1_r1.fastq.gz	./TestReads/44-1_r2.fastq.gz
test3	 PE	50	33	ILLUMINA	20	LIB3	CL100034574L1	SM2	CL100034574L1	./TestReads/32-1_r1.fastq.gz	./TestReads/32-1_r2.fastq.gz

NOTE: If several libraries are embedded in a single read file, library-specific reads have to be separated into own files before create the inputReads.txt file. In contrast, inputReads.txt file format can accommodate multiple library files to a single sample.

  • Prefix: Prefix for the resultant files from trimming.
  • PE/SE: Paired-End/Single-End reads as input.
  • MinLen: Minimum Length of reads after trimming.
  • PhredS: Used Phred coding by the sequencer (33 or 64).
  • Sequencer: Name of the sequencer.
  • PhredQ: Phred cut-off score used in trimming.
  • Library: Identifier for the library.
  • Read Group ID: Identifier for the read groups required by GATK (inputArgMaker tries to find this from FASTQ reads). Refer to (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups).
  • Sample: Identifier for the sample. Defined prefix for resultant sample specific files.
  • Platform Unit (optional): Information about flow cell, lane and sample. Helps GATK in recalibration (inputArgMaker copies Read Group ID here). Refer to (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups).
  • First pair of PE reads: Relative path to the forward pair of PE reads.
  • Second pair of PE reads: Relative path to the reverse pair of PE reads.

Create a file listing reference genomes and configure workflow-mapping.json file. An example reference file (references.txt) has been created for you. Use this as an example to create your own. Ensure there are no whitespaces at the end of the line or else the cromwell engine will throw an error. Reads are mapped to these reference files and the best matching reference will be selected for variant calling.

> cat references.txt
scf00001	./TestReferences/scf00001.fa
scf00013	./TestReferences/scf00013.fa

NOTE: Reference label (e.g. scf00001) must be a substring found in the reference fasta file (scf00001.fa)

The figure below illustrates the flow of the information, and appearance of labels (Prefix, Sample, Label) in file names, as defined in inputReads.txt and references.txt.

workflow-mapping.json config file

Add the path of your fastq and reference genome input files and change parameters as appropriate, and adjust the absolute paths for singularity image. If mapping_workflow.readQc is set to yes, reads are trimmed both for quality and the adapters. Adapters to trim are given in mapping_workflow.pe_filtering_workflow.trimmomatic_pe_task.truseq_pe_adapter. If you want to use custom adapters, copy them to adapters directory and instead of default TruSeq3-PE.fa, refer to your custom file. If you don't want to use adapters, use empty.fa file instead. For BGISEQ adapters, refer to (https://en.mgitech.cn/Download/download_file/id/71).

{
  "## CONFIG FILE": "WDL",
  "mapping_workflow.inputSampleFile": "./inputReads.txt",
  "mapping_workflow.inputReferenceFile": "./references.txt",

  "## Parameters for samtools read filtering": "-F 4 does filters unmapped reads from resultant files",
  "mapping_workflow.samtoolsParameters": "-F 4",
  
  "## Is read QC required": "yes or no",
  "mapping_workflow.readQc": "yes",
  "## What is the ploidy of given genome": "1 for haploid, 2 for diploid, etc.",
  "mapping_workflow.ploidy": 2,
  
  "## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
  "mapping_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
  "mapping_workflow.singularityBindPath": "/home/.../escalibur/",

  "## trimmomatic adapters": "",
  "mapping_workflow.pe_filtering_workflow.trimmomatic_pe_task.truseq_pe_adapter":"./adapters/TruSeq3-PE.fa",
  "mapping_workflow.pe_filtering_workflow.trimmomatic_se_task.truseq_se_adapter":"./adapters/TruSeq3-SE.fa",
  
  "## Indexing sub workflow task parameters": "Samtools index run time parameters",
  "mapping_workflow.index_sub_workflow.indexing_sam_task.IST_minutes": 300,
  "mapping_workflow.index_sub_workflow.indexing_sam_task.IST_threads": 16,
  "mapping_workflow.index_sub_workflow.indexing_sam_task.IST_mem": 30000,
  .
  .
  .
}

Run the mapping workflow.

> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-mapping.wdl -i workflow-mapping.json -o workflow-mapping.outputs.json > out.mapping 2> err.mapping

The resultant BAM files will be copied to outputMapping directory.

Step 4 (optional): Cleaning

If you suspect 'host' contamination in your data, you can remove that using the cleaning workflow. Define the file representing the contamination. First column defines the sample identifier, second the resultant BAM file from mapping workflow and third the putative contaminant genome assembly.

> cat cleanup.conf
SM1	/home/.../escalibur/outputMapping/SM1.scf00001.MarkDup.bam	/home/.../escalibur/Hosts/host1.fa
SM2	/home/.../escalibur/outputMapping/SM2.scf00001.MarkDup.bam	/home/.../escalibur/Hosts/host1.fa

NOTE: you have to use absolute paths both to BAM files and the contaminant reference genome (here host1.fa and host2.fa).

workflow-cleaning.json config file

Add the path of your cleaning config file (here cleanup.conf) and adjust the absolute paths for singularity image.

{
  "## CONFIG FILE": "WDL",
  "cleaning_workflow.inputContaminantFile": "./cleanup.conf",
  
  "## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
  "cleaning_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
  "cleaning_workflow.singularityBindPath": "/home/.../escalibur/",

  "cleaning_workflow.indexing_bwa_task.IBT_minutes": 60,
  "cleaning_workflow.indexing_bwa_task.IBT_threads": 1,
  "cleaning_workflow.indexing_bwa_task.IBT_mem": 16000,

  "######################################":"########################################",
  "CLEANING":"PARAMETERS",
  "######################################":"########################################",
  "cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_minutes": 600,
  "cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_threads": 4,
  "cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_mem": 32000,

  "cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_minutes": 300,
  "cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_threads": 4,
  "cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_mem": 32000,

  "cleaning_workflow.refsBySample.RBS_minutes": 5,
  "cleaning_workflow.refsBySample.RBS_threads": 1,
  "cleaning_workflow.refsBySample.RBS_mem": 4000
}

Run the cleaning workflow.

> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-cleaning.wdl -i workflow-cleaning.json -o workflow-cleaning.outputs.json > out.cleaning 2> err.cleaning

The resultant cleaned BAM files will be copied to outputCleaning directory. You can repeat the workflow if you suspect that there may be more than one contaminant genomes per each sample. In that case you have to take care of the properly configured cleanup.conf file that should describe the BAM files from previous cleaning round but also define new output directory for each round in workflow-cleaning.outputs.json file.

Step 5: Variant calling

Define the file listing the BAM files used for variant calling. First column defines the sample identifier, and second the resultant BAM file either from mapping of cleaning workflow.

> cat inputBams.txt
SM1	/home/.../escalibur/outputMapping/SM1.scf00001.MarkDup.bam
SM2	/home/.../escalibur/outputCleaned/SM2.scf00001.MarkDup.cleaned.bam

workflow-variants.json config file

Add the path of your file listing the locations of BAM files (here inputBams.txt), and add the location to selected reference genome (found in outputMapping/best.ref) and it's label, as defined in references.txt file. Adjust the absolute paths for singularity image and adjust other parameters, especially define if you want to recalibrate the BAM files by selecting value "independent" to "variants_workflow.call_type".

{
  "## CONFIG FILE": "WDL",
  "variants_workflow.inputSampleFile": "./inputBams.txt",
  "variants_workflow.selectedRefFile": "TestReferences/scf00001.fa",
  "variants_workflow.selectedRefLabel": "scf00001",
  
  "## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
  "variants_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
  "variants_workflow.singularityBindPath": "/home/.../escalibur/",

  "## Which variant call workflow to use": "fast or independent",
  "variants_workflow.call_type": "fast",
  
  "## Variant filtering expressions": "For SNPs and INDELs",
  "variants_workflow.SNP_filt_exp": "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0",
  "variants_workflow.INDEL_filt_exp": "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0",

  "## Variant Filter params": "Variant filter, indel, snps, report making: Safe to leave as default",
  "variants_workflow.ploidy": 2,
  "variants_workflow.maxIndelSize": 60,
  "variants_workflow.scafNumLim": 95,
  "variants_workflow.scafNumCo": 2,
  "variants_workflow.scafLenCutOff": 0,
  "variants_workflow.ldWinSize": 10,
  "variants_workflow.ldWinStep": 5,
  "variants_workflow.ldCutOff": 0.3,
  "variants_workflow.snp_indel_var_filtering_workflow.indelFilterName": "Indel_filter",
  "variants_workflow.snp_indel_var_filtering_workflow.indelFilterExpression": "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0",
  "variants_workflow.snp_indel_var_filtering_workflow.snpFilterName": "Snp_filter",
  "variants_workflow.snp_indel_var_filtering_workflow.snpFilterExpression": "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0",
  "variants_workflow.snp_indel_var_filtering_workflow.vfindel_tk.selectType": "",
  "variants_workflow.snp_indel_var_filtering_workflow.vfsnp_tk.selectType": "",

  "## Build chromosome map":"map_def_scf_lim_task",
  "variants_workflow.snp_indel_var_filtering_workflow.map_def_scf_lim_task.scafLenCutOff": 1000000,
  "variants_workflow.snp_indel_var_filtering_workflow.map_def_scf_lim_task.scafNumCo": 3,

  "## Indexing sub workflow task parameters": "Samtools index run time parameters",
  "variants_workflow.ref_index.IST_minutes": 300,
  "variants_workflow.ref_index.IST_threads": 2,
  "variants_workflow.ref_index.IST_mem": 8000,
  .
  .
  .
}

Run the variant calling workflow.

> java -Dconfig.file=./workflow-runtime.local.config  -jar ./cromwell-50.jar run workflow-variants.wdl -i workflow-variants.json -o workflow-variants.outputs.json > out.variants 2> err.variants

The resultant files will be copied to outputVariants directory. That includes filtered variants calls (full_genotype_output.vcf) and recalibrated BAM files (if independent call_type is selected).

Other considerations

Resource allocation in HPC environment

Wall time, memory usage and thread count (_minutes, _mem, _threads) given in .json files for each workflow can vary substantially and may require adjusting in HPC environment and slurm. This may lead to frequent restarting of the workflow after each adjustment. We have automated this task by providing scripts that automatically check the failed resource allocations and double them for each round. These scripts are located in Automation directory and can be run as follows:

> cd Automation
> sh init.sh # Copies the content of ../tasks directory to tasksOrig directory
> sbatch runMapping.slurm # Runs runLoopMapping.sh in a worker node
> sbatch runCleaning.slurm # Runs runLoopCleaning.sh in a worker node
> sbatch runVariants.slurm # Runs runLoopVariants.sh in a worker node

Scripts runLoop*.sh copy resource allocations from collective runtimes.json file to the files in ../tasks directory, run the workflow and double the failed resource allocations in ../tasks files, and reruns the workflow until it succeeds or until ten rounds have passed. Copying of resource allocations directly to the files in ../tasks directory is necessary to guarantee proper function of call-caching.

NOTE: automated resource allocation adjustment is experimental, should be monitored when running and may require modifications to scripts to function properly.

Disk usage

Cromwell will create duplicate copies of files while running the workflows. It is therefore recommended to remove cromwell-executions directory after each workflow is run, if disk space is getting sparse.

> rm -r cromwell-executions

Especially, if there are hundreds of samples that may sum up to terabytes of data, disk space might become an issue if unused files are not removed.

Troubleshooting

If the output text does not reveal the error, you can try to find an error message using command(s):

> find cromwell-executions/ -name stderr -exec cat {} \; | grep -i fatal
> find cromwell-executions/ -name stderr -exec cat {} \; | less

Most commonly encountered error cases:

  • Singularity is not running correctly. Typically you require help from your administrator to get singularity properly installed.
  • Singularity image escalibur.sif was not downloaded
  • Check that you are using correct runtime configuration file workflow-runtime.local.config or workflow-runtime.slurm.config when calling cromwell-50.jar
  • Absolute file paths for Singularity/Trimmomatic, input files or contaminant genomes are not updated or are wrong in workflow-*.json, inputBams.txt or cleanup.conf configuration files, respectively.
  • Defined run-time and memory requirements for some tasks are not sufficient in .json configuration files to run the pipeline in HPC environment.
  • If you are using slurm job scheduler and want to run the pipeline in HPC environment, you have to create the related configuration file yourselves.
  • Pipeline has not been tested in other environments but Linux and we expect that users encounter challenges if trying to run the pipeline e.g. in Mac environment.

Version History

v0.3-beta (earliest) Created 20th Apr 2022 at 00:21 by Pasi Korhonen

Test results from stand-alone server in independent run


Frozen v0.3-beta fa7d814
help Creators and Submitter
Creators
Not specified
Submitter
Activity

Views: 2086   Downloads: 308

Created: 20th Apr 2022 at 00:21

Last updated: 20th Apr 2022 at 00:23

help Tags

This item has not yet been tagged.

help Attributions

None

Total size: 472 MB
Powered by
(v.1.16.0)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH