ESCALIBUR
Escalibur Population Genomic Analysis Pipeline is able to explore key aspects centering the population genetics of organisms, and automates three key bioinformatic components in population genomic analysis using Workflow Definition Language (WDL: https://openwdl.org/), and customised R, Perl, Python and Unix shell scripts. Associated programs are packaged into a platform independent singularity image, for which the definition file is provided.
The workflow for analysis using Escalibur consists of three steps - each step can be run in a separate workflow in a sequential manner; step 2 is optional.
1. Trimming and mapping the raw data - selection of the best reference genome;
2. Removing the contamination from mapped data;
3. Recalibration, variant calling and filtering;
This implementation runs both locally and in a distributed environment that uses SLURM job scheduler.
Dependencies
Following software dependencies are required:
- Git
- SLURM scheduler required for distributed HPC environment (https://slurm.schedmd.com/documentation.html)
- Python3.7: (https://www.python.org/)
- Perl 5.26.2: (https://www.perl.org/)
- Java 1.8
- Singularity 3.7.3: (https://sylabs.io/singularity/)
Step 1: Installation
Typically, the installation of Singularity requires root rights. You should therefore contact your administrator to get it correctly installed. Minimum Linux kernel version requirement is 3.8, thought >= 3.18 would be preferred (https://sylabs.io/guides/3.5/admin-guide/installation.html).
Clone the git repository to a directory on your cluster or stand-alone server.
> git clone --depth 1 -b v0.3-beta https://gitlab.unimelb.edu.au/bioscience/escalibur.git
> cd escalibur
Description of Files
workflow-main.local.config
: main configuration file for stand alone server runtime environmentworkflow-main.slurm.config
: main configuration file for HPC runtime environment that support Slurm job schedulerworkflow-mapping.json
: defines location of input files, has behavioral settings and sets resource allocationsworkflow-cleaning.json
: defines location of input files and sets resource allocationsworkflow-variants.json
: defines location of input files, has behavioral settings and sets resource allocationsworkflow-mapping.wdl
: main workflow file to trim and map PE reads into the genomeworkflow-cleaning.wdl
: main workflow file to clean contamination from mapped PE reads against genomes representing putative contaminationworkflow-variants.wdl
: main workflow file to call variants using mapped and cleaned readsworkflow-mapping.outputs.json
: defines location for resultant outputs and logs from mapping workflowworkflow-cleaning.outputs.json
: defines location for resultant outputs and logs from cleaning workflowworkflow-variants.outputs.json
: defines location for resultant outputs and logs from variants workflowinputReads.txt
: example input file for fastq read files to mapping stepcleanup.conf
: example configuration file for putative host contamination to cleaning stepinputBams.txt
: example input file for resultant BAM files to variant calling stepreferences.txt
: contains list of example references genomesperl_scripts
: contains Perl scripts used by the pipelinescripts
: contains Python scripts used by the pipelineR_scripts
: contains R scripts used by the pipelinesub_workflows
: sub-workflows, one for each of the workflow stepstasks
: workflow taskscromwell-50.jar
: java archive file required to run the workflow.
Two config files have been created. One for stand alone server (workflow-runtime.local.config
) and another one for HPC environment that supports Slurm scheduler (workflow-runtime.slurm.config
).
These files have already been optimised. For slurm configuration you only need to define the HPC partition in line 35: "String rt_queue"
Change this to the partition you have access to on HPC environment.
Files workflow-mapping.outputs.json
, workflow-cleaning.outputs.json
and workflow-variants.outputs.json
define the directories to copy the result files to. Modify if you want to change default output directories outputMapping
, outputCleaning
and outputVariants
. These output directories are generated to the directory escalibur
.
NOTE: delete output directories from previous runs. If you have files there already and a name matches during the copy, the workflow may fail.
Singularity
directory contains the definition file for the software used in Escalibur. Pre-built singularity image can be downloaded from library://pakorhon/workflows/escalibur:0.0.1-beta
.
> singularity pull escalibur.sif library://pakorhon/workflows/escalibur:0.0.1-beta
Step 2: Test run
To confirm correct function of the workflows (mapping
, cleaning
and variant calling
), fix the required absolute paths, marked by three dots ...
in workflow-mapping.json
, workflow-cleaning.json
and workflow-variants.json
and configuration files cleanup.conf
and inputBams.txt
, and run the workflow with the provided test and configuration files, and parameter settings.
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-mapping.wdl -i workflow-mapping.json -o workflow-mapping.outputs.json > out.mapping 2> err.mapping
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-cleaning.wdl -i workflow-cleaning.json -o workflow-cleaning.outputs.json > out.cleaning 2> err.cleaning
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-variants.wdl -i workflow-variants.json -o workflow-variants.outputs.json > out.variants 2> err.variants
Slurm file templates runMapping.slurm
, runCleaning.slurm
and runVariants.slurm
are available for each workflow.
NOTE: default parameter settings for run-times, memory usage and module loading may require adjustment in these files if run in HPC environment using slurm. Current settings should account for the test run.
After the runs are complete, the results will be at the output directories: outputMapping
, outputCleaning
and outputVariants
.
You can compare the result of outputVariants/full_genotype_output.vcf
to that or pre-run TestResults/full_genotype_output.vcf
.
Step 3: Mapping
Make a directory for your fastq files e.g. Reads
and copy your paired end raw data in there.
> mkdir Reads
It should look something like below
> ls TestReads/
1-1_r1.fastq.gz 32-1_r1.fastq.gz 44-1_r1.fastq.gz
1-1_r2.fastq.gz 32-1_r2.fastq.gz 44-1_r2.fastq.gz
Run the python script to create a file of your input samples and edit the resulting file to match your sample identifiers and libraries.
> python3 scripts/inputArgMaker.py -d Reads/ -p -ps 33 -pq 20 -pl ILLUMINA -ml 50 -o inputReads.txt
The edited output file is shown below. The script will automatically sort the files by size.
> cat inputReads.txt
# Prefix PE/SE MinLen PhredS Sequencer PhredQ Library Read Group ID Sample Platform Unit First pair of PE reads Second pair of PE reads
test1 PE 50 33 ILLUMINA 28 LIB1 CL100082180L1 SM1 CL100082180L1 ./TestReads/1-1_r1.fastq.gz ./TestReads/1-1_r2.fastq.gz
test2 PE 50 33 ILLUMINA 20 LIB2 CL100082180L1 SM2 CL100082180L1 ./TestReads/44-1_r1.fastq.gz ./TestReads/44-1_r2.fastq.gz
test3 PE 50 33 ILLUMINA 20 LIB3 CL100034574L1 SM2 CL100034574L1 ./TestReads/32-1_r1.fastq.gz ./TestReads/32-1_r2.fastq.gz
NOTE: If several libraries are embedded in a single read file, library-specific reads have to be separated into own files before create the inputReads.txt file. In contrast, inputReads.txt file format can accommodate multiple library files to a single sample.
Prefix
: Prefix for the resultant files from trimming.PE/SE
: Paired-End/Single-End reads as input.MinLen
: Minimum Length of reads after trimming.PhredS
: Used Phred coding by the sequencer (33 or 64).Sequencer
: Name of the sequencer.PhredQ
: Phred cut-off score used in trimming.Library
: Identifier for the library.Read Group ID
: Identifier for the read groups required by GATK (inputArgMaker tries to find this from FASTQ reads). Refer to (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups).Sample
: Identifier for the sample. Defined prefix for resultant sample specific files.Platform Unit (optional)
: Information about flow cell, lane and sample. Helps GATK in recalibration (inputArgMaker copies Read Group ID here). Refer to (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups).First pair of PE reads
: Relative path to the forward pair of PE reads.Second pair of PE reads
: Relative path to the reverse pair of PE reads.
Create a file listing reference genomes and configure workflow-mapping.json
file.
An example reference file (references.txt
) has been created for you. Use this as an example to create your own.
Ensure there are no whitespaces at the end of the line or else the cromwell engine will throw an error.
Reads are mapped to these reference files and the best matching reference will be selected for variant calling.
> cat references.txt
scf00001 ./TestReferences/scf00001.fa
scf00013 ./TestReferences/scf00013.fa
NOTE: Reference label (e.g. scf00001
) must be a substring found in the reference fasta file (scf00001.fa
)
The figure below illustrates the flow of the information, and appearance of labels (Prefix
, Sample
, Label
) in file names, as defined in inputReads.txt
and references.txt
.
workflow-mapping.json config file
Add the path of your fastq and reference genome input files and change parameters as appropriate, and adjust the absolute paths for singularity image. If mapping_workflow.readQc
is set to yes
, reads are trimmed both for quality and the adapters. Adapters to trim are given in mapping_workflow.pe_filtering_workflow.trimmomatic_pe_task.truseq_pe_adapter
. If you want to use custom adapters, copy them to adapters
directory and instead of default TruSeq3-PE.fa
, refer to your custom file. If you don't want to use adapters, use empty.fa
file instead. For BGISEQ adapters, refer to (https://en.mgitech.cn/Download/download_file/id/71).
{
"## CONFIG FILE": "WDL",
"mapping_workflow.inputSampleFile": "./inputReads.txt",
"mapping_workflow.inputReferenceFile": "./references.txt",
"## Parameters for samtools read filtering": "-F 4 does filters unmapped reads from resultant files",
"mapping_workflow.samtoolsParameters": "-F 4",
"## Is read QC required": "yes or no",
"mapping_workflow.readQc": "yes",
"## What is the ploidy of given genome": "1 for haploid, 2 for diploid, etc.",
"mapping_workflow.ploidy": 2,
"## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
"mapping_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
"mapping_workflow.singularityBindPath": "/home/.../escalibur/",
"## trimmomatic adapters": "",
"mapping_workflow.pe_filtering_workflow.trimmomatic_pe_task.truseq_pe_adapter":"./adapters/TruSeq3-PE.fa",
"mapping_workflow.pe_filtering_workflow.trimmomatic_se_task.truseq_se_adapter":"./adapters/TruSeq3-SE.fa",
"## Indexing sub workflow task parameters": "Samtools index run time parameters",
"mapping_workflow.index_sub_workflow.indexing_sam_task.IST_minutes": 300,
"mapping_workflow.index_sub_workflow.indexing_sam_task.IST_threads": 16,
"mapping_workflow.index_sub_workflow.indexing_sam_task.IST_mem": 30000,
.
.
.
}
Run the mapping workflow.
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-mapping.wdl -i workflow-mapping.json -o workflow-mapping.outputs.json > out.mapping 2> err.mapping
The resultant BAM files will be copied to outputMapping
directory.
Step 4 (optional): Cleaning
If you suspect 'host' contamination in your data, you can remove that using the cleaning workflow. Define the file representing the contamination. First column defines the sample identifier, second the resultant BAM file from mapping workflow and third the putative contaminant genome assembly.
> cat cleanup.conf
SM1 /home/.../escalibur/outputMapping/SM1.scf00001.MarkDup.bam /home/.../escalibur/Hosts/host1.fa
SM2 /home/.../escalibur/outputMapping/SM2.scf00001.MarkDup.bam /home/.../escalibur/Hosts/host1.fa
NOTE: you have to use absolute paths both to BAM files and the contaminant reference genome (here host1.fa
and host2.fa
).
workflow-cleaning.json config file
Add the path of your cleaning config file (here cleanup.conf
) and adjust the absolute paths for singularity image.
{
"## CONFIG FILE": "WDL",
"cleaning_workflow.inputContaminantFile": "./cleanup.conf",
"## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
"cleaning_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
"cleaning_workflow.singularityBindPath": "/home/.../escalibur/",
"cleaning_workflow.indexing_bwa_task.IBT_minutes": 60,
"cleaning_workflow.indexing_bwa_task.IBT_threads": 1,
"cleaning_workflow.indexing_bwa_task.IBT_mem": 16000,
"######################################":"########################################",
"CLEANING":"PARAMETERS",
"######################################":"########################################",
"cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_minutes": 600,
"cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_threads": 4,
"cleaning_workflow.clean_bams_workflow.cleanBams_task.CLEAN_BAMS_mem": 32000,
"cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_minutes": 300,
"cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_threads": 4,
"cleaning_workflow.create_cleaned_bams_workflow.createCleanedBams_task.CREATE_CLEAN_BAMS_mem": 32000,
"cleaning_workflow.refsBySample.RBS_minutes": 5,
"cleaning_workflow.refsBySample.RBS_threads": 1,
"cleaning_workflow.refsBySample.RBS_mem": 4000
}
Run the cleaning workflow.
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-cleaning.wdl -i workflow-cleaning.json -o workflow-cleaning.outputs.json > out.cleaning 2> err.cleaning
The resultant cleaned BAM files will be copied to outputCleaning
directory. You can repeat the workflow if you suspect that there may be more than one contaminant genomes per each sample. In that case you have to take care of the properly configured cleanup.conf
file that should describe the BAM files from previous cleaning round but also define new output directory for each round in workflow-cleaning.outputs.json
file.
Step 5: Variant calling
Define the file listing the BAM files used for variant calling. First column defines the sample identifier, and second the resultant BAM file either from mapping of cleaning workflow.
> cat inputBams.txt
SM1 /home/.../escalibur/outputMapping/SM1.scf00001.MarkDup.bam
SM2 /home/.../escalibur/outputCleaned/SM2.scf00001.MarkDup.cleaned.bam
workflow-variants.json config file
Add the path of your file listing the locations of BAM files (here inputBams.txt
), and add the location to selected reference genome (found in outputMapping/best.ref
) and it's label, as defined in references.txt
file. Adjust the absolute paths for singularity image and adjust other parameters, especially define if you want to recalibrate the BAM files by selecting value "independent" to "variants_workflow.call_type".
{
"## CONFIG FILE": "WDL",
"variants_workflow.inputSampleFile": "./inputBams.txt",
"variants_workflow.selectedRefFile": "TestReferences/scf00001.fa",
"variants_workflow.selectedRefLabel": "scf00001",
"## Singularity parameters": "absolute paths to the container and the directory to bind visible inside singularity",
"variants_workflow.singularityContainerPath": "/home/.../escalibur/escalibur.sif",
"variants_workflow.singularityBindPath": "/home/.../escalibur/",
"## Which variant call workflow to use": "fast or independent",
"variants_workflow.call_type": "fast",
"## Variant filtering expressions": "For SNPs and INDELs",
"variants_workflow.SNP_filt_exp": "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0",
"variants_workflow.INDEL_filt_exp": "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0",
"## Variant Filter params": "Variant filter, indel, snps, report making: Safe to leave as default",
"variants_workflow.ploidy": 2,
"variants_workflow.maxIndelSize": 60,
"variants_workflow.scafNumLim": 95,
"variants_workflow.scafNumCo": 2,
"variants_workflow.scafLenCutOff": 0,
"variants_workflow.ldWinSize": 10,
"variants_workflow.ldWinStep": 5,
"variants_workflow.ldCutOff": 0.3,
"variants_workflow.snp_indel_var_filtering_workflow.indelFilterName": "Indel_filter",
"variants_workflow.snp_indel_var_filtering_workflow.indelFilterExpression": "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0",
"variants_workflow.snp_indel_var_filtering_workflow.snpFilterName": "Snp_filter",
"variants_workflow.snp_indel_var_filtering_workflow.snpFilterExpression": "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0",
"variants_workflow.snp_indel_var_filtering_workflow.vfindel_tk.selectType": "",
"variants_workflow.snp_indel_var_filtering_workflow.vfsnp_tk.selectType": "",
"## Build chromosome map":"map_def_scf_lim_task",
"variants_workflow.snp_indel_var_filtering_workflow.map_def_scf_lim_task.scafLenCutOff": 1000000,
"variants_workflow.snp_indel_var_filtering_workflow.map_def_scf_lim_task.scafNumCo": 3,
"## Indexing sub workflow task parameters": "Samtools index run time parameters",
"variants_workflow.ref_index.IST_minutes": 300,
"variants_workflow.ref_index.IST_threads": 2,
"variants_workflow.ref_index.IST_mem": 8000,
.
.
.
}
Run the variant calling workflow.
> java -Dconfig.file=./workflow-runtime.local.config -jar ./cromwell-50.jar run workflow-variants.wdl -i workflow-variants.json -o workflow-variants.outputs.json > out.variants 2> err.variants
The resultant files will be copied to outputVariants
directory. That includes filtered variants calls (full_genotype_output.vcf
) and recalibrated BAM files (if independent call_type is selected).
Other considerations
Resource allocation in HPC environment
Wall time, memory usage and thread count (_minutes
, _mem
, _threads
) given in .json
files for each workflow can vary substantially and may require adjusting in HPC environment and slurm. This may lead to frequent restarting of the workflow after each adjustment. We have automated this task by providing scripts that automatically check the failed resource allocations and double them for each round. These scripts are located in Automation
directory and can be run as follows:
> cd Automation
> sh init.sh # Copies the content of ../tasks directory to tasksOrig directory
> sbatch runMapping.slurm # Runs runLoopMapping.sh in a worker node
> sbatch runCleaning.slurm # Runs runLoopCleaning.sh in a worker node
> sbatch runVariants.slurm # Runs runLoopVariants.sh in a worker node
Scripts runLoop*.sh
copy resource allocations from collective runtimes.json
file to the files in ../tasks
directory, run the workflow and double the failed resource allocations in ../tasks
files, and reruns the workflow until it succeeds or until ten rounds have passed. Copying of resource allocations directly to the files in ../tasks
directory is necessary to guarantee proper function of call-caching.
NOTE: automated resource allocation adjustment is experimental, should be monitored when running and may require modifications to scripts to function properly.
Disk usage
Cromwell will create duplicate copies of files while running the workflows. It is therefore recommended to remove cromwell-executions
directory after each workflow is run, if disk space is getting sparse.
> rm -r cromwell-executions
Especially, if there are hundreds of samples that may sum up to terabytes of data, disk space might become an issue if unused files are not removed.
Troubleshooting
If the output text does not reveal the error, you can try to find an error message using command(s):
> find cromwell-executions/ -name stderr -exec cat {} \; | grep -i fatal
> find cromwell-executions/ -name stderr -exec cat {} \; | less
Most commonly encountered error cases:
- Singularity is not running correctly. Typically you require help from your administrator to get singularity properly installed.
- Singularity image
escalibur.sif
was not downloaded - Check that you are using correct runtime configuration file
workflow-runtime.local.config
orworkflow-runtime.slurm.config
when callingcromwell-50.jar
- Absolute file paths for Singularity/Trimmomatic, input files or contaminant genomes are not updated or are wrong in
workflow-*.json
,inputBams.txt
orcleanup.conf
configuration files, respectively. - Defined run-time and memory requirements for some tasks are not sufficient in
.json
configuration files to run the pipeline in HPC environment. - If you are using slurm job scheduler and want to run the pipeline in HPC environment, you have to create the related configuration file yourselves.
- Pipeline has not been tested in other environments but Linux and we expect that users encounter challenges if trying to run the pipeline e.g. in Mac environment.
Version History
v0.3-beta (earliest) Created 20th Apr 2022 at 00:21 by Pasi Korhonen
Test results from stand-alone server in independent run
Frozen
v0.3-beta
fa7d814

Creators
Not specifiedSubmitter
Views: 2086 Downloads: 308
Created: 20th Apr 2022 at 00:21
Last updated: 20th Apr 2022 at 00:23

This item has not yet been tagged.

None