DNA-Seq pipeline

Here we provide the tools to perform paired end or single read DNA-Seq analysis including raw data quality control, read mapping, variant calling and variant filtering.

Pipeline Workflow

All analysis steps are illustrated in the pipeline flowchart. In case of paired end reads, corresponding fastq files should be named using .R1.fastq.gz and .R2.fastq.gz suffixes. Specify the desired analysis details for your data in the essential.vars.groovy file (see below) and run the pipeline dnaseq.pipeline.groovy as described here. A markdown file variantreport.Rmd will be generated in the output reports folder after running the pipeline. Subsequently, the variantreport.Rmd file can be converted to a final html report using the knitr R-package. GATK requires chromosomes in bam files to be karyotypically ordered. Best you use an ordered genome fasta file as reference for the pipeline (assigned in essential.vars.groovy, see below).

The pipelines includes

quality control of rawdata with FastQC
Read mapping to the reference genome using BWA
identify and remove duplicate reads with Picard MarkDuplicates
Realign BAM files at Indel positions using GATK
Recalibrate Base Qualities in BAM files using GATK
Variant calling using GATK UnifiedGenotyper and GATK HaplotypeCaller
Calculate VQSLOD scores for further filtering variants using GATK VariantRecalibrator and ApplyRecalibration
Calculate the basic properties of variants as triplets for "all", "known" ,"novel" variants in comparison to dbSNP using GATK VariantEval

Pipeline parameter settings

essential.vars.groovy: essential parameter describing the experiment including:
- ESSENTIAL_PROJECT: your project folder name
- ESSENTIAL_BWA_REF: path to BWA indexed reference genome
- ESSENTIAL_CALL_REGION: bath to bed file containing region s to limit variant calling to (optional)
- ESSENTIAL_PAIRED: either paired end ("yes") or single end ("no") design
- ESSENTIAL_KNOWN_VARIANTS: dbSNP from GATK resource bundle (crucial for BaseQualityRecalibration step)
- ESSENTIAL_HAPMAP_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_OMNI_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_MILLS_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_THOUSAND_GENOMES_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_THREADS: number of threads for parallel tasks
additional (more specialized) parameter can be given in the var.groovy-files of the individual pipeline modules

Programs required

Bedtools
BWA
FastQC
GATK
Picard
Samtools

DNA-seq
Version 1

DNA-Seq pipeline

Pipeline Workflow

The pipelines includes

Pipeline parameter settings

Programs required

Version History

Version 1 (earliest) Created 7th Oct 2020 at 08:43 by Sergi Sayols

Creator

Submitter

DNA-seq Version 1

DNA-Seq pipeline

Pipeline Workflow

The pipelines includes

Pipeline parameter settings

Programs required

Version History

Version 1 (earliest) Created 7th Oct 2020 at 08:43 by Sergi Sayols

Creator

Submitter

Related items

DNA-seq
Version 1