DNA-Seq pipeline
Here we provide the tools to perform paired end or single read DNA-Seq analysis including raw data quality control, read mapping, variant calling and variant filtering.
Pipeline Workflow
All analysis steps are illustrated in the pipeline flowchart. In case of paired end reads, corresponding fastq files should be named using .R1.fastq.gz and .R2.fastq.gz suffixes. Specify the desired analysis details for your data in the essential.vars.groovy file (see below) and run the pipeline dnaseq.pipeline.groovy as described here. A markdown file variantreport.Rmd will be generated in the output reports folder after running the pipeline. Subsequently, the variantreport.Rmd file can be converted to a final html report using the knitr R-package. GATK requires chromosomes in bam files to be karyotypically ordered. Best you use an ordered genome fasta file as reference for the pipeline (assigned in essential.vars.groovy, see below).
The pipelines includes
- quality control of rawdata with FastQC
- Read mapping to the reference genome using BWA
- identify and remove duplicate reads with Picard MarkDuplicates
- Realign BAM files at Indel positions using GATK
- Recalibrate Base Qualities in BAM files using GATK
- Variant calling using GATK UnifiedGenotyper and GATK HaplotypeCaller
- Calculate VQSLOD scores for further filtering variants using GATK VariantRecalibrator and ApplyRecalibration
- Calculate the basic properties of variants as triplets for "all", "known" ,"novel" variants in comparison to dbSNP using GATK VariantEval
Pipeline parameter settings
- essential.vars.groovy: essential parameter describing the experiment including:
- ESSENTIAL_PROJECT: your project folder name
- ESSENTIAL_BWA_REF: path to BWA indexed reference genome
- ESSENTIAL_CALL_REGION: bath to bed file containing region s to limit variant calling to (optional)
- ESSENTIAL_PAIRED: either paired end ("yes") or single end ("no") design
- ESSENTIAL_KNOWN_VARIANTS: dbSNP from GATK resource bundle (crucial for BaseQualityRecalibration step)
- ESSENTIAL_HAPMAP_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_OMNI_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_MILLS_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_THOUSAND_GENOMES_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
- ESSENTIAL_THREADS: number of threads for parallel tasks
- additional (more specialized) parameter can be given in the var.groovy-files of the individual pipeline modules
Programs required
- Bedtools
- BWA
- FastQC
- GATK
- Picard
- Samtools
Version History
Version 1 (earliest) Created 7th Oct 2020 at 08:43 by Sergi Sayols
Added/updated 2 files
Open
master
670405c
Creator
Submitter
Views: 2269 Downloads: 300
Created: 7th Oct 2020 at 08:43
Last updated: 10th Jan 2022 at 15:19
None