DNA-seq
Version 1

Workflow Type: Bpipe
Stable

DNA-Seq pipeline

Here we provide the tools to perform paired end or single read DNA-Seq analysis including raw data quality control, read mapping, variant calling and variant filtering.

Pipeline Workflow

All analysis steps are illustrated in the pipeline flowchart. In case of paired end reads, corresponding fastq files should be named using .R1.fastq.gz and .R2.fastq.gz suffixes. Specify the desired analysis details for your data in the essential.vars.groovy file (see below) and run the pipeline dnaseq.pipeline.groovy as described here. A markdown file variantreport.Rmd will be generated in the output reports folder after running the pipeline. Subsequently, the variantreport.Rmd file can be converted to a final html report using the knitr R-package. GATK requires chromosomes in bam files to be karyotypically ordered. Best you use an ordered genome fasta file as reference for the pipeline (assigned in essential.vars.groovy, see below).

The pipelines includes

  • quality control of rawdata with FastQC
  • Read mapping to the reference genome using BWA
  • identify and remove duplicate reads with Picard MarkDuplicates
  • Realign BAM files at Indel positions using GATK
  • Recalibrate Base Qualities in BAM files using GATK
  • Variant calling using GATK UnifiedGenotyper and GATK HaplotypeCaller
  • Calculate VQSLOD scores for further filtering variants using GATK VariantRecalibrator and ApplyRecalibration
  • Calculate the basic properties of variants as triplets for "all", "known" ,"novel" variants in comparison to dbSNP using GATK VariantEval

Pipeline parameter settings

  • essential.vars.groovy: essential parameter describing the experiment including:
    • ESSENTIAL_PROJECT: your project folder name
    • ESSENTIAL_BWA_REF: path to BWA indexed reference genome
    • ESSENTIAL_CALL_REGION: bath to bed file containing region s to limit variant calling to (optional)
    • ESSENTIAL_PAIRED: either paired end ("yes") or single end ("no") design
    • ESSENTIAL_KNOWN_VARIANTS: dbSNP from GATK resource bundle (crucial for BaseQualityRecalibration step)
    • ESSENTIAL_HAPMAP_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
    • ESSENTIAL_OMNI_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
    • ESSENTIAL_MILLS_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
    • ESSENTIAL_THOUSAND_GENOMES_VARIANTS: variants provided by the GATK bundle (essential for Variant Score Recalibration)
    • ESSENTIAL_THREADS: number of threads for parallel tasks
  • additional (more specialized) parameter can be given in the var.groovy-files of the individual pipeline modules

Programs required

  • Bedtools
  • BWA
  • FastQC
  • GATK
  • Picard
  • Samtools

Version History

Version 1 (earliest) Created 7th Oct 2020 at 08:43 by Sergi Sayols

Added/updated 2 files


Open master 670405c
help Creators and Submitter
Creator
Submitter
Activity

Views: 1680   Downloads: 139

Created: 7th Oct 2020 at 08:43

Last updated: 10th Jan 2022 at 15:19

help Attributions

None

Total size: 11.4 KB
Powered by
(v.1.14.1)
Copyright © 2008 - 2023 The University of Manchester and HITS gGmbH