ChIP-Seq pipeline

Here we provide the tools to perform paired end or single read ChIP-Seq analysis including raw data quality control, read mapping, peak calling, differential binding analysis and functional annotation. As input files you may use either zipped fastq-files (.fastq.gz) or mapped read data (.bam files). In case of paired end reads, corresponding fastq files should be named using .R1.fastq.gz and .R2.fastq.gz suffixes.

Pipeline Workflow

All analysis steps are illustrated in the pipeline flowchart. Specify the desired analysis details for your data in the essential.vars.groovy file (see below) and run the pipeline chipseq.pipeline.groovy as described here. A markdown file ChIPreport.Rmd will be generated in the output reports folder after running the pipeline. Subsequently, the ChIPreport.Rmd file can be converted to a final html report using the knitr R-package.

The pipelines includes

raw data quality control with FastQC, BamQC and MultiQC
mapping reads or read pairs to the reference genome using bowtie2 (default) or bowtie1
filter out multimapping reads from bowtie2 output with samtools (optional)
identify and remove duplicate reads with Picard MarkDuplicates (optional)
generation of bigWig tracks for visualisation of alignment with deeptools bamCoverage. For single end design, reads are extended to the average fragment size
characterization of insert size using Picard CollectInsertSizeMetrics (for paired end libraries only)
characterize library complexity by PCR Bottleneck Coefficient using the GenomicAlignments R-package (for single read libraries only)
characterize phantom peaks by cross correlation analysis using the spp R-package (for single read libraries only)
peak calling of IP samples vs. corresponding input controls using MACS2
peak annotation using the ChIPseeker R-package (optional)
differential binding analysis using the diffbind R-package (optional). For this, input peak files must be given in NGSpipe2go/tools/diffbind/targets_diffbind.txt and contrasts of interest in NGSpipe2go/tools/diffbind/contrasts_diffbind.txt (see below)

Pipeline-specific parameter settings

targets.txt: tab-separated txt-file giving information about the analysed samples. The following columns are required:
- IP: bam file name of IP sample
- IPname: IP sample name to be used in plots and tables
- INPUT: bam file name of corresponding input control sample
- INPUTname: input sample name to be used in plots and tables
- group: variable for sample grouping (e.g. by condition)
essential.vars.groovy: essential parameter describing the experiment including:
- ESSENTIAL_PROJECT: your project folder name
- ESSENTIAL_BOWTIE_REF: full path to bowtie2 indexed reference genome (bowtie1 indexed reference genome if bowtie1 is selected as mapper)
- ESSENTIAL_BOWTIE_GENOME: full path to the reference genome FASTA file
- ESSENTIAL_BSGENOME: Bioconductor genome sequence annotation package
- ESSENTIAL_TXDB: Bioconductor transcript-related annotation package
- ESSENTIAL_ANNODB: Bioconductor genome annotation package
- ESSENTIAL_BLACKLIST: files with problematic 'blacklist regions' to be excluded from analysis (optional)
- ESSENTIAL_PAIRED: either paired end ("yes") or single read ("no") design
- ESSENTIAL_READLEN: read length of library
- ESSENTIAL_FRAGLEN: mean length of library inserts and also minimum peak size called by MACS2
- ESSENTIAL_THREADS: number of threads for parallel tasks
- ESSENTIAL_USE_BOWTIE1: if true use bowtie1 for read mapping, otherwise bowtie2 by default
additional (more specialized) parameter can be given in the var.groovy-files of the individual pipeline modules

If differential binding analysis is selected it is required additionally:

contrasts_diffbind.txt: indicate intended group comparisions for differential binding analysis, e.g. KOvsWT=(KO-WT) if targets.txt contains the groups KO and WT. Give 1 contrast per line.
targets_diffbind.txt:
- SampleID: IP sample name (as IPname in targets.txt)
- Condition: variable for sample grouping (as group in targets.txt)
- Replicate: number of replicate
- bamReads: bam file name of IP sample (as IP in targets.txt but with path relative to project directory)
- ControlID: input sample name (as INPUTname in targets.txt)
- bamControl: bam file name of corresponding input control sample (as INPUT in targets.txt but with path relative to project directory)
- Peaks: peak file name opbatined from peak caller (path relative to project directory)
- PeakCaller: name of peak caller (e.g. macs)

Programs required

Bedtools
Bowtie2
deepTools
encodeChIPqc (provided by another project from imbforge)
FastQC
MACS2
MultiQC
Picard
R with packages ChIPSeeker, diffbind, GenomicAlignments, spp and genome annotation packages
Samtools
UCSC utilities

ChIP-seq
Version 1

ChIP-Seq pipeline

Pipeline Workflow

The pipelines includes

Pipeline-specific parameter settings

Programs required

Version History

Version 1 (earliest) Created 7th Oct 2020 at 08:41 by Sergi Sayols

Creator

Submitter

ChIP-seq Version 1

ChIP-Seq pipeline

Pipeline Workflow

The pipelines includes

Pipeline-specific parameter settings

Programs required

Version History

Version 1 (earliest) Created 7th Oct 2020 at 08:41 by Sergi Sayols

Creator

Submitter

Related items

ChIP-seq
Version 1