Eukaryotic gene expression is controlled by cis-regulatory elements (CREs) including promoters and enhancers which are bound by transcription factors (TFs). Differential expression of TFs and their putative binding sites on CREs cause tissue and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and thus gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined ChIP-seq and RNA-seq data exist, they do not offer good usability, have limited support for large-scale data processing, and provide only minimal functionality for visual result interpretation.
We developed TF-Prioritizer, an automated java pipeline to prioritize condition-specific TFs derived from multi-modal data. TF-Prioritizer creates an interactive, feature-rich, and user-friendly web report of its results. To showcase the potential of TF-Prioritizer, we identified known active TFs (e.g., Stat5, Elf5, Nfib, Esr1), their target genes (e.g., milk proteins and cell-cycle genes), and newly classified lactating mammary gland TFs (e.g., Creb1, Arnt).
TF-Prioritizer accepts ChIP-seq and RNA-seq data, as input and suggests TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.
General overview of the TF-Prioritizer pipeline. TF-Prioritizer uses peaks from ChIP-seq and gene counts from RNA-seq. It then (1) calculates TF binding site affinities, (2) links candidate regions to potential target genes, (3) performs machine learning to find relationships between TFs and their target genes, (4) calculates background and TF distributions, (5) picks TFs which significantly differ from the background using the Mann-Whitney U test [18] and a comparison between the mean and the median of the background and TF distribution, (6) searches for tissue-specific TF ChIP-seq evaluation data in ChIP-ATLAS [17], (7) creates screenshots using the Integrative Genomics Viewer from predicted regions of interest [19–21], and (8) creates a feature-rich web application for researchers to share and evaluate their results.
TF-Prioritizer uses nf-core ChIP-seq and nf-core RNA-seq preprocessed data as input files (see GitHub repository for detailed formatting instructions). More specifically, broad peaks and gene counts. (1) Once started, the pipeline downloads necessary data (gene lengths, gene symbols, and short descriptions of the genes) from bioMart [22]. (2) The user can then decide to use a transcript per million (TPM) filter or a gene count filter to filter before DESeq2 usage. We also allow for batch correction in DESeq2. TF-Prioritizer uses a TPM filter of 1 as default. DESeq2 normalizes and calculates the log 2-fold change (log2fc) from raw gene count data [23]. In parallel, (3) TF-Prioritizer preprocesses the ChIP-seq broad peaks by filtering blacklisted regions [24]. We recommend using the sample combination option to combine similar broad peak samples into one peak file, as the total runtime of the pipeline is reduced significantly without losing the quality of the data. (4) Optionally, the user can decide to use TGene to predict links between target genes and regulatory elements combining distance and histone/expression correlation [25]. If the TGene option is not activated (deactivated per default), TEPIC, executed in the next step of the pipeline, uses a window-based approach (default: 50,000 bps) to link regulatory elements to target genes [26,27]. (5) TEPIC uses TRAP, an approach that quantifies transcription factor affinity scores based on a biophysical model for regulatory regions [28]. TEPIC “computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength”. TEPIC uses “machine learning to find low-affinity binding sites to improve the ability to explain gene expression variability compared to the standard presence/absence classification of binding sites” [26]. The default TEPIC framework searches for footprints on top of broad peaks. As we intend to use histone modification ChIP-seq data we extended the TEPIC framework so that it can search for transcription factor binding sites (TFBS) between two peaks that have close (~500 bps) genomic positions. (6) The pipeline then executes DYNAMITE an approach that uses a “sparse logistic regression classifier to infer TFs related to gene expression changes between samples” [27]. (7) We added a distribution analysis to the pipeline to further prioritize TFs depending on their distribution compared to the global distribution using (8) a Mann-Whitney U test [11] and the comparison of the means and the medians (see Section Distribution Analysis). (9) We then use a discounted cumulative gain approach to retrieve a global list (overall histone modification data provided) of all prioritized TFs (see Section Discounted Cumulative Gain). (10) In the following, TF-Prioritizer generates condition-specific and histone-modification-specific heatmaps for prioritized TFs and their predicted target genes (we visualize the target genes that have the highest affinity value according to the biophysical model of TRAP). (11) We then automatically check if we can find publicly available tissue-specific TF ChIP-seq data from ChIP-Atlas [12] and (12) download the files. (13) Afterward, we take screenshots using the IGV [13–15]. (14) In the last step, we conclude all analysis and plots in the form of a user-friendly and feature-rich web application that could also be used as a webpage (see Suppl. Fig. 1 for a visualization of the technical workflow).
13 | Holland CH, Tanevski J, Perales-Patón J, Gleixner J, Kumar MP, Mereu E, et al. Robustness and applicability of transcription factor and pathway analysis tools on single-cell RNA-seq data. Genome Biol. 2020;21: 36. |
14 | Ferreira SS, Hotta CT, de Carli Poelking VG, Leite DCC, Buckeridge MS, Loureiro ME, et al. Co-expression network analysis reveals transcription factors associated to cell wall biosynthesis in sugarcane. Plant Molecular Biology. 2016. pp. 15–35. doi:10.1007/s11103-016-0434-2 |
15 | Mason MJ, Fan G, Plath K, Zhou Q, Horvath S. Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells. BMC Genomics. 2009;10: 327. |
18 | Mann HB, Whitney DR. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. The Annals of Mathematical Statistics. 1947. pp. 50–60. doi:10.1214/aoms/1177730491 |
19 | Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nature Biotechnology. 2011. pp. 24–26. doi:10.1038/nbt.1754 |
20 | Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14: 178–192. |
21 | Robinson JT, Thorvaldsdóttir H, Wenger AM, Zehir A, Mesirov JP. Variant Review with the Integrative Genomics Viewer. Cancer Res. 2017;77: e31–e34. |
22 | Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008;9: R137. |
23 | Schmidt F, Gasparoni N, Gasparoni G, Gianmoena K, Cadenas C, Polansky JK, et al. Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction. Nucleic Acids Res. 2017;45: 54–66. |
24 | Schmidt F, Kern F, Ebert P, Baumgarten N, Schulz MH. TEPIC 2—an extended framework for transcription factor binding prediction and integrative epigenomic analysis. Bioinformatics. 2018;35: 1608–1609. |
25 | Roider HG, Kanhere A, Manke T, Vingron M. Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics. 2007;23: 134–141. |
26 | Lee HK, Willi M, Kuhns T, Liu C, Hennighausen L. Redundant and non-redundant cytokine-activated enhancers control Csn1s2b expression in the lactating mouse mammary gland. Nat Commun. 2021;12: 2239. |
27 | Patel H, Ewels P, Peltzer A, Hammarén R, Botvinnik O, Sturm G, et al. nf-core/rnaseq: nf-core/rnaseq v3.6 - Platinum Platypus. 2022. doi:10.5281/zenodo.6327553 |
28 | Patel H, Wang C, Ewels P, Silva TC, Peltzer A, Behrens D, et al. nf-core/chipseq: nf-core/chipseq v1.2.2 - Rusty Mole. 2021. doi:10.5281/zenodo.4711243 |