RNA-seq Scientific Workflow

Workflow for RNA sequencing using the Parallel Scripting Library - Parsl.

Reference: Cruz, L., Coelho, M., Terra, R., Carvalho, D., Gadelha, L., Osthoff, C., & Ocaña, K. (2021). Workflows Científicos de RNA-Seq em Ambientes Distribuídos de Alto Desempenho: Otimização de Desempenho e Análises de Dados de Expressão Diferencial de Genes. In Anais do XV Brazilian e-Science Workshop, p. 57-64. Porto Alegre: SBC. DOI: https://doi.org/10.5753/bresci.2021.15789

Requirements

In order to use RNA-seq Workflow the following tools must be available:

Bowtie2

You can install Bowtie2 by running:

bowtie2-2.3.5.1-linux-x86_64.zip

sudo yum install bowtie2-2.3.5-linux-x86_64

Samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data.

Picard

Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats.

HTSeq

HTSeq is a native Python library that folows conventions of many Python packages. You can install it by running:

pip install HTSeq

HTSeq uses NumPy, Pysam and matplotlib. Be sure this tools are installed.

To use DESEq2 script make sure R language is also installed. You can install it by running:

sudo apt install r-base

Parsl - Parallel Scripting Library

The recommended way to install Parsl is the suggest approach from Parsl's documentation:

python3 -m pip install parsl

Python (version >= 3.5)

To use Parsl, you need Python 3.5 or above. You also need Python to use HTSeq, so you should load only one Python version.

Workflow invocation

First of all, make a Comma Separated Values (CSV) file. So, onto the first line type: sampleName,fileName,condition. Remember, there must be no spaces between items. You can use the file "table.csv" in this repository as an example. Your CSV file will be like this:

sampleName	fileName	condition
tissue control 1	SRR5445794.merge.count	control
tissue control 2	SRR5445795.merge.count	control
tissue control 3	SRR5445796.merge.count	control
tissue wntup 1	SRR5445797.merge.count	wntup
tissue wntup 2	SRR5445798.merge.count	wntup
tissue wntup 3	SRR5445799.merge.count	wntup

The list of command line arguments passed to Python script, beyond the script's name, must be:

The indexed genome;
The number of threads for bowtie task, sort task, number of splitted files for split_picard task and number of CPU running in htseq task;
Path to read fastaq file, which is the path of the input files;
Directory's name where the output files must be placed;
GTF file;
and, lastly the DESeq script.

Make sure all the files necessary to run the workflow are in the same directory and the fastaq files in a dedicated folder, as a input directory. The command line will be like this:

python3 rna-seq.py ../mm9/mm9 24 ../inputs/ ../outputs ../Mus_musculus.NCBIM37.67.gtf ../DESeq.R

Remember to adjust the parameter multithreaded and multicore according with your computational environment. Example: If your machine has 8 cores, you should set the parameter on 8.