RNA-seq Scientific Workflow
Workflow for RNA sequencing using the Parallel Scripting Library - Parsl.
Reference: Cruz, L., Coelho, M., Terra, R., Carvalho, D., Gadelha, L., Osthoff, C., & Ocaña, K. (2021). Workflows Científicos de RNA-Seq em Ambientes Distribuídos de Alto Desempenho: Otimização de Desempenho e Análises de Dados de Expressão Diferencial de Genes. In Anais do XV Brazilian e-Science Workshop, p. 57-64. Porto Alegre: SBC. DOI: https://doi.org/10.5753/bresci.2021.15789
In order to use RNA-seq Workflow the following tools must be available:
You can install Bowtie2 by running:
sudo yum install bowtie2-2.3.5-linux-x86_64
Samtools is a suite of programs for interacting with high-throughput sequencing data.
Picard is a set of Java command line tools for manipulating high-throughput sequencing (HTS) data and formats.
HTSeq is a native Python library that folows conventions of many Python packages. You can install it by running:
pip install HTSeq
To use DESEq2 script make sure R language is also installed. You can install it by running:
sudo apt install r-base
The recommended way to install Parsl is the suggest approach from Parsl's documentation:
python3 -m pip install parsl
To use Parsl, you need Python 3.5 or above. You also need Python to use HTSeq, so you should load only one Python version.
First of all, make a Comma Separated Values (CSV) file. So, onto the first line type:
sampleName,fileName,condition. Remember, there must be no spaces between items. You can use the file "table.csv" in this repository as an example. Your CSV file will be like this:
|tissue control 1
|tissue control 2
|tissue control 3
|tissue wntup 1
|tissue wntup 2
|tissue wntup 3
The list of command line arguments passed to Python script, beyond the script's name, must be:
- The indexed genome;
- The number of threads for bowtie task, sort task, number of splitted files for split_picard task and number of CPU running in htseq task;
- Path to read fastaq file, which is the path of the input files;
- Directory's name where the output files must be placed;
- GTF file;
- and, lastly the DESeq script.
Make sure all the files necessary to run the workflow are in the same directory and the fastaq files in a dedicated folder, as a input directory. The command line will be like this:
python3 rna-seq.py ../mm9/mm9 24 ../inputs/ ../outputs ../Mus_musculus.NCBIM37.67.gtf ../DESeq.R
Remember to adjust the parameter multithreaded and multicore according with your computational environment. Example: If your machine has 8 cores, you should set the parameter on 8.
Created: 6th Dec 2022 at 19:17