ARA (Automated Record Analysis)

ARA (Automated Record Analysis) : An automatic pipeline for exploration of SRA datasets with sequences as a query

Requirements

Docker
- Please checkout the Docker installation guide.
  
  or
Mamba package manager
- Please checkout the mamba or micromamba official installation guide.
- We prefer mamba over conda since it is faster and uses libsolv to effectively resolve the dependencies.
- conda can still be used to install the pipeline using the same commands as described in the installation section.
  Note: It is important to include the 'bioconda' channel in addition to the other channels as indicated in the official manual. Use the following commands in the given order to configure the channels (one-time setup).
```
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
```

Installation

The user can install the pipeline by using either Docker or Mamba using the steps mentioned below.

First, click the green "Code" button, then select "Download Zip" to begin downloading the contents of this repository. Once the download is complete, extract the zip file by into the desired location before starting the setup. Please use the commands shown below to begin installing the pipeline.

Alternatively, the github repo can also be cloned through the options shown after clicking the "Code" button. Navigate inside the folder after by using the cd ARA/ command before starting the setup.

Warning: Before starting any analysis with the pipeline, please make sure that the system has enough disk space available for the data you wish to retrieve and process from the SRA repository.

Using Docker
```
cd ARA-main/
docker build -t ara_img .
```

Using Mamba
```
cd ARA-main/
mamba env create --file requirements.yaml
mamba activate ara_env
perl setup.pl
```
Note: After installation, the virtual environment consumes approximately 1.5 GB of disk space. The installation was tested on "Ubuntu 20.04.4 LTS", "Ubuntu 22.04.1 LTS" and "Fedora 37" using the procedure mentioned above.

Please be patient because downloading and configuring the tools/modules may take several minutes. The warning messages that appear during the installation of certain Perl modules can be ignored by users.

Optional: The user can also add the current directory to PATH for ease of use. Use the chmod +x ara.pl followed by export PATH="$(pwd):$PATH" command. Alternatively, the user is free to create symbolic, copy the executable to /bin/, or use any other method depending on their operating system.

Refer the 'Troubleshooting' section in case of any installation related issues.

Example usage

Docker

docker run -it ara_img /home/ARA-main/ara.pl --input /home/ARA-main/example/SraRunInfo.csv --sequences /home/ARA-main/example/Arabidopsis_thaliana.TAIR10.ncrna.fa
Mamba environment

perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa

To get full usage info: perl ara.pl --help

Note: The user can delete the contents of results/ directory after testing the tool using the example mentioned above.

Configuration file

The configuration file conf.txt is automatically generated during the installation by setup script. It contains certain default parameters as well as the location to the executable binaries of the tools incorporated in the pipeline.

The user can modify the default parameters in conf.txt and pass it to the pipeline as an input. For example, the data_perc option in the configuration refers to the default value of 5% of the dataset selected for analysis. However, the user has the flexibility to provide any integer value between 1 and 100 to specify the desired percentage of the dataset to be used.

Similarly, the user can choose between blastn or bowtie2 by changing the 'execute flag' to either 0 or 1 in the configuration file while leaving the rest of the parameters to default values. By default, both the tools are enabled ie. execute = 1.

The read_drop_perc_cutoff in conf.txt config file denotes the cutoff to discard a sample if the total reads left after executing the trimmomatic are higher than the threshold (by default, if the more than 70% of reads are dropped as per the trimmomatic log, then the sample will fail the quality criteria and will not be processed downstream). Please refer the documentation of Trimmomatic for more details about the parameters present in the config file.

Similarly, the criteria to check the minimal alignment rate are indicated by the alignment perc cutoff parameter under blastn and bowtie2 in the conf.txt configuration file (if the total alignment percentage is less than the threshold then the pipeline will report that the sample failed the quality criteria). More details about the parameters used in the conf.txt file can be found in the respective documentations of Blastn and Bowtie2.

By default, the pipeline uses a pre-built Kraken2 viral genomic database (release: 9/8/2022) from . Users can provide their own database by changing the kraken2_db_path parameter in the conf.txt file.

Note: If the user wishes to use a different installation than Bioconda, the user can manually install the required tools and specify the absolute path of the executable binaries in the configuration.

Pipeline parameters

--input (mandatory) The user can provide input in either of the following ways:
- A single SRA run accession. eg: perl ara.pl --input SRR12548227 --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
- A list of run accessions in a text file (1 run accession per line). eg: perl ara.pl --input example/list.txt --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
- The SRA runInfo exported directly from the NCBI-SRA web portal. Goto the SRA homepage and search for the desired keyword. Export the SraRunInfo.csv by clicking 'Send to' => File => RunInfo). eg: perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
--sequences (mandatory) The user should provide a fasta file containing the query sequences.
--output (optional) The output directory to store the results. By default, the output will be stored into the results/ directory of the package. eg: perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/
--mode (optional) Choose one of the three modes to run the pipeline.
- The screen is the default mode which will only download a fraction of the data-set per SRA-run accession and analyse the file as per the given configuration.
- The full mode will execute the pipeline by downloading the complete fastq file per SRA-run accession.
- The both option searches for samples using a fraction of the data that meet the minimum alignment cutoff from either 'bowtie2' or 'blastn', and then automatically performs alignment by downloading the entire fastq file. eg: perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/ --mode screen
  
  Note: There is a supporting summary mode, that will generate a unified alignment summary by examining the output files created by either screen-mode or full-mode. The summary mode should only be used when the user needs to recreate the summary stats from the pre-existing results. The user must enter –mode summary along with the previously used command parameters to re-generate the summary.
- --config (optional) Pipeline configuration. By default it will use the conf.txt generated by the setup script. eg: perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/ --mode screen --config conf.txt

Output structure

The pipeline will create folders per SRA run accession and generate results using the run accession as the prefix. The analysis related to the screening a fraction of data will be stored in screening_results directory whereas the analysis conducted on the whole dataset will be stored in full_analyis_results directory.

An outline of directory structure containing the results is shown below-

results/
`-- test/ (name derived from the input fasta sequence file)
    |-- test.screening.analysis.stats.sorted.by.alignment.txt (combined metadata and analysis report generated after processing all the SRA run accessions, sorted in decreasing order of total alignment percentage)
    |-- metadata/
    |   |-- test.metadata.txt (Combined metadata downloaded from SRA)
    |   |-- test.metadata.screened.txt (List of SRA accessions which qualify the filter criteria specified in the config.)
    |   |-- SRA_RUN.run.metadata.txt (unprocessed metadata on a single SRA accession as retrieved from NCBI)
    |-- reference/
    |   |-- blastn_db/ (folder containing the blast database created from the input fasta sequence)
    |   |-- bowtie2_index/ (folder containing the bowtie index created from the input fasta sequence)
    |   |-- bowtie2_index.stdout.txt (stdout captured from bowtie2 index creation)
    |   `-- makeblastdb.stdout.txt (stdout captured from blastn database creation)
    `-- screening_results/ (similar structure for screeing or full mode)
        |-- SRA_RUN/ (each SRA run accession will be processed into a seperate folder)
        |   |-- blastn/
        |   |   |-- SRA_RUN.blast.results.txt (output from NCBI Blastn)
        |   |   `-- blast.stats.txt (blastn overall alignment stats)
        |   |-- bowtie2/
        |   |   |-- SRA_RUN.bam (output from bowtie2)
        |   |   |-- alignment.stats.txt (bowtie2 stdout)
        |   |   `-- alignment.txt (bowtie2 overall alignment summary)
        |   |-- fastQC/
        |   |   |-- 
        |   |   |-- 
        |   |-- kraken2/
        |   |   |-- SRA_RUN.kraken (kraken2 standard classification table)
        |   |   |-- SRA_RUN.report (kraken2 classification report)
        |   |   `-- SRA_RUN.stdout.txt (kraken2 stdout)
        |   |-- raw_fastq/
        |   |   |-- 
        |   |   |-- fastq_dump.stdout.txt
        |   |   |-- sra/
        |   |   `-- wget.full.sra.stdout.txt
        |   `-- trimmed_data/
        |       |-- 
        |       `-- SRA_RUN_trim_stdout_log.txt (trimmomatic stdout)
        `-- runlog.SRA_RUN.txt (Complete run log of the pipeline per SRA run accession)

For a thorough understanding of the results of the third-party tools, take a look at the following documentations:

Disk usage using the input from the example

The table below provides a summary of the disk usage for different analyses conducted on varying dataset sizes. It demonstrates how disk usage can increase depending on the choice of the fraction of the dataset the user wishes to analyze.

RUN ACCESSION	100% of dataset	5% of dataset	10% of dataset
SRR8392720	1.3G	85M	156M
SRR7289585	1.4G	150M	288M
SRR12548227	15M	9.0M	9.1M

This summary highlights how the disk usage (in megabytes or gigabytes) can vary depending on the chosen fraction of the dataset for analysis.

Troubleshooting

Errors related to mamba/conda environment:

Since mamba is a drop-in replacement and uses the same commands and configuration options as conda, it's possible to swap almost all commands between conda & mamba.

Use conda list command to verify whether the packages mentioned in the requirements.yaml are successfully installed into your environment.

Note: The requirements.yaml provided in this package was exported from mamba 0.25.0 installation running on Ubuntu 20.04.4 LTS.

In case of any missing tool/ conflicting dependencies in the environment, the user can try using conda search or mamba repoquery search command to find the supported version of the tool and then manually install it by typing conda install or mamba install inside the environment. Please refer the official troubleshooting guide for further help.

Note: On macOS and Linux, the supported tools and their dependencies aren't always the same. Even when all of the requirements are completely aligned, the set of available versions isn't necessarily the same. User may try setting up the environment using any of the supplementary requirements-*.txt provided in the src/main/resources/ directory.
Error installing Perl modules:

Users must ensure that they have write permission to the /Users/\*/.cpan/ or similar directory, and the CPAN is properly configured.

You might need to define the PERLLIB/PERL5LIB environment variable if you see an error similar to the following:
```
    Cant locate My/Module.pm in @INC (@INC contains:
    ...
    ...
    .).
    BEGIN failed--compilation aborted.
```
Note about MAKE: 'make' is an essential tool for building Perl modules. Please make sure that you have 'make' installed in your system. The setup script provided in this package utilizes 'cpan' to build the required Perl modules automatically.

If the automatic setup provided in the package fails to install the required dependencies, you may need to install them manually by using the command cpan install or searching the package on Metacpan.

Additionally, some Perl modules can also be installed through mamba (eg. the compatible version of Perl module Config::Simple can be searched on mamba by mamba repoquery search perl-config-simple)

List of Perl modules and tools incorporated in the pipeline

Perl modules:
- Config::Simple
- Parallel::ForkManager
- Log::Log4perl
- Getopt::Long
- Text::CSV
- Text::Unidecode
Tools:

ARA (Automated Record Analysis)
main @ ea7e984