deepvariant-nextflow
main @ cb940b3

Workflow Type: Nextflow
Stable

Nextflow Pipeline for DeepVariant

This repository contains a Nextflow pipeline for Google’s DeepVariant, optimised for execution on NCI Gadi.

Quickstart Guide

  1. Edit the pipeline_params.yml file to include:

    • samples: a list of samples, where each sample includes the sample name, BAM file path (ensure corresponding .bai is in the same directory), path to an optional regions-of-interest BED file (set to '' if not required), and the model type.
    • ref: path to the reference FASTA (ensure corresponding .fai is in the same directory).
    • output_dir: directory path to save output files.
    • nci_project, nci_storage : NCI project and storage.
  2. Update nextflow.config to match the resource requirements for each stage of the pipeline. For NCI Gadi, you may need to adjust only time and disk (i.e. jobfs) parameters based on the size of the datasets used (the default values are tested to be suitable for a dataset of ~115GB in size).

  3. Load the Nextflow module and run the pipeline using the following commands:

    module load nextflow/24.04.1
    nextflow run main.nf -params-file pipeline_params.yml
    

    Note: Additional Nextflow options can be included (e.g., -resume to resume from a previously paused/interrupted run)

  4. For each sample, output files will be stored in the directory output_dir/sample_name.

Notes

  1. It is assumed that the user has access to NCI's if89 project (required for using DeepVariant via module load). If not, simply request access using this link.

Case Study

A case study was conducted using a ~115GB BAM alignment file from a HG002 ONT whole genome sequencing (WGS) dataset to evaluate the runtime and service unit (SU) efficiency of deepvariant-nextflow compared to the original DeepVariant running on a single node. The benchmarking results are summarised in the table below.

        Version
        Gadi Resources
        Runtime (hh:mm:ss)
        SUs
    


    
        Original DeepVariant
        <code>gpuvolta</code> (24 CPUs, 2 GPUs, 192 GB memory)
        05:07:21
        368.82
    
    
        <code>gpuvolta</code> (48 CPUs, 4 GPUs, 384 GB memory)
        03:18:31
        476.44
    
    
        <i>deepvariant-nextflow</i>
        <code>normal</code> (48 CPUs, 192 GB memory) → <code>gpuvolta</code> (12 CPUs, 1 GPU, 96 GB memory) → <code>normalbw</code> (28 CPUs, 256 GB memory) 
        03:21:01
        237.33
    
    
        <code>normalsr</code> (104 CPUs, 500 GB memory) → <code>gpuvolta</code> (12 CPUs, 1 GPU, 96 GB memory) → <code>normalbw</code> (28 CPUs, 256 GB memory) 
        02:04:35
        199

Notes

  • Negligible runtime/SU values for the DRY_RUN stage (<1 minute/<1 SU) have been excluded from the results.
  • Total queueing times, which were similar across all cases, have been omitted.

Acknowledgments

The deepvariant-nextflow workflow was developed by Dr Kisaru Liyanage and Dr Matthew Downton (National Computational Infrastructure), with support from Australian BioCommons as part of the Workflow Commons project.

We thank Leah Kemp (Garvan Institute of Medical Research) for her collaboration in providing test datasets and assisting with pipeline testing.

Version History

main @ cb940b3 (earliest) Created 5th Dec 2024 at 01:16 by Kisaru Liyanage

add LICENSE


Frozen main cb940b3
help Creators and Submitter
License
Activity

Views: 128   Downloads: 16

Created: 5th Dec 2024 at 01:16

help Attributions

None

Total size: 9.75 KB
Powered by
(v.1.16.0-main)
Copyright © 2008 - 2024 The University of Manchester and HITS gGmbH