EBP-Nor Genome Assembly pipeline
This repository contains the EBP-Nor genome assembly pipeline. This pipeline is implemented in snakemake. This pipeline is developed to create haplotype-resolved genome assemblies from PacBio HiFi reads and HiC reads, and is primarly designed for diploid eukaryotic organisms. The pipeline is designed to work on a linux cluster with slurm as workload manager.
Requirements & Setup
Some software need to be configured/installed before the pipeline can be run
Conda setup
Most required software, including snakemake itself, can be installed using conda.
Once conda is installed, you can create a new environment containing most necessary software from the provided asm_pipeline.yaml file as follows:
conda create -n asm_pipeline --file=worfklow/envs/asm_pipeline.yaml
Other software setup
The following software need to be installed manually:
- KMC v3.1.1 (https://github.com/tbenavi1/KMC)
- HiFiAdapterFilt (https://github.com/sheinasim/HiFiAdapterFilt)
- Oatk (https://github.com/c-zhou/oatk)
- OatkDB (https://github.com/c-zhou/OatkDB)
- NCBI FCS-Adaptor (https://github.com/ncbi/fcs/wiki/FCS-adaptor)
- NCBI FCS-GX (https://github.com/ncbi/fcs/wiki/FCS-GX)
Please refer to their respective installation instructions to properly install them. You will need to privide the installation paths of these software to the config file (see Parameter section).
BUSCO database setup
As in general, computing nodes are not connected to the internet, BUSCO lineage datasets need to be downloaded manually before running the pipeline. This can easily be done by running
busco --download eukaryota
You will need to specify the folder where you downloaded the busco lineages in the config file (see Parameter section).
Data
This pipeline is created for using PacBio HiFi reads together with paired-end Hi-C data. You will need to specify the absolute paths to these files in the config file (see Parameters section).
Parameters
The necessary config files for running the pipeline can be found in the config folder.
General snakemake and cluster submission parameters are defined in config/config.yaml
,
data- and software-specfic parameters are defined in config/asm_params.yaml
.
First, define the paths of the input files you want to use:
- pacbio: path to the location of the PacBio HiFi reads (
.fastq.gz
) - hicF and hicR: path to the forward and reverse HiC reads respectively
For software not installed by conda, the installation path needs to be provided to the Snakemake pipeline by editing following parameters in the config/asm_params.yaml
:
- Set the "adapterfilt_install_dir" parameter to the installation path of HiFiAdapterFilt
- Set the "KMC_path" parameter to the installation path of KMC
- Set the "oatk_dir" parameter to the installation path of oatk
- Set the "oatk_db" parameter to the directory where you downloaded the oatk_db files
- Set the "fcs_path" parameter to the location of the
run_fcsadaptor.sh
andfcs.py
scripts - Set the "fcs_adaptor_image" and "fcs_gx_image" parameters to the paths to the
fcs-adaptor.sif
andfcs-gx.sif
files respectively - Set the "fcs_gx_db" parameter to the path of the fcs-gx database
A couple of other parameters need to be verified as well in the config/asm_params.yaml file before running the pipeline:
- The location of the input data (
input_dir
) should be set to the folder containing the input data. - The location of the downloaded busco lineages (
busco_db_dir
) should be set to the folder containing the busco lineages files downloaded earlier - The required BUSCO lineage for running the BUSCO analysis needs to set (
busco_lineage
parameter). Runbusco --list-datasets
to get an overview of all available datasets. - The required oatk lineage for running organelle genome assembly (
oatk_lineage
parameter). Check https://github.com/c-zhou/OatkDB for an overview of available lineages. - A boolean value wether the species is plant (for plastid prediction) or not (
oatk_isPlant
; set to either True or False) - The NCBI taxid of your species, required for the decontamination step (
taxid
parameter)
Usage and run modes
Before running, make sure to activate the conda environment containing the necessary software: conda activate asm_assembly
.
To run the pipeline, run the following command:
snakemake --profile config/ --configfile config/asm_params.yaml --snakefile workflow/Snakefile {run_mode}
If you invoke the snakemake command in another directory than the one containing the workflow
and config
folders,
or if the config files (config.yaml
and asm_params.yaml
) are in another location, you need to specify their correct paths on the command line.
The workflow parameters can be modified in 3 ways:
- Directly modifying the
config/asm_parameters.yaml
file - Overriding the default parameters on the command line:
--config parameter=new_value
- Overriding the default parameters using a different yaml file:
--configfile path_to_parameters.yaml
The pipeline has different runing modes, and the run mode should always be the last argument on the command line:
- "all" (default): will run the full workflow including pre-assembly (genomescope & smudgeplot), assembly, scaffolding, decontamination, and organelle assembly
- "pre_assembly": will run only the pre-assembly steps (genomescope & smudgeplot)
- "assembly": will filter the HiFi reads and assemble them using hifiasm (also using the Hi-C reads), and run busco
- "scaffolding": will run all steps necessary for scaffolding (filtering, assembly, HiC filtering, scaffolding, busco), but without pre-assembly
- "decontamination": will run assembly, scaffolding, and decontamination, but without pre-assembly and busco analyses
- "organelles": will run only organnelle genome assembly
Output
All generated output will be present in the "results" directory, which will be created in the folder from where you invoke the snakemake command. This results directory contains different subdirectories related to the different steps in the assembly:
- results/pre_assembly: genomescope and smudgeplot output (each in its own subfolder)
- results/assembly: Hifiasm assembly output and corresponding busco results
- results/scaffolding: scaffolding output, separated in two folders:
- meryl: meryl databases used for filtering HiC reads
- yahs: scaffolding output, including final scaffolds and their corresponding busco results
- results/decontamination: decontamination output of the final scaffolded assembly
- results/organelles: assembled organellar genomes
Additionally, a text file containing all software versions will be created in the specified input directory.
The log files of the different steps in the workflow can be found in the logs
directory that will be created.
Click and drag the diagram to pan, double click or use the controls to zoom.
Version History
main @ 6d5595d (earliest) Created 13th Feb 2024 at 09:44 by Bram Danneels
Update Snakefile
Frozen
main
6d5595d
Creators
Not specifiedSubmitter
Views: 876 Downloads: 204
Created: 13th Feb 2024 at 09:44
This item has not yet been tagged.
None