Petrisnake: A secondary analysis pipeline for PETRI-seq data.
This is a Snakemake pipeline for the secondary computational analysis of single cell RNA-seq data from the PETRI-seq protocol (https://www.nature.com/articles/s41564-020-0729-6 and https://www.nature.com/articles/s41586-024-08124-2), this is: From the input FASTQ files, this workflow constructs a gene count table showing the expression of each gene in each cell.
Dependencies
This pipeline is only tested under Linux running on x86-64 architecture, although it might work on macOS and other CPU architectures if some work is put into dependency management. The workflow will complain, but still work if the system dependency fuse2fs is not installed. For running the pipeline the following dependencies are required (tested versions):
The configuration tool GUI is a R Shiny application with the following dependencies:
- R (4.5.2)
- shiny (1.12.1)
- shinyBS (0.61.1)
- purrr (1.2.0)
- yaml (2.3.12)
For convenience, the file config/environment.yml describes the conda dependencies for the configuration GUI. In order to install it, run conda env create --file config/environment.yml. This should create a conda environment named petrisnake-config with the required dependencies.
The posDemux R package (https://github.com/yaccos/posDemux) is a keystone dependency of this workflow. This package is scheduled to be released as part of Bioconductor 3.23. For the pipeline itself, the package is available as a container image on Docker Hub (https://hub.docker.com/repository/docker/yaccos/posdemux/general) and is automatically used by Snakemake in the workflow provided that the argument --software-deployment-method apptainer is used.
When running the container, you may get the warning messages:
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
These warnings can be safely ignored, but if you still want to silence them, run export LC_ALL=C in your shell before running Snakemake.
For using the cutoff selection tool, the posDemux package must be run outside the workflow. At the time of writing Bioconductor release 3.23 is not out yet and the package requires a minimum R version of 4.6.0 as per Bioconductor's policies. This makes the package trickier to install, but there are still some options:
- Install the devel version of R which is 4.6. You may use the
rigversion manager (https://github.com/r-lib/rig) for this as it allows you to install multiple versions of R and switch between them as needed. - Download the sources from GitHub and manually edit the minimum version requirements to your installed R version (everything above 4.4 should work).
- Pull the container image with
posDemuxwith Docker and use the bundled RStudio Server instance:docker run -e PASSWORD= -p 8787:8787 yaccos/posdemux:0.99.8. Then openhttp://localhost:8787/in your web browser, typerstudioas the username and `` as the password.
For the rules inside the workflow which do not use R scripts, Snakemakes automatically handles the conda environments for these rules such that no manual intervention is required for the dependency management. These conda environment files are located in workflow/envs.
Configuration
The pipeline reads config/config.yaml and normalizes the values through workflow/rules/handle_config.smk. Every run must define the keys listed below; optional fields default to empty strings or can be overridden per sample when needed. The bundled Shiny helper (config/config_shiny.R) generates exactly this structure.
Global keys
prefix: Base directory prepended to any relative path (for exampleresources/). Leave empty when using absolute paths. This option is special as it is not overridden by the keyprefixinside the sample. Instead, the global and sample-specific prefixes are concatenated. In addition, this prefix relates to all file paths, not just the FASTQ files, but also the reference genome and reference annotation.reference_genome: FASTA file relative toprefixunless the value is absolute. The file must exist in a writable directory sobwacan index it.reference_annotation: Annotation GTF/GFF file relative toprefixunless absolute.suffix,forward_suffix,reverse_suffix: Optional strings appended to every FASTQ path. They allow you to express shared filename parts one time (for example_001.fastq.gz).streaming_chunk_size: Integer, number of reads to load in each chunk while streaming FASTQ pairs during demultiplexing.bc_cutoff: Optional global barcode cutoff. When omitted you can still rundetermine_bc_cutoff; the value becomes mandatory before runningall.feature_tag,gene_id_attribute: Required feature and identifier attribute names extracted from the reference annotation (for exampleCoding_or_RNAandname).samples: Mapping of sample identifiers to per-sample configuration dictionaries (see below). The mapping itself is mandatory.
Sample entries
prefix: Prepended to the FASTQ file names in addition to the globalprefix. Useful for subdirectories such asrandom20000/random20000_S1_.suffix,forward_suffix,reverse_suffix: Optional overrides. When omitted, the global suffixes are used.reference_genome,reference_annotation,feature_tag,gene_id_attribute,bc_cutoff,streaming_chunk_size: Optional overrides. Provide only what differs from the global defaults.lanes: Dictionary mapping lane labels to the strings found in the filenames (for exampleL1: L001). Lane keys must match the regex^[^_]+$(underscores are rejected). The workflow derives each FASTQ pair asand.
The minimal viable YAML therefore looks like:
prefix: resources/
reference_genome: U00096_JE2.fa
reference_annotation: U00096_JE2_rRNA.gff
feature_tag: Coding_or_RNA
gene_id_attribute: name
streaming_chunk_size: 4000
samples:
random20000:
prefix: random20000/random20000_S1_
lanes:
L1: L001
forward_suffix: _R1
reverse_suffix: _R2
suffix: _001.fastq.gz
bc_cutoff: 3000
Each sample entry is validated at runtime. Missing required fields raise descriptive ValueErrors (for example when a genome is neither defined globally nor for the sample). The lanes dictionary may contain multiple entries when samples are split across lanes. In this case, the workflow processes the FASTQ files for the forward reads for a sample seperately until demultiplexing where the files are streamed sequentially within a single rule execution. Likewise, the reverse read FASTQ files of a sample are processed separately until the results are combined at the final step of the workflow.
To create or update the configuration interactively, activate the GUI environment (conda activate petrisnake-config) and launch Rscript config/config_shiny.R. The app mirrors all of the fields documented above and writes a ready-to-run YAML file.
Running the pipeline
Due to the fact that Snakemake uses Apptainer and Conda to handle the dependencies of the rules, the flag --software-deployment-method apptainer conda must be provided when launching the pipeline.
When the barcode threshold to use is unknown, we first run the pipeline to the end of the demultiplexer. In that case, run
snakemake --software-deployment-method apptainer conda --cores determine_bc_cutoff.
This will perform the demultiplexing of the forward reads and produce the following files in interest:
- The barcode table at
results//_barcode_table.txtwhich contains the barcode and UMI assignments to each of the reads. - The frequency table at
results//_frequency_table.txtwhich lists the frequency of each barcode combination in decending order (most common ones on top). This table is used byposDemux::interactive_bc_cutoff()to determine the barcode cutoff, this is how many of the top barcode combinations are kept. - The log file of the demultiplexing at
results//demultiplex.log. This file serves both as a progress indicator of the process and provides an informative summary of the results in the end.
After the barcode cutoff is determined and entered into the config file, we are ready to run the rest of the pipeline:
snakemake --software-deployment-method apptainer conda --cores all
The additional files of interest produced after the entire pipeline has finished are:
- The selected frequency table at
results//_selected_frequency_table.txt. This is a truncated and renormalized version of the frequency table containing only the reads within the barcode cutoff. - The figures of the knee plot and density of barcode frequencies located at
results//_kneePlot.pdfandresults//_ReadsPerBC.pdf. The dashed vertical line indicates the selected cutoff. Note that the density plot is scaled by the number of reads on the y-axis in order to make it easier to interpret. In case you are unhappy with how the plots are made, you can recreate the plots from the frequency table usingposDemux::knee_plot()andposDemux::freq_plot(). - The gene count matrix at
results//_gene_count_matrix.txt. This is a tab-seperated file (remember to explicitly set\tas the delimiter) showing the number of detected UMIs per gene for all cells. The genes form the columns, whereas the cells form the rows. - The UMI count table at
results//_umi_count_table.txtwhich can be seen as a linear form of the gene count matrix. Each row contains the cell barcode, the UMI, its associated gene and how many duplicates were found of this UMI. - The log file from the gene counting at
results//count_genes.logwhich serves both as a progress indicator and provides a summary of the gene count table.
This repository is already prepared with an example dataset random20000 and a corresponding pre-populated config file, meaning that the user can try the pipeline right out of the box.
Click and drag the diagram to pan, double click or use the controls to zoom.
Version History
v2.0 (earliest) Created 6th Feb 2026 at 11:26 by Jakob Peder Pettersen
Changed the license after obtaining concent from Columbia University
Frozen
v2.0
ba31614
Creators and SubmitterCreator
Submitter
Views: 381 Downloads: 87
Created: 6th Feb 2026 at 11:26
Tags
AttributionsNone
View on GitHub
https://orcid.org/0000-0002-3485-1634