PISAD - Phsaed Intraspecies Sample Anomalies Detection tool
Summary
We developed PISAD, a tool designed to detect anomalies in cohort samples without requiring reference information. It is primarily divided into two stages. Stage 1: We select low-error data from the cohort and conduct reference-free SNP calling to construct a variant sketch. Stage 2: By comparing the k-mer counts of other cohort data to the variant sketch, we infer the relationships between the sample and other samples to detect the sample swap.
Dependencies
recommend use conda to install
- GCC (Tested on 8.5.0)
- gperftools(2.10)
- hdf5(1.14.3)
- boost(1.85.0)
Installation
cloning the PISAD repository to your machine and enter its directory.
git clone https://github.com/ZhantianXu/PISAD.git
cd pisad/
Compiling should be as easy as:
./configure && make
To install in a specified directory:
./configure --prefix=/PATH && make install
Usage
Stage1: SNP callng :
First, we select a low-error-rate sequencing dataset as the target sample for rapid SNP calling. It supports multi-threaded processing.
Example:
./run.sh -i /data/hg002.fastq.gz -m 0
Required parameters:
-i: Input files ( *.fastq or *.fastq.gz files)
-m: Heterozygosity parameter (0 for <1.2%, 1 otherwise)
Optional parameters:
-k: kmer-size (default: 21)
-t: thread (default: 8)"
-o: Output prefix (defaults: first input file's prefix)
-d1: Directory for dsk files (default: current directory)
-d2: Directory for output plot (default: current directory)
-d3: Directory for SNP output (default: current directory)
-h: Show this help message
Advanced optional parameters:
-est: est_kmercov (default: Estimated by algorithm)
-cutoff: cutoff threshold (defaults: 0.95)
-het: Initial heterozygosity (defaults: 0/0.12)
-rho: Initial rho value (defaults: 0.2)
-setleft: Left boundary of the heterozygous region (defaults: Estimated by algorithm)
-setright: Right boundary of the heterozygous region (defaults: Estimated by algorithm)
Stage1: construct variant sketch:
Next, we convert the called SNPs into a variant sketch.
./create -i /snp/hg002_21_2_4_pairex.snp
Required parameters:
-i: Input files ( .snp file)
Optional parameters:
-k: kmer-size (default: 21)
-l: Filtering threshold (default: 21)
-o: Output prefix (defaults: current directory)
Stage2: count the k-mers:
we compare the k-mer counts of other cohort samples to the variant sketch to infer relationships between them. Files may be gzipped and multiple threads can be used.
./pisadCount -s /fa/hg002.fa /data/hg003.fastq.gz
Usage: ./pisadCount -s [FASTA] [OPTION]... [FILES...]
Required options:
-s: variant sketch (one or more)
Optional options:
-t: Number of threads to run (default: 1)
-m: k-mer coverage threshold for early termination (default: inf)
-i: extra debug information
-k: k-mer size used (default: 21)
-o: Evaluation file path (defaults: current directory)
-h: Display this dialog
Here, the -s option allows inputting multiple FA files for variant sketching, separated by commas, such as -s /fa/hg002.fa,/fa/hg001.fa
.
If your input file has a high coverage, you can also add the -m
parameter to control the reading process and save time, such as -m 10
.
Stage2:Evaluate the samples:
Input the statistics of samples to calculate their relationship and detect sample swaps.
./pisadEval /homeb/xuzt/coverage/eval/hg002_hg003.txt > summary.tsv
Usage: ./pisadEval [OPTION]... [FILES...]
Optional options:
-t: Number of threads to run(default: 1)
-h: Display this dialog
Version History
main @ 01f7a4a (earliest) Created 19th Mar 2025 at 02:17 by zhantian xu
Multi-platform support
Frozen
main
01f7a4a

Creator
Submitter
Views: 93 Downloads: 12
Created: 19th Mar 2025 at 02:17

This item has not yet been tagged.

None