plant2human workflow
main @ 11b46d8 (latest)

main @ 11b46d8 (latest)

main @ 044221e

main @ 76a6471

main @ 1aa2763

main @ b8c0b1d (earliest)

View on GitHub

Download RO-Crate

Workflow Type: Common Workflow Language

Work-in-progress

plant2human workflow

GitHub last commit (branch)

This analysis workflow is centered on foldseek, which enables fast structural similarity searches and supports the discovery of understudied genes by comparing plants, which are distantly related species, with humans, for which there is a wealth of information. Based on the list of genes you are interested in, you can easily create a scatter plot of “structural similarity vs. sequence similarity” by retrieving structural data from the AlphaFold protein structure database.

report

2025-02-02: fix foldseek easy-search process

Implementation Background

In recent years, with the AlphaFold protein structure database, it has become easier to obtain protein structure prediction data and perform structural similarity searches even for plant species such as rice. Against this background, searching for hits with “low sequence similarity and high structural similarity” for the gene groups being focused on has become possible. This approach may allow us to discover proteins that are conserved in distantly related species and to consider the characteristics of these proteins based on the wealth of knowledge we have about humans.

Analysis Environment

1. Using Dev Containers (Docker and VScode extension)

Most processes, such as Foldseek, use container (BioContainers), but some involve processing with jupyter notebook, which requires the preparation of some python libraries (e.g., polars.). If you want to experiment with a simple workflow, you can create an analysis environment easily using Dev Containers system, a VScode extension. Using this environment, the version of the python library is probably the one used during development, so errors are unlikely to occur (see Dockerfile for the package version).

Please check the official website for Dev Container details.

2. Executing with cwltool

This analysis workflow is tested using cwltool version 3.1.20250110105449.

Example 1 ( Oryza sativa vs Homo sapiens)

Here, we will explain how to use the list of 10 rice genes as an example.

1. Creation of a TSV file of gene and UniProt ID correspondences

First, you need the following gene list tsv file. (Please set the column name as "From")

From
Os01g0187600
Os12g0129300
Os12g0159500
Os02g0609000
Os05g0468600
Os05g0352750
Os06g0140700
Os04g0391500
Os01g0795250
Os01g0859200

The following TSV file is required to execute the following workflow.

From	UniProt Accession
Os01g0187600	A0A0P0UZ77
Os12g0129300	A0A0P0Y6G7
Os12g0129300	B9GBP4
Os12g0159500	A0A0P0Y794
Os12g0159500	A0A8J8YJ44
Os12g0159500	B9GBZ8
...

To do this, you need to follow the CWL workflow command below. This yaml file is the parameter file for the workflow, for example.

cwltool --debug --outdir ./test/oryza_sativa_test ./Tools/01_uniprot_idmapping.cwl ./job/uniprot_idmapping_job_example_os.yml

In this execution, mmcif files are also retrieved. The execution results are output with the jupyter notebook.

2. Creating and Preparing Indexes

I'm sorry, but the main workflow does not currently include the creation of an index process (both for foldseek index and BLAST index). Please perform the following processes in advance.

2-1. Creating a Foldseek Index

In this workflow, the target of the structural similarity search is specified as the AlphaFold database to perform comparisons across a broader range of species. Index creation using the foldseek databases command is through the following command.

Please select the database you want to use from Alphafold/UniProt, Alphafold/UniProt50-minimal, Alphafold/UniProt50, Alphafold/Proteome, Alphafold/Swiss-Prot. You can check the details of this database using the following command.

docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321h5021889_2 foldseek databases --help

For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following command,

# using docker container
docker run -u $(id -u):$(id -g) --rm -v $(pwd):/home -e HOME=/home --workdir /home quay.io/biocontainers/foldseek:9.427df8a--pl5321h5021889_2 foldseek databases Alphafold/Swiss-Prot swissprot tmp --threads 8

# making directory
mkdir ./index/index_swissprot

# moving index file
mv swissprot* ./index/index_swissprot/

Note: We have written a CWL file describing the above process, but we have confirmed an error and are correcting it.

2-2. Downloading a BLAST Index File

An index FASTA file must be downloaded to obtain the amino acid sequence using the blastdbcmd command from the UniProt database. Since it is a rice and human comparison, it can be downloaded as follows.

# Oryza sativa (all uniprot entries)
curl -o uniprotkb_39947_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A39947%29"

gzip -d uniprotkb_39947_all.fasta.gz

# Homo sapiens (all uniprot entries)
curl -o uniprotkb_9606_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A9606%29"

gzip -d uniprotkb_9606_all.fasta.gz

3. Execution of the Main Workflow

In this process, we perform a structural similarity search using the foldseek easy-search command and then perform a pairwise alignment of the amino acid sequences of the hit pairs using the needle and water commands. Finally, based on this information, we create a scatter plot and output a jupyter notebook as a report. Examples of commands are as follows.

cwltool --debug --outdir ./test/oryza_sativa_test ./Workflow/plant2human.cwl ./job/plant2human_job_example_os.yml

For example, you can visualize the results of structural similarity and global alignment, as shown below. In this case, the x-axis represents the global alignment similarity match (%), and the y-axis represents the LDDT score (an indicator of structural alignment).

The following scatter diagram can also be obtained from the test results of Zey Mays random 100 genes vs. Homo sapiens.

Click and drag the diagram to pan, double click or use the controls to zoom.

SEEK ID: https://workflowhub.eu/workflows/1206?version=5

DOI: 10.48546/workflowhub.workflow.1206.5

Inputs

ID	Name	Description	Type
INPUT_DIRECTORY	input protein structure file directory	query protein structure file (default: mmCIF) directory for foldseek easy-search input.	Directory
FILE_MATCH_PATTERN	file match pattern	file match pattern for listing input files. default: *.cif	string
FOLDSEEK_INDEX	foldseek index files	"foldseek index files for foldseek easy-search input. default: ../index/index_swissprot/swissprot Note: At this time (2025/02/02), the process of acquiring and indexing index files for execution has not been incorporated into the workflow. Therefore, we would like you to execute the following commands in advance. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 8` "	File
OUTPUT_FILE_NAME1	output file name (foldseek easy-search)	output file name for foldseek easy-search result. Currently, this workflow only supports TSV file output.	string
EVALUE	e-value (foldseek easy-search)	e-value threshold for foldseek easy-search. workflowdefault: 0.1	double
ALIGNMENT_TYPE	alignment type (foldseek easy-search)	alignment type for foldseek easy-search. default: 2 (3Di + AA: local alignment) for detailed information, see foldseek GitHub repository.	int
THREADS	threads (foldseek easy-search)	threads for foldseek easy-search. default: 16	int
SPLIT_MEMORY_LIMIT	split memory limit (foldseek easy-search)	split memory limit for foldseek easy-search. default: 120G	string
TAXONOMY_ID_LIST	taxonomy id list (foldseek easy-search)	taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon)	string
OUTPUT_FILE_NAME2	output file name (extract target species)	output file name for extract target species (default: human) python script.	string
WF_COLUMN_NUMBER_QUERY_SPECIES	column number of query species	column number of query species. default: 1 (UniProt ID list)	int
OUTPUT_FILE_NAME_QUERY_SPECIES	output file name (extract query species column)	output file name for extract query species column python script. default: foldseek_result_query_species.txt	string
WF_COLUMN_NUMBER_HIT_SPECIES	column number of hit species	column number of hit species. default: 2 (UniProt ID list)	int
OUTPUT_FILE_NAME_HIT_SPECIES	output file name (extract hit species column)	output file name for extract hit species column python script. default: foldseek_result_hit_species.txt	string
SW_INPUT_FASTA_FILE_QUERY_SPECIES	input fasta file (blastdbcmd process)	input fasta file for blastdbcmd. Retrieve files in advance. default: rice UniProt FASTA file	File
SW_INPUT_FASTA_FILE_HIT_SPECIES	input fasta file (blastdbcmd process)	input fasta file for blastdbcmd. Retrieve files in advance. default: human UniProt FASTA file	File
ROUTE_DATASET	route dataset (ID conversion using togoID)	route dataset for ID conversion. This operation selects the UniProt ID of the target species (human) for which cross-references exist (final destination is HGNC gene symbol). default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol	string
OUTPUT_FILE_NAME3	output file name (ID conversion using togoID)	output file name for ID conversion. default: foldseek_hit_species_togoid_convert.tsv	string
OUT_NOTEBOOK_NAME	output notebook name (papermill process)	output notebook name for papermill. After the analysis workflow is output, it can be freely customized such as changing the parameter values. default: plant2human_report.ipynb	string
QUERY_IDMAPPING_TSV	query idmapping tsv (papermill process)	query idmapping tsv file. Retrieve files in advance. default: rice UniProt ID mapping file	File
QUERY_GENE_LIST_TSV	query gene list tsv (papermill process)	query gene list tsv file. Retrieve files in advance. default: rice random gene list	File

Steps

ID	Name	Description
sub_workflow_foldseek_easy_search	foldseek easy-search sub-workflow process	"Execute foldseek easy-search using foldseek using BioContainers docker image. This workflow supports only TSV file output. Step 1: listing files Step 2: foldseek easy-search process"
extract_target_species	extract target species (human) process	Extract target species (human) from foldseek easy-search result. execute: ../Tools/12_extract_target_species.cwl
extract_query_species_column	extract query species column process	Extract query species column (UniProt ID list) from foldseek easy-search result. execute: ../Tools/13_extract_id.cwl
extract_hit_species_column	extract hit species column process	Extract hit species column (UniProt ID list) from foldseek easy-search result. execute: ../Tools/13_extract_id.cwl
sub_workflow_retrieve_sequence_query_species	retrieve sequence sub-workflow process using EMBOSS package	"Perform pairwise alignment of protein sequences for pairs identified by structural similarity search. Step 1: retrieve sequence from blastdbcmd result Step 2: makeblastdb: ../Tools/14_makeblastdb.cwl Step 3: blastdbcmd: ../Tools/15_blastdbcmd.cwl Step 4: seqretsplit: ../Tools/16_seqretsplit.cwl Step 5: needle (Global alignment): ../Tools/17_needle.cwl Step 6: water (Local alignment): ../Tools/17_water.cwl"
togoid_convert	togoid convert process	retrieve UniProt ID to HGNC gene symbol using togoID python script. execute: ../Tools/18_togoid_convert.cwl
papermill	papermill process	output notebook using papermill. This process allows you to create a scatter plot of structural similarity vs. sequence similarity. execute: ../Tools/19_papermill.cwl

Outputs

ID	Name	Description	Type
TSVFILE1	output file (foldseek easy-search result)	output file for foldseek easy-search all hit result.	File
TSVFILE2	output file (extract target species)	extract target species foldseek result file. (in this workflow, human result only)	File
IDLIST1	output file (extract query species column)	extract query species column UniProt ID list file.	File
IDLIST2	output file (extract hit species column)	extract hit species column UniProt ID list file.	File
INDEX_DIR1	index directory (query species)	index directory for query species.	Directory
INDEX_FILES1	index files (query species)	index files for query species.	File
INDEX_DIR2	index directory (hit species)	index directory for hit species.	Directory
INDEX_FILES2	index files (hit species)	index files for hit species.	File
BLASTDBCMD_RESULT1	blastdbcmd result (query species)	blastdbcmd result file for query species.	File
BLASTDBCMD_RESULT2	blastdbcmd result (hit species)	blastdbcmd result file for hit species.	File
LOGFILE1	logfile (blastdbcmd query species)	logfile for blastdbcmd query species.	File
LOGFILE2	logfile (blastdbcmd hit species)	logfile for blastdbcmd hit species.	File
DIR1	directory (seqretsplit query species)	directory for seqretsplit query species.	Directory
FASTA_FILES1	split fasta files (seqretsplit query species)	split fasta files using seqretsplit for pairwise sequence alignment.	File[]
DIR2	directory (seqretsplit hit species)	directory for seqretsplit hit species.	Directory
FASTA_FILES2	split fasta files (seqretsplit hit species)	split fasta files using seqretsplit for pairwise sequence alignment.	File[]
DIR3	needle result directory	needle (global alignment) result directory.	Directory
NEEDLE_RESULT_FILE	needle result file (.needle)	needle (global alignment) result files. suffix is .needle.	File[]
DIR4	water result directory	water (local alignment) result directory.	Directory
WATER_RESULT_FILE	water result file (.water)	water (local alignment) result files. suffix is .water.	File[]
TSVFILE3	output file (togoid convert)	output file for togoid convert.	File
REPORT_NOTEBOOK	output notebook (papermill)	output notebook using papermill. notebook name is `plant2human_report.ipynb`.	File