plant2human workflow
This analysis workflow is centered on foldseek, which enables fast structural similarity searches, and supports the discovery of understudied genes by comparing plants, which are distantly related species, with human, for which there is a wealth of information. Based on the list of genes you are interested in, you can easily create a scatter plot of “structural similarity vs. sequence similarity” by retrieving structural data from the AlphaFold protein structure database.
Analysis environment
You can create an analysis environment using the Dev Containers, which is one of VScode extensions. Please check the official website for details.
Example 1 ( Oryza sativa vs Homo sapiens)
Here, we will explain how to use the list of rice genes as an example.
1. Creation of a TSV file of gene and UniProt ID correspondences
First, you will need the following gene list tsv file. (Please set the column name as "From")
From
Os01g0187600
Os12g0129300
Os12g0159500
Os02g0609000
Os05g0468600
Os05g0352750
Os06g0140700
Os04g0391500
Os01g0795250
Os01g0859200
The following TSV file is required to execute the following workflow.
From UniProt Accession
Os01g0187600 A0A0P0UZ77
Os12g0129300 A0A0P0Y6G7
Os12g0129300 B9GBP4
Os12g0159500 A0A0P0Y794
Os12g0159500 A0A8J8YJ44
Os12g0159500 B9GBZ8
...
To do this, you need to follow the CWL workflow command below. This yaml file is the parameter file for the workflow for example.
cwltool --debug --outdir ./test/oryza_sativa_test ./Tools/01_uniprot_idmapping.cwl ./job/uniprot_idmapping_job_example_os.yml
In this execution, mmcif files are also retrieved. The actual execution results are output together with the jupyter notebook.
2. Creating and preparing indexes
I'm sorry, but the main workflow does not currently include the creation of an index (both for foldseek index and BLAST index). Please perform the following processes in advance.
2-1. Creating a foldseek index
In this workflow, the target of the structural similarity search is specified as the AlphaFold database in order to perform comparisons across a wider range of species.
Index creation using the foldseek databases
command is possible with CWL Command Line Tool file.
Please select the database you want to use from Alphafold/UniProt
, Alphafold/UniProt50-minimal
, Alphafold/UniProt50
, Alphafold/Proteome
, Alphafold/Swiss-Prot
.
You can check the details of this database using the following command.
docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases --help
For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following command,
# using docker container
docker run -u $(id -u):$(id -g) --rm -v $(pwd):/home -e HOME=/home --workdir /home quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases Alphafold/Swiss-Prot swissprot tmp --threads 8
# making directory
mkdir ./index/index_swissprot
# moving index file
mv swissprot* ./index/index_swissprot/
Note: We have written a CWL file describing above process, but we have confirmed that an error occurs and are in the process of correcting it.
2-2. Downloading a BLAST index file
An index FASTA file must be downloaded to obtain the amino acid sequence using the blastdbcmd
command from the UniProt database.
In this case, since it is a rice and human comparison, it can be downloaded as follows.
# Oryza sativa (all uniprot entries)
curl -o uniprotkb_39947_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A39947%29"
gzip -d uniprotkb_39947_all.fasta.gz
# Homo sapiens (all uniprot entries)
curl -o uniprotkb_9606_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A9606%29"
gzip -d uniprotkb_9606_all.fasta.gz
3. Execution of the main workflow
In this process, we perform a structural similarity search using foldseek easy-search
, and then perform a pairwise alignment of the amino acid sequences of the hit pairs using needle
and water
.
Finally, we create a scatter plot based on this information and output a jupyter notebook as a report.
Examples of commands are as follows.
cwltool --debug --outdir ./test/oryza_sativa_test ./Workflow/plant2human.cwl ./job/plant2human_job_example_os.yml
For example, you can visualize the results of structural similarity and global alignment, as shown below.
https://github.com/yonesora56/plant2human/blob/main/image/rice_test_scatter_plot.png
The following scatter diagram can also be obtained from the test results of Zey mays random 100 genes.
https://github.com/yonesora56/plant2human/blob/main/image/zey_mays_scatter_plot.png
Click and drag the diagram to pan, double click or use the controls to zoom.
Inputs
ID | Name | Description | Type |
---|---|---|---|
INPUT_DIRECTORY | input cif file directory | query protein structure cif file directory for foldseek easy-search input. |
|
FILE_MATCH_PATTERN | file match pattern | file match pattern for listing input cif files. |
|
FOLDSEEK_INDEX | foldseek index file | foldseek index file for foldseek easy-search input. This index file can be retrieved by executing the `foldseek databases` command. example: `foldseek databases Alphafold/Swiss-Prot index_swissprot/swissprot tmp --threads 16` |
|
OUTPUT_FILE_NAME1 | output file name (foldseek easy-search) | output file name for foldseek easy-search result. |
|
EVALUE | e-value (foldseek easy-search) | e-value threshold for foldseek easy-search. default: 0.1 |
|
THREADS | threads (foldseek easy-search) | threads for foldseek easy-search. default: 16 |
|
SPLIT_MEMORY_LIMIT | split memory limit (foldseek easy-search) | split memory limit for foldseek easy-search. default: 120G |
|
TAXONOMY_ID_LIST | taxonomy id list (foldseek easy-search) | taxonomy id list. separated by comma. Be sure to set “9606”. default: 9606 (human), 10090 (mouse), 3702 (Arabidopsis), 4577 (Zea mays), 4529 (Oryza rufipogon) |
|
OUTPUT_FILE_NAME2 | output file name (extract target species) | output file name for extract target species python script. |
|
WF_COLUMN_NUMBER_QUERY_SPECIES | column number of query species | column number of query species. default: 1 |
|
OUTPUT_FILE_NAME_QUERY_SPECIES | output file name (extract query species column) | output file name for extract query species column python script. |
|
WF_COLUMN_NUMBER_HIT_SPECIES | column number of hit species | column number of hit species. default: 2 |
|
OUTPUT_FILE_NAME_HIT_SPECIES | output file name (extract hit species column) | output file name for extract hit species column python script. |
|
SW_INPUT_FASTA_FILE_QUERY_SPECIES | input fasta file (for blastdbcmd) | input fasta file for blastdbcmd. Retrieve files in advance. |
|
SW_INPUT_FASTA_FILE_HIT_SPECIES | input fasta file (for blastdbcmd) | input fasta file for blastdbcmd. Retrieve files in advance. |
|
ROUTE_DATASET | route dataset (togoid convert) | route dataset for togoid convert. default: uniprot,ensembl_protein,ensembl_transcript,ensembl_gene,hgnc,hgnc_symbol |
|
OUTPUT_FILE_NAME3 | output file name (togoid convert) | output file name for togoid convert python script. |
|
OUT_NOTEBOOK_NAME | output notebook name (papermill) | output notebook name for papermill. |
|
QUERY_IDMAPPING_TSV | query idmapping tsv (papermill) | query idmapping tsv file. Retrieve files in advance. |
|
QUERY_GENE_LIST_TSV | query gene list tsv (papermill) | query gene list tsv file. Retrieve files in advance. |
|
Steps
ID | Name | Description |
---|---|---|
sub_workflow_foldseek_easy_search | n/a | n/a |
extract_target_species | n/a | n/a |
extract_query_species_column | n/a | n/a |
extract_hit_species_column | n/a | n/a |
sub_workflow_retrieve_sequence_query_species | n/a | " retrieve sequence from blastdbcmd result makeblastdb: ../Tools/14_makeblastdb.cwl blastdbcmd: ../Tools/15_blastdbcmd.cwl seqretsplit: ../Tools/16_seqretsplit.cwl needle (Global alignment): ../Tools/17_needle.cwl water (Local alignment): ../Tools/17_water.cwl " |
togoid_convert | n/a | n/a |
papermill | n/a | n/a |
Outputs
ID | Name | Description | Type |
---|---|---|---|
TSVFILE1 | output file (foldseek easy-search) | output file for foldseek easy-search all hit result. |
|
TSVFILE2 | output file (extract target species) | extract target species foldseek result file. (in this workflow, human result only) |
|
IDLIST1 | output file (extract query species column) | extract query species column UniProt ID list file. |
|
IDLIST2 | output file (extract hit species column) | extract hit species column UniProt ID list file. |
|
INDEX_DIR1 | index directory (query species) | index directory for query species. |
|
INDEX_FILES1 | index files (query species) | index files for query species. |
|
INDEX_DIR2 | index directory (hit species) | index directory for hit species. |
|
INDEX_FILES2 | index files (hit species) | index files for hit species. |
|
BLASTDBCMD_RESULT1 | blastdbcmd result (query species) | blastdbcmd result file for query species. |
|
BLASTDBCMD_RESULT2 | blastdbcmd result (hit species) | blastdbcmd result file for hit species. |
|
LOGFILE1 | logfile (blastdbcmd query species) | logfile for blastdbcmd query species. |
|
LOGFILE2 | logfile (blastdbcmd hit species) | logfile for blastdbcmd hit species. |
|
DIR1 | directory (seqretsplit query species) | directory for seqretsplit query species. |
|
FASTA_FILES1 | split fasta files (seqretsplit query species) | split fasta files for seqretsplit query species. |
|
DIR2 | directory (seqretsplit hit species) | directory for seqretsplit hit species. |
|
FASTA_FILES2 | split fasta files (seqretsplit hit species) | split fasta files for seqretsplit hit species. |
|
DIR3 | needle result directory | needle result directory. |
|
NEEDLE_RESULT_FILE | needle result file (.needle) | needle result files. suffix is .needle. |
|
DIR4 | water result directory | water result directory. |
|
WATER_RESULT_FILE | water result file (.water) | water result files. suffix is .water. |
|
TSVFILE3 | output file (togoid convert) | output file for togoid convert. |
|
REPORT_NOTEBOOK | output notebook (papermill) | output notebook using papermill. notebook name is `plant2human_report.ipynb`. |
|
Version History
main @ 76a6471 (latest) Created 20th Nov 2024 at 05:29 by Sora Yonezawa
update README & add description
Frozen
main
76a6471
main @ 1aa2763 Created 16th Nov 2024 at 10:34 by Sora Yonezawa
main workflow changed
Frozen
main
1aa2763
main @ b8c0b1d (earliest) Created 16th Nov 2024 at 04:56 by Sora Yonezawa
UPDATE README & zey mays test files
Frozen
main
b8c0b1d
Creator
Submitter
Views: 595 Downloads: 131
Created: 16th Nov 2024 at 04:56
Last updated: 3rd Dec 2024 at 02:23
None