Research Object Crate for plant2human workflow

Original URL: https://workflowhub.eu/workflows/1206/ro_crate?version=1

### plant2human workflow This analysis workflow is centered on [foldseek](https://github.com/steineggerlab/foldseek), which enables fast structural similarity searches, and supports the discovery of understudied genes by comparing plants, which are distantly related species, with human, for which there is a wealth of information. Based on the list of genes you are interested in, you can easily create a scatter plot of **“structural similarity vs. sequence similarity”** by retrieving structural data from the AlphaFold protein structure database.   ### Analysis environment You can create an analysis environment using the Dev Containers, which is one of VScode extensions. Please check the official website for details. - [VScode Dev Containers](https://code.visualstudio.com/docs/devcontainers/containers)   ### Example 1 ( *Oryza sativa* vs *Homo sapiens*) Here, we will explain how to use the list of rice genes as an example. #### **1. Creation of a TSV file of gene and UniProt ID correspondences** First, you will need the following gene list tsv file. (Please set the column name as "From") ```tsv From Os01g0187600 Os12g0129300 Os12g0159500 Os02g0609000 Os05g0468600 Os05g0352750 Os06g0140700 Os04g0391500 Os01g0795250 Os01g0859200 ``` The following TSV file is required to execute the following workflow. ```tsv From UniProt Accession Os01g0187600 A0A0P0UZ77 Os12g0129300 A0A0P0Y6G7 Os12g0129300 B9GBP4 Os12g0159500 A0A0P0Y794 Os12g0159500 A0A8J8YJ44 Os12g0159500 B9GBZ8 ... ``` To do this, you need to follow the CWL workflow command below. This yaml file is the parameter file for the workflow for example. ```bash cwltool --debug --outdir ./test/oryza_sativa_test ./Tools/01_uniprot_idmapping.cwl ./job/uniprot_idmapping_job_example_os.yml ``` In this execution, mmcif files are also retrieved. The actual execution results are output together with the jupyter notebook.   #### **2. Creating and preparing indexes** I'm sorry, but the main workflow does not currently include the creation of an index (both for foldseek index and BLAST index). Please perform the following processes in advance. #### 2-1. Creating a foldseek index In this workflow, the target of the structural similarity search is specified as the AlphaFold database in order to perform comparisons across a wider range of species. Index creation using the `foldseek databases` command is possible with CWL Command Line Tool file. Please select the database you want to use from `Alphafold/UniProt`, `Alphafold/UniProt50-minimal`, `Alphafold/UniProt50`, `Alphafold/Proteome`, `Alphafold/Swiss-Prot`. You can check the details of this database using the following command. ```bash docker run --rm quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases --help ``` For example, if you want to specify AlphaFold/Swiss-Prot as the index, you can do so with the following command, ```bash # using docker container docker run -u $(id -u):$(id -g) --rm -v $(pwd):/home -e HOME=/home --workdir /home quay.io/biocontainers/foldseek:9.427df8a--pl5321hb365157_1 foldseek databases Alphafold/Swiss-Prot swissprot tmp --threads 8 # making directory mkdir ./index/index_swissprot # moving index file mv swissprot* ./index/index_swissprot/ ``` Note: We have written a CWL file describing above process, but we have confirmed that an error occurs and are in the process of correcting it. #### 2-2. Downloading a BLAST index file An index FASTA file must be downloaded to obtain the amino acid sequence using the `blastdbcmd` command from the UniProt database. In this case, since it is a rice and human comparison, it can be downloaded as follows. ```bash # Oryza sativa (all uniprot entries) curl -o uniprotkb_39947_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A39947%29" gzip -d uniprotkb_39947_all.fasta.gz # Homo sapiens (all uniprot entries) curl -o uniprotkb_9606_all.fasta.gz "https://rest.uniprot.org/uniprotkb/stream?compressed=true&format=fasta&query=%28organism_id%3A9606%29" gzip -d uniprotkb_9606_all.fasta.gz ``` #### 3. Execution of the main workflow In this process, we perform a structural similarity search using `foldseek easy-search`, and then perform a pairwise alignment of the amino acid sequences of the hit pairs using `needle` and `water`. Finally, we create a scatter plot based on this information and output a jupyter notebookas a report. Examples of commands are as follows. ```bash cwltool --debug --outdir ./test/oryza_sativa_test ./Workflow/plant2human.cwl ./job/plant2human_job_example_os.yml ``` For example, you can visualize the results of structural similarity and global alignment, as shown below. https://github.com/yonesora56/plant2human/blob/main/image/rice_test_scatter_plot.png The following scatter diagram can also be obtained from the test results of Zey mays random 100 genes. https://github.com/yonesora56/plant2human/blob/main/image/zey_mays_scatter_plot.png

Author
Sora Yonezawa
License
MIT

Contents

Main Workflow: plant2human workflow
Size: 15519 bytes