CroMaSt: A workflow for assessing protein domain classification by cross-mapping of structural instances between domain databases and structural alignment
CroMaSt (Cross Mapper of domain Structural instances) is an automated iterative workflow to clarify the assignment of protein domains to a given domain type of interest, based on their 3D structure and by cross-mapping of domain structural instances between domain databases. CroMaSt (for Cross-Mapper of domain Structural instances) will classify all structural instances of a given domain type into 4 different categories (Core, True, Domain-like, and Failed).
Requirements
- Conda or Miniconda
- Kpax
Download and install conda (or Miniconda) and Kpax by following the instructions from their official site.
Get it running
(Considering the requirements are already met)
- Clone the repository and change the directory
git clone https://gitlab.inria.fr/capsid.public_codes/CroMaSt.git
cd CroMaSt
- Create the conda environment for the workflow
conda env create --file yml/environment.yml
conda activate CroMaSt
- Change the path of variables in paramter file
sed -i 's/\/home\/hdhondge\/CroMaSt\//\/YOUR\/PATH\/TO_CroMaSt\//g' yml/CroMaSt_input.yml
- Create the directory to store files from PDB and SIFTS (if not already)
mkdir PDB_files SIFTS
- Download the source input data
cwl-runner Tools/download_data.cwl yml/download_data.yml
Basic example
1. First, we will run the workflow for the KH domain with family identifiers RRM_1
and RRM
in Pfam and CATH, respectively.
Run the workflow -
cwl-runner --parallel --outdir=Results/ CroMaSt.cwl yml/CroMaSt_input.yml
2. Once the iteration is complete, check the new_param.yml
file from the outputdir
(Results), if there is any family identifier in either pfam
or cath
; run the next iteration using following command (Until there is no new families explored by workflow) -
cwl-runner --parallel --outdir=Results/ CroMaSt.cwl Results/new_param.yml
Extra: Start the workflow with multiple families from one or both databases
If you would like to start the workflow with multiple families from one or both databases, then simply add a comma in between two family identifiers.
pfam: ['PF00076', 'PF08777']
cath: ['3.30.70.330']
- Pro Tip: Don't forget to give different path to
--outdir
option while running the workflow multiple times or at least move the results to some other location after first run.
Run the workflow for protein domain of your choice
1. You can run the workflow for the domain of your choice by simply changing the family identifers in yml/CroMaSt_input.yml
file.
Simply replace the following values of family identifiers (for pfam and cath) with the family identifiers of your choice in yml/CroMaSt_input.yml
file.
pfam: ['PF00076']
cath: ['3.30.70.330']
Data files used in current version are as follows:
Files in Data directory can be downloaded as follows:
-
File used from Pfam database: pdbmap.gz
-
File used from CATH database: cath-domain-description-file.txt
-
Obsolete entries from RCSB PDB obsolete_PDB_entry_ids.txt
CATH Version - 4.3.0 (Ver_Date - 11-Sep-2019) FTP site Pfam Version - 35.0 (Ver_Date - November-2021) FTP site
Reference
Poster -
1. Hrishikesh Dhondge, Isaure Chauvot de Beauchêne, Marie-Dominique Devignes. CroMaSt: A workflow for domain family curation through cross-mapping of structural instances between protein domain databases. 21st European Conference on Computational Biology, Sep 2022, Sitges, Spain. ⟨hal-03789541⟩
Acknowledgements
This project has received funding from the Marie Skłodowska-Curie Innovative Training Network (MSCA-ITN) RNAct supported by European Union’s Horizon 2020 research and innovation programme under granta greement No 813239.
Click and drag the diagram to pan, double click or use the controls to zoom.
Inputs
ID | Name | Description | Type |
---|---|---|---|
pfam | Pfam family ids | n/a |
|
cath | CATH family ids | n/a |
|
iteration | Iteration number starting from 0 | n/a |
|
filename | Filename to store family ids per iteration | n/a |
|
true_domain_file | To store all the true domain StIs | n/a |
|
siftsDir | Directory for storing all SIFTS files | n/a |
|
paramfile | Parameter file for current iteration | n/a |
|
db_for_core | Database to select to compute core average structure | n/a |
|
core_domain_struct | Core domain structure (.pdb) | n/a |
|
prev_crossMapped_pfam | Pfam cross-mapped domain StIs from previous iteration | n/a |
|
prev_crossMapped_cath | CATH cross-mapped domain StIs from previous iteration | n/a |
|
unmapped_analysis_file | Filename with alignment scores for unmapped instances | n/a |
|
pdbDir | The directory for storing all PDB files | n/a |
|
cath_resmap | Filename for residue-mapped CATH domain StIs | n/a |
|
cath_lost | Obsolete and inconsistent CATH domain StIs | n/a |
|
pfam_resmap | Filename for residue-mapped Pfam domain StIs | n/a |
|
pfam_lost | Obsolete and inconsistent Pfam domain StIs | n/a |
|
domain_like | To store all the domain-like StIs | n/a |
|
failed_domain | To store all failed domain StIs | n/a |
|
min_domain_length | Threshold for minimum domain length | n/a |
|
alignment_score | Alignment score from Kpax to analyse structures | n/a |
|
score_threshold | Score threshold for given alignment score from Kpax | n/a |
|
unmap_pfam_pass | Filename to store unmapped but structurally well aligned instances from Pfam | n/a |
|
unmap_pfam_fail | Filename to store unmapped and not properly aligned instances from Pfam | n/a |
|
unmap_cath_pass | Filename to store unmapped but structurally well aligned instances from CATH | n/a |
|
unmap_cath_fail | Filename to store unmapped and not properly aligned instances from CATH | n/a |
|
Steps
ID | Name | Description |
---|---|---|
get_family_ids | Get domain family ids | Get domain family ids from CATH and Pfam databases from parameter file provided by user |
pfam_domain_instances | Produce a list of residue-mapped domain StIs from Pfam ids | Retrieve and process the PDB structures corresponding to the Pfam family ids resulting in a list of residue-mapped structural domain instances along with lost structural instances (requires Data/pdbmap downloaded from Pfam and uses SIFTS resource for UniProt to PDB residue Mapping) |
cath_domain_instances | Produce a list of residue-mapped domain StIs from CATH ids | Retrieve and process the PDB structures corresponding to the CATH superfamily ids resulting in a list of residue-mapped structural domain instances along with lost structural instances (requires Data/cath_domain_description_file.txt downloaded from CATH and uses SIFTS resource for PDB to UniProt residue Mapping) |
add_crossmapped_to_resmapped | Add cross-mapped to residue-mapped domain StIs | Add crossmapped domain instances from last iteration to current list of residue mapped domain instances. |
compare_instances_CATH_Pfam | Compare residue-mapped domain StIs | Find the intersection between residue-mapped domain StIs of Pfam and CATH lists. Allows variable domain boundaries in a certain range +/- 30aa. Produces three files: common domain instances, and unique domain instances to each Pfam and CATH. |
crossmapping_Pfam2CATH | Map unique Pfam domain StIs to CATH db | Maps the unique domain StIs from Pfam to the whole CATH database (using residue numbering from PDB allowing variable domain boundaries +/-30aa) |
crossmapping_CATH2Pfam | Map unique CATH domain StIs to Pfam db | Maps the unique domain StIs from CATH to the whole Pfam database (using residue numbering from UniProt allowing variable domain boundaries +/-30aa) |
format_core_list | Format core domain StIs list | Fornat core domain instances list from the common instances list identified at first iteration; Preparing input for average structure computation |
chop_and_avg_for_core | Compute average of average for core domain instances | Compute average structure for all averaged structures corresponding to core UniProt domain instances. First computes average per UniProt domain instance and then average all averaged structures. |
chop_and_avg_for_CATH2Pfam | Compute average of average per cross-mapped Pfam | Compute average structure for all averaged structures corresponding to UniProt domain instances cross-mapped from CATH to a Pfam family. First computes average per UniProt domain instance and then average all averaged structures per Pfam family. |
chop_and_avg_for_Pfam2CATH | Compute average of average per cross-mapped CATH | Compute average structure for all averaged structures corresponding to UniProt domain instances cross-mapped from Pfam to a CATH superfamily. First computes average per UniProt domain instance and then average all averaged structures per CATH superfamily. |
align_avg_structs_pairwise | Pairwise alignemnt with core average structure | Align crossmapped averaged structures against core average domain structure pairwise using Kpax Outputs a csv file with all the scores from pairwise alignments |
check_alignment_scores | Checks the alignment score for given threshold | Checks the alignment score for each aligned structure based on the given threshold Outputs the structural instances passing and failing the threshold in separate files |
unmapped_from_pfam | Averages and aligns the unampped instances from Pfam | First computes average per UniProt domain instance and then aligns all the average structures against core average structure. Outputs the alignment results along with the structures passing and failing the threshold for given Kpax score. |
unmapped_from_cath | Averages and aligns the unampped instances from CATH | First computes average per UniProt domain instance and then aligns all the average structures against core average structure. Outputs the alignment results along with the structures passing and failing the threshold for given Kpax score. |
gather_domain_like | Collects all domain-like structural instances | Collects all domain-like structural instances from Pfam and CATH Outputs the list with all domain-like structural instances together. |
gather_failed_domains | Collects all failed domain instances | Collects all domain instances failed to pass the criteria from both Pfam and CATH Outputs the list with all failed domain instances together. |
create_new_parameters | Create parameter file for next iteration | Create parameter file for next iteration from previous parameter file Filter the pairwise alignments to retrieve family ids passing the threshold for a given Kpax score type |
Outputs
ID | Name | Description | Type |
---|---|---|---|
family_ids_x | Family ids per iteration | n/a |
|
resmapped_pfam | All Pfam residue-mapped domain StIs with domain labels | n/a |
|
reslost_pfam | Obsolete and inconsistent domain StIs from Pfam | n/a |
|
resmapped_cath | All CATH residue-mapped domain StIs with domain labels | n/a |
|
reslost_cath | Obsolete and inconsistent domain StIs from CATH | n/a |
|
true_domains | True domain StIs per iteration | n/a |
|
core_domains_list | Core domain StIs | n/a |
|
core_structure | Core domain structure (.pdb) | n/a |
|
all_domain_like | Domain-like StIs | n/a |
|
all_failed_domains | Failed domain StIs | n/a |
|
crossmapped_pfam_passed | Cross-mapped families with Pfam domain StIs passing the threshold | n/a |
|
crossmapped_cath_passed | Cross-mapped families with CATH domain StIs passing the threshold | n/a |
|
crossres_mappedpfam | Merged cross-mapped and residue-mapped domain StIs from Pfam | n/a |
|
crossres_mappedcath | Merged cross-mapped and residue-mapped domain StIs from CATH | n/a |
|
unmap_pfam | All Pfam un-mapped domin StIs | n/a |
|
allmap_pfam | All Pfam domain StIs cross-mapped to CATH family-wise | n/a |
|
unmap_cath | All un-mapped domin StIs from CATH | n/a |
|
allmap_cath | All CATH cross-mapped domin StIs family-wise together | n/a |
|
pfam_crossmap_cath_avg | Average structures per cross-mapped CATH family for Pfam StIs at family level | n/a |
|
cath_crossmap_pfam_avg | Average structures per cross-mapped Pfam family for CATH StIs at family level | n/a |
|
avg_alignment_result | Alignment results from Kpax for all cross-mapped families | n/a |
|
next_parmfile | Parameter file for next iteration of the workflow | n/a |
|
align_unmap_pfam | Alignment results for Pfam unmapped instances | n/a |
|
unmap_pfam_passed | Domain-like StIs from Pfam | n/a |
|
unmap_pfam_failed | Failed domain StIs from Pfam | n/a |
|
align_unmap_cath | Alignment results for CATH unmapped instances | n/a |
|
unmap_cath_passed | Domain-like StIs from CATH | n/a |
|
unmap_cath_failed | Failed domain StIs from CATH | n/a |
|
crossmap_pfam | Pfam domin StIs cross-mapped to CATH family-wise | n/a |
|
crossmap_cath | CATH domain StIs cross-mapped to Pfam family-wise | n/a |
|
Version History
v1.1 (latest) Created 20th Jun 2023 at 13:06 by Hrishikesh Dhondge
Pfam v35.0 and Results_archive for publication
Frozen
v1.1
b5a9d4b
main @ 9f38328 (earliest) Created 28th Sep 2022 at 12:34 by Hrishikesh Dhondge
Updated input parameter file
Frozen
main
9f38328
Creators
Submitter
Views: 3506 Downloads: 368
Created: 28th Sep 2022 at 12:34
Last updated: 20th Jun 2023 at 13:06
None