Protein domains can be viewed as building blocks, essential for understanding structure-function relationships in proteins. However, each domain database classifies protein domains using its own methodology. Thus, in many cases, boundaries between different domains or families differ from one domain database to the other, raising the question of domain definition and enumeration. The answer to this question cannot be found in a single database. Rather, expert integration and curation of various databases are required to refine the contours of a domain of interest, in a domain-centric approach. Here, we illustrate the role of 3-D structure in clarifying domain definition with the help of CroMaSt: “Cross-Mapper for Structural Domains”, a fully automated workflow that classifies all structural instances of a given domain into 3 different categories (core, true and domain-like). CroMaSt is developed in Common Workflow Language (CWL) and takes advantage of 2 well-known and widely used domain databases, Pfam (sequence-based) and CATH (structure-based). It uses the domain definitions from Pfam and CATH and SIFTS resource for cross-mapping of structural instances from the above-mentioned sources. Structural alignments generated by Kpax allow to identify the false positive instances from each domain database. We tested CroMaSt on the RNA Recognition Motif (RRM), the most prevalent and diverse RNA-binding domain. Starting from PF00076 and 220.127.116.110 domain families from Pfam and CATH respectively, our workflow identifies 882 core, 966 true and 344 domain-like structural instances. The information generated by this method will play a crucial role in machine learning methods applied to domain-specific synthetic biology.
Views: 551 Downloads: 19
Created: 3rd Oct 2022 at 11:29