As the UniProt database approaches the 200 million entries’ mark, the vast majority of proteins it contains lack any experimental validation of their functions. In this context, the identification of homologous relationships between proteins remains the single most widely applicable tool for generating functional and structural hypotheses in silico. Although many databases exist that classify proteins and protein domains into homologous families, large sections of the sequence space remain unassigned. In this thesis we introduce DPCfam, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions. After a proof-of-principle experiment of the method based on the analysis of two clans from the Pfam protein family database and, we present an all-to-all clustering of the UniRef50 database, containing ~23,000,000 proteins. In both cases we present the DPCfam implementations: in particular present the strategies adopted to write a parallel and optimized implementation of the algorithm, needed to cluster the massive number of sequences in UniRef50. We develop specific measures to assess the quality of DPCfam's clusters, both in terms of boundaries and homology, using the Pfam database as a reference. Our tests indicate that DPCfam automatically-generated clusters are generally evolutionary accurate corresponding to one or more Pfam families and that they cover a significant fraction of known homologs. Moreover, find possible candidates for new family (around 14,000 when clustering UniRef50). Overall, DPCfam shows potential both for assisting manual annotation efforts (domain discovery, detection of classification inconsistencies, improvement of family coverage and boosting of clan membership) and as a stand-alone tool for unsupervised classification of sparsely annotated protein datasets such as those from environmental metagenomics studies (domain discovery, analysis of domain diversity).

Unsupervised protein family classification by Density Peak clustering / Russo, Elena Tea. - (2020 Dec 21).

Unsupervised protein family classification by Density Peak clustering

Russo, Elena Tea
2020-12-21

Abstract

As the UniProt database approaches the 200 million entries’ mark, the vast majority of proteins it contains lack any experimental validation of their functions. In this context, the identification of homologous relationships between proteins remains the single most widely applicable tool for generating functional and structural hypotheses in silico. Although many databases exist that classify proteins and protein domains into homologous families, large sections of the sequence space remain unassigned. In this thesis we introduce DPCfam, a new unsupervised procedure that uses sequence alignments and Density Peak Clustering to automatically classify homologous protein regions. After a proof-of-principle experiment of the method based on the analysis of two clans from the Pfam protein family database and, we present an all-to-all clustering of the UniRef50 database, containing ~23,000,000 proteins. In both cases we present the DPCfam implementations: in particular present the strategies adopted to write a parallel and optimized implementation of the algorithm, needed to cluster the massive number of sequences in UniRef50. We develop specific measures to assess the quality of DPCfam's clusters, both in terms of boundaries and homology, using the Pfam database as a reference. Our tests indicate that DPCfam automatically-generated clusters are generally evolutionary accurate corresponding to one or more Pfam families and that they cover a significant fraction of known homologs. Moreover, find possible candidates for new family (around 14,000 when clustering UniRef50). Overall, DPCfam shows potential both for assisting manual annotation efforts (domain discovery, detection of classification inconsistencies, improvement of family coverage and boosting of clan membership) and as a stand-alone tool for unsupervised classification of sparsely annotated protein datasets such as those from environmental metagenomics studies (domain discovery, analysis of domain diversity).
21-dic-2020
Laio, Alessandro
Punta, Marco
Russo, Elena Tea
File in questo prodotto:
File Dimensione Formato  
ETR-thesis-17-12-2020.pdf

Open Access dal 22/06/2021

Licenza: Non specificato
Dimensione 27.27 MB
Formato Adobe PDF
27.27 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/116345
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact