It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.

The intrinsic dimension of protein sequence evolution / Facco, E.; Pagnani, A.; Russo, Elena Tea; Laio, A.. - In: PLOS COMPUTATIONAL BIOLOGY. - ISSN 1553-734X. - 15:4(2019), pp. 1-16. [10.1371/journal.pcbi.1006767]

The intrinsic dimension of protein sequence evolution

Facco E.;Russo, Elena Tea;Laio A.
2019-01-01

Abstract

It is well known that, in order to preserve its structure and function, a protein cannot change its sequence at random, but only by mutations occurring preferentially at specific locations. We here investigate quantitatively the amount of variability that is allowed in protein sequence evolution, by computing the intrinsic dimension (ID) of the sequences belonging to a selection of protein families. The ID is a measure of the number of independent directions that evolution can take starting from a given sequence. We find that the ID is practically constant for sequences belonging to the same family, and moreover it is very similar in different families, with values ranging between 6 and 12. These values are significantly smaller than the raw number of amino acids, confirming the importance of correlations between mutations in different sites. However, we demonstrate that correlations are not sufficient to explain the small value of the ID we observe in protein families. Indeed, we show that the ID of a set of protein sequences generated by maximum entropy models, an approach in which correlations are accounted for, is typically significantly larger than the value observed in natural protein families. We further prove that a critical factor to reproduce the natural ID is to take into consideration the phylogeny of sequences.
2019
15
4
1
16
e1006767
https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006767
Facco, E.; Pagnani, A.; Russo, Elena Tea; Laio, A.
File in questo prodotto:
File Dimensione Formato  
journal.pcbi.1006767.pdf

accesso aperto

Descrizione: Open Access
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 1.52 MB
Formato Adobe PDF
1.52 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/104316
Citazioni
  • ???jsp.display-item.citation.pmc??? 7
  • Scopus 17
  • ???jsp.display-item.citation.isi??? 14
social impact