Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.

Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges / Sacco, Giuseppe; Bussi, Giovanni; Sanguinetti, Guido. - In: RNA. - ISSN 1355-8382. - 32:4(2026), pp. 443-456. [10.1261/rna.080840.125]

Machine Learning for RNA Secondary Structure Prediction: a review of current methods and challenges

Sacco, Giuseppe;Bussi, Giovanni;Sanguinetti, Guido
2026-01-01

Abstract

Predicting the secondary structure of RNA is a core challenge in computational biology, essential for understanding molecular function and designing novel therapeutics. The field has evolved from foundational but accuracy-limited thermodynamic approaches to a new data-driven paradigm dominated by machine learning and deep learning. These models learn folding patterns directly from data, leading to significant performance gains. This review surveys the modern landscape of these methods, covering single-sequence, evolutionary-based, and hybrid models that blend machine learning with biophysics. A central theme is the field's "generalization crisis," where powerful models were found to fail on new RNA families, prompting a community-wide shift to stricter, homology-aware benchmarking. In response to the underlying challenge of data scarcity, RNA foundation models have emerged, learning from massive, unlabeled sequence corpora to improve generalization. Finally, we look ahead to the next set of major hurdles-including the accurate prediction of complex motifs like pseudoknots, scaling to kilobase-length transcripts, incorporating the chemical diversity of modified nucleotides, and shifting the prediction target from static structures to the dynamic ensembles that better capture biological function. We also highlight the need for a standardized, prospective benchmarking system to ensure unbiased validation and accelerate progress.
2026
RNA
32
4
443
456
10.1261/rna.080840.125
https://arxiv.org/abs/2511.02622
Sacco, Giuseppe; Bussi, Giovanni; Sanguinetti, Guido
File in questo prodotto:
File Dimensione Formato  
RNA-2026-Sacco-rna.080840.125.pdf

accesso aperto

Descrizione: postprint
Tipologia: Documento in Post-print
Licenza: Creative commons
Dimensione 899 kB
Formato Adobe PDF
899 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/149890
Citazioni
  • ???jsp.display-item.citation.pmc??? 1
  • Scopus ND
  • ???jsp.display-item.citation.isi??? 0
social impact