Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates.

Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks / Macocco, Iuri. - (2023 Oct 26).

Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks

MACOCCO, IURI
2023-10-26

Abstract

Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates.
26-ott-2023
Laio, Alessandro
Grilli, Jacopo
Macocco, Iuri
File in questo prodotto:
File Dimensione Formato  
Macocco_PhD_Thesis.pdf

embargo fino al 31/05/2024

Tipologia: Tesi
Licenza: Creative commons
Dimensione 11.8 MB
Formato Adobe PDF
11.8 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/134630
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact