SISSA DIGITAL LIBRARYInstitutional Research Information System (Statistiche: prodotti, OA)
Per informazioni contatta sdl@sissa.it

Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates.

Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks / Macocco, Iuri. - (2023 Oct 26).

Intrinsic Dimension Estimation for non-Euclidean manifolds: from metagenomics to unweighted networks

MACOCCO, IURI

2023-10-26

Abstract

Within the field of unsupervised manifold learning, Intrinsic Dimension estimators are among the most important analysis tools. The Intrinsic Dimension provides a measure of the dimensionality of the hidden manifold from which data are sampled, even if the manifold is embedded in a space with a much higher number of features. The present Thesis tackles the still unanswered problem of computing the Intrinsic Dimension (ID) of spaces characterised by non-Euclidean metrics. In particular, we focus on datasets where the distances between points are measured by means of Manhattan, Hamming or shortest-path metrics and, thus, can only assume discrete values. This peculiarity has deep consequences on the way datapoints populate the neighbourhoods and on the structure on the manifold. For this reason we develop a general purpose, nearest-neighbours-based ID estimator that has two peculiar features: the capability of selecting explicitly the scale at which the Intrinsic Dimension is computed and a validation procedure to check the reliability of the provided estimate. We thus specialise the estimator to lattice spaces, where the volume is measured by means of the Ehrhart polynomials. After testing the reliability of the estimator on artificial datasets, we apply it to genomics sequences and discover an unexpectedly low ID, suggesting that the evolutive pressure exerts strong restraints on the way the nucleotide basis are allowed to mutate. This same framework is then employed to profile the scaling of the ID of unweighted networks. The diversity of the obtained ID signatures prompted us into using it as a signature to characterise the networks. Concretely, we employ the ID as a summary statistics within an Approximate Bayesian Computation framework in order to pinpoint the parameters of network mechanistic generative models of increasing complexity. We discover that, by targeting the ID of a given network, other typical network properties are also fairly retrieved. As a last methodological development, we improved the ID estimator by adaptively selecting, for each datapoint, the largest neighbourhoods with an approximately constant density. This offers a quantitative criterion to automatically select a meaningful scale at which the ID is computed and, at the same time, allows to enforce the hypothesis of the method, implying more reliable estimates.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di discussione
	
				26-ott-2023
			
	Relatore/i afferenti alla SISSA
	
				Laio, Alessandro
			
	Relatore/i esterni
	
				Grilli, Jacopo
			
	Tutti gli autori
	
						Macocco, Iuri
					
	Appare nelle tipologie:
	
				8.1 PhD thesis

File in questo prodotto:

File	Dimensione	Formato
Macocco_PhD_Thesis.pdf Open Access dal 01/06/2024 Tipologia: Tesi Licenza: Creative commons Dimensione 11.8 MB Formato Adobe PDF Visualizza/Apri	11.8 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/134630

Citazioni

ND

ND

ND

social impact