SISSA DIGITAL LIBRARYInstitutional Research Information System (Statistiche: prodotti, OA)
Per informazioni contatta sdl@sissa.it

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.

Ranking the information content of distance measures / Glielmo, Aldo; Zeni, Claudio; Cheng, Bingqing; Csányi, Gábor; Laio, Alessandro. - In: PNAS NEXUS. - ISSN 2752-6542. - 1:2(2022). [10.1093/pnasnexus/pgac039]

Ranking the information content of distance measures

Glielmo, Aldo;Zeni, Claudio;Cheng, Bingqing;Csányi, Gábor;Laio, Alessandro

2022-01-01

Abstract

Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Finding a small set of features that still retains sufficient information about the dataset is important for the successful application of many statistical learning approaches. We introduce a statistical test that can assess the relative information retained when using 2 different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This ranking can in turn be used to identify the most informative distance measure and, therefore, the most informative set of features, out of a pool of candidates. To illustrate the general applicability of our approach, we show that it reproduces the known importance ranking of policy variables for Covid-19 control, and also identifies compact yet informative descriptors for atomic structures. We further provide initial evidence that the information asymmetry measured by the proposed test can be used to infer relationships of causality between the features of a dataset. The method is general and should be applicable to many branches of science.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Rivista
	
				PNAS NEXUS
			
	Numero del volume
	
				1
			
	Fascicolo
	
				2
			
	Numero di articolo
	
				pgac039
			
	Codice DOI
	
				https://dx.doi.org/10.1093/pnasnexus/pgac039
			
	Fulltext via DOI
	
				10.1093/pnasnexus/pgac039
			
	Tutti gli autori
	
						Glielmo, Aldo; Zeni, Claudio; Cheng, Bingqing; Csányi, Gábor; Laio, Alessandro
					
	Appare nelle tipologie:
	
				1.1 Journal article

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/131770

Citazioni

4

25

25

social impact