We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.

Stratified learning: A general-purpose statistical method for improved learning under covariate shift / Autenrieth, Maximilian; van Dyk, David A.; Trotta, Roberto; Stenning, David C.. - In: STATISTICAL ANALYSIS AND DATA MINING. - ISSN 1932-1864. - (2023), pp. 1-16. [10.1002/sam.11643]

Stratified learning: A general-purpose statistical method for improved learning under covariate shift

Maximilian Autenrieth
;
David A. van Dyk;Roberto Trotta;
2023-01-01

Abstract

We propose a simple, statistically principled, and theoretically justified method to improve supervised learning when the training set is not representative, a situation known as covariate shift. We build upon a well-established methodology in causal inference, and show that the effects of covariate shift can be reduced or eliminated by conditioning on propensity scores. In practice, this is achieved by fitting learners within strata constructed by partitioning the data based on the estimated propensity scores, leading to approximately balanced covariates and much-improved target prediction. We demonstrate the effectiveness of our general-purpose method on two contemporary research questions in cosmology, outperforming state-of-the-art importance weighting methods. We obtain the best reported AUC (0.958) on the updated "Supernovae photometric classification challenge", and we improve upon existing conditional density estimation of galaxy redshift from Sloan Data Sky Survey (SDSS) data.
2023
1
16
10.1002/sam.11643
http://arxiv.org/abs/2106.11211v2
Autenrieth, Maximilian; van Dyk, David A.; Trotta, Roberto; Stenning, David C.
File in questo prodotto:
File Dimensione Formato  
Statistical Analysis - 2023 - Autenrieth.pdf

accesso aperto

Descrizione: pdf editoriale
Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 2.4 MB
Formato Adobe PDF
2.4 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.11767/134310
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact