Please use this identifier to cite or link to this item: http://hdl.handle.net/2445/212913
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorLekadir, Karim, 1977--
dc.contributor.advisorCamacho, Marina-
dc.contributor.authorBrosten, Peter Hannagan-
dc.date.accessioned2024-06-13T07:55:31Z-
dc.date.available2024-06-13T07:55:31Z-
dc.date.issued2023-06-30-
dc.identifier.urihttp://hdl.handle.net/2445/212913-
dc.descriptionTreballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona. Curs: 2022-2023. Tutor: Karim Lekadir i Marina Camachoca
dc.description.abstract[en] Data drift is a problem in machine learning (ML) where characteristics of the input predictors changes over time, leading to model degradation. However, the effects of data drift on ML models built from human exposome data have not been well described yet. This study aimed to investigate data drifts for exposome data in ML models of diabetes risk. 7,521 participants with a diagnosis of diabetes from the UK Biobank, along with a proportional control group from 2006 to 2010 were used to train several baseline ML models for diabetes prediction. A second cohort of 4,007 participants attending the follow-up assessment period from 2012 to 2013 was used to assess potential data drifts over time. When evaluated on the second cohort, significant performance degradation was found in all baseline models (i.e.average precision dropped by 15%, f1-score by 12%, recall by 15%, and precision by 10%). A suite of drift detection tests were run on the best performing baseline models to identify possible signatures of three distinct kinds of data drift: covariate drift, label drift, and concept drift. Utilizing both multivariate and univariate data distribution based detection methods, covariate drift was identified in features such as Birth Year, BMI, Frequency of Tiredness, and Lack of Education. A comparison of prevalence rates for time-ordered batches of the population found no severe label drift. Nonetheless, gradual label drift could not be ruled out. A model-aware concept drift detection method was employed, monitoring temporal changes in normalized Shapley contributions for the model’s input features. This test found drift in abnormal changes in feature contribution when predicting on the second cohort for the Birth Year feature and near alerts in multiple others. This study shows the potential for data drift acting as a driver of model degradation in exposome-based ML models and highlights the need for further research into the traceability of clinical AI/ML solutions.ca
dc.format.extent59 p.-
dc.format.mimetypeapplication/pdf-
dc.language.isoengca
dc.rightscc-by-nc-nd (c) Peter Hannagan Brosten, 2023-
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.sourceMàster Oficial - Fonaments de la Ciència de Dades-
dc.subject.classificationAprenentatge automàtic-
dc.subject.classificationDiabetis-
dc.subject.classificationDades massives-
dc.subject.classificationTreballs de fi de màster-
dc.subject.otherMachine learning-
dc.subject.otherDiabetes-
dc.subject.otherBig data-
dc.subject.otherMaster's thesis-
dc.titleExposome data drift: implications for machine learning based diabetes predictionca
dc.typeinfo:eu-repo/semantics/masterThesisca
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
Appears in Collections:Màster Oficial - Fonaments de la Ciència de Dades

Files in This Item:
File Description SizeFormat 
tfm_brosten_peter_hannagan.pdfMemòria2.97 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons