Missingness threshold for participants for PLS-DA and DIABOLO

lizak · June 24, 2024, 2:01am

Question around participant data inclusion

I’m aware that roughly the variable (i.e. matrix columns) inclusion threshold for missingness in continuous variables is 20%, however, is there a rough guide for missingness in participants (i.e. matrix rows)?
I’m working with clinical, biochemical and survey data, which is quite patchy as some participants didn’t complete the survey questions or refused blood draws.
Is a cut off of 50% of missingness for each participants acceptable?
For now I’m using this for PLS-DA and DIABOLO with just these kind of variables, but down the track we’ll be including other omics layers such as transcriptomics and rare and common variants.

kimanh.lecao · June 27, 2024, 10:38pm

@lizak

You can use NIPALS to impute the missing values, but above 20% might be pushing it (i.e you assign to missing values a wrong or biased value depending only on a small number of samples).

For PLS-DA and DIABLO I think you won’t be able to tune the models with missing values, so either do not tune, or impute. See what varies in the results if you vary the threshold of NAs.

Kim-Anh

Topic		Replies	Views
PLS-DA with missing '' values predicted in Y Analysis	1	723	April 26, 2020
`plsda`: NA values in Y data	2	182	May 2, 2023
VIP score mismatch with number of components Bugs	8	1534	March 18, 2021
Sample size for PLS and DIABLO Analysis	1	1229	August 13, 2020
Perf.pls with missing data Analysis	1	275	August 31, 2023

Missingness threshold for participants for PLS-DA and DIABOLO

Related topics