Missingness threshold for participants for PLS-DA and DIABOLO

Question around participant data inclusion

I’m aware that roughly the variable (i.e. matrix columns) inclusion threshold for missingness in continuous variables is 20%, however, is there a rough guide for missingness in participants (i.e. matrix rows)?
I’m working with clinical, biochemical and survey data, which is quite patchy as some participants didn’t complete the survey questions or refused blood draws.
Is a cut off of 50% of missingness for each participants acceptable?
For now I’m using this for PLS-DA and DIABOLO with just these kind of variables, but down the track we’ll be including other omics layers such as transcriptomics and rare and common variants.


You can use NIPALS to impute the missing values, but above 20% might be pushing it (i.e you assign to missing values a wrong or biased value depending only on a small number of samples).

For PLS-DA and DIABLO I think you won’t be able to tune the models with missing values, so either do not tune, or impute. See what varies in the results if you vary the threshold of NAs.
