Perf.pls with missing data

Hello,
I am running a simple PLS1 analysis where my X matrix contains a reasonable level of missing data. To tune the number of components using perf I get the error message “Error: missing data in ‘X’ and/or ‘Y’. Use ‘nipals’ for dealing with NAs.”

Is it a reasonable solution to first impute my X matrix using impute.nipals and then run perf? I’m a little concerned as the results in pls1 and pls2 objects (see below) have quite different components from dimension 2 upwards.

pls1 <- pls(CGdat[, var_names_trans],CGdat$timeGrazed, scale=TRUE, ncomp=20, mode="classic")
# tuning number of components
X.impute <- impute.nipals(X = CGdat[, var_names_trans], ncomp = 20)
pls2 <- pls(X.impute,CGdat$time, scale=TRUE, ncomp=20, mode="classic")
perf.pls <- perf(pls2, validation = 'Mfold',
                        folds = 10, nrepeat = 5)

plot(perf.pls, criterion = ‘R2’)

I also get some very strange results for criteria other than R2 with SDs exploding after about 2/3 components - I assume this must be related to the missingness, but any advice is welcome. Note my data is relatively small with
dim(CGdat[, var_names_trans])
[1] 112 32

hi @kirsty.hassall,

I can’t see your outputs, but yes, you should impute with NIPALS first. Also check beforehand that you have less than 20% of missing values in your dataset, as NIPALS can only do so much. If not, then you may have to remove variables with too many missing values.

Regarding the perf, potentially you are including a large number of components, and you should aim for a small number (I dont know anything about your data, but probably up to 10 is enough).

Kim-Anh