(s)PLS-DA parameter tuning - System computationally Singular

Hi folks!

Hope you are all well, thanks for being active in the forums.

Continuing the discussion from Perf function: error in solve.default(Sr): system is computationally singular:

I’m also having an issue with my datasets in the perf for (s)PLS-DA, in that it is returning the system is computationally singular: reciprocal condition number =.

I’ve read the possible causes, so I tried to remove missing values with near.zero.vars in the plsda function. In terms of too many components being the cause, I used values <= 4, and it works - but the plot looks like this:

Perhaps suggesting 1 component? I’m working with a low number of samples (6 overall) for each dataset, of which there are 3 different omic datasets I am investigating before attempting integration.

Here’s the code I’ve used for the perf:

MyResult.plsda<- plsda(data,factors, ncomp=4, near.zero.var = TRUE)
MyPerf.plsda <- perf(MyResult.plsda, validation = "loo",dist = c("max.dist","centroids.dist"), 
                     progressBar = FALSE, nrepeat = 1)
plot(MyPerf.plsda.trans, sd = TRUE, legend.position =  "horizontal")

The only thing I have not investigated is “Potentially variables that are almost exactly the same (multi collinearity)”. I’ve not much of a clue on how to deal with this, apologies but I’m just starting out.

Thanks for any assitance offered.

Sam

I have a pretty good idea what your issue is here.

My masters focused on plsda and other methods (that were supervised) for low class sizes such as 6. I have a good familiarity with issues that pop up with this package with low sample size.

My guess what has happened here is the reciprocal condition number error appeared because you cannot invert the sample covariance matrix when using malahanobis distance. Try not specifying this in the function and only use centroid distance for example.

But in all seriousness, using leave one out with this small of a sample size is very unpractical.

However given you really want to use cross validation, one option is you could attempt a bootstrap cross-validation by writing code manually. This maybe tricky, depending on your skill level in R. I specifically recommend the bootstrap zero estimator, while biased, will at least somewhat work with this small class size.

You could also find your components without cross validation by using the full 6 samples and finding which component has the highest correlation between your response Y and X (your data) multiplied by T (your scores) —> cor(Y,tX).

Hope this helps!

1 Like

Hi,

Thanks for the help!

I’m also doing my masters project and this is the main focus, but it’s been tough getting to grips with all the concepts and getting stuff to work!

I did read about leaving out malahanobis distances, and I think I’ve done it correctly by leaving it out (in the plot for example, it only shows max.dist and centroids.dist). But this hasn’t really solved my problem.

That sounds very interesting, I’m quite new to R (and some of these concepts!) but hopefully I can try and figure it out!

That sounds like a great idea as well! Unfortunately I’ve just discovered that all my datasets predictors have high correlation within each dataset, and I’m trying to find a workaround to dealing with this high correlation (for instance, there’s a majority of predictors with a correlation score above the threshold of 0.99) for my data.

Thank you very much for the response, it’s given me a lot to think about and hopefully I can get it solved really soon for my project! :joy: :sweat_smile:

Hi @studentScot,

Can you try feature selection using the splsda function and see if the issue still presists? Our feature selection algorithm is designed to efficiently deal with highly collinear variables.

Thanks

Al

1 Like

Hi Al,

I gave it a go with splsda (code below) and I’ve got the same result for all three datasets I’m working with.

MyResult.splsda.trans2 <- splsda(t.trans_data,factors, ncomp=4)
MyPerf.splsda.trans <- perf(MyResult.splsda.trans2, validation = "loo",dist = c("max.dist","centroids.dist"), 
                           progressBar = FALSE, nrepeat = 1) 
plot(MyPerf.splsda.trans, sd = TRUE, legend.position = 
       "horizontal")

I’m having to give less attention to this problem due to time-constraints, but any further help would be much appreciated. If it stands that performance tuning won’t work with my dataset, I’ll call it as an exploratory analysis and make a note of that in the report I’m doing.

Thanks for the assistance.