Perf function: error in solve.default(Sr): system is computationally singular

Dear mixOmics team,

Thank you for your hard work in creating this package.

I’m trying to integrate two -omics datasets and was attempting to use the performance function of the PLS-DA model to determine the number of components. I was rewarded with the “Error in solve.default(Sr) : system is computationally singular” message.

I saw previous posts related to this matter and I’m confident that there isn’t an issue with zeroes/missing values in the data.

Utilizing the “loo” validation method, rather than “Mfold,” still gave me the same error.

I suspect it may be related to my extremely low sample size (12 subjects with 3 class features, so basically n=4) versus the number of variables (~2000 genes). I saw in the FAQ it was noted:

“With a small n you can adopt an exploratory approach that does not require a performance assessment.”

In a scenario like this, is there any suggestions about a workflow to still derive some sort of meaningful statement about the dataset and to choose appropriate/optimal variable and component numbers?

Thank you so much for your assistance and thank you again for all the work that has been done.

Not sure what metric you are using but if you are letting it use mahalanobis distance during leave one out cross validation it is possible that your covariance matrix is not inverting due to the small sample size. Try to specify just using centroids or max distance. I had the issue once when doing loo with 6 subjects.

Hope this helps!

Hi @hchen8, i think that you are using the wrong method? The PLS-DA model it not suitable for integration of two omics datasets. Its purpose is to discriminate between your reponse variables based on a single omics-dataset. Depending on your question i think you should use (s)pls or block.(s)plsda.

  • Christopher

Hello @christoa, thank you for answering me. I’m aware that the PLS-DA is not the integration method, but I initially had wanted to see whether I can discriminate the response variables in the individual datasets using PLS to compare to PCA, which is how I had traditionally looked at my data. I apologize if that wasn’t clear.

I’ve run DIABLO on my two datasets without error by following the tutorial-- thank you for checking that I was able to accomplish my stated aim!

@Neystale Thank you so much, I’ll look into that further!

Hi again @hchen8, so far so good! :slight_smile:

Then it could be due to:

  • Too many missing or zero values. nearZeroVar() can be used to handle this, either externally or internally in the pls function.
  • Too many components (empty residual matrices)
  • Variables that are highly correlated/nearly the same (multi collinearity)

Let me know if any of these solved the problem.

  • Christopher

Hi @christoa,

I was wondering if I could ask the same, I’m having the same issue with my data and I’ve attempted the nearZeroVar(), which didn’t work, then I tried to lower the components but this seems to result in a plot which has a classification error rate of 0 throughout. I get the feeling that it is likely the variables that are highly correlated - since it seems I cannot change the number of components - so is there a solution to that?

It is possible that I’m doing something wrong in the script, or perhaps this is due to a small sample size on my end. Regardless, would greatly appreciate any input.

All the best,
Sam