Possible bug in block.splsda + various questions

Dear mixOmics community

The latest month I’v been working on data integration of several metabolomics/lipidomics datasets with microbiome (16S) data. Lately I encoutered this strange behaviour of the block.splsda function. When I create a model using this list.keepX object:
[1] 17 6

[1] 14 5

[1] 20 12

[1] 6 5

The list indicates my otu (microbiome) dataset should have 6 & 5 variables for comp 1 & 2 respectivly in the model. However, when looking at the resulting model we can see that there are 51 & 5 variables for the otu dataset! In another try, with the same input, it resulted in 52 & 6 variables!

Any idea on why this might be the case, or what is happening here?

While I’m at it, I have some other (not as important) questions:

  1. For all my datasets, both in multi omic & single omic analyses, the optimal number of components is always = 1. Is this erratic behaviour, or should this be no problem? For visualization purposes I always construct models with ncomp = 2. Also the error rate of my models stays high, even after tuning, so I guess my data is not suited for prediction, but can be used for pathway analysis / search for biologically relevant correlations?

  2. Is there a way to disregard “within block correlations” when plotting the circosplot or networks?

Many thanks in advance if someone takes the time to read and answer this!

Kind regards
PhD Student @ Laboratory for Chemical Analysis, Faculty of Veterinary Medicine, Ghent University

Hi @pvgeende,

@aljabadi may want to ask you for more details (and data), in case this is related to a bug in the function. My intuition is that your OTU data are highly collinear and so you have ~ 51 variables considered as important (and potentially exactly the same values) on the first component. Have a look back at the data, the variables selected and let us know.

1 - It means that the discrimination only happens on the first component (as you will visualise on the plot), and after that you are only adding noise. The performance results indicate that yes, it is difficult to separate the groups. It might be better not to tune, and instead choose a reasonable number of variables ad-hoc that will allow you for exploration and interpretation using pathway analysis etc. For visualisation, you can still use 2 components but focus your interpretation on the variables selected on component 1.

2 - No, but you can extract the similarity matrix from circosPlot (see post here) and use cytoscape for customised plots.


1 Like

Dear Kim Ahn

Thanks a lot for taking the time to respond to my questions.
I’ll take time to investigate my data / selected variables again and will post more info here later if needed.

Kind regards