PLS - choose X and Y dataset

Hello,
I’ve been working with mixOmics for a while now and I do not fully understand how to choose the datasets for the (s)PLS analysis.

In my case I have a metabolomics, transcriptomics and microbiomics dataset and I want to analyse them pairwise with the PLS. How do I choose which one is my dataset X and which one is dataset Y?

I want to use the whole datasets, so like in Example: PLS2 regression shown in the vignette.

Thanks for any help!

hi @kathi_munk,

It depends if you want to use a regression or a canonical (similar to canonical correlation framework). For PLS regression then Y will be what you are trying to predict, For PLS canonical mode then X and Y are interchangeable. If you are unsure about what to predict, I would suggest you use PLS canonical mode but you won’t have the performance estimation.

If you have sample groups, after you do the PLS analyses, you should probably then consider block.plsda() (DIABLO) to integrate all three data sets.

Kim-Anh

Hi Kim-Anh,
thanks for your response, but I would have some further questions.

I now changed the mode of the spls and tune.spls to canonical, but when going through the tuning steps with the perf and tune.spls functions I still get different results when swapping the two datasets.

As an example here the heatmaps after the tuning process:
Heatmap 1 - metabolomics dataset as X and transcriptomics as Y

Heatmap 2 - transcriptomics as X and metabolomics as Y

As you can see the difference is not massive, but there is a difference which I assumed there shouldn’t be one. Or to say it in an other way, the only difference I expected was that the heatmap is just flipped.

A second way to see the difference was to have look at the total Q2 score. For the first order of the datasets the first component had a Q2 score of around 0.2, after switching the datasets the score of comp1 was below 0.0975, almost hitting 0.

So evidently, there seems to be a difference.

Can you maybe explain to me why the tuning process still produces different results?

Thanks for your help!

PS: I am doing a DIABLO analysis as the last step of my “workflow”, but first I wanted to have a look at my datasets with the sPLS.

hi @kathi_munk,

Yes I was not expecting this either! Assuming the results you are showing are using the same parameters (comp), then it might happen as we calculate the singular value decomposition of X * Y (which, when you swap, can be either p x q, or q x p), leading to some differences in the resulting SVD because of approximations. I would recommend you stick to the version that would make most sense to you from those heat maps (I assume the correlation circle plots would look also different?).

For perf() we perform cross-validation, so I am not surprised that the results may differ, unless you really boost the number of repeats (and even with that, I would expect some differences).

Kim-Anh