PLS - choose X and Y dataset

kathi_munk · June 27, 2023, 4:30pm

Hello,
I’ve been working with mixOmics for a while now and I do not fully understand how to choose the datasets for the (s)PLS analysis.

In my case I have a metabolomics, transcriptomics and microbiomics dataset and I want to analyse them pairwise with the PLS. How do I choose which one is my dataset X and which one is dataset Y?

I want to use the whole datasets, so like in Example: PLS2 regression shown in the vignette.

Thanks for any help!

kimanh.lecao · June 29, 2023, 6:27am

hi @kathi_munk,

It depends if you want to use a regression or a canonical (similar to canonical correlation framework). For PLS regression then Y will be what you are trying to predict, For PLS canonical mode then X and Y are interchangeable. If you are unsure about what to predict, I would suggest you use PLS canonical mode but you won’t have the performance estimation.

If you have sample groups, after you do the PLS analyses, you should probably then consider block.plsda() (DIABLO) to integrate all three data sets.

Kim-Anh

kathi_munk · July 16, 2023, 1:08pm

Hi Kim-Anh,
thanks for your response, but I would have some further questions.

I now changed the mode of the spls and tune.spls to canonical, but when going through the tuning steps with the perf and tune.spls functions I still get different results when swapping the two datasets.

As an example here the heatmaps after the tuning process:
Heatmap 1 - metabolomics dataset as X and transcriptomics as Y

Heatmap 2 - transcriptomics as X and metabolomics as Y

As you can see the difference is not massive, but there is a difference which I assumed there shouldn’t be one. Or to say it in an other way, the only difference I expected was that the heatmap is just flipped.

A second way to see the difference was to have look at the total Q2 score. For the first order of the datasets the first component had a Q2 score of around 0.2, after switching the datasets the score of comp1 was below 0.0975, almost hitting 0.

So evidently, there seems to be a difference.

Can you maybe explain to me why the tuning process still produces different results?

Thanks for your help!

PS: I am doing a DIABLO analysis as the last step of my “workflow”, but first I wanted to have a look at my datasets with the sPLS.

kimanh.lecao · July 21, 2023, 12:44am

hi @kathi_munk,

Yes I was not expecting this either! Assuming the results you are showing are using the same parameters (comp), then it might happen as we calculate the singular value decomposition of X * Y (which, when you swap, can be either p x q, or q x p), leading to some differences in the resulting SVD because of approximations. I would recommend you stick to the version that would make most sense to you from those heat maps (I assume the correlation circle plots would look also different?).

For perf() we perform cross-validation, so I am not surprised that the results may differ, unless you really boost the number of repeats (and even with that, I would expect some differences).

Kim-Anh

Topic		Replies	Views
sPLS explained variance and variable selection	2	113	June 17, 2024
Proportion explained variance in PLS vs sPLS model Analysis	4	70	March 28, 2025
choice.keepX changes each run Support	1	390	November 27, 2022
Goodness of fit measures for canonical (s)PLS Analysis	4	321	July 1, 2021
Choosing Diablo Design Matrix Analysis	9	2618	April 18, 2024

PLS - choose X and Y dataset

Related topics