Goodness of fit measures for canonical (s)PLS

Hi all,

I’m working with two types of metabolomics data that come from the same samples (the intracellular and extracellular metabolomes) but each type were measured on different NMR spectrometers so they are considered “different” data types. I have 25 samples, with around 150 variables for the intracellular block and 250 variables for the extracellular block.
I want to integrate the two data blocks to see which metabolites are correlated, so based on previous posts here, I started with a PCA on each block separately and had good discrimination, and now I am trying to integrate them using a (s)PLS. I’m not a statistician so based on the documentation I thought the (s)PLS in “canonical” mode was the best solution, because I am not trying to predict the Y block from the X, I just want to see which variables are correlated. However, from what I see in the literature you need to have some kind of measure to determine if your model is overfitted or not, but it doesn’t seem like this is possible for the PLS in canonical mode. And from what I understood in the paper (Le Cao et al, BMC Bioinformatics (2009)) it doesn’t seem to be necessary. So my questions are these:

-Is s(PLS) in canonical mode the best solution for what I’m trying to achieve, or should I try one of the other modes?
-If canonical mode is indeed the best solution, does some measure of goodness of fit need to be reported as well? And what would that measure be?

Many thanks,
Natalie

Dear @npayne,
I do think that sPLS canonical mode would be the best approach for your analysis, if you are using more than one component (if you use only ncomp = 1 identify correlated variables, then both regression and canonical mode would give the same result. Obviously you may still use 2 components to plot the graphics!).

Unfortunately there is no goodness of fit for that model as it is not regression based.

A basic metric would be the correlation between latent components, but that would only assess ‘how similar the components from both data sets are’. Potentially you could compare this correlation coefficient to a full PLS with no variable selection (but a correlation from sPLS compared to PLS might still be higher anyway, simply because noise may also play a part in correlation). You could also use graphical outputs to show the benefit in variable selection (here I am assuming you are interested in variable selection). Researchers who use this method often dont report any metric as this is still an exploratory approach, instead they focus on the interpretation of the results.

In terms of methods, you could have a look at rCCA, but that method does not perform variable selection.

Kim-Anh

After discussing with @aljabadi who masters the code of mixOmics, there is also the stability output you can use (bump the number of repeat), and the correlation I mentioned above:

data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
liver.pls.can <- spls(X, Y, keepX = rep(5,5), keepY = rep(5,5), ncomp = 5, mode = 'canonical')
perf.res.can <- perf(liver.pls.can, folds = 5, nrepeat = 3)
perf.res.can
perf.res.can$features$stability.X

I also attach a screenshot from our book (it is coming! in September).

Thank you so much for your help @kimanh.lecao and @aljabadi !

I think I’ll go with the variable stability option, but just in case, is this correlation coefficient the values found in (using the example from above)

perf.res.can$measures$cor.tpred$summary
perf.res.can$measures.$cor.upred$summary

And then I would compare these values for each component with those from a full PLS? Sorry if this a basic question!

Thanks again, can’t wait to read the book!

hi @npayne,

perf.res.can$measures$cor.tpred$summary
perf.res.can$measures.$cor.upred$summary

are used for tuning during the perf() calculation.

The correlation I was mentioning is to just extract from a normal sPLS model the latent components (’variates’) and calculate the correlation between them and see how high it is.

Kim-Anh