Hi all,

I’m working with two types of metabolomics data that come from the same samples (the intracellular and extracellular metabolomes) but each type were measured on different NMR spectrometers so they are considered “different” data types. I have 25 samples, with around 150 variables for the intracellular block and 250 variables for the extracellular block.

I want to integrate the two data blocks to see which metabolites are correlated, so based on previous posts here, I started with a PCA on each block separately and had good discrimination, and now I am trying to integrate them using a (s)PLS. I’m not a statistician so based on the documentation I thought the (s)PLS in “canonical” mode was the best solution, because I am not trying to predict the Y block from the X, I just want to see which variables are correlated. However, from what I see in the literature you need to have some kind of measure to determine if your model is overfitted or not, but it doesn’t seem like this is possible for the PLS in canonical mode. And from what I understood in the paper (Le Cao et al, BMC Bioinformatics (2009)) it doesn’t seem to be necessary. So my questions are these:

-Is s(PLS) in canonical mode the best solution for what I’m trying to achieve, or should I try one of the other modes?

-If canonical mode is indeed the best solution, does some measure of goodness of fit need to be reported as well? And what would that measure be?

Many thanks,

Natalie