Proportion explained variance in PLS vs sPLS model

Hi everyone,

Many thanks for making such an amazing package with so many capabilities, it has been mega helpful in analysing my data so far!

It’s an issue that would apply to sPLSDA analysis as well, but I’ll post my code with sPLS in mind. I am using sPLS to investigate the relationship between 232 plasma metabolites (X) and age (Y) in my study that has 11 samples.

My issue is that I am not sure how many metabolites were used for each component. :smiley:

I first ran tune(method = “pls”) in order to tune the number of components:
tune_pls1_plasma ← mixOmics::tune(df_plasma, df_plasma_traits[, “age”],
ncomp = 3, method = “pls”,
multilevel = NULL,
mode = “regression”, logratio = “none”,
validation = “Mfold”,
folds = 3, nrepeat = 100,
center = TRUE, scale = TRUE,
progressBar = TRUE, dist = “all”,
BPPARAM = BPPARAM,
seed = 42)

This suggested using 1 component, but I needed to use 2 components for plotting reasons anyway so I kept 1 component for the next step. Following this, I wished to check how many metabolites to keep, and I used the following code:

list.keepX ← seq(8, 232, by = 8)
tune_spls1_plasma_MSE ← mixOmics::tune(df_plasma, df_plasma_traits[, “age”],
ncomp = 2, method = “spls”,
test.keepX = list.keepX,
mode = “regression”, logratio = “none”,
validation = “Mfold”,
folds = 3, nrepeat = 100,
center = TRUE, scale = TRUE,
measure = “MSE”,
progressBar = TRUE, dist = “all”,
BPPARAM = BPPARAM,
seed = 42)

The suggested parameters were obtained by this:
tune_spls1_plasma_MSE$choice.ncomp$ncomp # 2 components
tune_spls1_plasma_MSE$choice.keepX # 232 and 88

Given the above and the associated tuning plot, I decided to go ahead with 2 components, and defined these as follows to “plug” them in my final model:
choice.ncomp ← tune_spls1_plasma_MSE$choice.ncomp$ncomp
choice.keepX ← tune_spls1_plasma_MSE$choice.keepX[1:choice.ncomp]

spls1_final_plasma ← spls(df_plasma, df_plasma_traits[, “age”],
ncomp = choice.ncomp, keepX = choice.keepX,
scale = TRUE, mode = “regression”)

I then run the following two lines, which confirm that 232 and 88 metabolites were used for each component, and checked the correlations between X and Y for each component:
comp1.choices ← selectVar(spls1_final_plasma, comp = 1)$X$name %>% as.data.frame()
comp2.choices ← selectVar(spls1_final_plasma, comp = 2)$X$name %>% as.data.frame()

print(cor(spls1_final_plasma$variates$X[,1], spls1_final_plasma$variates$Y[,1])) # 0.9002954
print(cor(spls1_final_plasma$variates$X[,2], spls1_final_plasma$variates$Y[,2])) # 0.8927105

Finally, I checked the model’s performance
perf_spls_plasma ← perf(spls1_final_plasma, dist = “all”,
validation = ‘Mfold’, folds = 3,
nrepeat = 100, seed = 42,
BPPARAM = BPPARAM, progressBar = TRUE)

Performance was not much improved for the spls1 model compared to the pls1 one, but it was not identical. However, when I run the following code, I get identical variance explained - this shouldn’t be happening though as component 2 used 88 metabolites only.

spls1_final_plasma$prop_expl_var$X # 0.1667005 0.1213769
pls1_plasma$prop_expl_var$X # 0.1667005 0.1219007 0.1853511

Shouldn’t these be different since the pls1 model had all 232 metabolites for each component? What am I doing wrong? Thank you in advance for your help! :slight_smile:

P.S. I tried Mfold and loo validation methods, but my tuning plots and keepX suggestions from loo were a bit weird so I kept Mfold with fold = 3 given my small sample size.

Many thanks,
Evelyn

Hi @windsnowflake,

From what I can see of the code you shared you are performing all the correct steps for model creation and tuning (if you haven’t already I would check out this webpage for more details on mixOmics tuning functions).

Your question is around the proportion of variance explained in your spls and pls models.
The important thing to keep in mind is that, unlike PCA models, sPLS, PLS, sPLS-DA and PLS-DA models are not designed to maximise variance for each component, instead they are designed to maximise the covariance between your X and Y data. Therefore, the more useful metric to focus on is the performance of your spls and pls models which you can calculate using perf().

In terms of your results:

spls1_final_plasma$prop_expl_var$X # 0.1667005 0.1213769
pls1_plasma$prop_expl_var$X # 0.1667005 0.1219007 0.1853511

To me these appear to make sense, for your pls1 model you used all 232 metabolites for all 3 components. For your spls1 model you used 232 metabolites for your first component (hence identical explained variance) and 88 metabolites for your second component (resulting in a slightly lowered explained variance).

Hope that helps!
Eva

Hi Eva,

This was very helpful, thank you very much! I spent some more time reading around the topic and your

I have a few follow-up questions if that’s OK, please let me know if I should create a new post for one or more of them!

What is the right thing to do when the initial step e.g. in my case tune_pls1_plasma ← mixOmics::tune(df_plasma, df_plasma_traits[, “age”]) suggests 1 component, but the next step to select metabolite suggests 2 components? Should I keep 1 or 2 in this case?

In terms of using perf() to assess model performance, how do I interpret MSEP/RMSEP/R2/Q2? I understand that lower MSEP/RMSEP indicate a better predictive accuracy, a higher R2 suggests a better fit to the training data and a higher Q2 suggests a better predictive ability. However, what does getting an MSEP of 1.04 ± 0.23 mean? Also, as I have so few samples, my Q2 starts close to 0 or even negative, and adding a 2nd component as suggested by the tuning process always makes it smaller/more negative. Does this mean I am overfitting my data?

Finally, I have a more general question so I can understand which parts of the dataset to apply these techniques. I have paired data, so samples collected before and after an intervention, and for the subjects I have their sex and age.
Should I perform ALL of the following?

  1. PLS for age and PLSDA for sex in the baseline data
  2. PLS for age and PLSDA for sex in the post data
  3. PLS for age, PLSDA for sex and PLSDA for the intervention effect in the full data
    I was thinking that if I had a simpler study design that wasn’t metabolomics, for example if I had measured weight before and after a marathon race, I’d analyse if there were age or sex differences before the race and after the race, as well as analyse the weight change that occurred as a result of the race.

Thank you in advance for your help! :slight_smile:

Best wishes,
Evelyn

Hi @windsnowflake,

So your data includes:

  • metabolites (232 continuous variables)
  • before/after intervention (1 categorical variable)
  • age (1 continuous variable)
  • sex (1 categorical variable)

The type of model you will want to build depends on which of these variables you are interested in and which ones might be confounding. I imagine you are most interested in the effect of your intervention, in which case you should set before/after intervention as your Y variable and run a (s)PLS-DA model. If this is the case there are a couple of things I would consider:

  1. You have the same sample before/after treatment, so as you pointed out this is paired data, also called multilevel data. You can read more about multilevel data on this page, but essentially you need to account for this when you build any model in mixOmics, because we expect the difference between your individuals to be greater than the difference between before and after treatment. You can actually check if this is the case by doing a PCA plot and colouring your samples by individuals and making different shapes for before/after intervention.

  2. Age and sex are factors that you might not be primarily interested in, but they may also influence your metabolite data - these are called confounders/covariates. Again you can check what effect these have on your data with a simple PCA plot. Unfortunately mixOmics doesn’t currently have functionality to account for covariates, but there are other things you can do to get around this - see this related question and this one.

Hope that helps!
Eva

PS in future it please could you post different questions in different posts, it just helps others who might have similar questions to find what they need :slight_smile:

1 Like

Hi @evahamrud - many thanks for your reply! I did as you suggested and split the questions to make it easier for people to find what they need in the future. :slight_smile:

The only follow-up I have is in term of the validation I should use. I have 10 individuals (4 female and 6 male), measured before and after an intervention on 2 biofluids (232 and 125 metabolites). I therefore have 10 or 9 (missing sampling point in one biofluid) samples for my baseline and post-intervention analysis, and 20 or 18 for my multilever analysis.

As noted above, I have been using Mfold with 3 folds and an nrepeat of 100, however given my small sample size (or other reasons) should I instead be using LOOCV?

Best wishes,
Evelyn