Hi everyone,
Many thanks for making such an amazing package with so many capabilities, it has been mega helpful in analysing my data so far!
It’s an issue that would apply to sPLSDA analysis as well, but I’ll post my code with sPLS in mind. I am using sPLS to investigate the relationship between 232 plasma metabolites (X) and age (Y) in my study that has 11 samples.
My issue is that I am not sure how many metabolites were used for each component.
I first ran tune(method = “pls”) in order to tune the number of components:
tune_pls1_plasma ← mixOmics::tune(df_plasma, df_plasma_traits[, “age”],
ncomp = 3, method = “pls”,
multilevel = NULL,
mode = “regression”, logratio = “none”,
validation = “Mfold”,
folds = 3, nrepeat = 100,
center = TRUE, scale = TRUE,
progressBar = TRUE, dist = “all”,
BPPARAM = BPPARAM,
seed = 42)
This suggested using 1 component, but I needed to use 2 components for plotting reasons anyway so I kept 1 component for the next step. Following this, I wished to check how many metabolites to keep, and I used the following code:
list.keepX ← seq(8, 232, by = 8)
tune_spls1_plasma_MSE ← mixOmics::tune(df_plasma, df_plasma_traits[, “age”],
ncomp = 2, method = “spls”,
test.keepX = list.keepX,
mode = “regression”, logratio = “none”,
validation = “Mfold”,
folds = 3, nrepeat = 100,
center = TRUE, scale = TRUE,
measure = “MSE”,
progressBar = TRUE, dist = “all”,
BPPARAM = BPPARAM,
seed = 42)
The suggested parameters were obtained by this:
tune_spls1_plasma_MSE$choice.ncomp$ncomp # 2 components
tune_spls1_plasma_MSE$choice.keepX # 232 and 88
Given the above and the associated tuning plot, I decided to go ahead with 2 components, and defined these as follows to “plug” them in my final model:
choice.ncomp ← tune_spls1_plasma_MSE$choice.ncomp$ncomp
choice.keepX ← tune_spls1_plasma_MSE$choice.keepX[1:choice.ncomp]
spls1_final_plasma ← spls(df_plasma, df_plasma_traits[, “age”],
ncomp = choice.ncomp, keepX = choice.keepX,
scale = TRUE, mode = “regression”)
I then run the following two lines, which confirm that 232 and 88 metabolites were used for each component, and checked the correlations between X and Y for each component:
comp1.choices ← selectVar(spls1_final_plasma, comp = 1)$X$name %>% as.data.frame()
comp2.choices ← selectVar(spls1_final_plasma, comp = 2)$X$name %>% as.data.frame()
print(cor(spls1_final_plasma$variates$X[,1], spls1_final_plasma$variates$Y[,1])) # 0.9002954
print(cor(spls1_final_plasma$variates$X[,2], spls1_final_plasma$variates$Y[,2])) # 0.8927105
Finally, I checked the model’s performance
perf_spls_plasma ← perf(spls1_final_plasma, dist = “all”,
validation = ‘Mfold’, folds = 3,
nrepeat = 100, seed = 42,
BPPARAM = BPPARAM, progressBar = TRUE)
Performance was not much improved for the spls1 model compared to the pls1 one, but it was not identical. However, when I run the following code, I get identical variance explained - this shouldn’t be happening though as component 2 used 88 metabolites only.
spls1_final_plasma$prop_expl_var$X # 0.1667005 0.1213769
pls1_plasma$prop_expl_var$X # 0.1667005 0.1219007 0.1853511
Shouldn’t these be different since the pls1 model had all 232 metabolites for each component? What am I doing wrong? Thank you in advance for your help!
P.S. I tried Mfold and loo validation methods, but my tuning plots and keepX suggestions from loo were a bit weird so I kept Mfold with fold = 3 given my small sample size.
Many thanks,
Evelyn