Different PLS-DA loadings using Mixomics and DiscriMiner

Hello,
I have a 16S microbiome dataset, and I have been using MixOmics to make a PLS-DA and plot the loadings of mi features (X) comparing them with my factor loadings (Y).
As MixOmics plsda does not directly provide model R2 and Q2 values, I also tried DiscriMiner package.
But when comparing the results, I am having 2 issues:

  1. R2 and Q2 values outputted by DiscriMiner are different from those obtained using MixOmics pls() + perf(pls).
  2. X and Y loadings are also pretty much different.
    I am concerned as I don’t know the cause of these differences between packages, maybe someone could put a light on this…

I can provide code if necessary.
Thanks!

hi ALG,
We have not benchmarked against discriMiner so those differences are hard to tell and could come from various reasons

  • the cross-validation folds are different to start with
  • our PLS-DA centres and scale the variables (PLS-style), does discriMiner do this too?
  • the prediction formula / method can vary slightly

can you quantify your statement ‘X and Y loadings are also pretty much different.’ i.e. a correlation coefficient?

And have you compared the prediction performance of the two methods? This is really what matters (classification error rate). We have not implemented R2 and Q2 because PLS-DA sit in a supervised framework, whereas those criteria are for regression (the reason why it can be calculated with PLS-DA is because PLS-DA is a special case of the regression PLS, but we decided not to take that route for now).

Kim-Anh

Thanks for your answer.
I will explain with more detail the points you ask me.
First, here are the function parameters I have used:

Using DiscriMiner:
plsDA(X, Y, autosel = FALSE, comps = 3, cv = "LOO")

Using MixOmics:
plsda(X, Y, ncomp = 3, mode = "regression")
perf(plsda, validation = "LOO")

I do not know whether discriMiner scales variables, the documentation of the package is poor. I can only link the plsDA function for more info…

According to X and Y loadings, I have calculated Pearson correlation for each component:
image
Although correlations are almost 1 (I don’t understand why component 1 and 3 have inverse correlation), Y loading values differ a lot, maybe because of scaling, but it makes a huge difference when plotting X and Y loadings in a PCA.

Here I present Y loadings:
image

Classification error rate is difficult to compare between the two packages, as yours presents a complete table with overall and BER error rates with different distances, and discriMiner only outputs an error rate number:

With mixOmics:

$overall max.dist centroids.dist mahalanobis.dist
comp 1 0.7446809 0.7234043 0.7234043
comp 2 0.4680851 0.5319149 0.4468085
comp 3 0.4468085 0.4893617 0.4680851
$BER max.dist centroids.dist mahalanobis.dist
comp 1 0.8000000 0.7585470 0.7585470
comp 2 0.5185897 0.5365385 0.4514957
comp 3 0.4651709 0.4950855 0.4792735

With discriMiner:
error_rate = 0.1702128

Lastly, if R2 and Q2 are not available, is there any estimate indicating the model prediction accuracy and predictive relevance of my variable I could use?

Thanks a lot for your help.
Adrián López

Hi Andrián,

Ah thank you for the details, that easier for us to see what is going on.

  • your correlations are fine and very good. You would not expect a correlation of 1 because of some approximations here and there. The negative sign is nothing to worry about, as the methods can rotate the components one way or the other, but it does not affect the interpretation. So this is just a scaling issue what is happening on the loading output, depending how they were scaled internally in the method. To reassure yourself, you would do a plot of loading Y_discriminR vs loading Y_mixOmics and you should see a straight line (with 4 points…). The components should be very similar between both approaches.

  • the loocv results with mixOmics indicate that the problem is not easy to classify, consider adding more ocomponents. TheBalanced error rate (BER) takes into account unbalance classes is higher that the overall error rate, indicating that you do not have the same number of samples per class?. As why discriminR gives a much lower error rate is hard to say. I would double check that they actually calculate the error rate based on LOOC (and not the whole data set, otherwise it would be overfitting).

  • the prediction performance/ accuracy can be calculated as 1 - classification error rate . You could also use the auroc() function from mixOmics but be aware this is calculate on the whole data set (or you can estimate it also from perf using cross validation).

Kim-Anh

Hi Andrián,
I have edited my answer as my interpretation were wrong :slight_smile:

Thanks a lot for your explanations.
These days I have been trying other PLS software (SIMCA and r package ropls), and I am still suspicious about the Y loading scaling mixOmics is doing, as these two other methods produce similar loading plots to discriMiner one…

The loading plot with mixOmics:

The loading plot with discriMiner, ropls and SIMCA (it is pretty similar with the three pipelines):

Sorry for the insistence, but it is something that I want to fix, as the representation makes a large difference, and I presume that also the interpretation.
Is there any scaling parameter that I can modify at mixOmics functions? Could be valid applying a scaling multiplication factor to reduce Y load dimensions?

Thanks a lot for your help.
Adrián

Hi Adrian,
could you share the code you use to produce these plots (especially how you overlay the ‘Y’ information)? And why you would be interested in using mixOmics if SIMCA does the job?
You would have more options with SIMCA if you are interested in PLSDA. mixOmics is more specialised towards variable selection and we have our own plots (with colors, ellipse plots etc) to emphasize on those aspects, so I would like to clarify what your intentions are first.

Kim-Anh

I use mixOmics because I do not have a SIMCA license available, I only downloaded the trial version to test the similarity between R packages…
I tried to reproduce the loading plots that SIMCA produces because I think they are pretty informative, using mixOmics outputs and ggplot2. Here is the pipeline I implemented:

# 0) Y variable (4 level factor)
Y <- unmap(var)

# 1) Run pls with optimal ncomp
pls1 <- pls (X, Y, ncomp = 3)
# Filter X matrix using a VIP cutoff threshold = 1
pls_vip <- vip(pls1)
pls_vip <- pls_vip[pls_vip[,3] > 1, ]
X_vip <- X [ , colnames(X) %in% row.names(pls_vip)]

# 2) Re-run pls with this VIP cutoff.
pls2 <- pls (X_vip, Y, ncomp = 3)
pls2_tune <- perf (pls2, validation = "Mfold", folds = 5, nrepeat = 10)
# Trying to calculate model Q2 and R2 (I don't think it is well calculated)
sum(mxdat_pls_tune$Q2.total)
sum(apply(mxdat_pls_tune$R2, 2, mean))

# 3) PLS-DA with optimal ncomp and X = X_vip:
plsda <- plsda (X_vip, var, ncomp = 3)
plsda_tune <- perf (plsda, validation = "Mfold", folds = 5, nrepeat = 10)

## Loading plot: VIP cutoff = 1.
plsda_loads <- as.data.frame(plsda$loadings$X)
plsda_groups <- as.data.frame(plsda$loadings$Y)

ploads <- ggplot(plsda_loads, aes(x = `comp 1`, y = `comp 2`)) + 
  geom_hline(yintercept = 0, size = 0.5) + 
  geom_vline(xintercept = 0, size = 0.5) +
  geom_point(shape = 21) + 
  annotate("point", x = plsda_groups$`comp 1`, y = plsda_groups$`comp 2`, shape = 15, size = 6) +
  annotate("text", x = plsda_groups$`comp 1`, y = plsda_groups$`comp 2` - 0.05, label = row.names(plsda_groups), hjust = 0.5, size = 6)

The Y information is overlayed with ggplot2 annotate, taking the Y loadings (plsda$loadings$Y).

Adrian

Hi Adrian,
Ok I understand your plots, but only slightly. I do not know if SIMCA has released the algorithm (I would suspect not), but we use the one proposed originally by Wold, see attached, where the loadings are normed and there are theoretical reasons for this, in the case of a PLS method.
As I have said earlier, for PLS-DA, the method is considered as a special case of PLS when the Y variables are coded as dummy, and yes the Y loadings are thus normed too. The loading plot you show if very peculiar, we are certainly not used to this, instead we prefer to project the variables onto the dimensions and represent them in correlation circles plot (plotVar). The Q2 and R2 are calculated according to the book of Tenenhaus et al. using cross validation. Again, if SIMCA does not release their method then it is hard to assess the difference between the software.

Lastly, we have a sparse PLS and sparse PLS-DA methods that enable variable selection, and this is not based on VIP. Your code as you show seem to indicate some level of overfitting.

Feel free to amend our PLS code and fit your needs, as I dont think we will be able to help further in your endeavours.

Kim-Anh32%20pm

1 Like