Help deciding the number of components in PLS-DA

I have a lipidomics dataset with more than 1800 features and 30 samples and I want to apply PLSDA. So, I used the following code to do this:

HLGA_plsda ← plsda(Normalized_data, Groups, ncomp = 10, scale = FALSE)
perf_plsda ← perf(HLGA_plsda, validation = “Mfold”,
folds = 3, nrepeat = 100,
progressBar = TRUE, auc = TRUE)

plot(perf_plsda, col = color.mixo(5:7), sd = TRUE,
legend.position = “horizontal”)
perf_plsda$choice.ncomp

and I get this plot:

Untitled

Now, I get that only one component is the best with “$choice.ncomp”, although I feel it is not reasonable and I should choose 2. Also, how come I have a very low error rate, does it mean the model is overtrained? or am I doing something wrong?
I chose not to scale because I get a higher classification error rate with scaling.

Also, using RVAideMemoire package, cross validation with 2 components gives me a lower classification error rate than one component:

MVA.cv(Normalized_data, Groups, repet = 100, k = 3, ncomp = 1, scale = FALSE, model =“PLS-DA”)

Mean (standard error) classification error rate (%): 0.2 (0.08), but with 2 components I get 0% (0)

So, I am really confused how to decide the number of components, can anyone help? Both models with one or 2 components are significant upon permutation.

I am really new to this area so any advice will help!
Thank you

hi @IsraaHH

Based on this plot you show, it would make sense to use at least 2 components, but I suspect the sd for comp 1 and comp 2 are similar and overlapping, hence the suggested output of 1 component. But from a practical perspective 2 comp would make it easier for plotting etc.

I don’t know what is calculated in RVAideMemoire. In mixOmics/perf we use one sided t-test to define whether the decrease in error rate is significant or not as you add an extra component.

Re low error rate: it seems that your data are highly separable. (It may depend on whether you have pre-screened the 1,800 features and how).

Kim-Anh

Hi @ kimanh.lecao

We have two datasets that are generated with different cutoffs of VIP scores from 30 samples.

    1. Dataset 1: 869 differentially expressed proteins, 5509 differentially expressed genes and 2760 DMRs in exons.
    1. Dataset 2: 869 differentially expressed proteins, 5509 differentially expressed genes and 458 DMRs in exons.

We used the following code to generate the plot and to choose number of components.

Design matrix

design = matrix(0.1, ncol = length(data), nrow = length(data),
dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s
design
basic.diablo.model = block.splsda(X = data, Y = y, ncomp = 10, design = design)

Tuning the number of components

perf.diablo = perf(basic.diablo.model, validation = ‘Mfold’, folds = 10, nrepeat = 50)
plot(perf.diablo)

We obtained similar image irrespective of changing the number of features or the number of components.

Note: We tried by varying the features from ncomp =2 to ncomp = 15.

Can you please help us understand why we are getting overall error rate line at zero and how can we select the number of components and the “distance” method?

With Kind Regards
Priyanka Ramesh

hi @r.priyanka1802,

I believe this is because you are including the differentially expressed variables already in the model, and so in that case you are overfitting.

These approaches should be applied in an unbiased way, we pre filter only the top (say up to 5,000) most variable variables in each data set.

If you are interested only in visualisation, then I would not bother about tuning the number of components, and just show the first 2 components. It seems that you will obtain a perfect (albeit biased) separation of your groups.

Kim-Anh