Prediction results

Dear all,

I did a tuned sPLS-DA to discriminate between two groups. The group separation is clear and the error rate per class is 0.01 for both classes. I am really happy about it and it is logical based on biological results on both groups and other analyses.
However, when I want to predict the class of new data (3 samples per class), but the classification correctly classify 1 class but completely failed for the other one. (see below)
image
I was so surprised by this that instead of giving new data to the test matrix I gave, as a trial, data that I use in the training part. The results are exactly the same.

Is this normal or I am doing something wrong?

Thank you

Best regards

Fabien Filaire

Hi Fabien!

First of all, I am not part of the mixOmics team, so maybe my answer is not entirely accurate.

From what you comment, it is likely that you have created a classification model that is too tightly tuned to the training data (overfitting) and, by passing it the unlabeled data for the test, the model is not able to perform a good classification.

To begin with, are the groups balanced when you train the model? If they are not, I have used SMOTE to balance the groups, as long as I take into account the risks of balancing groups synthetically. I leave here some material in case you are interested in reading more about this.

Best regards,
Marta

1 Like

Hi Marta,

Thank you for your answer.

Yes I assumed that too but I am really surprised to see that it did work better the second time.
My groups are perfectly balanced… but thanks for the tips I’ll need something like this soon :slight_smile:

Fabien

hi @Fabien-Filaire

@Margonmon is correct in the overfitting issue. We don’t know what is the sample size in your training set, but basically the model is not generalising well to new data (it seems to be biased towards that first Ctrl class).
Of note: when we perform cross validation (in tune, predict) we subsample so as to respect the unbalance of the classes, and this is why we recommend reporting the Balanced Error Rate (BER) rather than the error rate.

Kim-Anh

Hi @kimanh.lecao ,

Thank you for your answer.
Actually I changed my pre-processing and it works better. I had variables that create too much confusion for nothing.

I do not understand your point about BER. My understanding was to focus on BER with my initial groups were not balanced. Mine are (3 groups of 15 samples/group).

Thank you again

Best regards

Fabien

1 Like

Great @Fabien-Filaire,
I had no information regarding whether your groups were balanced or not. It was a generic comment (both BER and ER would be the same if your classes are balanced anyway).

Kim-Anh