ROC analysis on a PLS-DA model built on only training data

Hi everyone!

I’m new to PLS-DA analysis so I have some questions about it and I hope somebody can help me.

I’ve used the package to analyse a dataset using PLS-DA. After plotting the plot it appears that there is clear discrimination between my samples.

Now I want to further analyse how good this discriminative power is using ROC-analysis and calculating the AUC. There’s only one problem and that is that I haven’t split the data into a trainingset and validation/test set because I don’t have enough samples to do so. Is calculating the AUC on my PLS-DA model using only trainingdata/trainingset of any added value? And if I calculate the AUC using only the training set and get an AUC of let’s say 0.9, does this mean that my PLS-DA model has a discriminative accuracy of 90%? I’m assuming this is not correct because you would need a validation and test set in order to be able to evaluate the dicriminative power of the model right?

Best regards,

Gabby

Hello @Gabby,

There are a few points that I think I should clarify about the usage of the AUROC curve:

Is calculating the AUC on my PLS-DA model using only trainingdata/trainingset of any added value?

The short answer is no. A model is going to be extremely good at classifying samples it was taught on as those samples drove the way the model was formed. While “training accuracy” is a metric that is assessed, it is almost always done in conjunction with “testing accuracy” (sometimes referred to as “validation accuracy”). This is because it allows comparison between the two. Having a high training accuracy but a low testing accuracy is an easy-to-identify indicator of overfitting. Hence, if you only have the training accuracy, any report on the model’s performance will be extremely over-optimistic. At the end of the day, model building is all about generalisability, something which training accuracy does not measure at all.

if I calculate the AUC using only the training set and get an AUC of let’s say 0.9, does this mean that my PLS-DA model has a discriminative accuracy of 90%?

Again, the simple answer is no. The Receiver Operator Curve (ROC) depicts how the True Positive Rate (TPR) and the False Positive Rate (FPR) change as your classification threshold changes. Taking the Area Under the Curve (AUC) is a way of condensing that information into a single value. It is by no means a direct measure of the performance of the model. It is better used as a way to evaluate the relative performances of certain models (eg. models using different component counts). “Discriminative accuracy” is defined by:

This article was useful for my understanding, I’d encourage you to have a read.

I’m assuming this is not correct because you would need a validation and test set in order to be able to evaluate the dicriminative power of the model right?

This assumption is correct. This is not due to the lack of validation data, but for the reason described above. The absence of validation data merely exacerbates this issue as the AUROC is measuring the performance of the model on the data is was trained on.

Lastly, I believe its important to address this comment:

I haven’t split the data into a trainingset and validation/test set because I don’t have enough samples to do so

If this is the case, discriminant analysis may not be appropriate given the fact that your sample size must be very small. You could try using “Leave One Out” cross validation, such that you generate a model on all but one sample (meaning the remaining sample is your entire testing set), and then repeat this so that every sample is tested on once. However, as mentioned about, model building is all about generalisability and having such a small sample size decreases the likelihood your model can generalise at all.

I hope this was of use for you. If you have any more questions feel free to ask.

Cheers,
Max.

Hi Max,

Thank you very much for you answer! It is really helpful! I got told PLS-DA would be the right fit for my data but after your response I’m starting to hesitate. I was wondering if i can get your expertise as second opinion.

So i have a dataset of 30 observations containing 20 variables each. These observations are labled into two different classes: bacterial infection and viral infection (so 15 observations are of patients with a bacterial infection and 15 observations are of patients with a viral infection). I want to see whether, using these 20 variables, there’s discrimination between the two classes (bacterial vs viral). I used plot.plsda and it showed indeed that there was indeed discrimination.

-Is 30 observations too little to use for PLS-DA?
-Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?

And in order to assess how well this discriminative power is / to asses the relative performance of the PLS-DA model I want to, after your suggestion, use ‘‘Leave One Out’’ cross validation. I’m thinking of using the following code for it, is this the right code?

-perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)

And do you know if there’s a way to plot the ROC curve for this LOOCV? Cause when I run the code I only get the AUC results but not the plot. And speaking of the AUC, I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?

Thanks in advance!

Best regards,

Gabby

So i have a dataset of 30 observations containing 20 variables each

With a sample size of 30, validation is definitely possible. I was under the impression you had significantly less than this. Using 5-10 samples as testing samples would be appropriate. Also, I would again encourage you to explore the use of Leave-One-Out Cross Validation (LOOCV) to effectively boost your sample size when it comes to tuning and assessing your final model.

I used plot.plsda and it showed indeed that there was indeed discrimination

I’m going to assume you mean plotIndiv() on the plsda object as there is no function plot.plsda() within mixOmics.

Is 30 observations too little to use for PLS-DA?

Referring to the Goodhue, D. L., Lewis, W., & Thompson, R. (2012). Does PLS Have Advantages for Small Sample Size or Non-Normal Data? MIS Quarterly , 36 (3), 981–1001. Does PLS Have Advantages for Small Sample Size or Non-Normal Data? on JSTOR, the effect of a small sample size on the PLS algorithm was explored. In the case of n = 20, false positive rate was appropriate, viable solutions were found in the vast majority of cases and overall accuracy was comparable to larger sample size models. Its efficacy decreased when the models became more complex, but was still deemed as an appropriate integration method. Note that despite this paper using PLS, PLS-DA derives from this algorithm. Hence I am extending the results - but be aware that they are not entirely equivalent.

Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?

Without any context or seeing the plot, I cannot confirm this. I’d suggest looking at the ellipse parameter as part of plotIndiv(). If you see no overlap of the resulting 95% confidence ellipses, that is fairly good evidence for suggesting your model can discriminate between them. Just looking at plots is not an empirical measure of discriminative ability however.

perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)

That seems appropriate. The code I would use myself:

data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix

initial.plsda <- plsda(X, Y, ncomp = 5)

# ---------------------------------------------------------------------------- #

plsda.perf <- perf(initial.plsda, validation = "loo", 
                   progressBar = FALSE, auc = TRUE,
                   nrepeat = 3)
plot(plsda.perf) # look at the output of the perf function

optimal.ncomp <- pls.da.perf$choice.ncomp["BER", "centroids.dist"]

final.plsda <- plsda(X, Y, ncomp = optimal.ncomp)
auroc(final.plsda, roc.comp = 1) # change this value to look at models using diff dim

do you know if there’s a way to plot the ROC curve for this LOOCV?

As in the above code, to achieve a ROC plot, you must use the auroc() function on a plsda object (not the output of perf()). You can just get the AUROC values via the perf() function, not the plot unfortunately.

I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?

The $auc.all component provides the AUROC values for each component, across each repeat whereas the $auc component shows the average across all repeats. Hence the $auc is more appropriate for general contexts. However, in your case you are using validation = "loo", meaning that nrepeat = 1 (more repeats would yield the exact same results, hence nrepeat > 1 is redundant). This means that auc and auc.all will be the exact same, so it doesn’t really matter which you use.

Hope this is all clear.

Cheers,
Max.

Hi Max,

Thanks once again for your quick response. Your help and expertise is greatly appreciated.
I’m going to try and implement your suggestions and hopefully it will work out fine :slight_smile:

Best regards,

Gabby

1 Like

Hi Max!

I hope all is well with you. Your help on my previous problems really helped me to move forward, but I came across this problem that I can’t seem to solve myself.

When I conduct the LOOCV-analysis on a 2-class problem (for example bacterial vs viral infection) I get the following output:

As you can see I get an AUC for comp 1 and 2. And the AUC lies approximately between 0,7 and 0,8 depending on the component. So the interpretation of these AUC values are quite straightforward.

When I conduct the LOOCV analysis on a 3-class problem (for example bacterial vs viral vs parasite infection) I get the following output:

I’m struggling to understand how I must interpret the output of this analysis since there are 3 AUC values given per component. I hope you can help me understand which of these values should be used for the AUC analysis of the LOOCV.

Thanks in advance!

Best regards,

Sorry for the slow reply.

In multi-class (more than two) problems. AUROC (and other metrics) use one two overarching types of evaluation: ‘one-vs-one’ and ‘one-vs-others’. mixOmics uses the latter of these two. I like to think about it like this. Lets take a two class problem (classes A and B). Describing True/False Positive/Negatives is quite simple. We can specifically select class A or B to be our “positive” class and go from there, ie. A is the positive class and the model classifies a sample as A. This sample in reality is of class B, therefore this is a False Positive - the model falsely selected the positive (A) class.

Lets try apply the same philosophy to a three class problem with classes A, B and C. What is the positive class? The problem is no longer dealing with a binary so the binary nature of positive and negative no long applies. Here’s what we do: we iterate over each class and condense it down to a binary problem. So our first iteration, A is positive. Therefore, any sample that is of class B and C is now descibed as negative - B and C are treated as essentially equivalent (for this iteration only). We can now calculate the True/False Positive/Negative metrics using A = positive and B/C = negative.

Now, we move onto the second iteration, where B = positive and A/C = negative. Calculate our metrics and move onto the last iteration where C is now the positive class. This means we now have three AUROC values. This is why in the output, its descibed as class vs Other(s) (see below).

image

Going back to the two class problem and using this iterative methodology, the two AUROC values are complementary to one another and therefore essentially describe the same thing. In other words, it doesn’t matter which of class A or B is the positive class.

Hence, in a two class problem we report a single value and in an N-class problem we report N values.

Hope this clears things up

Dear Max,
How do we interpret the p-value of the AUC? It if is larger than 0.05 does it mean the model is not better than pure chance (AUC=0.5)?
Thank you!
Stef

A p-value is defined as the likelihood that the result of a statistical test is due to random chance. This is why 0.05 is a common value, meaning that one can say their result only has a 5% chance of being due to randomness, and a 95% chance it is reflective of a significant finding. Hence, the p-value for an AUC value describes if the level of specificity/sensitivity is of significance compared to that an AUC = 0.5.

If you have a p-value greater than 0.05, it means that the AUROC is not statistically different to model which produces an AUROC of 0.05.

Hello,

I have a proteomic dataset with 12 samples, 2 classes and 2500 variables. I estimated that due to my low sample size, it is better to use the the loo validation (please correct me if im wrong). The nrepeat cannot be set to anything higher than 1:

In perf.mixo_splsda(trial.splsda, validation = “loo”, progressBar = FALSE, :
Leave-One-Out validation does not need to be repeated: ‘nrepeat’ is set to ‘1’

Is there another way to repeat it (I assume nrepeat = no. samples)?

Another question I have is regarding error rates. After running perf and getting an optimal ncomp, the error rates dont change so they remain quite high:

$overall
max.dist centroids.dist mahalanobis.dist
comp1 0.33333333 0.33333333 0.33333333
comp2 0.25000000 0.25000000 0.25000000
comp3 0.16666667 0.16666667 0.16666667
comp4 0.08333333 0.08333333 0.08333333
comp5 0.08333333 0.08333333 0.08333333
$BER
max.dist centroids.dist mahalanobis.dist
comp1 0.33333333 0.33333333 0.33333333
comp2 0.25000000 0.25000000 0.25000000
comp3 0.16666667 0.16666667 0.16666667
comp4 0.08333333 0.08333333 0.08333333
comp5 0.08333333 0.08333333 0.08333333

AND

$overall
max.dist centroids.dist mahalanobis.dist
comp1 0.3333333 0.3333333 0.3333333
comp2 0.2500000 0.2500000 0.2500000
comp3 0.1666667 0.1666667 0.1666667
$BER
max.dist centroids.dist mahalanobis.dist
comp1 0.3333333 0.3333333 0.3333333
comp2 0.2500000 0.2500000 0.2500000
comp3 0.1666667 0.1666667 0.1666667

On top of it, the AUC and p value seem very suspicious:

$Comp1
AUC p-value
EDDHA vs Fe 1 0.003948
$Comp2
AUC p-value
EDDHA vs Fe 1 0.003948
$Comp3
AUC p-value
EDDHA vs Fe 1 0.003948
$Comp4
AUC p-value
EDDHA vs Fe 1 0.003948

Oddly enough, using another dataset with the same structure gives the exact same AUC and p value.

I have also tried Mfold validation with 3 folds. I either get a relatively good sPLSDA with error rates of around 0.15 or I get an optimal amount of ncomp of 1 and cannot run sPLSDA. Its different every time I run the code…

Am I missing something?

hi @Rina

Is there another way to repeat it (I assume nrepeat = no. samples)?

If you use loo you cannot repeat it (basically you remove one sample each time, so there is only one way of doing this, it is not a random process like M-fold cross validation

Another question I have is regarding error rates. After running perf and getting an optimal ncomp, the error rates dont change so they remain quite high

Then it means there is no classification improvement when you add a component. I would consider in the interpretation comp = 1 (but of course use 2 for plotting)

On top of it, the AUC and p value seem very suspicious

Yes the AUC has always given somewhat optimistic results. This is why we dont really recommend using it to make a decision on parameters. We cover this on our website + book.

Oddly enough, using another dataset with the same structure gives the exact same AUC and p value.

Probably just (bad) luck? :slight_smile:

I have also tried Mfold validation with 3 folds. I either get a relatively good sPLSDA with error rates of around 0.15 or I get an optimal amount of ncomp of 1 and cannot run sPLSDA. Its different every time I run the code…
Am I missing something?

With Mfold the training samples are assigned randomly to each fold. This is where you need nrepeat = 10 or 50 or order to get more stable results.

Kim-Anh