ROC analysis on a PLS-DA model built on only training data

Hi everyone!

I’m new to PLS-DA analysis so I have some questions about it and I hope somebody can help me.

I’ve used the package to analyse a dataset using PLS-DA. After plotting the plot it appears that there is clear discrimination between my samples.

Now I want to further analyse how good this discriminative power is using ROC-analysis and calculating the AUC. There’s only one problem and that is that I haven’t split the data into a trainingset and validation/test set because I don’t have enough samples to do so. Is calculating the AUC on my PLS-DA model using only trainingdata/trainingset of any added value? And if I calculate the AUC using only the training set and get an AUC of let’s say 0.9, does this mean that my PLS-DA model has a discriminative accuracy of 90%? I’m assuming this is not correct because you would need a validation and test set in order to be able to evaluate the dicriminative power of the model right?

Best regards,

Gabby

Hello @Gabby,

There are a few points that I think I should clarify about the usage of the AUROC curve:

Is calculating the AUC on my PLS-DA model using only trainingdata/trainingset of any added value?

The short answer is no. A model is going to be extremely good at classifying samples it was taught on as those samples drove the way the model was formed. While “training accuracy” is a metric that is assessed, it is almost always done in conjunction with “testing accuracy” (sometimes referred to as “validation accuracy”). This is because it allows comparison between the two. Having a high training accuracy but a low testing accuracy is an easy-to-identify indicator of overfitting. Hence, if you only have the training accuracy, any report on the model’s performance will be extremely over-optimistic. At the end of the day, model building is all about generalisability, something which training accuracy does not measure at all.

if I calculate the AUC using only the training set and get an AUC of let’s say 0.9, does this mean that my PLS-DA model has a discriminative accuracy of 90%?

Again, the simple answer is no. The Receiver Operator Curve (ROC) depicts how the True Positive Rate (TPR) and the False Positive Rate (FPR) change as your classification threshold changes. Taking the Area Under the Curve (AUC) is a way of condensing that information into a single value. It is by no means a direct measure of the performance of the model. It is better used as a way to evaluate the relative performances of certain models (eg. models using different component counts). “Discriminative accuracy” is defined by:

This article was useful for my understanding, I’d encourage you to have a read.

I’m assuming this is not correct because you would need a validation and test set in order to be able to evaluate the dicriminative power of the model right?

This assumption is correct. This is not due to the lack of validation data, but for the reason described above. The absence of validation data merely exacerbates this issue as the AUROC is measuring the performance of the model on the data is was trained on.

Lastly, I believe its important to address this comment:

I haven’t split the data into a trainingset and validation/test set because I don’t have enough samples to do so

If this is the case, discriminant analysis may not be appropriate given the fact that your sample size must be very small. You could try using “Leave One Out” cross validation, such that you generate a model on all but one sample (meaning the remaining sample is your entire testing set), and then repeat this so that every sample is tested on once. However, as mentioned about, model building is all about generalisability and having such a small sample size decreases the likelihood your model can generalise at all.

I hope this was of use for you. If you have any more questions feel free to ask.

Cheers,
Max.

Hi Max,

Thank you very much for you answer! It is really helpful! I got told PLS-DA would be the right fit for my data but after your response I’m starting to hesitate. I was wondering if i can get your expertise as second opinion.

So i have a dataset of 30 observations containing 20 variables each. These observations are labled into two different classes: bacterial infection and viral infection (so 15 observations are of patients with a bacterial infection and 15 observations are of patients with a viral infection). I want to see whether, using these 20 variables, there’s discrimination between the two classes (bacterial vs viral). I used plot.plsda and it showed indeed that there was indeed discrimination.

-Is 30 observations too little to use for PLS-DA?
-Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?

And in order to assess how well this discriminative power is / to asses the relative performance of the PLS-DA model I want to, after your suggestion, use ‘‘Leave One Out’’ cross validation. I’m thinking of using the following code for it, is this the right code?

-perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)

And do you know if there’s a way to plot the ROC curve for this LOOCV? Cause when I run the code I only get the AUC results but not the plot. And speaking of the AUC, I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?

Thanks in advance!

Best regards,

Gabby

So i have a dataset of 30 observations containing 20 variables each

With a sample size of 30, validation is definitely possible. I was under the impression you had significantly less than this. Using 5-10 samples as testing samples would be appropriate. Also, I would again encourage you to explore the use of Leave-One-Out Cross Validation (LOOCV) to effectively boost your sample size when it comes to tuning and assessing your final model.

I used plot.plsda and it showed indeed that there was indeed discrimination

I’m going to assume you mean plotIndiv() on the plsda object as there is no function plot.plsda() within mixOmics.

Is 30 observations too little to use for PLS-DA?

Referring to the Goodhue, D. L., Lewis, W., & Thompson, R. (2012). Does PLS Have Advantages for Small Sample Size or Non-Normal Data? MIS Quarterly , 36 (3), 981–1001. https://doi.org/10.2307/41703490, the effect of a small sample size on the PLS algorithm was explored. In the case of n = 20, false positive rate was appropriate, viable solutions were found in the vast majority of cases and overall accuracy was comparable to larger sample size models. Its efficacy decreased when the models became more complex, but was still deemed as an appropriate integration method. Note that despite this paper using PLS, PLS-DA derives from this algorithm. Hence I am extending the results - but be aware that they are not entirely equivalent.

Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?

Without any context or seeing the plot, I cannot confirm this. I’d suggest looking at the ellipse parameter as part of plotIndiv(). If you see no overlap of the resulting 95% confidence ellipses, that is fairly good evidence for suggesting your model can discriminate between them. Just looking at plots is not an empirical measure of discriminative ability however.

perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)

That seems appropriate. The code I would use myself:

data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix

initial.plsda <- plsda(X, Y, ncomp = 5)

# ---------------------------------------------------------------------------- #

plsda.perf <- perf(initial.plsda, validation = "loo", 
                   progressBar = FALSE, auc = TRUE,
                   nrepeat = 3)
plot(plsda.perf) # look at the output of the perf function

optimal.ncomp <- pls.da.perf$choice.ncomp["BER", "centroids.dist"]

final.plsda <- plsda(X, Y, ncomp = optimal.ncomp)
auroc(final.plsda, roc.comp = 1) # change this value to look at models using diff dim

do you know if there’s a way to plot the ROC curve for this LOOCV?

As in the above code, to achieve a ROC plot, you must use the auroc() function on a plsda object (not the output of perf()). You can just get the AUROC values via the perf() function, not the plot unfortunately.

I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?

The $auc.all component provides the AUROC values for each component, across each repeat whereas the $auc component shows the average across all repeats. Hence the $auc is more appropriate for general contexts. However, in your case you are using validation = "loo", meaning that nrepeat = 1 (more repeats would yield the exact same results, hence nrepeat > 1 is redundant). This means that auc and auc.all will be the exact same, so it doesn’t really matter which you use.

Hope this is all clear.

Cheers,
Max.

Hi Max,

Thanks once again for your quick response. Your help and expertise is greatly appreciated.
I’m going to try and implement your suggestions and hopefully it will work out fine :slight_smile:

Best regards,

Gabby

1 Like

Hi Max!

I hope all is well with you. Your help on my previous problems really helped me to move forward, but I came across this problem that I can’t seem to solve myself.

When I conduct the LOOCV-analysis on a 2-class problem (for example bacterial vs viral infection) I get the following output:

As you can see I get an AUC for comp 1 and 2. And the AUC lies approximately between 0,7 and 0,8 depending on the component. So the interpretation of these AUC values are quite straightforward.

When I conduct the LOOCV analysis on a 3-class problem (for example bacterial vs viral vs parasite infection) I get the following output:

I’m struggling to understand how I must interpret the output of this analysis since there are 3 AUC values given per component. I hope you can help me understand which of these values should be used for the AUC analysis of the LOOCV.

Thanks in advance!

Best regards,

Sorry for the slow reply.

In multi-class (more than two) problems. AUROC (and other metrics) use one two overarching types of evaluation: ‘one-vs-one’ and ‘one-vs-others’. mixOmics uses the latter of these two. I like to think about it like this. Lets take a two class problem (classes A and B). Describing True/False Positive/Negatives is quite simple. We can specifically select class A or B to be our “positive” class and go from there, ie. A is the positive class and the model classifies a sample as A. This sample in reality is of class B, therefore this is a False Positive - the model falsely selected the positive (A) class.

Lets try apply the same philosophy to a three class problem with classes A, B and C. What is the positive class? The problem is no longer dealing with a binary so the binary nature of positive and negative no long applies. Here’s what we do: we iterate over each class and condense it down to a binary problem. So our first iteration, A is positive. Therefore, any sample that is of class B and C is now descibed as negative - B and C are treated as essentially equivalent (for this iteration only). We can now calculate the True/False Positive/Negative metrics using A = positive and B/C = negative.

Now, we move onto the second iteration, where B = positive and A/C = negative. Calculate our metrics and move onto the last iteration where C is now the positive class. This means we now have three AUROC values. This is why in the output, its descibed as class vs Other(s) (see below).

image

Going back to the two class problem and using this iterative methodology, the two AUROC values are complementary to one another and therefore essentially describe the same thing. In other words, it doesn’t matter which of class A or B is the positive class.

Hence, in a two class problem we report a single value and in an N-class problem we report N values.

Hope this clears things up

Dear Max,
How do we interpret the p-value of the AUC? It if is larger than 0.05 does it mean the model is not better than pure chance (AUC=0.5)?
Thank you!
Stef

A p-value is defined as the likelihood that the result of a statistical test is due to random chance. This is why 0.05 is a common value, meaning that one can say their result only has a 5% chance of being due to randomness, and a 95% chance it is reflective of a significant finding. Hence, the p-value for an AUC value describes if the level of specificity/sensitivity is of significance compared to that an AUC = 0.5.

If you have a p-value greater than 0.05, it means that the AUROC is not statistically different to model which produces an AUROC of 0.05.