So i have a dataset of 30 observations containing 20 variables each

With a sample size of 30, validation is definitely possible. I was under the impression you had significantly less than this. Using 5-10 samples as testing samples would be appropriate. Also, I would again encourage you to explore the use of Leave-One-Out Cross Validation (LOOCV) to effectively boost your sample size when it comes to tuning and assessing your final model.

I used plot.plsda and it showed indeed that there was indeed discrimination

I’m going to assume you mean `plotIndiv()`

on the `plsda`

object as there is no function `plot.plsda()`

within `mixOmics`

.

Is 30 observations too little to use for PLS-DA?

Referring to the *Goodhue, D. L., Lewis, W., & Thompson, R. (2012). Does PLS Have Advantages for Small Sample Size or Non-Normal Data? **MIS Quarterly* , *36* (3), 981–1001. https://doi.org/10.2307/41703490, the effect of a small sample size on the PLS algorithm was explored. In the case of n = 20, false positive rate was appropriate, viable solutions were found in the vast majority of cases and overall accuracy was comparable to larger sample size models. Its efficacy decreased when the models became more complex, but was still deemed as an appropriate integration method. Note that despite this paper using PLS, PLS-DA derives from this algorithm. Hence I am extending the results - but be aware that they are not entirely equivalent.

Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?

Without any context or seeing the plot, I cannot confirm this. I’d suggest looking at the `ellipse`

parameter as part of `plotIndiv()`

. If you see no overlap of the resulting 95% confidence ellipses, that is fairly good evidence for suggesting your model can discriminate between them. Just looking at plots is not an empirical measure of discriminative ability however.

perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)

That seems appropriate. The code I would use myself:

```
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
initial.plsda <- plsda(X, Y, ncomp = 5)
# ---------------------------------------------------------------------------- #
plsda.perf <- perf(initial.plsda, validation = "loo",
progressBar = FALSE, auc = TRUE,
nrepeat = 3)
plot(plsda.perf) # look at the output of the perf function
optimal.ncomp <- pls.da.perf$choice.ncomp["BER", "centroids.dist"]
final.plsda <- plsda(X, Y, ncomp = optimal.ncomp)
auroc(final.plsda, roc.comp = 1) # change this value to look at models using diff dim
```

do you know if there’s a way to plot the ROC curve for this LOOCV?

As in the above code, to achieve a ROC plot, you must use the `auroc()`

function on a `plsda`

object (not the output of `perf()`

). You can just get the AUROC values via the `perf()`

function, not the plot unfortunately.

I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?

The `$auc.all`

component provides the AUROC values for each component, across each repeat whereas the `$auc`

component shows the average across all repeats. Hence the `$auc`

is more appropriate for general contexts. **However, in your case** you are using `validation = "loo"`

, meaning that `nrepeat = 1`

(more repeats would yield the exact same results, hence `nrepeat`

> 1 is redundant). This means that `auc`

and `auc.all`

will be the exact same, so it doesn’t really matter which you use.

Hope this is all clear.

Cheers,

Max.