So i have a dataset of 30 observations containing 20 variables each
With a sample size of 30, validation is definitely possible. I was under the impression you had significantly less than this. Using 5-10 samples as testing samples would be appropriate. Also, I would again encourage you to explore the use of Leave-One-Out Cross Validation (LOOCV) to effectively boost your sample size when it comes to tuning and assessing your final model.
I used plot.plsda and it showed indeed that there was indeed discrimination
I’m going to assume you mean
plotIndiv() on the
plsda object as there is no function
Is 30 observations too little to use for PLS-DA?
Referring to the Goodhue, D. L., Lewis, W., & Thompson, R. (2012). Does PLS Have Advantages for Small Sample Size or Non-Normal Data? MIS Quarterly , 36 (3), 981–1001. https://doi.org/10.2307/41703490, the effect of a small sample size on the PLS algorithm was explored. In the case of n = 20, false positive rate was appropriate, viable solutions were found in the vast majority of cases and overall accuracy was comparable to larger sample size models. Its efficacy decreased when the models became more complex, but was still deemed as an appropriate integration method. Note that despite this paper using PLS, PLS-DA derives from this algorithm. Hence I am extending the results - but be aware that they are not entirely equivalent.
Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?
Without any context or seeing the plot, I cannot confirm this. I’d suggest looking at the
ellipse parameter as part of
plotIndiv(). If you see no overlap of the resulting 95% confidence ellipses, that is fairly good evidence for suggesting your model can discriminate between them. Just looking at plots is not an empirical measure of discriminative ability however.
perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)
That seems appropriate. The code I would use myself:
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
initial.plsda <- plsda(X, Y, ncomp = 5)
# ---------------------------------------------------------------------------- #
plsda.perf <- perf(initial.plsda, validation = "loo",
progressBar = FALSE, auc = TRUE,
nrepeat = 3)
plot(plsda.perf) # look at the output of the perf function
optimal.ncomp <- pls.da.perf$choice.ncomp["BER", "centroids.dist"]
final.plsda <- plsda(X, Y, ncomp = optimal.ncomp)
auroc(final.plsda, roc.comp = 1) # change this value to look at models using diff dim
do you know if there’s a way to plot the ROC curve for this LOOCV?
As in the above code, to achieve a ROC plot, you must use the
auroc() function on a
plsda object (not the output of
perf()). You can just get the AUROC values via the
perf() function, not the plot unfortunately.
I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?
$auc.all component provides the AUROC values for each component, across each repeat whereas the
$auc component shows the average across all repeats. Hence the
$auc is more appropriate for general contexts. However, in your case you are using
validation = "loo", meaning that
nrepeat = 1 (more repeats would yield the exact same results, hence
nrepeat > 1 is redundant). This means that
auc.all will be the exact same, so it doesn’t really matter which you use.
Hope this is all clear.