So i have a dataset of 30 observations containing 20 variables each
With a sample size of 30, validation is definitely possible. I was under the impression you had significantly less than this. Using 5-10 samples as testing samples would be appropriate. Also, I would again encourage you to explore the use of Leave-One-Out Cross Validation (LOOCV) to effectively boost your sample size when it comes to tuning and assessing your final model.
I used plot.plsda and it showed indeed that there was indeed discrimination
I’m going to assume you mean plotIndiv()
on the plsda
object as there is no function plot.plsda()
within mixOmics
.
Is 30 observations too little to use for PLS-DA?
Referring to the Goodhue, D. L., Lewis, W., & Thompson, R. (2012). Does PLS Have Advantages for Small Sample Size or Non-Normal Data? MIS Quarterly , 36 (3), 981–1001. Does PLS Have Advantages for Small Sample Size or Non-Normal Data? on JSTOR, the effect of a small sample size on the PLS algorithm was explored. In the case of n = 20, false positive rate was appropriate, viable solutions were found in the vast majority of cases and overall accuracy was comparable to larger sample size models. Its efficacy decreased when the models became more complex, but was still deemed as an appropriate integration method. Note that despite this paper using PLS, PLS-DA derives from this algorithm. Hence I am extending the results - but be aware that they are not entirely equivalent.
Can i draw the conclusion that, based on the PLS-DA plot, there is discrimination between the patients with a bacterial and viral infection, based on the 20 variables?
Without any context or seeing the plot, I cannot confirm this. I’d suggest looking at the ellipse
parameter as part of plotIndiv()
. If you see no overlap of the resulting 95% confidence ellipses, that is fairly good evidence for suggesting your model can discriminate between them. Just looking at plots is not an empirical measure of discriminative ability however.
perf(myresult.plsda, validation = “loo”, progressBar = FALSE, auc = TRUE)
That seems appropriate. The code I would use myself:
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
initial.plsda <- plsda(X, Y, ncomp = 5)
# ---------------------------------------------------------------------------- #
plsda.perf <- perf(initial.plsda, validation = "loo",
progressBar = FALSE, auc = TRUE,
nrepeat = 3)
plot(plsda.perf) # look at the output of the perf function
optimal.ncomp <- pls.da.perf$choice.ncomp["BER", "centroids.dist"]
final.plsda <- plsda(X, Y, ncomp = optimal.ncomp)
auroc(final.plsda, roc.comp = 1) # change this value to look at models using diff dim
do you know if there’s a way to plot the ROC curve for this LOOCV?
As in the above code, to achieve a ROC plot, you must use the auroc()
function on a plsda
object (not the output of perf()
). You can just get the AUROC values via the perf()
function, not the plot unfortunately.
I get two AUC results: a normal AUC and an AUC.ALL. Which one should I ideally use?
The $auc.all
component provides the AUROC values for each component, across each repeat whereas the $auc
component shows the average across all repeats. Hence the $auc
is more appropriate for general contexts. However, in your case you are using validation = "loo"
, meaning that nrepeat = 1
(more repeats would yield the exact same results, hence nrepeat
> 1 is redundant). This means that auc
and auc.all
will be the exact same, so it doesn’t really matter which you use.
Hope this is all clear.
Cheers,
Max.