Problems with AUC for splsda object

Hello,

I have a metabolomics dataset with 2 observations from each participant, under different conditions. I used splsda() to perform sPLS-DA and had trouble getting the model to converge, so I used the options near.zero.var=TRUE and had to set the tolerance to 0.2. When I use perf() with option AUC=TRUE, I get the error mesage:

Error in cut.default(cases, thresholds) : ‘breaks’ are not unique

If I use AUC=FALSE, then perf() completes without errors. However, I want to report the AUC.

When I stratify the observations by condition, so that each participant has only 1 observation, there is no problem with the AUC. So although the sample size is relatively small (N=21), I don’t think that’s the issue.

Many thanks for any assistance.

hi @lpyle,

It is hard for us to diagnose what is going on without any details of the perf() code that is run. I think the reason might come from the fact that the splsda model has very limited discriminative capacity, hence the AUCROC cannot be calculated (maybe). You should first have a look at the classification error rate - this measure would be more reflective of the performance of the model, rather than the AUC which is limited and over optimistic for sPLS-DA (see the paper - I copy the relevant text below. Make sure also you use 3-fold or leave-one-out cross validation.

Performance assessment.
Once the optimal parameters have been chosen (number of components and number of variables to select), the final model is run on the whole data set X , and the performance of the final model in terms of classification error rate is estimated using the perf function and repeated CV. Additional evaluation outputs include the receiver operating characteristic (ROC) curves and Area Under the Curve (AUC) averaged over the cross-validation process using one-vs-all comparison if K > 2. AUC is a commonly used measure to evaluate a classifier discriminative ability. It incorporates the measures of sensitivity and specificity for every possible cut-off of the predicted dummy variables. However, as presented in Section ‘Prediction distances’, our PLS-based models rely on prediction distances, which can be seen as a determined optimal cut-off. Therefore, the ROC and AUC criteria may not be particularly insightful in relation to the performance evaluation of our supervised multivariate methods, but can complement the statistical analysis.

Kim-Anh