P-value in SPLSDA plot

Hi,

is it possible to have a p-value that can denote the goodness of separation between groups as depcited in the sPLSDA plot by plotIndiv?

Does the concept of p-values fit in here (hypothesis testing)?

Or is there some other statistical measure(s) which can denote how good the separation is?

Thank you in anticipation of your responses!

Best regards,
SKD

Hi SKD,

the concept of p-values does not fit here, but there is many other ways to assess the “goodness of separation”, which is based on how well the model is performing.

Ways to assess this includes, among others:

  1. You can add 95% confidence ellipses to your scoreplot (ellipse = TRUE), to highlight the strength of the discrimination (do ellipses overlap?) (simplest way).
  2. You can overlay the prediction area and scoreplot to see how the samples fall within those (?background.predict)
  3. You can create AUC plots, but this might not always reflect the sPLSDA performance (?auroc)
  4. If you have a large dataset in terms of observations/samples in each group, you can split the dataset into a train & test set, and evaluate how well your model is at predicting the class of new samples (?predict).

If you have not seen it yet, i suggest you have a look at the SRBCT case study (http://mixomics.org/case-studies/splsda-srbct/) where they explain it in more detail.

Thank you very much for your prompt anc clear reply.
Yes, I use all these points that you mentioned to check “goodness of separation”
Glad to know that I didnt miss out on some method that I didn’t know about.

hi @SKD and thanks @christoa to chip in (this is much appreciated!).

In addition to what @christoa has mentioned, you could also do a permutation test. I include the text I am currently writing about this, but RVAidememoire only includes nested CV so that may not work for large data sets. We are considering adding this feature in mixOmics at some stage, but that wont be immediate.


You can have a look at the reference here: Hervé, Maxime R, Florence Nicolè, and Kim-Anh Lê Cao. 2018. “Multivariate Analysis of Multiple Datasets: A Practical Guide for Chemical Ecology.” Journal of Chemical Ecology 44 (3). Springer: 215–34.

Kim-Anh

Dear Kim-Anh,

many thanks for your additional reply and particularly the notes and references you have provided. I will look more into these. Thanks again!

I did think about permutation tests, in the way that you will permute the labels/ end-point and check if you still obtained similar separation or not after you have changed the structure, indicating if the original was a true relation or spurious signals. However, in a two-class problem, swapping the labels randomly will still leave some with the same ids as before and one will also need a check indicating the overap with original data labels. Thus, the permutation test will have to be displayed on two axis for a series of permutations, so to speak - overlap fraction of permuted labels with original & thus obtained signal value. Lastly, swapping all labels with another in a two class problem just calls A as B, and B as A but maintains inherently the same data structure.

Is my understanding correct in the way you alluded to using the permutation tests, and the limitation I highlight for binary classification problems?

Thank you again in advance for your insights and valuable discussions!
SKD

hi @SKD,

I am not sure I understand correctly your concerns. Permutation means that you randomly assign class labels in Y whilst the X rows order remains the same, so this should not depend on the number of classes, because it is a random allocation as opposed to a literal permutation. By doing this many times, you should be able to obtain the distribution of the outputs (i.e. here classification error rate) and measure how far your results are from PLS-DA compared to this distribution.

You can have a look at the reference I mentioned and the associated R code in RVaideMemoire.

Kim-Anh