sPLS-DA hyperparameter tuning in unbalanced dataset

Hi there,

First of all, thank you for developing such a comprehensive package and learning resources! I am currently working with a dataset with ~200 samples with 20 different exposures, and a binary outcome (1/0). I had some questions about how to utilize sPLSDA for unbalanced datasets:

  1. Hyperparameter tuning:
    When tuning hyperparameters (ncomp and KeepX) for a dataset where the outcome is rare (e.g., only 10% of samples are in the positive class), what is the best performance metric to use?
  • Would AUC be preferred over overall error rate or Balanced Error Rate (BER) in this scenario for optimizing both ncomp and KeepX?
  • If error rate is recommended, out of the 3 metrics: max.dist, centroids.dist, mahalanobis.dist, which would be the best metric to use for optimization of ncomp?
  1. Extracting component scores for exposure “signature”:
    I’m hoping to extract component scores as “signature” score for the exposures I’m looking at.
  • What’s the correct way to extract component scores from an splsda model? Is it using splsda.model$variates$X ?
  • If optimized ncomp is > 1, how should I integrate the different component scores into downstream regression modeling? Should all components be included as predictors, or is there a way to combine all components into 1 score?
  • If I just want to use the component scores as my signature score, is it still best practice to split the data into testing/training set? I worry that splitting may limit model performance, and if the reduced sample size would compromise the stability of the component loadings.
  • If I split my data into training and testing sets, can I use predict(splsda_model, X_test)$variates$X to obtain component scores for the test set, and then combine these scores with the training set component scores (from splsda_model$variates$X) to create a unified dataset for downstream analysis?

Thanks in advance for your help—really appreciate this community and the great tools I’m learning!

Hi @ys2004,

Thanks for using mixOmics!

  1. The Balanced Error Rate (BER) takes into account unbalanced class allocation so this would be the best metric to tune your hyperparameters. In terms of the distance metrics, the right one will depend on your dataset and you should be able to identify it as the one with the lowest BER when you run tune (see our webpage for more details).

  2. I’m not sure if I’ve correctly understood what you want to do, but I believe it is trying to use the components of the sPLS-DA model to accurately predict your binary outcome. This can be done directly in mixOmics using the predict() function as you identified, and if you want to assess the accuracy of your model using cross-validation you can use the perf() function.
    If you want to extract the components and use them as predictors in downstream regression modelling, you can do this using splsda.model$variates$X as you said. You can either keep the components separate or combine them into a single score (eg using a weighted average, or just use the first component if adding more doesn’t improve accuracy much). If you want to assess the quality of your model you will have to split the data into training/test or have an unseen test data with known outcomes.

Hope that helps!
Eva