Hi there,
First of all, thank you for developing such a comprehensive package and learning resources! I am currently working with a dataset with ~200 samples with 20 different exposures, and a binary outcome (1/0). I had some questions about how to utilize sPLSDA for unbalanced datasets:
- Hyperparameter tuning:
When tuning hyperparameters (ncomp and KeepX) for a dataset where the outcome is rare (e.g., only 10% of samples are in the positive class), what is the best performance metric to use?
- Would AUC be preferred over overall error rate or Balanced Error Rate (BER) in this scenario for optimizing both ncomp and KeepX?
- If error rate is recommended, out of the 3 metrics: max.dist, centroids.dist, mahalanobis.dist, which would be the best metric to use for optimization of ncomp?
- Extracting component scores for exposure “signature”:
I’m hoping to extract component scores as “signature” score for the exposures I’m looking at.
- What’s the correct way to extract component scores from an
splsda
model? Is it usingsplsda.model$variates$X
? - If optimized ncomp is > 1, how should I integrate the different component scores into downstream regression modeling? Should all components be included as predictors, or is there a way to combine all components into 1 score?
- If I just want to use the component scores as my signature score, is it still best practice to split the data into testing/training set? I worry that splitting may limit model performance, and if the reduced sample size would compromise the stability of the component loadings.
- If I split my data into training and testing sets, can I use
predict(splsda_model, X_test)$variates$X
to obtain component scores for the test set, and then combine these scores with the training set component scores (fromsplsda_model$variates$X
) to create a unified dataset for downstream analysis?
Thanks in advance for your help—really appreciate this community and the great tools I’m learning!