Hello! I use mixOmics for research with 40 blood cell variables. Thank you for this very useful package! I would like to prove that analyzing data for men and women separately yields better model performance. I have a binary outcome variable and use sPLS-DA. Any suggestions on how I can best (statistically) substantiate this (not necessarily based on eye-balling but something more statistically valid)? The error rate for women separately is lower than for the pooled data, can I also provide statistical support for this? Thank you very much in advance! Best regards, Malin
A brief note: you state you " would like to prove that…". Be very careful about trying to actively “prove” something, rather than assessing whether your hypothesis is correct or incorrect. Especially with tools like those in mixOmics
, if you approach your analysis with this mindset you are likely to find results which you want to find, rather than those that are really there.
The only way I could think to statistically assess whether the model is significantly better would be the following process:
- Randomly split data into training and testing sets - stratify by the gender of the samples.
- Generate a model using all samples, a model with just male samples and a model with just female samples. Use the training samples to generate these.
- Assess the predictive performance of each of these samples and extract the error rate
- Repeat this process many times, I would suggest a minimum of 100 times.
From here, you will have three distributions of error rates, one for all samples, one for male samples and one for female samples. Now, you can apply a t-test or some sort of ANOVA to determine if the difference in mean error rate between these model types is significant.
This introduces a host of assumptions and isn’t exactly the most rigorous procedure, but I can’t think of anything else. I would also explore different classification models, unless you are specifically looking at the sPLS-DA algorithm
Hi Max, thank you very much for your quick response! You are completely right, the phrasing was incorrect. Based on different classification models, a better model performance with sex-stratified data emerged, therefore I was already a little further along in my thought process/phrasing :-). For now I prefer sPLS-DA because of the lasso integration, variable selection and visualization capabilities. And I will proceed to do the error rate analysis, thank you for the suggestion! Malin