I understand that the perf function evaluates the overfitting of the feautures. The method it uses is evaluating model classification performance. In case I want to do a permutation test, what command should I run. I did not see in the mixOmics package an option to do a permutation test.
Hi @Lorengol,
We don’t have a function in mixOmics to do a permutation test, but you can do this simply by randomly shuffling the Y variables on your dataset and comparing the performance of these randomly shuffled data with your real data. I’ve made a quick bit of code to do this for a PLS-DA model built with the breast tumors test dataset below that you can modify for your own use.
library(mixOmics)
# Fit model
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- breast.tumors$sample$treatment
model_real <- plsda(X, Y, ncomp = 2)
perf_real <- perf(model_real, validation = "Mfold", folds = 5, nrepeat = 10)
real_error <- perf_real$error.rate$overall[2] # extract real error rate of model on component 2
# Run permutation test to get random error rate from model
n_permutations <- 100
perf_permutations <- numeric(n_permutations)
set.seed(123) # for reproducibility
for (i in 1:n_permutations) {
Y_perm <- sample(Y) # Randomly shuffle Y labels
model_perm <- plsda(X, Y_perm, ncomp = 2)
perf_perm <- perf(model_perm, validation = "Mfold", folds = 5, nrepeat = 10)
perf_permutations[i] <- perf_perm$error.rate$overall[2] # Store overall error rate extracted from perf on component 2
}
# Calculate p-value: proportion of random models better than real model
p_value <- mean(perf_permutations <= real_error)
p_value
Cheers,
Eva
Eva! Thank you very much!!! I will try it
Eva! In my case, I perform a sPLD-DA (Diablo). So I have multiple data sets. In this case, I should extract real_error for each variable set?
I think in terms of the permutation test you can just run the code above as is, but swap the plsda
model with your DIABLO model, you still have just a categorical Y output which you can randomly shuffle the labels for and assess the classification error rate.
Eva. Thank you very much!!! I ran it. Now I have a doubt. The p-value I obtain is 0! It is too good to be perfect.
I think on two possibilities:
- There is a strong signal in our multi-omics data that robustly differentiates the four groups. Even when the group labels are shuffled, the underlying structure in your omics data might not yield models with classification error rates as low as your actual model.
- Could be “The small n of one group (n=3) might limit the ability to detect overfitting with cross-validation (perf function) prediction methods. Because of this group size, we had to reduce the number of folds to 3, thus assembling groups of 3 (the other three groups has n = 6). I believe that assembling groups of n=3 (permuted) of which have 6 samples makes it more difficult to achieve a model with a low error rate.
Do you know what the problem may be?
I think having a group with n=3 is definitely too small to make any reliable conclusions as to how well your multiomics data can distinguish that group. Instead of permutation tests and p-values I would recommend using the perf()
function we have in mixOmics to assess model classification performance with error rates and use visualisations like sample plots (plotIndiv()
) to explore how your groups separate across your data.
If you would really like a reliable p-value perhaps you need to exclude your small n=3 group and re-run the permutation test with your remaining larger groups.
Cheers,
Eva