Group-wise cross validation

keunbae · November 13, 2024, 8:01am

I’m attempting to implement group-wise cross validation in perf function. Instead of default Mfold cross validation, which adding the number of fold, I would like to use custom fold setup that respects predefined group structure. This training approach should provide a more realistic measure of model performance as data within a random factor have not been seen in a model.
#######################################

perf.diablo = perf(basic.diablo.model,
validation = ‘Mfold’,
folds = my_list,
)
#########################################
Here is the my_list structure

my_list
$1
[1] 1 2 3 4 5 6
$2
[1] 7 8 9 10 11 12
$3
[1] 13 14 15 16 17 18
…
#######################################

I had a look the script in the perf function, I assume perf function is designed to handle an integer folds value for basic k-fold cross-validation, also it seems they are also okay with custom list of group indices. However, this is apparently not working. Could you please assist with adjusting the function to accept group-wise cross-validation, please?

keunbae · November 14, 2024, 11:23am

I have been searching that based on the code error stated " Error in repeat_cv_perf.diablo(nrep) : Invalid number of folds."

This perf function seems somehow interlinked to repeat_cv_perf.diablo in perf.diablo function where this function do not have a list option. If it is correct, how can I modify the code to fit my desire, and I also want to know how the perf function suddenly links to perf.diablo function.

All the best,
Keunbae

evahamrud · November 15, 2024, 4:52am

Hi keunbae,

The perf() function allows for two types of validation:

leave-one-out cross-validation in which either one sample or one study (in the case of MINT models) is used as test data in turn
Mfold cross-validation in which the data is randomly partitioned into a defined number of folds and each fold in turn is used as the test data

We currently do not have functionality for user-defined folds when using Mfold cross-validation and this is not something we are looking at adding in the near future. If you are worried that your data is unbalanced (i.e. the number of samples per category is not equal) perf() uses stratified subsampling to partition data into folds and this ensures that the class proportion of samples per fold is similar to the proportions from the data.

If you would still like to use specific folds yourself you can edit the source code for the internal function repeat_cv_perf.diablo which is defined inside the perf.diablo.R script. In terms of the relationship between perf() and perf.diablo(), perf() is a generic function that detects the type of object you pass it and will call the appropriate function, i.e. if you pass it a DIABLO object it will automatically call perf.sgccda().

General steps for editing functions:

1, Copy the Source Code

Open the source code for the mixOmics function you want to modify (in this case perf.sgccda() and its internal function repeat_cv_perf.diablo which you can find here. Copy it into a new R script, giving the function a unique name (e.g., modified_perf_sgccda) to avoid overwriting the original function.

2, Make Your Edits

Edit the function as needed to incorporate your changes or additional features.

3, Load the Edited Function

To use your modified function, source the edited R script to load it into your environment. If your function relies on other internal mixOmics functions, make them accessible by adding this line after sourcing your function:

environment(modified_perf_sgccda) <- asNamespace("mixOmics")

This step allows your modified function to access any hidden mixOmics functions it may depend on.

I hope this answers your question!
Cheers,
Eva

keunbae · November 19, 2024, 9:33pm

Hello Eva,

Thank you for the detailed explanation and guidance.

Yes, this approach works well and successfully run the code. However, I am now encountering an issue with the selection of the number of component as the result does not offer standard deviation for a single repeat, as like LOOCV.

From my understanding, a group-wise cross validation should yield both average and standard deviation for customized folds (in my case, I have 13 folds, and ideally, these should produce 13 evaluate matrix that can be used to determine the optimal number of components).

I understand that this is not something you planning ahead. However, I believe that in many case, especially ecological studies, incorporating random effects in their experimental design is very common. This makes group-wise cross-validation a key component in adapting such methods. Do you think it might be worthwhile to consider implementing this functionality for future studies?

I am currently reviewing the entire codes to adapt my desire, but it seems a bit challenging under my coding skill and understanding. However, if you are able to contribute to this effort, I would be deeply grateful and highly appreciate it in advance.

All the best,
Keunbae

Topic		Replies	Views
Perf MFold cross-validation error	2	36	February 25, 2025
Perf() step gives inconsistent results? Support	1	36	October 17, 2024
Perf diablo error Bugs	1	543	June 30, 2020
Diablo prediction result per sample Analysis	3	343	April 26, 2021
Choice of components for DIABLO Analysis	5	144	May 16, 2024

Group-wise cross validation

Related topics