Train and test set division of data

shweta · April 22, 2021, 12:17pm

Dear All,

Firstly, thank you so much for developing such an interesting algorithm. I would like to use DIABLO to integrate proteomics and methylation data. However, prior to doing that, I have a few questions on which I would like to get your expert opinion.

I have preprocessed the datasets individually and taken around 5000 most abundant proteins and M-values for methylation data for 122 samples. Would you consider 122 to be a small sample size? I read a thread previously where you have mentioned a few things to keep in mind while integrating small sample size data, however, if not, I would like to know whether I should combine the data and then divide it into a train and test set or should I first divide it into train and test and then combine the training data together?

Thank you so much in advance and I hope to hear back from you,

Shweta

christoa · April 26, 2021, 7:59am

Hi @shweta,

It depends on how many group you have in the dataset. If you have 2 groups with 61 samples in each group, then you have a “fairly large” dataset. On the other hand, if you have 10 groups with 12 samples in each group, then i would consider it a small dataset.

I am not sure, if i understand the question about test-train split. What exactly do you mean by combining the data? Combining proteomics and methylation data, or combining a training and validation cohort? Also, keep in mind, that you don’t actually need a test-train split in order to create a DIABLO model or use the perf() and tune() functions. This is only relevant, if you want to apply the predict function, to validate your model.

Using the 5000 most variable proteins across all samples might be more relevant than using the 5000 most abundant proteins.

Christopher

shweta · April 26, 2021, 9:10am

Dear Christopher,

Thank you so much for your suggestions. By combining I meant combining the methylation and proteomics data actually. So if I just want to get the correlation between these datasets and also determine the features which are correlated or so, I do not need to split the data into train and test if I understood you right?

Many thanks once again for your quick response and time,

I look forward to using mixOmics!

Shweta

christoa · April 26, 2021, 10:07am

Hi @shweta,

Yes, this is correct. Also, the cross-validation steps (perf and tune) will by definition split data into training and test data. Of course you still need to combine data in the sense that you are making a list data = list(Proteomics = Proteomics_X, Methylation = Methylation_X). But, besides that you don’t have to combine/split data. You just have to make sure, that the subjects are matched in all datasets:

head(cbind(rownames(Proteomics_X), rownames(Methylation_X), rownames(Y)))
lapply(data, dim)

Christopher

shweta · April 26, 2021, 10:12am

Dear Christopher,

Thanks a million for your help.

Regards,
Shweta

christoa · April 26, 2021, 10:24am

You are welcome. Please let me know if you have more questions

shweta · April 27, 2021, 9:08am

Dear Christopher,

I am back with a small issue, following your suggestions, I finished designing my matrix and was trying to tune the number of components required for the final DIABLO model. However, as you can see in the image attached below, I do not get any outputs for choice ncomp. I used leave one out CV because I have 6 groups and some groups have fewer samples present within them.

Looking forward to your response,

Thank you so much in advance.

Best,
Shweta

christoa · April 27, 2021, 9:54am

Hi @shweta,

This is because the cross-validation has to repeated at least 3 times in order to assess whether there is a significant improvement. However, leave-one-out cross-validation can by definition only repeated once. You can simply look at the perf plot and choose ncomp manually. Alternatively, you can give Mfold cross-validation a try (5-folds, 50 repeats should be fine). In such case, remember to use overall error rate instead of BER if your groups are unbalanced.

Christopher

shweta · May 5, 2021, 6:58am

Dear Christopher,

I have a question about the selection of components. As you can see from the images attached and your previous suggestion, I would look at the overall error rate since my groups and not balanced. And I get the choice of the weighted vote for 5 components for centroid and maximum distance, I would like to get your advice on which distance to consider now, because, the further results from them are different.

looking forward to your reply,

Thank you very much in advance

Shweta

christoa · May 14, 2021, 9:14am

Hi @shweta, sorry for the late reply, i have been on sick leave due to covid. Given that you have 6 groups, it makes alot of sense to test 5 components when tuning the model. There is no definite answer, but, in this case i would probably go with centroids distances, since it results in more accurate predictions for N-integration problems.

You can find the full-text in the supplemental information of this paper

Hope i helps

Christopher

shweta · May 17, 2021, 4:42pm

Dear Christopher,

Oh no, I hope you feel better soon.

And thanks a lot for your response, I will work around using centroid distance then

Thanks so much once again,

Shweta

christoa · May 18, 2021, 11:19am

Thank you!
You are welcome. Feel free to reach out if you have any questions.

Christopher

Topic		Replies	Views
Integration with DIABLO for N-ingretaion with low sample size Analysis	7	3175	June 27, 2024
Analytical issues using DIABLO Analysis	2	739	April 13, 2022
Test and training datasets Analysis	1	35	October 17, 2024
Training and test samples Analysis	5	568	October 18, 2022
Working on TCGA data using mixOmics Analysis	1	367	September 9, 2019

Train and test set division of data

Related topics