Train and test set division of data

Dear All,

Firstly, thank you so much for developing such an interesting algorithm. I would like to use DIABLO to integrate proteomics and methylation data. However, prior to doing that, I have a few questions on which I would like to get your expert opinion.

I have preprocessed the datasets individually and taken around 5000 most abundant proteins and M-values for methylation data for 122 samples. Would you consider 122 to be a small sample size? I read a thread previously where you have mentioned a few things to keep in mind while integrating small sample size data, however, if not, I would like to know whether I should combine the data and then divide it into a train and test set or should I first divide it into train and test and then combine the training data together?

Thank you so much in advance and I hope to hear back from you,

Shweta

Hi @shweta,

It depends on how many group you have in the dataset. If you have 2 groups with 61 samples in each group, then you have a “fairly large” dataset. On the other hand, if you have 10 groups with 12 samples in each group, then i would consider it a small dataset.

I am not sure, if i understand the question about test-train split. What exactly do you mean by combining the data? Combining proteomics and methylation data, or combining a training and validation cohort? Also, keep in mind, that you don’t actually need a test-train split in order to create a DIABLO model or use the perf() and tune() functions. This is only relevant, if you want to apply the predict function, to validate your model.

Using the 5000 most variable proteins across all samples might be more relevant than using the 5000 most abundant proteins.

  • Christopher

Dear Christopher,

Thank you so much for your suggestions. By combining I meant combining the methylation and proteomics data actually. So if I just want to get the correlation between these datasets and also determine the features which are correlated or so, I do not need to split the data into train and test if I understood you right?

Many thanks once again for your quick response and time,

I look forward to using mixOmics!

Shweta

Hi @shweta,

Yes, this is correct. Also, the cross-validation steps (perf and tune) will by definition split data into training and test data. Of course you still need to combine data in the sense that you are making a list data = list(Proteomics = Proteomics_X, Methylation = Methylation_X). But, besides that you don’t have to combine/split data. You just have to make sure, that the subjects are matched in all datasets:

head(cbind(rownames(Proteomics_X), rownames(Methylation_X), rownames(Y)))
lapply(data, dim)
  • Christopher

Dear Christopher,

Thanks a million for your help.

Regards,
Shweta

You are welcome. Please let me know if you have more questions :slight_smile:

Dear Christopher,

I am back with a small issue, following your suggestions, I finished designing my matrix and was trying to tune the number of components required for the final DIABLO model. However, as you can see in the image attached below, I do not get any outputs for choice ncomp. I used leave one out CV because I have 6 groups and some groups have fewer samples present within them.

Looking forward to your response,

Thank you so much in advance.

Best,
Shweta

Hi @shweta,

This is because the cross-validation has to repeated at least 3 times in order to assess whether there is a significant improvement. However, leave-one-out cross-validation can by definition only repeated once. You can simply look at the perf plot and choose ncomp manually. Alternatively, you can give Mfold cross-validation a try (5-folds, 50 repeats should be fine). In such case, remember to use overall error rate instead of BER if your groups are unbalanced.

  • Christopher
1 Like

Dear Christopher,

I have a question about the selection of components. As you can see from the images attached and your previous suggestion, I would look at the overall error rate since my groups and not balanced. And I get the choice of the weighted vote for 5 components for centroid and maximum distance, I would like to get your advice on which distance to consider now, because, the further results from them are different.

looking forward to your reply,

Thank you very much in advance

Shweta


Capture_2

Hi @shweta, sorry for the late reply, i have been on sick leave due to covid. Given that you have 6 groups, it makes alot of sense to test 5 components when tuning the model. There is no definite answer, but, in this case i would probably go with centroids distances, since it results in more accurate predictions for N-integration problems.


You can find the full-text in the supplemental information of this paper

Hope i helps

  • Christopher

Dear Christopher,

Oh no, I hope you feel better soon.

And thanks a lot for your response, I will work around using centroid distance then :slight_smile:

Thanks so much once again,

Shweta

Thank you! :slight_smile:
You are welcome. Feel free to reach out if you have any questions.

  • Christopher
1 Like