Supervised binary classification of 2 distinct datasets sharing only a small number of common samples

Dear mixOmics community,

first of all congrats for the great resource and materials ! As I was also navigating in the book vignette https://mixomicsteam.github.io/mixOmics-Vignette/, I would like your broader feedback and opinion as experts. In particular, based on an ongoing pilot dataset:

  • I have two distinct omic layers: expression from microRNAs, and proteomics measurements.

  • Both datasets have a limited number of selected features (~6 in each dataset), after independent feature selection based on biological prior knowledge and experimental data.

  • In addition, the extra complex drawback, is that only a proportion of both technologies, share the same samples; In particular, only ~ 35 samples are common, whereas there are ~60 additional observations for the microRNA measurements. Hence, my critical questions are the following:

  1. As my ultimate goal is to perform to build a model upon classification of a binary categorical class (Disease 1 vs Disease 2), would a multiblock PLS-DA (i.e. DIABLO) work in this context? Even if necessary to try only to the subset of common samples in both assays? And without having an independent test dataset, comprising by both datasets?

  2. As another necessity is to keep the total number of selected features in the trained model or approach, is there a way also to not remove or keep an even lower number of features? something for example like ridge regression ?

  3. Finally, you would suggest for my task, if other solutions or approaches within mixOmics would be beneficial? With the ultimate goal, to investigate the discrimination accuracy of these features? And building a respective model, as perhaps I might also get in the near future more samples in the least represented assay?

PS1: Of course I acknowledge the limitations of having a small number of common samples, but I wanted to investigate possible ways of even exploiting the putative utility of these features, of separating the studied diseases.

PS2: In a possible-most optimal-scenario of acquiring more than 120 common samples in both assays, I could implement again DIABLO, but initially split before training the complete dataset into train & test sets? by keeeping ~ 20 samples like an independent test set in both assays?

Thank you in advance !

Efstathios

Hi @mbgventer,

Thanks for your questions.
I would do the following:

  • sPLS-DA on each data set individually, where you will maximise the number of features. This step will already be quite insightful to understand the discriminative power of each data set and check other things (PCA for outliers would be a first first step).
  • In terms of filtering, we usually filter the first 5,000 or so variables with the highest variance across all samples. I would not recommend you use a classification method or Differential analysis method beforehand as this will lead to severe overfitting down the track.
  • You can train your multi block PLS-DA on the overlapping samples, and then test on the samples that you had to leave out in each omics. This is what is featured in the Vignette and the example on the website on the TCGA data.

I hope that helps, good luck with your analyses

Kim-Anh

Dear Kim-Anh,

thank you very much for your feedback and important comments mentioned; Just hopefully without causing any disturbance, some quick additional-but very critical points-to further ask:

  1. I have performed PLS-DA on both datasets individually, where the discriminative capacity overall is not “great” (average BER~31% for microRNAs & ~35% for the proteomics); and as you pinpointed, there are specific common samples that behave as “outliers” in both assays.

  2. I totally agree and understand the importance of starting with full capacity concerning the number of features, and continue as you pinpointed. However, due to other reasons and broader project manifestation from experimental data, there is an unmet need to proceed with this limited number of features: that is less than 10 features in each assay. Hence, despite this limitation, you would still think that it would be interesting and valuable to inspect, if both assays are combined with DIABLO, if the discriminative accuracy is improved?

  3. Finally, as I’m also interested in having an interpretable model, if the integration produces a descent classification power based on the multi-omic signature, are there any metrics that I could extract from DIABLO analysis, in order to quantify the contribution of each selected feature in the final multi-signature? As I’m heavily interested in order to build some score based on the capacity of the multi-omics features?

Thanks a gazillion for your help and consideration !!

Best,

Efstathios

hi @mbgventer,

  1. I have performed PLS-DA on both datasets individually, where the discriminative capacity overall is not “great” (average BER~31% for microRNAs & ~35% for the proteomics); and as you pinpointed, there are specific common samples that behave as “outliers” in both assays.

At least you get some good insights from this. The classification performance is unlikely to improve when you move on to DIABLO (perhaps a little as here you did not perform variable selection)

  1. I totally agree and understand the importance of starting with full capacity concerning the number of features, and continue as you pinpointed. However, due to other reasons and broader project manifestation from experimental data, there is an unmet need to proceed with this limited number of features: that is less than 10 features in each assay. Hence, despite this limitation, you would still think that it would be interesting and valuable to inspect, if both assays are combined with DIABLO, if the discriminative accuracy is improved?

You will seriously overfit if you start with DE features already. My suggestion is to filter to ~5,000 most variable features, then ask DIABLO to select the top 10 features. You could then compare the features selected with the ones selected in point 1 with sparse PLS-DA. You may get additional insights from the features that overlap / do not overlap between both analyses.

  1. Finally, as I’m also interested in having an interpretable model, if the integration produces a descent classification power based on the multi-omic signature, are there any metrics that I could extract from DIABLO analysis, in order to quantify the contribution of each selected feature in the final multi-signature? As I’m heavily interested in order to build some score based on the capacity of the multi-omics features?

Yes you can have a look at our DIABLO examples using the perf() function. When you print out your perf result object it will give you a list of metrics based on classification performance (see also the ?perf help file).

Kim-Anh