Appropriate mixOmics methodology for multi-omics integration with time course measurements

Dear mixOmics group,

good afternoon and I hope my message finds you well !! I would like to ask your opinion and feedback regarding one “upcoming” multi-omics dataset we will produce, and weather any methodology implemented in the mixOmics framework, would be more suitable to implement in our biological scenarios:

Briefly, we sought to create multi-omics layers (i.e. transcriptome, genome, epigenetics) of ~ 50 to 100 patients. The main interest would be the presence of two samples per each patient: one prior therapy, and one after administration of therapy; Thus, we would have actually 2 timepoints/samples per patient (which of course would be “confounded” with biological condition: T1 is always before therapy, T2 is always after).

In addition, we will have further distinct measurements such as histology features and drug screens. Our main question would be to study the heterogeneity of these cancer patients (belonging to the same cancer entity), and identify these molecular circuits that differentiate or significantly perturbed “before vs after therapy”. Of course, intra-patient variation would be present, but we sought to unravel these biological sources of variation that might categorize these patients into at least these “two” distinct groups.

On this premise, and based on the experimental design, you would suggest DIABLO for a direct “supervised” implementation? And for example treat the specimens belonging to the same patient but in different timepoints as “distinct samples”? In addition, if my notion is correct, you would create a “binary categorical variable”, denoting two levels? such as “Before_Treat” & “After_Treat”?

Or due to the fact that these patients will have “paired” samples, supervised models like this could not model appropriately time course measurements from the same sample? As also, each patient has paired samples (nested) and complexes the analysis?

Alternatively, is there also another approach (even unsupervised) that would be also beneficial? Towards the direction of dissentagling “before vs after treatment” most important biological features and most important molecular sources of variation?

Any suggestion, feedback or idea would be grateful

Kind regards,

Efstathios

hi @estefaniatn,

Our multilevel approach tries to take into account (and remove) the individual variation in the paired design so that might work in your case.

I would try the following (close to what you suggest):

  • a simple PCA would tell you whether individual patients two time points tend to cluster together. If that is the case, try multilevel PCA and see if the results are better.
    here is an example:
  • based on what you find earlier, then either continue your analysis considering the time points/patients independently, or perform a multilevel decomposition with the withinVariation() function and plug this instead of your original data into DIABLO/PLS/PLSDA
  • I would also try an unsupervised approach, first to inspect the common source of variation in the data (e.g. PLS.block.pls) just as a follow up to PCA.
  • you should also complement your results with more classical univariate tools which can take into account this design (e.g linear mixed models)
  • dont forget to filter your data first, keep the top 5,000ish most variable features in each data set (or more if you feel like to, but just remember that if you are interested in variable selection, your aim is really to select the most 100s relevant features in the end).

Kim-Anh

Dear @kimanh.lecao,

thank you very much for your kind response and useful suggestions !! Apologies for any “naive” further comments from my side, but just to be certain that I understood completely your notion:

  1. Thank you for your suggestion for unsupervised approaches to inspect the trend of patients, based on the available multi-omics layers; is there a suitable unsupervised method within mixOmics framework, that I could use to project & cluster the patient samples, based at least on two available omics layers (gene expression, epigenomics, etc)?

  2. Regarding the supervised approaches:

A) if I have understood correctly, DIABLO differs from multiblock PLS? and on which exact part?

B) In conjuction with part A: as for each patient I will have 2 paired samples, the most straightforward solution, would be to create a categorical vector for DIABLO, that essentially would be a “two-level” variable? That is before and after treatment levels? And on this direction, is there any possibility to incorporate into the model that I have paired samples (so which samples are related to each patient)?

C) Moreover, if my concept is correct, DIABLO will help dissentagle which omics layers correlate the most and drive the variation between the 2 “conditions” above, correct? If so, which would be the “minimal” requirement for sample size? For example, 100 samples in total (that is 50 for each level), would suffice?

  1. Finally, regarding feature selection and clinicopathological parameters:
    A) Could also somatic mutations/CNAs in the form of binary numbers be incorporated into the training of the model? For example, 0 for absense, 1 for presense?

B) Is there also a possibility to correlate the output of the model to various clinical parameters, both qualitative (for example tumor status) or even quantitative/continuous (like PET/histopathology features)?

Thank you in advance :slight_smile:

Efstathios

hi @JasonMbg

  1. Thank you for your suggestion for unsupervised approaches to inspect the trend of patients, based on the available multi-omics layers; is there a suitable unsupervised method within mixOmics framework, that I could use to project & cluster the patient samples, based at least on two available omics layers (gene expression, epigenomics, etc)?

Yes, PLS for example. You can then extract the components and apply further clustering (e.g K-means).

  1. Regarding the supervised approaches:

A) if I have understood correctly, DIABLO differs from multiblock PLS? and on which exact part?

DIABLO is supervised whereas multi block PLS expects Y to be either a continuous vector (regression style) or a matrix of continuous values (e.g transcriptomics, multi-variable regression style)

B) In conjuction with part A: as for each patient I will have 2 paired samples, the most straightforward solution, would be to create a categorical vector for DIABLO, that essentially would be a “two-level” variable? That is before and after treatment levels? And on this direction, is there any possibility to incorporate into the model that I have paired samples (so which samples are related to each patient)?

That would be the multilevel approach as we discussed earlier, after you have inspected the PCA multilevel and you think that multilevel would help, then you can use the withinVariation() function to extracted the within variance for all your X data sets, and include them in DIABLO. You can read more about multilevel here with the reference at the bottom of that page.

C) Moreover, if my concept is correct, DIABLO will help dissentagle which omics layers correlate the most and drive the variation between the 2 “conditions” above, correct? If so, which would be the “minimal” requirement for sample size? For example, 100 samples in total (that is 50 for each level), would suffice

I do not have a straight answer for that as it depends on the disease, organism, degree of variability etc! 50 sounds fine but if you have pilot data to assess a priori that would be better. You can use also the multiPower package but our experience with the package is that it requires pilot data, that the sample size is often quite large (there is an option to input direct parameters to do the sample size calculation but it is a bit filly).

  1. Finally, regarding feature selection and clinicopathological parameters:
    A) Could also somatic mutations/CNAs in the form of binary numbers be incorporated into the training of the model? For example, 0 for absense, 1 for presense?

Yes you can code as 0/1 but my experience is that this type of data does not work very well (mostly because those variables are either perfectly correlated, and also lack ‘resolution’ as they are binary, so you may not get much insight). But some people have included them.

B) Is there also a possibility to correlate the output of the model to various clinical parameters, both qualitative (for example tumor status) or even quantitative/continuous (like PET/histopathology features)?

Yes! we often talk about ‘omics’ but it applies to other types of data that are continuous. For the qualitative side it’s a bit more difficult and it might be best combining them with your outcome Y of interest (e.g before/after AND tutor status). Note that this would probably require a larger sample size than 50 as you want to make sure you have enough samples per group.

Kim-Anh