N-integration with smaller datasets (few predictors)

I’m writing to congratulate you on the brilliant paper.

I would be grateful if you could help me to clarify some aspects about the application of N-integration (DIABLO) that are still not so clear for me after reading the paper.

Could N-integration be applied to smaller datasets as well (datasets with small number of predictors)?

I ask this question because in mixOmics paper its described that " Here we applied our multivariate frameworks to transcriptomics, proteomics and miRNA data. However, other types of biological data can be analysed, as well as data beyond the realm of ‘omics as long as they are expressed as continuous values. "

I’m working in my master degree trying to combine markers measured from different methodologies in the same samples such as transcriptome (9997 predictors), plasma markers such as cytokines (about 19 predictors) and heart damage molecules (4 predictors) measured by Luminex and ELISA, respectively, antibody reactivity from a panel of 15 different antigens specific for T. cruzi (parasite responsible for Chagas disease) and clinical data (7 variables).

Our aim is to find if a combination of markers from these different methodologies is more efficient than a single methodology, such as clinical data, to discriminate the Chagas disease groups of patients, indeterminate form and cardiac form, with and without severe left ventricle dysfunction.

I’m trying to implement you methodology from DIABLO with these data, do you think I can do such evaluation with the package? Comparing prediction performance by cross-validation of each dataset with PLS-DA or sPLS-DA with the prediction performance of N-integration with the combination of all or some of these datasets?

Thank you in advance for your attention. I look forward to hearing from you.

If you need further information, please contact me.

Hi Natália,

Thank you for using mixOmics :slight_smile:

It is very well possible to use N-integration in the problem you described as long as the measurements can be appropriately inputted as continuous variables, and samples are the same, even if you miss some data from some of them in certain datasets.

I can see you mentioned that your groups of patients comprise of: indeterminate form and cardiac form, with and without severe left ventricle dysfunction. Does that mean the groupings are nested? - i.e. a patient has a so-to-speak superclass and a subclass. Because you can only use sPLSDA if you define distinct classes (such as cardiac form AND with … dysfunction) and the model will not produce nested-form markers, although you might find common ones across similar super-/sub-classes.

Best,

Al

Dear Natalia,

A small number of predictors in some of the datasets is not an issue, you can choose to include all predictors for a specific dataset. The tricky bits for you when performing N-integration are the following:

  • identify whether the datasets include common information: for this we advise our users is to do a single omics analysis first (a simple PCA, one dataset at a time) to explore and identify the sources of variation in each data set
  • identify whether pairs of datasets agree: either with rCCA or sPLS canonical mode
  • assess what is your outcome of interest (not clear in this case if your outcomes are nested, see Al’s answer) and run several sPLS-DA on each dataset, that will also give you a range of how many variables to select per data set (tune.splsda) to reduce the computational burden of the tuning for the N-integration

After this you can start building up your N-integration model, the plotDiablo will be useful to identify the ‘weak’ datasets too. And as you mention, you can compare the different cross-validation performance (use the nrepeat argument in perf()). Depending on the complexity of your data, it may happen that sPLS-DA > DIABLO in terms of classification performance, however DIABLO would be superior to extract correlated features. You may also want to tweak the design in DIABLO to compromise between classification and correlation (as we discuss in the simulated results in the DIABLO paper).

Good luck, keep us updated

Kim-Anh

Thank you for your answers.
Mr. aljabadi,
When you say:
“and samples are the same, even if you miss some data from some of them in certain datasets.”

  1. The number of samples of each dataset is not homogeneous because each one of them were measured in a different moment in time/ distinct projects. ( Now, I’m trying to integrate it to see if the combination of markers from different datasets with DIABLO is better to classify samples than the multivariate analysis with sPLS-DA of one single dataset)
    By this way, each individual datasets has a higher number of samples than the samples with available information for the majority of datasets for n-integration. Is that a problem to do the comparison of PLS-DA and DIABLO performance?

  2. Another question: how the package handles missing data? If a samples has some NAs, it disregards all the information for this sample or only the predictor with NA value?

  3. Cytokines: didn’t find a appropriate way to normalize the data . I’m using the raw concentration values after LUMINEX determination according to standard curve, bur the markers have very distinct ranges… Do you think that this is right? Or should I try log or z-score transformation before the input in mixomics?

  4. “Does that mean the groupings are nested? - i.e. a patient has a so-to-speak superclass and a subclass. Because you can only use sPLSDA if you define distinct classes (such as cardiac form AND with … dysfunction) and the model will not produce nested-form markers, although you might find common ones across similar super-/sub-classes.”

The groups are:

  • indeterminate form
  • cardiac form with ventricle dysfunction
  • cardiac form without ventricle dysfunction
    But the main classes are cardiac and indeterminate form, the ventricle dysfunction is a way to subdivide the cardiac form according to disease severity.

Thank you so much for your attention.