DIABLO inputs and optimal number of components

Hi DIABLO team,
I have two questions about the input data and the optimal number of components. Can you help me to answer them? Thank you so much!

  1. What is the optimal number of candidates for each omics data before integrating them using DIABLO? For example, I have host mRNA (20,000 transcripts), host miRNA (1,200 miRNAs) and taxonomy data (350 species), do I need to reduce the 20000 mRNA to somewhere around 1,000 before running DIABLO?
  2. May I ask how I can interpret the performance plot when I am tuning the number of components? My interpretation is generally the lower the line the better (i.e., smaller classification error rate), and from 1 to 2 components, it’s better to decrease. And centroids.dist is generally better than max.dist? Are these interpretation correct?

Thank you and best,
ZZ

Hi @hellofuture,

It is definitely recommended to filter out the transcripts, but there is no precise answer on how many variables to keep. You could try to keep 10-20% of most the variable transcripts or maybe filter out transcripts that are not present in at least 70% of samples within groups of interest.

Yes, this is correct. If the error rate increase when adding more components, it’s a sign of increasing noise levels.

Yes, centroids.dist and especially mahalanobis.dist seems to be more accurate for N-integrations. You can read more about it in the supplementary material (Section 1.3) of this paper

Best
Christopher

Thank you so much, Christopher!

Hi Christopher, I have generated a performance plot, however, I cannot upload this website to show you. I want to know whether I should choose Centroid or Mahalanobis distance from this plot. May I ask if you have an email address where I can email you the plot? Thank you!

Zhaohzong

Hi @hellofuture, yes of course. My email is cabo@hst.aau.dk

-Christopher