Integration with DIABLO for N-ingretaion with low sample size

Hi,
I intended to perform an N-integration analysis using mixOmics. My design is as below:
mRNA normalized data: ~20K genes across 10 samples( 5 control vs 5 Diseased)
Proteomics: ~1200 analytes across same 10 samples
metabolites: ~400 analytes across same 10 samples
Phenotype data: ~18 predictors across the same 10 samples

Is it possible to use DIABLO to integrate all 4 layers of (transcriptome, proteome, metabolomics and phenotype) data together to extract features that can more define markers for the phenotype? Or do you suggest to use sPLS or rCCA to find more correlated features across same layer to each other and then feed them to DIABLO?

What will be your suggestion for the above design?

Is there also a vignette that I can also take a look into for integrating such low samples?

Thanks,

VD

2 Likes

Hi VD,

Yes, you can integrate all these data types, despite the low sample size. What you probably need to be careful about is during the parameters tuning (use leave one out cross validation, or just adopt a more exploratory approach, where you arbitrarily decide a small numbers of biomarkers to be selected). You can use the usual DIABLO vignette, but basically remove the tune() and perf() functions from your analyses (those are the ones using cross validation).

We would also recommend you filter out your genes. For example only keep the most 5000 highly variable genes across all samples (low variance means those genes probably don’t change much across the conditions).

We also advise you start with sPLS to get more insight into pairs of data before you jump into N-integration!

Let us know if you have further questions.

Kim-Anh

2 Likes

Hi Kim-Anh,

Many thanks for the reply. I have actually thought of using sPLS first as an exploratory analysis to only make paired canonical correlation across mRNA~Protein and mRNA~metabolite but wanted to also see if I can directly feed all 3 layers of omics into DIABLO.

Currently, I did something as you suggested:

  1. Reduced the mRNA genes to highly variable genes across all samples based on 1st quantile, currently, it is around ~5500 highly variable genes
  2. Took all 3 layers and followed the DIABLO tutorial. I have some queries as below

My dataset dims are as below(I am not using the phenotypic layer):
mRNA normalized data: ~5K genes across 10 samples( 5 control vs 5 Diseased)
Proteomics: ~1200 analytes across same 10 samples
metabolites: ~400 analytes across same 10 samples

Code:

X <- list(mRNA = mRNA.D, protein = protein.D, metabolite = metabol.D)
Y <- as.factor(dietGroup)
summary(Y)
##Not sure if this is correct for the keepX
list.keepX <- list(mRNA = c(16, 17), protein = c(18,5), metabolite = c(5, 5))
MyResult.diablo <- block.splsda(X, Y, keepX=list.keepX)
plotIndiv(MyResult.diablo, 
          ind.names = FALSE, 
          legend=TRUE, cex=c(1,2),
          title = 'Results with DIABLO')
plotVar(MyResult.diablo, var.names = c(FALSE, FALSE, TRUE),
        legend=TRUE, pch=c(16,16,1))
plotDiablo(MyResult.diablo, ncomp = 1)
plotDiablo(MyResult.diablo, ncomp = 2)
circosPlot(MyResult.diablo, cutoff=0.9)
plotLoadings(MyResult.diablo, comp = 1, contrib = "max")
plotLoadings(MyResult.diablo, comp = 2, contrib = "max")
plotLoadings(MyResult.diablo, comp = 1, contrib = "min")
plotLoadings(MyResult.diablo, comp = 2, contrib = "min")

Rest images are in the link as I am unable to put mulitple.
test2

Queries:

  1. How should be the keepX parameter tuned specifically with low overall samples in my case?
  2. How do I select the parameters tuning(also where does the leave one out cross-validation falls here). Any links from the tutorial to look into?
  3. I am actually not able to understand the above figure. Why is the metabolic ID as metID in grey and showing up as block protein? Should not they be showing as block metabolite?
  4. Also plotting the correlation it seems that pairwise correlation of mRNA with two other layers based on ncomp=1 is over 90% so I plotted the circos for over 90% for comp(1,2). Is this the correct analogy?
  5. Is it advisable to pull out the results of the circos and take them as the highest correlated features across all omics layers? How can I do that? I could not find a way to extract that information.

I will try the sPLS using pairs soon. I did that earlier to familiarize myself mixOmics but certainly, I should do that as well in this data with the current dimension, but it would be helpful if you can provide your valuable comments.

Kind regards,

VD

Hi Kim-Anh,
I went bak to the DIABLO paper and downloaded the Supervised Multivariate Analyses with mixOmics note and ran the tune() and perf() to find the ncomp and keepX for my data. Below is the code. I used a 5 fold CV and just one CPU. Design is Full weighted design.
Here is the code:

data = list(mRNA = mRNA.D,protein = protein.D,metabolite = metabol.D)
# check dimension
lapply(data, dim)
$mRNA
[1]   10 5559

$protein
[1]   10 1365

$metabolite
[1]  10 471

design = matrix(0.1, ncol = length(data), nrow = length(data),dimnames = list(names(data), names(data)))
diag(design) = 0
design
           mRNA protein metabolite
mRNA        0.0     0.1        0.1
protein     0.1     0.0        0.1
metabolite  0.1     0.1        0.0

sgccda.res = block.splsda(X = data, Y = Y, ncomp = 5,design = design)
set.seed(123) # for reproducibility, only when the `cpus' argument is not used
t1 = proc.time()
perf.diablo = perf(sgccda.res, validation = 'Mfold', folds = 5, nrepeat = 5)
t2 = proc.time()
running_time = t2 - t1; running_time
plot(perf.diablo)
perf.diablo$choice.ncomp$WeightedVote
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), protein = c(5:9, seq(10, 18, 2), seq(20,30,5)),metabolite = c(5:9, seq(10, 18, 2), seq(20,30,5)))
t1 = proc.time()

tune.BBM = tune.block.splsda(X = data, Y = Y, ncomp = ncomp,test.keepX = test.keepX, design = design, validation = 'Mfold', folds = 5, nrepeat = 1,dist = "centroids.dist", cpus = 1)
t2 = proc.time()
running_time = t2 - t1; running_time
list.keepX = tune.BBM$choice.keepX
list.keepX

I got the warning
: The SGCCA algorithm did not converge
list.keepX output is below. My ncomp was also 1.
$mRNA
[1] 5

$protein
[1] 5

$metabolite
[1] 8
Do you think the above is a better way to pull out the list.keepX for my data and then feed this to the supervised DIABLO? Any thoughts? My intention is to now understand what is the optimal list.keepX for my data and use that for supervised DIABLO for N-integration. I am not intending to put anything like arbitrary list.keepX to the model for N-integration. I reckon my initial analysis is very arbitrary. Any advice?

ber

Kind regards,

VD

Hi VD,
note that our time is limited, so we may not be able to follow up in depth with your analyses.
Here are some answers below

In addition to your second message, this is exactly how you should do the tuning and run your final DIABLO model. See the example given below and you can follow very similar steps.

Good luck!
Kim-Anh

1 Like

Dear Kim-Anh,

Many thanks for your inputs. Very much appreciate your valuable inputs from your busy schedule.

I have been now able to extract the selected variables using the tune operations employing the 5-fold CV. I have also done exploration using various design matrix to understand which design better predicts my outcome of interest & how well the variables selected are meaningful in context of the biology we are addressing post N-integration. Thanks for the blogs and various materials that help me dive deeper into mixOmics modules and also finally extract and fit parameters in the final DIABLO model for our multi-Omics integration.

Kind regards,

VD

Dear Kim-Anh,

 I would like to open the thread for a very similar question. I have a rather small dataset (n=19) and two modalities (mRNA, microbiome) which I wanted to use to predict a severity score in my metadata. I was not sure how to interpret the error rate vs PC plot I got. In theory, I should take only one PC as it demonstrates the lowest BER. But I thought it may be counterintuitive as one could not plot the results of the final model in only one PC. Do I understand something wrong?

hi @tkapello

Your question was missing in this post, but the answer is:
It seems that your data are not easy to separate so you could adopt an exploratory approach (no tuning) by inspecting the graphical outputs and see if they make sense, probably only considering the components 1 and 2.

Kim-Anh

1 Like