Integration with DIABLO for N-ingretaion with low sample size

I intended to perform an N-integration analysis using mixOmics. My design is as below:
mRNA normalized data: ~20K genes across 10 samples( 5 control vs 5 Diseased)
Proteomics: ~1200 analytes across same 10 samples
metabolites: ~400 analytes across same 10 samples
Phenotype data: ~18 predictors across the same 10 samples

Is it possible to use DIABLO to integrate all 4 layers of (transcriptome, proteome, metabolomics and phenotype) data together to extract features that can more define markers for the phenotype? Or do you suggest to use sPLS or rCCA to find more correlated features across same layer to each other and then feed them to DIABLO?

What will be your suggestion for the above design?

Is there also a vignette that I can also take a look into for integrating such low samples?



Hi VD,

Yes, you can integrate all these data types, despite the low sample size. What you probably need to be careful about is during the parameters tuning (use leave one out cross validation, or just adopt a more exploratory approach, where you arbitrarily decide a small numbers of biomarkers to be selected). You can use the usual DIABLO vignette, but basically remove the tune() and perf() functions from your analyses (those are the ones using cross validation).

We would also recommend you filter out your genes. For example only keep the most 5000 highly variable genes across all samples (low variance means those genes probably don’t change much across the conditions).

We also advise you start with sPLS to get more insight into pairs of data before you jump into N-integration!

Let us know if you have further questions.


1 Like

Hi Kim-Anh,

Many thanks for the reply. I have actually thought of using sPLS first as an exploratory analysis to only make paired canonical correlation across mRNA~Protein and mRNA~metabolite but wanted to also see if I can directly feed all 3 layers of omics into DIABLO.

Currently, I did something as you suggested:

  1. Reduced the mRNA genes to highly variable genes across all samples based on 1st quantile, currently, it is around ~5500 highly variable genes
  2. Took all 3 layers and followed the DIABLO tutorial. I have some queries as below

My dataset dims are as below(I am not using the phenotypic layer):
mRNA normalized data: ~5K genes across 10 samples( 5 control vs 5 Diseased)
Proteomics: ~1200 analytes across same 10 samples
metabolites: ~400 analytes across same 10 samples


X <- list(mRNA = mRNA.D, protein = protein.D, metabolite = metabol.D)
Y <- as.factor(dietGroup)
##Not sure if this is correct for the keepX
list.keepX <- list(mRNA = c(16, 17), protein = c(18,5), metabolite = c(5, 5))
MyResult.diablo <- block.splsda(X, Y, keepX=list.keepX)
          ind.names = FALSE, 
          legend=TRUE, cex=c(1,2),
          title = 'Results with DIABLO')
plotVar(MyResult.diablo, var.names = c(FALSE, FALSE, TRUE),
        legend=TRUE, pch=c(16,16,1))
plotDiablo(MyResult.diablo, ncomp = 1)
plotDiablo(MyResult.diablo, ncomp = 2)
circosPlot(MyResult.diablo, cutoff=0.9)
plotLoadings(MyResult.diablo, comp = 1, contrib = "max")
plotLoadings(MyResult.diablo, comp = 2, contrib = "max")
plotLoadings(MyResult.diablo, comp = 1, contrib = "min")
plotLoadings(MyResult.diablo, comp = 2, contrib = "min")

Rest images are in the link as I am unable to put mulitple.


  1. How should be the keepX parameter tuned specifically with low overall samples in my case?
  2. How do I select the parameters tuning(also where does the leave one out cross-validation falls here). Any links from the tutorial to look into?
  3. I am actually not able to understand the above figure. Why is the metabolic ID as metID in grey and showing up as block protein? Should not they be showing as block metabolite?
  4. Also plotting the correlation it seems that pairwise correlation of mRNA with two other layers based on ncomp=1 is over 90% so I plotted the circos for over 90% for comp(1,2). Is this the correct analogy?
  5. Is it advisable to pull out the results of the circos and take them as the highest correlated features across all omics layers? How can I do that? I could not find a way to extract that information.

I will try the sPLS using pairs soon. I did that earlier to familiarize myself mixOmics but certainly, I should do that as well in this data with the current dimension, but it would be helpful if you can provide your valuable comments.

Kind regards,


Hi Kim-Anh,
I went bak to the DIABLO paper and downloaded the Supervised Multivariate Analyses with mixOmics note and ran the tune() and perf() to find the ncomp and keepX for my data. Below is the code. I used a 5 fold CV and just one CPU. Design is Full weighted design.
Here is the code:

data = list(mRNA = mRNA.D,protein = protein.D,metabolite = metabol.D)
# check dimension
lapply(data, dim)
[1]   10 5559

[1]   10 1365

[1]  10 471

design = matrix(0.1, ncol = length(data), nrow = length(data),dimnames = list(names(data), names(data)))
diag(design) = 0
           mRNA protein metabolite
mRNA        0.0     0.1        0.1
protein     0.1     0.0        0.1
metabolite  0.1     0.1        0.0

sgccda.res = block.splsda(X = data, Y = Y, ncomp = 5,design = design)
set.seed(123) # for reproducibility, only when the `cpus' argument is not used
t1 = proc.time()
perf.diablo = perf(sgccda.res, validation = 'Mfold', folds = 5, nrepeat = 5)
t2 = proc.time()
running_time = t2 - t1; running_time
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), protein = c(5:9, seq(10, 18, 2), seq(20,30,5)),metabolite = c(5:9, seq(10, 18, 2), seq(20,30,5)))
t1 = proc.time()

tune.BBM = tune.block.splsda(X = data, Y = Y, ncomp = ncomp,test.keepX = test.keepX, design = design, validation = 'Mfold', folds = 5, nrepeat = 1,dist = "centroids.dist", cpus = 1)
t2 = proc.time()
running_time = t2 - t1; running_time
list.keepX = tune.BBM$choice.keepX

I got the warning
: The SGCCA algorithm did not converge
list.keepX output is below. My ncomp was also 1.
[1] 5

[1] 5

[1] 8
Do you think the above is a better way to pull out the list.keepX for my data and then feed this to the supervised DIABLO? Any thoughts? My intention is to now understand what is the optimal list.keepX for my data and use that for supervised DIABLO for N-integration. I am not intending to put anything like arbitrary list.keepX to the model for N-integration. I reckon my initial analysis is very arbitrary. Any advice?


Kind regards,