Biological replicates and Diablo

Hi there,

I’m new to MixOmics and I’m having some trouble understanding how I should integrate my biological replicates into my analysis. I have 16 strains and for each, 4 or 5 biological replicates. For each replicate I have RNA Seq and metabolite data (no missing data). My strains belong to two main categories. I want to compare these two broad categories and identify RNA and metabolites present/over-expressed in one of the two groups versus the other, but taking into account the replicates. I was thinking of using Diablo to do this. I consulted the forum and found these topics which seemed to me the most similar to my case: withinVariation() on part of dataset and DIABLO analysis with time points

My questions are the following:
Can I use withinVariation on each of the 16 strains and then concatenate the 16 submatrices with rbind?
Can I do that on both datasets (RNA and metabolites)?
Is the use of diablo appropriate with my data and with the withinVariation function?
Do I necessarily have to use a linear mixed model to account for variation within strains?

I’ve looked in the tutorials (multilevel and multilevel sPLS pages) and in the forum, as well as in the related publications, but I feel like each time it doesn’t quite fit (for example, repeated measurements could have been appropriate, but I don’t have the same strains in my two main categories), so I’d rather ask the question here. i’m sorry if I missed the explanation somewhere.

Thanks a lot

I just want to clarify my understanding of your experimental design. Correct me if I’m wrong. So, you have 4 to 5 repeated measurements for 16 different strains - so you have between 64 - 80 total samples? Is each sample independent or from the same individual at different time points? Each of these strains (and therefore each sample) falls into one of two overarching classes and you’re concerned with the integrated differences between these two classes?

Can I use withinVariation on each of the 16 strains and then concatenate the 16 submatrices with rbind?

The answer depends on what you want to do with it. However, I can’t think why you’d do this. withinVariation() works to decompose the input data - controlling for your repeated measurements of each strain. By doing them separately, the method won’t be able to control for this properly. I may be missing something here though.

Can I do that on both datasets (RNA and metabolites)?

I would recommend applying the withinVariation() function to both datasets in their entirety.

Is the use of diablo appropriate with my data and with the withinVariation function?

I would say so if you use the withinVariation() on the entire datasets rather than each strain

1 Like

Hi MaxBladen,
Thank you for your answer. I will try to clarify my design:

Yes exactly, I have 78 samples

Each of the five sample of a strain originally came from one unique strain. but they were each transplanted into a different bottle for growth before the experiment. They are of the same genotype but they are independent samples (each was measured only once during experiment)

yes, for example all replicates of strains 1 to 9 (~45 samples) are in the first category and all replicates of strains 10 to 16 are in the other category. I would like to identify genes and metabolites that distinguish the two categories, but I would like to take into account the “non-independence” of the replicates in each category so as not to overestimate links between genes and metabolites for example.

Sorry for that, I tried to adapt this answer to my case withinVariation() on part of dataset - #2 by MaxBladen, but I understand now that it is not adapted, thank you :slight_smile:
so taking into account your answers, I arrived at something like this:

rna<-t(read.table("rna",header=T,row.names=1)) #dim=78 600
metabolites<-t(read.table("metabolites",header=T,row.names=1)) #dim=78 400
samples_in_strains<-c(rep(1:4,each=5),rep(5,each=4),rep(6:15,each=5),rep(16,each=4))
design=data.frame(sp=samples_in_strains,row.names=rownames(rna))
categories<-c(rep("A",each=44),rep("B",each=34))
Y=as.factor(categories)
Xw_rna<-withinVariation(rna,design=design)
Xw_metabo<-withinVariation(metabolites,design=design)
X<-list(mRNA=Xw_rna,metabolite=Xw_metabo)
list.keepX<-list(mRNA=c(60,60),metabolite=c(30,30))
test<-block.splsda(X,Y,keepX=list.keepX)

However, on the PlotIndiv, my samples are not separated into two categories at all, everything is mixed up, do you think this is from the code or from my data?

Thank you !

Yes unfortunately, the method described in withinVariation() on part of dataset - #2 by MaxBladen would not be valid here. Also, as I say in that post, it’s a risky procedure and could invalidate your method.

Everything in your code looks appropriate and the way I’d do it. My main question is, are you tuning the values you used to form list.keepX? They seem fairly arbitrarily chosen. If so, that will significantly reduce block.splsda() from discriminating your two classes

Hi,
you are right, these were arbitrary values for the example. When I use tune.splsda on an “Xw” RNA array in which I used withinVariation, or with multilevel, I get very high error rates (~0.6). And the difference between my two categories disappears (when I consider replicates as independent individuals, there is a separation between the two categories). I think this is because I don’t have enough samples. What do you think?
Interestingly, I find differences in expression when I use the Limma package. I don’t understand why I can’t find them with Mixomics, I must be missing something!
Here is the code I used:

res<-splsda(X, Y, ncomp=6, keepX=c(100, 100), multilevel=design)

#Splitting the variation for 1 level factor.
#Warning messages:
#1: The SGCCA algorithm did not converge 
#2: The SGCCA algorithm did not converge 
#3: The SGCCA algorithm did not converge

perf.splsda<-perf(res, validation="Mfold", folds=2, nrepeat=50)

list.keepX<-c(5:10, seq(20, 100, 10))
t<-tune.splsda(X, Y, ncomp=2, validation="Mfold", folds=2, dist="max.dist", multilevel=design,
               measure="BER", test.keepX=list.keepX, nrepeat=50)
t$choice.keepX
#comp1 comp2 
#   6    50 

And the two plots (with and without multilevel)
splsda_rna_multilevel
splsda_rna

thanks a lot for your help!

What an interesting result. Increasing your sample size will always be of benefit - but I can’t be sure that is the only cause here. I’m really not sure to be honest.

I had a brief read through of this paper. I didn’t find the answer but it may be found in there.

I’ll ask around a bit and do some thinking. If I come up with anything I’ll be sure to let you know

Hi,
I’ll look at this paper and try to find an answer to my problem.
Thanks again for your help!