Hi @MaxBladen ,
Thank you for your email! This is super helpful. My response is below in bold:
By the sounds of experimental design, sPLS and multilevel sPLS are going to be your best bet. You could also explore rCCA (read more here and here).
Good to hear I am on the right track!
Out of curiosity, what are the dimensions of your datasets? How many missing values are there?
So I am lucky to have quite a lot of samples – 48 sites across the Great Barrier Reef, each with 4 replicates. After filtering some out, I end up with 183 samples. Sampling for these sites was done once in time, across 4 individual sampling trips – I am seeing the block effect of these trips which I would like to remove in the sPLS analysis, as I wrote already. For each of the sites we collected environmental data on water quality and benthic cover at the same time – so our omic data is perfectly synchronised with environmental data. In terms of missing values, I have quite a lot – but this is what is expected when working on microbial datasets. I dealt with these missing vals by introducing pseudocounts (raw counts + 1), and doing a center log ratio transformation, as recommended in the package. For the missing vals in the environmental metadata (there were not too many, they lacked only for some of the sites), I imputed the vals using the NIPALS algorithm – not sure how appropriate this is though.
Essentially I only have 3 groups of microbes that are abundant, all of the other microbes in the seawater communities are less than 0.1 percent abundant! To give you a better idea, I filtered out those microbes with less than 0.00001 relative abundance, and this halved my dataset.
I would encourage you to explore the explained variance of each component and the loadings (feature contributions) to each component to determine whether this model is valid.
That’s something I have been trying to do in the last few weeks. Essentially in the sPLS-DA analysis (when I wish to discriminate between our 4 sampling trips) I got that the optimal number of PCs to retain is 8 for my taxonomy dataset, and 9 for the functional (genes) dataset (I have metagenomics data). It is always that the first two components discriminate between my winter and peak of summer samples – even when I keep only a few variables for each of the 2 retained PCs. Keeping 8 or 9 PCs would be needed to discriminate between reef sites that were also taken during the wet season (but early summer, not peak!), but that is not something I am essentially interested in because with the sPLS-DA analysis I just want to show the main differences in microbial profiles between summer and winter. So when plotting (I use heatmaps here), I always keep only the strongest indicator taxa/genes that discriminate on the first 2 PCs – and that’s enough.
I do think I would need more PCs for the multi-level sPLS analysis though. The tune.splslevel function seems to have a bug though, but I will repeat the multilevel sPLS analysis and let you know which error exactly I was getting.
I was also wondering if you could share some tutorials on sPLS-based prediction? I am hypothesising that functional information will act as a better predictor of environmental parameters compared to taxonomy. My idea is to do the sPLS-based prediction of a continuous response (water quality data) with multiple subsets: training the model on 10%, 20%, 30%… 90% of data, and validating the prediction performance on the remaining samples. I can do this for both taxa and function - ideally I would expect to see that taxonomy-based prediction will be accurate when training the model on more samples, while accurate predictions based on function will maybe be possible on a smaller training dataset. So instead of just saying ‘function is a better predictor of the environment compared to taxonomy’, this way I would quantify it and say for example 'by training the model on 60% of the data, prediction performance improved for function compared to taxonomy by x %'. I hope I’m explaining well – would you say this approach makes sense?
If you want me to go more in depth regarding this (tuning the parameters), just say the word
That would be amazing, thank you Max! I also think it is hard to explain what I want to do without showing you the actual code and the results. What is the easiest way to do this – should I prepare an R Markdown html report and share this with you?
I’m surprised I haven’t seen someone asking about this before - this seems like a useful feature. I’ll add it to my list!
Thank you!
You can set symkey
to FALSE
(see ?cim
) which will not force the key to be symmetric.
Sounds good – I will test this. Thank you!
I’ve just realised that interestingly, only the SPLS-DA methods apply stratified sampling using the classes of each sample.
Good to know! I am also testing the sPLS-DA analysis.
Stratifying the subsamples by the repeated measurements would be handy. I’ll also look into a implementation of this for multilevel sPLS.
Sounds great.
Pretty much everything I can speak to is contained on the site. This includes the three primary steps of pre-processing covered here
I applied all these!
Having a bit of a look online, I found this page which may be of use. I also found a python repository and accompanying paper which you could reverse engineer to some extent.
I will read these as well – although I am certain that the data is transformed properly now, my results make quite a lot of sense after processing them as suggested in mixMC.
Regarding the internships, I’ll ask A/Prof Lê Cao as she’ll know - I wasn’t even aware we did!
Thank you Max, that would be amazing. MixOmics will be useful for pretty much every chapter in my PhD – I want to analyse my single omics data in mixOmics (which I am doing at the moment), but also apply both N and P integration methods in the latter chapters. I am also happy to contact A/Prof Lê Cao myself – please let me know what you think is better.
Hope this info dump is helpful
Very helpful! Thank you again, @MaxBladen - I am sure I will have more questions soon!
Best wishes,
Marko