Is it possible to do a multi level sPLS analysis?

Hi there, I am a new mixOmics user - so hopefully this will not be a silly question! :slight_smile:

My dataset - I am applying a read-based metagenomic analysis on seawater communities retrieved from 48 sites across the Great Barrier Reef, together with environmental data on water quality (e.g concentration of phosphate, ammonia, nitrate, and etc…). The sampling was done only once in timepoint, and in 4 sampling trips - essentially 3 sampling events were done during the wet season (summer), and one during the dry season (winter).

I am interested in identifying indicator microbes to high nutrient loads - to do this, I am applying an sPLS method as a way to find correlation patterns between 2 continuous datasets - microbial counts and water quality data. And I did manage to identify certain taxa that correlate to high nutrient loads.

As the next step, I would like to do a multi level analysis and ‘subtract’ the effect of sampling trip - however I am unsure how this can be done for sPLS analysis. Reasons to apply a multilevel approach - some measurements are higher for all of my winter samples, like phosphate - I think this may be overemphasizing the role of phosphate in structuring the microbial communities, and that taxa which correlate to winter conditions in the reef (low nutrient and temperature) seem to correlate to phosphate. Do you have any advice on how this can be done? Or is there an analysis that would be better suited for my particular question?

Thank you & I look forward to hearing from you!
Marko

Hi there,

If you have a brief look on the website, you’ll see we have pages which answer your exact question. Please read the Multilevel page and Multilevel sPLS Case Study page. If you have questions after reading these, feel free to ask

Hi @MaxBladen,

Thank you for your answer! Somehow I missed this tutorial, apologies for the basic question.

I will get back in touch with you in case of additional doubts!

Very best wishes,
Marko

Hi @MaxBladen / MixOmics community,

I have additional questions about the Multilevel analysis. Sorry in advance for the long message!

I am aiming to identify which microbial taxa/genes correlate to water quality measurements - is sPLS the best option analysis to use? I do not have 2 omics datasets, but 2 continuous datasets – microbial counts (metagenomics data) and water quality data (nutrient values expressed in conc.). Guidi et al (2016) applied the same method and had a similar dataset, but I wonder if there is a more appropriate one;

When tuning the parameters in the sPLS analysis, I am always getting that 2PCs is the optimal number to retain – I do think this makes sense in terms of biology, as we do not have a strong environmental gradient, and seawater communities are generally quite similar between different sites. Would you agree?

How can I fix the coloring **scale when making the heatmap, using the cim command? I am getting the results where genes (function) correlates better (higher correlation coefficient) to env. data compared to taxonomy, and I would like to emphasize this by fixing the color scale;

Multilevel analysis – I am trying to subtract the effect of sampling trip (4 trips, categorical data) when correlating microbial abundances to env. measurements (continuous data). I am not sure if I am doing everything right, because once I apply it – I get a lot more ‘indicator’ taxa, while environmental measurements still appear correlated… One thing that I wanted to apply is to use a stratified CV approach – how can I apply this CV method when tuning the parameters in multi level sPLS? I do think it’s relevant in my case because the number of samples differs across the 4 sampling trips. Is there any other recommendation from your end;

From what I read, the main recommendation when working with microbial datasets is to CLR transform the data – which I did already (and the patterns make a lot more sense, so thank you for this advice :slight_smile: ). Is there any additional advice when working with microbial data in particular?

So far the mixOmics package seems really useful, and I will attend your course in November (can’t wait!). I saw on the web page that you also offer internship opportunities, and I wanted to see if that would be a possibility. If you have any information to share about internships, please let me know;

Thank you for your time and help, I look forward to hearing from you!
Marko

Hello again @MarkoTerzin!

is sPLS the best option analysis to use?

By the sounds of experimental design, sPLS and multilevel sPLS are going to be your best bet. You could also explore rCCA (read more here and here).

Out of curiosity, what are the dimensions of your datasets? How many missing values are there?

I am always getting that 2PCs is the optimal number to retain

Regarding this comment and your interpretation: I’ve seen many cases where 2 components was more than sufficient for a given analysis. However, without your results its hard to say. I would encourage you to explore the explained variance of each component and the loadings (feature contributions) to each component to determine whether this model is valid. If you want me to go more in depth regarding this, just say the word

How can I fix the coloring scale when making the heatmap

I’m surprised I haven’t seen someone asking about this before - this seems like a useful feature. I’ll add it to my list! You can set symkey to FALSE (see ?cim) which will not force the key to be symmetric. This may help with the colouring slightly. Otherwise, I’ll keep you updated regarding the implementation of a way to properly control the scale.

stratified CV approach – how can I apply this CV method

I’ve just realised that interestingly, only the SPLS-DA methods apply stratified sampling using the classes of each sample. Stratifying the subsamples by the repeated measurements would be handy. I’ll also look into a implementation of this for multilevel sPLS.

Is there any additional advice when working with microbial data in particular?

I’m not an expert when it comes to microbial data. Pretty much everything I can speak to is contained on the site. This includes the three primary steps of pre-processing covered here.

Having a bit of a look online, I found this page which may be of use. I also found a python repository and accompanying paper which you could reverse engineer to some extent.

I will attend your course in November

Fantastic to hear! Regarding the internships, I’ll ask A/Prof Lê Cao as she’ll know - I wasn’t even aware we did!

Hope this info dump is helpful

Cheers,
Max.

Hi @MaxBladen ,

Thank you for your email! This is super helpful. My response is below in bold:

By the sounds of experimental design, sPLS and multilevel sPLS are going to be your best bet. You could also explore rCCA (read more here and here).

Good to hear I am on the right track!

Out of curiosity, what are the dimensions of your datasets? How many missing values are there?

So I am lucky to have quite a lot of samples – 48 sites across the Great Barrier Reef, each with 4 replicates. After filtering some out, I end up with 183 samples. Sampling for these sites was done once in time, across 4 individual sampling trips – I am seeing the block effect of these trips which I would like to remove in the sPLS analysis, as I wrote already. For each of the sites we collected environmental data on water quality and benthic cover at the same time – so our omic data is perfectly synchronised with environmental data. In terms of missing values, I have quite a lot – but this is what is expected when working on microbial datasets. I dealt with these missing vals by introducing pseudocounts (raw counts + 1), and doing a center log ratio transformation, as recommended in the package. For the missing vals in the environmental metadata (there were not too many, they lacked only for some of the sites), I imputed the vals using the NIPALS algorithm – not sure how appropriate this is though.

Essentially I only have 3 groups of microbes that are abundant, all of the other microbes in the seawater communities are less than 0.1 percent abundant! To give you a better idea, I filtered out those microbes with less than 0.00001 relative abundance, and this halved my dataset.

I would encourage you to explore the explained variance of each component and the loadings (feature contributions) to each component to determine whether this model is valid.

That’s something I have been trying to do in the last few weeks. Essentially in the sPLS-DA analysis (when I wish to discriminate between our 4 sampling trips) I got that the optimal number of PCs to retain is 8 for my taxonomy dataset, and 9 for the functional (genes) dataset (I have metagenomics data). It is always that the first two components discriminate between my winter and peak of summer samples – even when I keep only a few variables for each of the 2 retained PCs. Keeping 8 or 9 PCs would be needed to discriminate between reef sites that were also taken during the wet season (but early summer, not peak!), but that is not something I am essentially interested in because with the sPLS-DA analysis I just want to show the main differences in microbial profiles between summer and winter. So when plotting (I use heatmaps here), I always keep only the strongest indicator taxa/genes that discriminate on the first 2 PCs – and that’s enough.

I do think I would need more PCs for the multi-level sPLS analysis though. The tune.splslevel function seems to have a bug though, but I will repeat the multilevel sPLS analysis and let you know which error exactly I was getting.

I was also wondering if you could share some tutorials on sPLS-based prediction? I am hypothesising that functional information will act as a better predictor of environmental parameters compared to taxonomy. My idea is to do the sPLS-based prediction of a continuous response (water quality data) with multiple subsets: training the model on 10%, 20%, 30%… 90% of data, and validating the prediction performance on the remaining samples. I can do this for both taxa and function - ideally I would expect to see that taxonomy-based prediction will be accurate when training the model on more samples, while accurate predictions based on function will maybe be possible on a smaller training dataset. So instead of just saying ‘function is a better predictor of the environment compared to taxonomy’, this way I would quantify it and say for example 'by training the model on 60% of the data, prediction performance improved for function compared to taxonomy by x %'. I hope I’m explaining well – would you say this approach makes sense?

If you want me to go more in depth regarding this (tuning the parameters), just say the word

That would be amazing, thank you Max! I also think it is hard to explain what I want to do without showing you the actual code and the results. What is the easiest way to do this – should I prepare an R Markdown html report and share this with you?

I’m surprised I haven’t seen someone asking about this before - this seems like a useful feature. I’ll add it to my list!

Thank you!

You can set symkey to FALSE (see ?cim ) which will not force the key to be symmetric.

Sounds good – I will test this. Thank you!

I’ve just realised that interestingly, only the SPLS-DA methods apply stratified sampling using the classes of each sample.

Good to know! I am also testing the sPLS-DA analysis.

Stratifying the subsamples by the repeated measurements would be handy. I’ll also look into a implementation of this for multilevel sPLS.

Sounds great.

Pretty much everything I can speak to is contained on the site. This includes the three primary steps of pre-processing covered here

I applied all these!

Having a bit of a look online, I found this page which may be of use. I also found a python repository and accompanying paper which you could reverse engineer to some extent.

I will read these as well – although I am certain that the data is transformed properly now, my results make quite a lot of sense after processing them as suggested in mixMC.

Regarding the internships, I’ll ask A/Prof Lê Cao as she’ll know - I wasn’t even aware we did!

Thank you Max, that would be amazing. MixOmics will be useful for pretty much every chapter in my PhD – I want to analyse my single omics data in mixOmics (which I am doing at the moment), but also apply both N and P integration methods in the latter chapters. I am also happy to contact A/Prof Lê Cao myself – please let me know what you think is better.

Hope this info dump is helpful

Very helpful! Thank you again, @MaxBladen - I am sure I will have more questions soon!

Best wishes,

Marko

hi @MarkoTerzin,

Just to complement what @MaxBladen has said, the multilevel approach will not remove the effect per say but account for it (see this review that list the different approaches would could potentially use, noting that it is currently impossible for the integration of data sets: https://academic.oup.com/bib/article-abstract/21/6/1954/5643537).

You can contact me via email directly regarding a potential scientific visit, after the online course would probably best - if this is still relevant.

Kim-Anh

Dear @kimanh.lecao ,

Thank you very much for your message and clarification. I will read the review paper you shared!

I am sure that it will be relevant after a course - mixOmics proved to be very useful so far. But I agree, it is probably best that we discuss a potential visit after the course. I look forward to the course!

Very best wishes,
Marko

Dear @mixomics ,

Thank you again for developing such a useful tool.

I have a few questions that are linked to a predictive sPLS framework (ideally doing a multilevel analysis), that I hope you can help me with:

  • What would you say the best metric is for prediction performance in sPLS?

I have metagenomic data on seawater microbial communities from the GBR, and I am trying to prove that Microbial Function is a better predictor of environmental changes (water quality data) compared to Microbial Taxonomy. To do this, I extracted the VIP scores for all variables (taxa/genes) from PC1 of a multilevel sPLS analysis, and the VIP scores indeed seem to be higher for function. I then checked for the Stability of the signal as well - frequency of occurrence of variables on PC1 is also higher for genes compared to taxonomy. Would you say this is enough for me to say that microbial function may be more informative/better predictor of env. changes compared to taxonomy?
(please see the boxplot below)

  • Is there any tutorial to you could share on sPLS based prediction?

I did try to follow the tutorial from the textbook - 10.6 Take a detour: PLS2 regression for prediction (page 163). Yet I don’t have an external dataset, so instead (1) I split my data into training (using 75% of my data) and validation datasets, then (2) predicted the water quality values for the sites in the validation dataset from microbial taxonomy and functional profiles, and lastly (3) computed the Mean Squared Error (MSE) and R2 values between actual and predicted water quality values for the validation dataset. I was expecting to see higher R2 and lower MSE values for function compared to taxonomy.

Yet I am getting very different results every time I run the analysis - I believe this may be because of random subsampling of samples, and there are 2 ways I thought of addressing this. So my questions are: is there a way to do a stratified sPLS analysis, so that I take a representative number of samples with each group (in my case winter and summer samples) every time I subset a random subsample? This would be important as my winter sites are outliers, so I am guessing what happens is: if the random subset in the training datasets is mainly composed of summer sites, the model will be very bad at predicting WQ parameters for the winter period (and vice versa);

The other thing I tried is to do a multilevel sPLS (for my data it is important to remove the effect of sampling trip, as again my winter reef sites are very different compared to summer sites) - but I noticed that if I do the multilevel sPLS, the predict() function does not seem to work - the predicted values I am getting for wq parameters (e.g temperature, nutrient concentrations) do not make any sense and are even negative. It almost looks as if the function is trying to predict the microbial taxa/gene abundance values (CLR transformed), and not WQ data, but I am unsure;

Please let me know if you have any advice on how I can move ahead.

Thank you for your time and assistance, I look forward to hearing from you! :slight_smile:

Very best wishes,
Marko

What would you say the best metric is for prediction performance in sPLS?

This is one of those questions where I can’t give a definite answer. It is so data and case specific. I would say about your method, that the VIP measure might not be the best way to go. It is a measure of the importance of a given feature to the predictors, not the outcome. Hence, its quite risky to compare VIP scores from different predictor datasets. I’d recommend looking at the loading values instead, as these measure the contribution of each predictor to the response. As you are using the same response variable, these values are comparable and a little more meaningful. To regard your question as to whether that is enough, I’d say probably not. The best way to determine the relative efficacy of your datasets is to compare their predictive performance (which brings us to your second point).

Is there any tutorial to you could share on sPLS based prediction?

We haven’t really put one together for this in particular. I’d definitely recommend exploring the Q2 score (read the last section of this page.

is there a way to do a stratified sPLS analysis

This is a feature I have seen requested by a few other people. Unfortunately, I haven’t had the time to implement it. You could do this externally to the spls() function manually.

if I do the multilevel sPLS, the predict() function does not seem to work

I’d recommend doing something similar to what I suggested in the other post I replied to you a few minutes ago. Use the withinVariation() function externally, then just use the standard spls() function. From this, the predict() function should operate as normal. If it still gives you weird results, let me know and we can discuss further options.

Cheers,
Max.