Pre-filtering data prior to sPLS

I have a quick question about running mixOmics sPLS on my datasets - I have been going back and forth on this. In some papers it seems like filtering data out to only focus on important taxa/genes or whatever is useful but their datasets still remain large.

In my case, I have about 800 taxa and 20,400 genes in my expression table. However, 32 genes have been identified as important DEGs and 11 taxa were identified as key top players using another analysis.

Is it alright if I first preprocess my metagenomic data (CLR transformation on all 800 taxa), then filter out the top 11 - then filtering out the normalized gene expression table for 32 genes (both have 100 samples matching), and then run sPLS on it?

I have done it this way and I get some correlation values (I only run with 2 components and no keepX or keepY as I want all of the features to be used). When I run sPLS on the entire datasets (no filtering but still no keepX or keepY so I can get all feature correlations), the correlation values are drastically different when I manually search for my top taxa and top genes.

I’m just wondering what approach makes most sense when going from a relatively large dataset to a focus group only. My goal is to create a network of the correlations between the most important genes/taxa in my dataset and see how those correlation changes with changes in experiment.

Both the filtered and unfiltered processes you’ve described are fine! It really depends on what your goals are. For instance, if your work is just looking to verify previous literature’s claim about the importance of these genes and taxa, then running a model with just these features makes sense. However, the draw back from this procedure is that you can’t examine your genes and taxa of interest in the context of all the other features.

When you do look at these variables of interest with all other features present, the correlation values have to be different to if you run a model with just these features. In (s)PLS, components are constructed to maximise the covariance it and its correspond component in the other block. Also remember: your features of interest are just a handful in the ~21 000 you have supplied!

Think about this example: one of your GOI has a very high correlation in the smaller model, but a low correlation in the full model (all taxa and genes). This GOI has some non-negligible correlation with some of the other genes (not of interest). When we construct a component using all our features, our GOI will have “redundant information” as it relates to other genes’ expression. Therefore, it’s loading in this component is likely to be less as the model can gain some of the information in that GOI by using other genes more. Hence, the resulting component correlation will be different.

That is just one example that could explain the discrepancy. You are likely to see some large differences when you change the number of supplied features by ~20 000.

So to address your explicit goal, creating a network of correlations: I would say the filtered approach will be of interest to you. To complement this however, I have a couple recommendations:

  • Run spls() using all features, no keepX or keepY
  • Run spls() with your keepX and keepY equal to 32 and 11. Generate a model (via spls()) inputting these parameters and all features
  • Run the tune.spls() function to determine optimal keepX and keepY for your data. Generate a model (via spls()) inputting these parameters and all features
  • Compare the correlations, loadings and accuracy of all the above models. This is likely to help you identify the importance of your features of interest both compared to one another and compared to your whole dataset. Also, with the second model, you can see if the model selects your features of interest.
  • Explore the usage of the cim() and network() functions in mixOmics. These will be a massive boon to your work in generating correlation networks