sPLS with X and Y matrix

Hello there,
I’m new to mixOmics and would like to ask a question. Therefore I briefly explain my data: I’m a river ecologist and I have collected biofilm and planktonic samples across an entire river network. At each site we also took samples for some water chemical variables and we performed extracellular enzymatic activity assays. After the fieldwork we calculated upstream catchment area for each site. In the end I have the following data:

  1. X matrix, composed out of 115 samples (i.e. sites) and 1562 ASVs (amplicon sequence variants)
  2. Y matrix, with environmental data (catchment area for each site & conductivity) as well as data from the enzyme assays (8 different enzymes)

The problem is that in the enzyme data there are NAs. I could in theory split the data into two datasets where one would resemble all the data from the biofilm and one all the data from the planktonic lifestyle. But my initial goal was to analyze them all together since I am interested in how different communities produce a different local functioning (enzyme activities).

In the biofilm data set I have nearly no NAs, so there I could just remove the sites from the X and the Y matrix that have NAs. For the planktonic data however, I have a lot of NAs and therefore removing that many sites from both, the biofilm and the planktonic data set would be a huge loss of information.

Hence my question: Do you think that considering this information it is reasonable to separate the data sets and have a different number of sites for each data set and run the sPLS separately for the biofilm and planktonic data? In the end it comes down to whether the results of these two analysis will then be comparable as they don’t have the same number of sites.

I’d appreciate your help.

All the best,

hi @lukastb,

We have a few tricks to handle missing values, but you will need to proceed with caution.

  • include NAs as is, as long as a careful PCA eigenvalue plot (proportion of explained variance) does decrease. Note however that some of the predict / perf / tune functions in PLS / PLS-DA may not work.
  • estimate first the missing values with NIPALS, if you dont have too many missing values. See http://mixomics.org/methods/missing-values/

As you highlighted, it would be a loss to separate those two data sets.