Graphical interpretation of multilevel PCA & sPLS2

Dear community,

i have a microbial data set (16S RNA) in which each participant gets measured twice. Therefore, I want to implement a multilevel analysis, using your workflow Multilevel Liver Toxicity Case Study | mixOmics which has been very helpful.

I just have a question regarding basic understanding of the output - to me, it looks like the datapoints appear mirrored instead of clustered in a unique way (baseline timepoints = circle, follow-up timepoints = triangle). Is this to be expected since I only have 2 time points? Does this impact the interpretation of the results in any way?

I apologize if this is a rather basic question, I just want to make sure I interpret the output correctly.

Looking forward to hearing from you!

Best,
Luise

hi @l.bel,

I dont know what your colors or symbols correspond so I can’t comment :sweat_smile:

If you are able to show your code here, then I can check if the multilevel has been used properly, you are not expecting a mirror version usually unless there is something awkward in the data or the way the design = multilevel has been set up.

(ps: no rush, I can only answer next Friday morning)

Kim-Anh

Dear @kimanh.lecao - Apologies!

my data set looks like this: patients were allocated to 2 diet groups, within each of those groups, an additional treatment may be administered - thus, i have four groups (diet 1 with and without treatment, diet 2 with or without treatment). All participants were sampled for fecal microbial analysis (16S rRNA Seq) at baseline and after intervention (d0 = baseline sample, d14 = follow-up sample).

I wanted to apply multilevel PCA for clustering according to timepoints and groups.

Here is the code I used:

#transform phyloseq obj. to df for NIPALS
phylo.as.df <- data.frame(pls_phylo@sam_data)

#remove samples with too high NAs (> 25% of data entries)
remove <- colnames(phylo.as.df[colSums(is.na(phylo.as.df)) > nrow(phylo.as.df)/4])
phylo.as.df <- phylo.as.df %>% 
  dplyr::select(-all_of(remove))

#NIPALS
numeric.X <- phylo.as.df %>% 
  dplyr::select_if(is.numeric) 
numeric.X <- impute.nipals(numeric.X, ncomp = 10)

#prep df for merging with phylo df
numeric.X <- numeric.X %>% 
  data.frame() %>% 
  tibble::rownames_to_column(var = "sample_ID")

#merge with phylo df
phylo.as.df <- phylo.as.df %>%
  dplyr::select(-colnames(numeric.X[,2:ncol(numeric.X)]))  %>% 
  dplyr::left_join(numeric.X, by = join_by(sample_ID)) %>% 
  tibble::column_to_rownames(var = "sample_ID")

#reduce near zero variance
nzv <- mixOmics::nearZeroVar(phylo.as.df, uniqueCut = 15)

# Subset using an if function (in case column length is 0)
if(length(nzv$Position) > 0) phylo.as.df <- phylo.as.df[, -nzv$Position]

#define data sets
X <- phylo.as.df %>% 
  dplyr::select(Bacteroides:ncol(.))
Y <- phylo.as.df %>% 
  dplyr::select(age,body_mass_index, HOMA_IR, FPG, Matsuda,
                Stress_Scale, Severity_Index, Stress_Scale_subcat)

#remove too sparse taxa
i <- colSums(X == 0, na.rm=TRUE) < nrow(phylo.as.df) - nrow(phylo.as.df)/5
X <- X[i]

# for later analysis, reintroduce sample_ID
phylo.as.df <- phylo.as.df %>% 
  tibble::rownames_to_column(var = "sample_ID")

#make sure every id is there 2x
id.2x <- phylo.as.df %>% 
  dplyr::group_by(subject_id) %>% 
  dplyr::filter( n() == 2) %>%
  dplyr::ungroup() %>% 
  dplyr::pull(sample_ID)

X <- X[id.2x,]
Y <- Y[id.2x,]

#choose parameter for multilevel decomposition 
multi.level.param <- factor(subset(phylo.as.df, sample_ID %in% id.2x)$subject_id)

#make design matrix for sPLS model
design <- data.frame(sample = multi.level.param)

#define colors
colors <- c("#F1D6D6", "lightcoral", "#E7A3B8", "maroon", "#E2F0FE", "#BCDCFE", "#B2D0E4",  "skyblue3")

#run multilevel PCA
pca.result <- pca(X, multilevel = multi.level.param, scale = TRUE)

#plot multilevel PCA
plotIndiv(pca.result, 
          pch = subset(phylo.as.df, sample_ID %in% id.2x)$timepoint, 
          group = factor(subset(phylo.as.df, sample_ID %in% id.2x)$group), 
          col.per.group = colors,
          legend = TRUE, 
          legend.title = "Treatment", 
          legend.title.pch = "Timepoint",
          title = 'Multilevel PCA on taxa', 
          ellipse = F)

The code results in this plot:

image

To me, it looks like the data points are somewhat “mirrored”. For example: the pink triangle (patient A, timepoint 2) in the upper left corner is exactly in a mirrored position seen as the pink circle (patient A, timepoint 1) in the lower right corner.

As the plot is too crowded for ellipses, I plotted the PCA again, this time only according to diet and time:

The same mirror effect can be seen here.
Is this effect to be expected as I only have 2 timepoints? Although the diet clusters cannot be distinguished from each other at either timepoint, the cluster pattern still looks the same at timepoint 1 vs. timepoint 2 - as there is a longitudinal effect to be seen, is it plausible that the clustering pattern does not change at all? Do I have a mistake in my code?

I really appreciate your help and thank you in advance for taking the time to addressing my problem.

Best,
Luise

hi @lacorsan,

I think this is because you have a two-factor analysis (see A novel approach for biomarker selection and the integration of repeated measures experiments from two assays | BMC Bioinformatics | Full Text)

You could try this trick:

  • use the withinVariation() function to decompose your data, where design should be a data frame with 3 columns: the information ID of the individuals, time (d0 or d14) and treatment (ctl or interv). We removed the examples for this function for the two level decomposition it was not used much and was confusing to use so I hope this will work!

  • then you should obtain a new matrix, which you can include into PCA.

Avoid using any tune function for this, as it requires to perform the cross-validation stratifying by patients. We have not implemented this option.

Kim-Anh

Hi @kimanh.lecao,

thanks for getting back to me! I have tried your suggestion, but I am not sure whether I have implemented it correctly.

#make design matrix for PCA 
design <- data.frame(sample = factor(subset(phylo.as.df, JMF_sample_ID %in% id.2x)$host_subject_id), 
                     Timepoint = factor(subset(phylo.as.df, JMF_sample_ID %in% id.2x)$Timepoint), 
                     Treatment = factor(subset(phylo.as.df, JMF_sample_ID %in% id.2x)$diet_cpi))

rownames(design) <- rownames(X)

Xw <- withinVariation(X = X, design = design)

pca.result <- pca(Xw, ncomp = 2, scale = TRUE)

plotIndiv(pca.result, 
          pch = factor(subset(phylo.as.df, JMF_sample_ID %in% id.2x)$intervention), 
          group = factor(subset(phylo.as.df, JMF_sample_ID %in% id.2x)$Diet), 
          col.per.group = colors,
          legend = TRUE, 
          legend.title = "diet", 
          legend.title.pch = "intervention",
          title = 'Multilevel PCA on taxa', 
          ellipse = T, 
          style = "ggplot2")

But now the plot looks like this :sweat_smile:

This is the structure of my design matrix:

> head(design)
                sample     Timepoint          Treatment
JMF-2006-B-0003      2    before       diet1, control
JMF-2006-B-0004      2    after          diet1, control
JMF-2006-B-0005      9    before       diet2, control
JMF-2006-B-0006      9    after          diet2, control
JMF-2006-B-0007     10   before       diet1, intervention
JMF-2006-B-0008     10   after          diet1, intervention

Did I construct the design matrix correctly?

And responding to your last comment, I also want to implement a multilevel sPLS2 model - how should I choose my parameters without tuning?

Thanks a lot in advance.
Best,
Luise

hi @l.bel,

There must be a confounder between Time, Diet and Sample, but I won’t be able to work this out unless you send me your .RData (I’ll only do this Friday next, so no rush).

For the tuning for sPLS2 we have a few functions for sPLS2 but I honestly do not think they are very ‘on point’. You can try them, as explained in 4 PLS on the liver toxicity study | mixOmics vignette

If you choose the parameters arbitrarily, I’d suggest you inspect the output plots closely, e.g. plotIndiv() and plotVar(), cim(), you will get a feeling of what is too much or too little.

Kim-Anh

Hi @kimanh.lecao ,

thank you very much for your help! I will send the .RData files to you :slight_smile:

1 Like

hi @l.bel,

Thanks for sharing this with me … and I think it raises an interesting issue.
The multilevel paper explains how we calculate the within variance matrix that is then input into PCA or PLSDA. It is basically a sort of rescaling, and a special case is when you have 2 repeated measures (in your case before / after) and a perfectly balanced design.
In that case it is equivalent to calculating the deviation of a single individual, before and after these two measurements. This shows up on a plot as you mentioned, a mirror of the 2 measurements with respect to the origin.

Example below showing the individual ID

pca.result <- pca(X, multilevel = multi.level.param, scale = TRUE)
plotIndiv(pca.result, ind.names = multi.level.param)

I dont think there is anything wrong with this calculation, it is just a very special case that we did not encounter before (I observed similar behaviour on another set of data). I looked into a PLSDA multilevel, and the separation between your groups is not great, so I would not recommend going down that path. Instead, I wonder if you should calculate the after - before values for each individual. You final data would include 3 samples per group, a bit borderline but worth trying? (it is basically similar to the idea of doing a paired t-test where you consider the differences between the 2 time points, but here in a multivariate case).

PS: a lot of your taxa have the same identical values, so I think this dataset is challenging to analyse too!

Kim-Anh

Dear @kimanh.lecao,

thank you so much for your help, I really appreciate it! I’m looking forward to further working with your amazing package.

Best,
Luise

I have the same findings in 2 (treatment) x 2( pre, post). Namely, the pca plot show subjects as mirror image, which makes sense given the scores are deviation before and after. I wanted to perform multilevel PCA on this longitudinal data to reduce the dimensionality of the data and use the PC scores for down association analyses. Would the using PC1 and PC2 scores at each time points still be valid for association analyses?

hi @jlabus,

I think so (I never thought about it this way!). What people have done in the past (e.g Dynamic molecular changes during the first week of human life follow a robust developmental trajectory | Nature Communications) is instead to directly input in down stream analysis the withinVariation() matrix (which is internally called in multilevel PCA).

If you use the multilevel PCs, just remember that your interpretation of the results should be based on the deviation w.r.t before / after.

Kim-Anh