Dear @kimanh.lecao - Apologies!
my data set looks like this: patients were allocated to 2 diet groups, within each of those groups, an additional treatment may be administered - thus, i have four groups (diet 1 with and without treatment, diet 2 with or without treatment). All participants were sampled for fecal microbial analysis (16S rRNA Seq) at baseline and after intervention (d0 = baseline sample, d14 = follow-up sample).
I wanted to apply multilevel PCA for clustering according to timepoints and groups.
Here is the code I used:
#transform phyloseq obj. to df for NIPALS
phylo.as.df <- data.frame(pls_phylo@sam_data)
#remove samples with too high NAs (> 25% of data entries)
remove <- colnames(phylo.as.df[colSums(is.na(phylo.as.df)) > nrow(phylo.as.df)/4])
phylo.as.df <- phylo.as.df %>%
dplyr::select(-all_of(remove))
#NIPALS
numeric.X <- phylo.as.df %>%
dplyr::select_if(is.numeric)
numeric.X <- impute.nipals(numeric.X, ncomp = 10)
#prep df for merging with phylo df
numeric.X <- numeric.X %>%
data.frame() %>%
tibble::rownames_to_column(var = "sample_ID")
#merge with phylo df
phylo.as.df <- phylo.as.df %>%
dplyr::select(-colnames(numeric.X[,2:ncol(numeric.X)])) %>%
dplyr::left_join(numeric.X, by = join_by(sample_ID)) %>%
tibble::column_to_rownames(var = "sample_ID")
#reduce near zero variance
nzv <- mixOmics::nearZeroVar(phylo.as.df, uniqueCut = 15)
# Subset using an if function (in case column length is 0)
if(length(nzv$Position) > 0) phylo.as.df <- phylo.as.df[, -nzv$Position]
#define data sets
X <- phylo.as.df %>%
dplyr::select(Bacteroides:ncol(.))
Y <- phylo.as.df %>%
dplyr::select(age,body_mass_index, HOMA_IR, FPG, Matsuda,
Stress_Scale, Severity_Index, Stress_Scale_subcat)
#remove too sparse taxa
i <- colSums(X == 0, na.rm=TRUE) < nrow(phylo.as.df) - nrow(phylo.as.df)/5
X <- X[i]
# for later analysis, reintroduce sample_ID
phylo.as.df <- phylo.as.df %>%
tibble::rownames_to_column(var = "sample_ID")
#make sure every id is there 2x
id.2x <- phylo.as.df %>%
dplyr::group_by(subject_id) %>%
dplyr::filter( n() == 2) %>%
dplyr::ungroup() %>%
dplyr::pull(sample_ID)
X <- X[id.2x,]
Y <- Y[id.2x,]
#choose parameter for multilevel decomposition
multi.level.param <- factor(subset(phylo.as.df, sample_ID %in% id.2x)$subject_id)
#make design matrix for sPLS model
design <- data.frame(sample = multi.level.param)
#define colors
colors <- c("#F1D6D6", "lightcoral", "#E7A3B8", "maroon", "#E2F0FE", "#BCDCFE", "#B2D0E4", "skyblue3")
#run multilevel PCA
pca.result <- pca(X, multilevel = multi.level.param, scale = TRUE)
#plot multilevel PCA
plotIndiv(pca.result,
pch = subset(phylo.as.df, sample_ID %in% id.2x)$timepoint,
group = factor(subset(phylo.as.df, sample_ID %in% id.2x)$group),
col.per.group = colors,
legend = TRUE,
legend.title = "Treatment",
legend.title.pch = "Timepoint",
title = 'Multilevel PCA on taxa',
ellipse = F)
The code results in this plot:
To me, it looks like the data points are somewhat “mirrored”. For example: the pink triangle (patient A, timepoint 2) in the upper left corner is exactly in a mirrored position seen as the pink circle (patient A, timepoint 1) in the lower right corner.
As the plot is too crowded for ellipses, I plotted the PCA again, this time only according to diet and time:
The same mirror effect can be seen here.
Is this effect to be expected as I only have 2 timepoints? Although the diet clusters cannot be distinguished from each other at either timepoint, the cluster pattern still looks the same at timepoint 1 vs. timepoint 2 - as there is a longitudinal effect to be seen, is it plausible that the clustering pattern does not change at all? Do I have a mistake in my code?
I really appreciate your help and thank you in advance for taking the time to addressing my problem.
Best,
Luise