Help with data analysis and perf function

Hi there,

I have a couple of omics data (transcriptomics and lipidomics) that I would like to integrate and recently came across your R package, I’m very new to this and was hoping if you could provide some guidance/help regarding my analysis.

Experimental design
I have two different cell lines, A and B, with or without gene X knockdown. Each sample group consists of 6 replicates = total 24 samples. each omics data is obtained from separate batches of samples

I’m interested in the effects of gene X knockdown on gene expression and lipid changes and how these are correlated with each other. I went through the tutorials, and it was suggested to first analyse individual datasets prior to integration. I first analysed the transcriptomics dataset, as follows:

GeneX.data = list(gene = mrna.data, celltype = CellType, treatment = Treatment)
mrna.data is normalised lcpm values of ~13,000 genes
celltype is celltype A or B
treatment is control or geneX knockdown
X = GeneX.data$gene
Y = GeneX.data$treatment

MyResult.pca ← pca(X, scale = TRUE, ncomp = 10)
plot(MyResult.pca)
plotIndiv(MyResult.pca,
group = GeneX.data$celltype,
ind.names = FALSE,
pch = GeneX.data$treatment,
legend.title = ‘Cell Type’,
legend.title.pch = ‘Treatment’,
legend = TRUE, title = ‘PCa Genes, PCA comp 1 - 2’)


I read about the multi-level approach and decided to try and use that to cluster my samples to treatments, instead of cell lines (as seen above).

MyResult.pca.ml ← pca(X, multilevel = GeneX.data$celltype, scale = TRUE)
plotIndiv(MyResult.pca.ml,
ind.names = GeneX.data$celltype,
group = GeneX.data$treatment,
legend = TRUE,
legend.title = “Treatment”,
title = ‘Multilevel PCA on GeneX RNAseq data’)

Then, I wanted to tune and optimise parameters for the sPLS-DA model:

Initial sPLS-DA model

gene.splsda ← splsda(X, Y, ncomp = 4, multilevel = design)
plotIndiv(gene.splsda , comp = 1:2,
group = Y, ind.names = FALSE,
ellipse = TRUE,
legend = TRUE, title = ‘PLSDA with confidence ellipses’)

To select ncomp parameter,

gene.splsda ← splsda(X, Y, ncomp = 4, multilevel = design, near.zero.var = TRUE)
perf.splsda.gene ← perf(gene.splsda,
validation = “Mfold”,
folds = 3,
nrepeat = 50,
progressBar = FALSE,
auc = TRUE)
Error in MCVfold.spls(X, Y, multilevel = multilevel, validation = validation, :
Invalid number of folds.

I read about this in a few other topics on the forum and tried to solve this mainly via:
(1) reducing number of components - but didn’t work
(2) reducing number of folds to 2, which didn’t give me an error, but I’m not sure how appropriate it is and the plot looks like the one below


(3) using validation = “loo”

perf.splsda.gene ← perf(gene.splsda,
validation = “loo”,
dist = c(“max.dist”, “centroids.dist”), #as suggested in one of the forums
progressBar = FALSE,
auc = TRUE)

They all returned with classification rate of 0, the plot looks like this:

plot(perf.splsda.gene, sd = TRUE, legend.position = “horizontal”)

I also read that it could be due to multi-collinearity, which I’m not too sure how to solve this as I’m just starting out (sorry, but any help in this matter would be very much appreciated!)
So I tried to proceed with feature selection, hoping that it would show me something? And here’s the results:

list.keepX ← c(1:10, seq(20, 300, 10))
tune.splsda.GeneX<- tune.splsda(X, Y, ncomp = 3, # calculate for first 4 components
validation = ‘Mfold’,
folds = 3,
nrepeat = 50, # use repeated cross-validation
dist = ‘max.dist’, # use max.dist measure
measure = “BER”, # use balanced error rate of dist measure
test.keepX = list.keepX,
cpus = 2)
plot(tune.splsda.GeneX, col = color.jet(3))

I’m not sure how I should proceed at this point, I would really appreciate your help!!

Thank you.

Kind regards,
Mah

First things first: you’re following the right procedure and your results from the PCA are encouraging! It seems your data can be represented adequately in a lower dimension space.

Onto the Invalid number of folds issue. Using two folds (as you identified) is not valid. Reducing the number of components won’t solve the issue unfortunately. If you go to this line of the MCV.spls() function you’ll see the cases where this error is raised. In your case, folds is not NA, it is numeric and is not less than 2.

A few lines above, we can see than n is defined as the number of unique entries in multilevel[,1]. Hence, my guess is that maybe your design object which you passed in as the input to the multilevel parameter has a bit of an issue with it. If you could show me what it looks like, and maybe the head() of your X and Y dataframe I can confirm this or not for you.

Regarding the potential of it being a result of multi collinearity, I would go through and assess the pairwise correlations of each of your predictor features. A heatmap would be helpful. Using these, remove the features which are highly correlated with other features.

Another thing, don’t use the cpus argument - it’s depreciated. I’ll make note to ensure that this is communicated to users in the future. You’ll want to read up on the BPPARAM parameter.

Regarding the plots produced by tune.splsda() and perf(): without of an understanding of your data its difficult to diagnose the issue. Obviously something is going wrong if the model is always 100% accurate - irrespective of the number of features/components. I would guess something to do with the aforementioned design object may be causing this. It could also potentially be due to a high similarity between your samples (which would add credence to the idea of multi-collinearity).

What is the outcome of you running the nearZeroVar() function? A bit more exploration should yield the solution to your issue.