Hi there,
I have a couple of omics data (transcriptomics and lipidomics) that I would like to integrate and recently came across your R package, I’m very new to this and was hoping if you could provide some guidance/help regarding my analysis.
Experimental design
I have two different cell lines, A and B, with or without gene X knockdown. Each sample group consists of 6 replicates = total 24 samples. each omics data is obtained from separate batches of samples
I’m interested in the effects of gene X knockdown on gene expression and lipid changes and how these are correlated with each other. I went through the tutorials, and it was suggested to first analyse individual datasets prior to integration. I first analysed the transcriptomics dataset, as follows:
GeneX.data = list(gene = mrna.data, celltype = CellType, treatment = Treatment)
mrna.data is normalised lcpm values of ~13,000 genes
celltype is celltype A or B
treatment is control or geneX knockdown
X = GeneX.data$gene
Y = GeneX.data$treatment
MyResult.pca ← pca(X, scale = TRUE, ncomp = 10)
plot(MyResult.pca)
plotIndiv(MyResult.pca,
group = GeneX.data$celltype,
ind.names = FALSE,
pch = GeneX.data$treatment,
legend.title = ‘Cell Type’,
legend.title.pch = ‘Treatment’,
legend = TRUE, title = ‘PCa Genes, PCA comp 1 - 2’)
I read about the multi-level approach and decided to try and use that to cluster my samples to treatments, instead of cell lines (as seen above).
MyResult.pca.ml ← pca(X, multilevel = GeneX.data$celltype, scale = TRUE)
plotIndiv(MyResult.pca.ml,
ind.names = GeneX.data$celltype,
group = GeneX.data$treatment,
legend = TRUE,
legend.title = “Treatment”,
title = ‘Multilevel PCA on GeneX RNAseq data’)
Then, I wanted to tune and optimise parameters for the sPLS-DA model:
Initial sPLS-DA model
gene.splsda ← splsda(X, Y, ncomp = 4, multilevel = design)
plotIndiv(gene.splsda , comp = 1:2,
group = Y, ind.names = FALSE,
ellipse = TRUE,
legend = TRUE, title = ‘PLSDA with confidence ellipses’)
To select ncomp parameter,
gene.splsda ← splsda(X, Y, ncomp = 4, multilevel = design, near.zero.var = TRUE)
perf.splsda.gene ← perf(gene.splsda,
validation = “Mfold”,
folds = 3,
nrepeat = 50,
progressBar = FALSE,
auc = TRUE)
Error in MCVfold.spls(X, Y, multilevel = multilevel, validation = validation, :
Invalid number of folds.
I read about this in a few other topics on the forum and tried to solve this mainly via:
(1) reducing number of components - but didn’t work
(2) reducing number of folds to 2, which didn’t give me an error, but I’m not sure how appropriate it is and the plot looks like the one below
(3) using validation = “loo”
perf.splsda.gene ← perf(gene.splsda,
validation = “loo”,
dist = c(“max.dist”, “centroids.dist”), #as suggested in one of the forums
progressBar = FALSE,
auc = TRUE)
They all returned with classification rate of 0, the plot looks like this:
plot(perf.splsda.gene, sd = TRUE, legend.position = “horizontal”)
I also read that it could be due to multi-collinearity, which I’m not too sure how to solve this as I’m just starting out (sorry, but any help in this matter would be very much appreciated!)
So I tried to proceed with feature selection, hoping that it would show me something? And here’s the results:
list.keepX ← c(1:10, seq(20, 300, 10))
tune.splsda.GeneX<- tune.splsda(X, Y, ncomp = 3, # calculate for first 4 components
validation = ‘Mfold’,
folds = 3,
nrepeat = 50, # use repeated cross-validation
dist = ‘max.dist’, # use max.dist measure
measure = “BER”, # use balanced error rate of dist measure
test.keepX = list.keepX,
cpus = 2)
plot(tune.splsda.GeneX, col = color.jet(3))
I’m not sure how I should proceed at this point, I would really appreciate your help!!
Thank you.
Kind regards,
Mah