DIABLO perf & tuning

christoa · July 8, 2020, 11:36am

When I assess the performance of the block.splsda using the perf function, the classification error rate (with mahalanobis distance) increases when adding comp 2 and 3, but decreases when adding the 4th component. However, the classification error rate (with max and centroids distances) decreases consistently when adding component 2 and 3. Is this simply because, max and centroids distances gives a much better performance in this particular case, or is it because I am overfitting? I have 4 classes, 39 samples and 7 folds (n/k = 5.5), and it does not change anything when using leave-one-out, increasing the nrepeat or using less or more folds.

Untitled

I also have some questions/considerations on how to set the parameters for tune.block.splsda. For my large proteomics dataset, I would normally test seq(5:9, seq(10, 150, 5)). However, lately I have been using keepX = c(5:150) with nrepeat = 80, because I believe this results in a more thorough tuning. Could there be any statistical/mathematical advantages/drawbacks (except for the heavier computation) by doing it this way? Computational time is not a problem, since I am doing it on a big server.

kimanh.lecao · July 12, 2020, 10:50pm

hi @christoa,

Regarding the choice of distances: have a look at the Supplemental in: https://journals.plos.org/ploscompbiol/article/related?id=10.1371/journal.pcbi.1005752 I screenshot the text here, I dont think it is an overfitting problem.

I would favour the centroid distance in your case. Your error bars indicate a high variability, and increasing the # of repeats is appropriate, it will make the estimation more reliable and ‘reproducible’ (to an extent), and this is how it should be done.

Kim-Anh

christoa · July 20, 2020, 8:47am

Hi @kimanh.lecao,

Thanks very much for the help. It is sincerely appreciated.

What about the fact that I test many variables (keepX = c(5:150) instead of seq(5,150, 5)), do you have an opinion on this? I achieve a slightly better correlation between the two datasets (0.97 vs 0.95) when doing this, and thus, I am wondering why people don’t do it this way. Is it just a matter of personal preference? (Benefit vs. computational time)

kimanh.lecao · July 22, 2020, 10:41pm

hi @christoa,
it is definitely a computational issue that explains the coarse grid seq(5,150, 5) because there are so many combinations of keepX values across all data sets to test! But if you have the compute power and need a very precise answer, then a fine grid is the way to go.

Kim-Anh

christoa · July 23, 2020, 9:05am

Again thank you for your help. The reason why I do it, is because it improves the clustering coefficient in the subsequent functional network analysis.

Topic		Replies	Views
I Have a problem/error with tune.block.splsda Support	5	1253	July 27, 2022
Block sPLSDA error (DIABLO) and multi block sPLS Support	1	303	June 9, 2022
keepX for sPLS-DA - small sample size Analysis	1	277	June 22, 2023
Tune.block.spls? Analysis	6	1161	July 12, 2022
Perf on DIABLO with one component Support	5	988	December 7, 2020

DIABLO perf & tuning

Related topics