Question about error rate of zero

stepra · January 12, 2021, 4:21pm

Dear developers,
I have a question about my model. I have integrated three datasets with ca. 600 variables each and I have two categories (healthy/not healthy) in my small cohort. When tuning the model, the lowest error rate for comp1= 0.13, for comp2= 0.05 and comp3=0.0. I am a bit suspicious about the 0.0 and when looking at the simDIABLO plot attached here, there is no 100% separation of cases and controls but shouldn’t there be when the model is “perfect”? I must have misunderstood something important here and hope you can explain
Thank you very much!
All the best,
Stef

aljabadi · January 14, 2021, 1:57am

Hi @stepra,

Please note that as mentioned in the documentation, DIABLO uses one of ("max.dist", "centroids.dist", "mahalanobis.dist") distances to assess the model performance, while cimDiablo uses euclidean distance to perform hierarchical clustering.

Best,

Al

stepra · January 15, 2021, 7:53am

Thank you, Al. I have been trying to read up on max, centroids and mahalanobis.dist but cannot find the explanations in the documentation. How are these calculated?
And how would you interpret the “perfect” model based on few samples? Is there any way of testing for overfitting when we do not have any additional samples?
Thanks,
Stef

christoa · January 15, 2021, 11:03am

You can read about the distances here (Section 1.3): https://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1005752.s001

I am not sure about this one, but i don’t think there is a way to interpret the “perfect” model or test for overfitting in this case (@aljabadi please correct me if i am wrong).

In this case i would try to experiment with the design matrix depending on what i wanted to see (Near 0 to prioritize discrimination between groups or near 1 to priortize the correlation between components). Also, assuming that LOO-CV was used, I would try to tune it with folds = 2 and nrepeat = 200 instead and see if it changes anything.

Topic		Replies	Views
DIABLO perf & tuning Analysis	4	1096	July 23, 2020
DIABLO for small N Analysis	1	877	April 15, 2020
The SGCCA algorithm did not converge and length of variable selection Analysis	3	2574	February 1, 2024
Model validation / analysis Analysis	2	303	June 2, 2023
DIABLO inputs and optimal number of components Analysis	4	407	December 10, 2021

Question about error rate of zero

Related topics