Dear developers,
I have a question about my model. I have integrated three datasets with ca. 600 variables each and I have two categories (healthy/not healthy) in my small cohort. When tuning the model, the lowest error rate for comp1= 0.13, for comp2= 0.05 and comp3=0.0. I am a bit suspicious about the 0.0 and when looking at the simDIABLO plot attached here, there is no 100% separation of cases and controls but shouldn’t there be when the model is “perfect”? I must have misunderstood something important here and hope you can explain
Thank you very much!
All the best,
Stef
Hi @stepra,
Please note that as mentioned in the documentation, DIABLO uses one of ("max.dist", "centroids.dist", "mahalanobis.dist")
distances to assess the model performance, while cimDiablo
uses euclidean distance to perform hierarchical clustering.
Best,
Al
Thank you, Al. I have been trying to read up on max, centroids and mahalanobis.dist but cannot find the explanations in the documentation. How are these calculated?
And how would you interpret the “perfect” model based on few samples? Is there any way of testing for overfitting when we do not have any additional samples?
Thanks,
Stef
You can read about the distances here (Section 1.3): https://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1005752.s001
I am not sure about this one, but i don’t think there is a way to interpret the “perfect” model or test for overfitting in this case (@aljabadi please correct me if i am wrong).
In this case i would try to experiment with the design matrix depending on what i wanted to see (Near 0 to prioritize discrimination between groups or near 1 to priortize the correlation between components). Also, assuming that LOO-CV was used, I would try to tune it with folds = 2 and nrepeat = 200 instead and see if it changes anything.