Estimating performance of classifiers from dataset properties
|Author:||Mgr. Michal Todt|
|Year:||2018 - summer|
|Leaders:|| Mgr. Petr Polák MSc. Ph.D.
|Work type:|| Economic Theory
|Awards and prizes:|
|Abstract:||The following thesis explores the impact of the dataset distributional prop- erties on classiﬁcation performance. We use Gaussian copulas to generate 1000 artiﬁcial dataset and train classiﬁers on them. We train Generalized linear models, Distributed Random forest, Extremely randomized trees and Gradient boosting machines via H2O.ai machine learning platform accessed by R. Classi- ﬁcation performance on these datasets is evaluated and empirical observations on inﬂuence are presented. Secondly, we use real Australian credit dataset and predict which classiﬁer is possibly going to work best. The predicted perfor- mance for any individual method is based on penalizing the diﬀerences between the Australian dataset and artiﬁcial datasets where the method performed com- paratively better, but it failed to predict correctly.|