by Marianne Stecklina, Tuan Phan Minh, Tim Sabsch, Cornelius Styp von Rekowski, Daniel Kottke, Georg Krempl, Matthias Deliano, Myra Spiliopoulou.
In multi-class classification, datasets often contain both classes that can be easily separated from others and classes that require many data to learn an expressive decision boundary from. In active class selection (ACS), the main challenge is to determine the most beneficial sampling proportion for the classes with few training data. This semi-supervised learning problem is closely related to active learning. In contrast to active learning, where class labels are subsequently acquired from a pool of unlabeled instances, ACS methods choose a class to generate a new instance until the generation budget is exceeded.
Our probabilistic active class selection algorithm converts the ACS problem in an active learning problem by introducing pseudo-instances. We generate an instance for a class multiple times and estimate its expected impact on classification performance. For estimation, we use our Probabilistic Active Learning model to determine the gain in performance induced by that pseudo-instance. This model uses the expectation over the true posterior probability and directly optimizes the accuracy. To be applicable for ACS, we extend that approach for multiple classes. The final decision theoretic equation includes the expectation over all pseudo-instances, and over the vector of true posteriors for an evaluation set. Our experimental evaluation shows beneficial results compared to state-of-the-art methods in terms of final error and sampling proportion convergence.
Published on the Tagung der Deutschen Arbeitsgemeinschaft Statistik (DAGSTAT), 2016, Göttingen.
General Information about PAL: http://kmd.cs.ovgu.de/res/pal/