reveal.js - The HTML Presentation Framework

Multi-Class
Probabilistic Active Learning

Daniel Kottke, Georg Krempl,
Dominik Lang, Johannes Teschner, Myra Spiliopoulou

Knowledge Management and Discovery Lab
Otto-von-Guericke-University Magdeburg, Germany

daniel.kottke@ovgu.de

www.daniel.kottke.eu/talks/2016_ECAI

Active learning in times of Big data?

"Big data" lead to a huge amount of data (mostly unlabeled)
Results from unlabeled data mostly too general
More information has to be added (labels, annotations)
But annotations are expensive, exhausting, time-consuming
Hence minimizing the annotation effort is preferred

\(\Rightarrow\) Active Learning

Recap:
Basic techniques in Active learning

Random

Selects label candidates at random.

The probability for each label candidate to be selected is equal.

Uncertainty Sampling

Prefers label candidates near the decision boundary.

Candidates in the darker green area (near the decision boundary) are preferred.

Expected Error Reduction

Simulates the selection of each candidate and each class and estimates the expected error reduction on the data.

Candidates in the darker green area (complex calculation) are preferred.

Our method:
Multi-Class Probabilistic
Active Learning

Basic idea and main tools

Probabilistic Active Learning [1]

estimate the performance gain of acquiring a label

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

Probabilistic Active Learning [1]

estimate the performance gain of acquiring a label by using local statistics around each candidate

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

Probabilistic Active Learning [1]

estimate the performance gain of acquiring a label by using local statistics around each candidate

frequency estimates (count the number of nearby labels)
\(\rightarrow\) posterior, reliability

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

Probabilistic Active Learning [1]

estimate the performance gain of acquiring a label by using local statistics around each candidate

frequency estimates (count the number of nearby labels)
\(\rightarrow\) posterior, reliability
density estimates (share of instances in the neighborhood)
\(\rightarrow\) impact

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

Posterior and its reliability

Label frequencies summarize information about the posterior and the reliability of this posterior
Provides local performance estimate by:

modeling the true posterior \(\vec{p}\) as a random experiment
modeling all possible label realizations \(\vec{l}\) as a random experiment \[\operatorname{expPerf}(\vec{k},m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]] \] (\(m\) is the number, how many additional hypothetical labels are considered)

The difference of the current performance and the performance with additional labels builds the performance gain \[\operatorname{perfGain}(\vec{k},m) = \frac{1}{m} \big(\operatorname{expPerf}(\vec{k},m) - \operatorname{expPerf}(\vec{k},0) \big)\]

The true posterior

Given the true posterior, the observed labels are Multinomial distributed.

The normalized likelihood function is:

\[P(\vec{p} \mid \vec{k}) = \frac{ \operatorname{\Gamma}{\sum(k_i + 1)} }{ \prod\left(\operatorname{\Gamma}{k_i+1}\right) } \cdot \prod\left( p_i^{k_i} \right)\]

Example: In the binary case, the normalized likelihood is a Beta-distribution (see figure).

no labels \(\rightarrow\) uniform
more labels \(\rightarrow\) less variance
peak at the observed posterior

The labeling vector

Describe all possible combinations of labels that could appear in that region given that \(m\) labels are acquired.

Three-Class Example (\(m=2\)): \[\vec{l} \in \{(2,0,0), (0,2,0), (0,0,2), (1,1,0), (1,0,1), (0,1,1)\} \]

This labeling vector is Multinomial distributed.\[P(\vec{l} \mid \vec{p}) = \textrm{Multinomial}_{\vec{p}}(\vec{l}) = \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) } \cdot \prod\left( p_i^{l_i} \right)\]

Impact of a candidate

Idea: prefer candidates that would have a higher impact on the overall classification performance
Here: density estimate

Closed-form solution

The main contribution is to have a fast closed-form solution for the calculation of a candidate's usefulness.

\[\operatorname{expPerf}(\vec{k}, m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]] \\ = \sum_{\vec{l}} \left(\prod_{j = \sum(k_i + 1)}^{\big(\sum(k_i + l_i + d_i + 1) \big) - 1} \frac{1}{j} \right) \cdot \prod \left( \prod_{j=k_i+1}^{k_i + l_i + d_i} j \right) \cdot \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) } \]

(details in the paper)

Evaluation

Experimental Setup

Algorithms:

McPAL
Conf, Entr, Random
BvsSB, MaxECost, VOI

Classifiers: Parzen Window, pKNN (real-valued frequencies)
100 random partitions into training and test set (trials)
Mean + variance curves and tables
Trial-wise comparisons: statistical tests on superiority (Wilcoxon signed rank test (\(p=.05\)) + Hommel procedure)

Results summary

Especially beneficial at the beginning (all results converge to the same level)
After 20 acquisitions: Best mean performance in 11/12 cases (significantly better than all other methods in 9/12 cases)
No clear 2nd winner
Execution time: Much faster than EER, slightly slower than US.
Analysis of the influence factors.

Conclusions

McPAL combines well-known advantages by using local statistics:

decision theoretic (expected error reduction)
fast (uncertainty sampling)
exploring (random sampling).

McPAL is superior in experimental evaluation.

Code is available: https://kmd.cs.ovgu.de/res/mcpal/
(ready to use in practice - cooperations welcome)

Thank you for your attention!

Slides, Paper, Bibtex:
www.daniel.kottke.eu/talks/2016_ECAI

Supplemental material:
kmd.cs.ovgu.de/res/mcpal/

Workshop at iKNOW (Oct 18, 2016):
Active Learning: Applications, Foundations and Emerging Trends
http://i-know.tugraz.at/

Multi-Class Probabilistic Active Learning
Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, Myra Spiliopoulou
European Conference on Artificial Intelligence (ECAI)
The Hague, Netherlands, 2016.

Multi-Class Probabilistic Active Learning

Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, Myra Spiliopoulou

Active learning in times of Big data?

Recap:Basic techniques in Active learning

Random

Uncertainty Sampling

Expected Error Reduction

Our method: Multi-Class Probabilistic Active Learning

Probabilistic Active Learning [1]

Probabilistic Active Learning [1]

Probabilistic Active Learning [1]

Probabilistic Active Learning [1]

Posterior and its reliability

The true posterior

The labeling vector

Impact of a candidate

Closed-form solution

Evaluation

Experimental Setup

Results summary

Conclusions

Thank you for your attention!

Multi-Class
Probabilistic Active Learning

Daniel Kottke, Georg Krempl,
Dominik Lang, Johannes Teschner, Myra Spiliopoulou

Recap:
Basic techniques in Active learning

Our method:
Multi-Class Probabilistic
Active Learning