Probabilistic Active Learning in Datastreams
Daniel Kottke, Georg Krempl, Myra Spiliopoulou
Knowledge Management and Discovery Lab Otto-von-Guericke-University Magdeburg, Germany
daniel.kottke@ovgu.de
www.daniel.kottke.eu/talks/2015_IDA
Pool vs. Stream Learning
Spatial components is extended with temporal information
Classification model might change (Drift)
Fast, endless instance generation
Efficient (on-line) algorithms required
Applications:
Data from twitter, sensors, bank transactions
Acquiring the next label?
Spatial component
Where to buy (feature vectors)?
Budgets in streams
Pools: absolute number (e.g. stop after 40 labels)
Streams: relative definition necessary (e.g. buy 10%)
Best budget fit with sequential acquisitions
Better: spend budget in expectation and control variance through tolerance window
Here: budget is constant over time for constant labeling capacities, permanently updated models, fair evaluation
streams usually do not end -> relative
Divide and Conquer
Divide and Conquer
Divide and Conquer
Measure spatial usefulness from Probabilistic Active Learning
Divide and Conquer
Divide and Conquer
Choose the most useful instances based on this value
Spatial Selection (Recap): Probabilistic Active Learning
Evaluates each labeling candidate in its neighborhood using its label statistics:
Posterior probability (\(\hat{p}\))
Number of labels (\(n\))
Image: "Probabilistic Active Learning: Towards Combining Versatility, Optimality and Efficiency" by G. Krempl, D. Kottke, M. Spiliopoulou, Proc. of the 17th Int. Conf. on Discovery Science, Bled, 2014. Springer.
Spatial Selection (Recap): Probabilistic Active Learning
Probabilistic description of expected performance gain using the expectation over:
the possible label outcomes (\(y\)) of the candidate
the true posterior probability (\(p\))
\[
\mathrm{pgain}((n, \hat{p})) = \mathtt{E}_{p}\bigg[ \mathtt{E}_{y} \left[ \mathrm{gain}_{p}((n, \hat{p}),y) \right] \bigg]
\\
= \int_0^1 { \mathrm{Beta}_{n \hat{p} + 1,n (1-\hat{p}) + 1}(p) \cdot \sum_{y \in \{0,1\}} \mathrm{Ber}_{p}(y) \cdot \mathrm{gain}_{p}((n, \hat{p}),y) } \, \mathrm{d} p
\]
Temporal selection: Problem Specification
Input: Single-value stream of spatial usefulness values
Arbitrary range
Arbitrary distribution
Changing distribution (dependent on previous decisions)
Task: Select the best b % of this value stream instantly
Subtask: Difference of current and target budget should not exceed a given tolerance window
Temporal selection: Main Idea
Store the most recent usefulness values in a list
When a new value appears:
Determine its relative rank in that list
If it belongs to the better b % of that list, acquire the corresponding label
Temporal selection: IQF - Properties
Selects the best b % of a single value stream in expectation
Only works for value streams coming from one single distribution
But: Spatial usefulness probably decrease as classification performance increases over time \(\rightarrow\) less labels were acquired
Idea: Threshold correction that ensures that the target budget is met w.r.t. a tolerance window
Temporal selection: Balancing
Tolerance window (\(w_\textrm{tol}\)):
maximal difference between current and the target budget
If there are label acquistions left (\(acq_\textrm{left}\) > 0)
\(\rightarrow\) decrease threshold \(\theta\) (and vice versa)
\[
\theta_\textrm{bal} = \theta - f \cdot acq_\textrm{left}
\]
How to define \(f\)?
Temporal selection: Balancing
Defining the normalization factor \(f\):
Scales with the range of the data (\(\Delta\))
The higher the tolerance window (\(w_\textrm{tol}\)), the smaller the correction
\[
\theta_\textrm{bal} = \theta - \Delta \cdot \frac{acq_\textrm{left}}{w_\textrm{tol}}
\]
\(\theta_\textrm{bal}\) - Balanced threshold
\(\theta\) - IQF acquisition threshold
\(\Delta\) - Data range of IQF window
\(w_\textrm{tol}\) - Tolerance window size
Experimental Settings
Datasets: 5 synthetic, 2 real
Algorithms:
PAL + BIQF
Random, Variable Uncertainty¹ (VarUncer), Split¹
Split + BIQF, VarUncer + BIQF
100 random test and train streams
Different budgets \(b \in \{0.02, 0,05, 0.1, 0.2, 0.3, 0.5, 0.7\}\)
BIQF parameter: \(w = 100, w_{\mathrm{tol}} = 50\)
¹ I. Zliobaite, A. Bifet, B. Pfahringer, G. Holmes: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 2014.
Evaluation of Active Learning
Evaluation of Active Learning
Evaluation of Active Learning
Conclusion
Active Learning algorithm for Datastreams that
is able to reach a defined budget within a given tolerance window
separates the spatial and temporal usefulness
uses Probabilistic Active Learning for spatial usefulness
uses the new Balanced Incremental Quantile Filter (BIQF) to determine the temporal usefulness of instances
is superior to state-of-the-art methods (esp. at the beginning)
Thank you for your attention!
Slides, Paper, Bibtex:
www.daniel.kottke.eu/talks/2015_IDA
Supplemental material:
kmd.cs.ovgu.de/res/pal/
Probabilistic Active Learning in Datastreams
Daniel Kottke, Georg Krempl, Myra Spiliopoulou
Fourteenth International Symposium on Intelligent Data Analysis (IDA)
Saint-Etienne, France, 2015. Springer.