+ - 0:00:00
Notes for current slide
Notes for next slide

Imbalanced-learn

Leverage knowledge from under-represented classes in machine learning

Guillaume Lemaitre
Euroscipy 2017
1 / 18

The curse of imbalanced data set

2 / 18

The curse of imbalanced data set

Applications

  • Bioinformatics
  • Medical imaging: diseases versus healthy
  • Social sciences: prediction of academic dropout
  • Web services: Service Level Agreement violation prediction
  • Security services: fraud detection
3 / 18

What scikit-learn offers?

4 / 18

Transformers in scikit-learn

Reduce the number of features in X: extraction and selection

X = [[ 0.14, 0.05, 0.33, 0.63],
[ 0.48, 0.1 , 0.29, 0.33],
[ 0.63, 0.25, 0.28, 0.79],
[ 0.35, 0.95, 0.11, 0.57],
[ 0.13, 0.33, 0.43, 0.48]]
X_red = [[ 0.33, 0.63],
[ 0.29, 0.33],
[ 0.28, 0.79],
[ 0.11, 0.57],
[ 0.43, 0.48]])

Transform X by scaling, normalizing, etc.

X = [[ 0.14, 0.05, 0.33, 0.63],
[ 0.48, 0.1 , 0.29, 0.33],
[ 0.63, 0.25, 0.28, 0.79],
[ 0.35, 0.95, 0.11, 0.57],
[ 0.13, 0.33, 0.43, 0.48]]
X_sc = [[-1.05, -0.9 , 0.41, 0.47],
[ 0.68, -0.73, 0.03, -1.5 ],
[ 1.48, -0.26, -0.08, 1.5 ],
[ 0.01, 1.9 , -1.72, 0.05],
[-1.12, -0.01, 1.36, -0.52]]
5 / 18

Transformers in scikit-learn

Transform y by encoding, binarizing, or transforming

y = ['A', 'B', 'B', 'A', 'C', 'C']
y_enc = [0, 1, 1, 0, 2, 2]

Limitation

  • Impossible to reduce the number of samples
  • sample_weight and class_weight solutions

Need

  • Implementation of specific samplers
6 / 18

What is scikit-learn-contrib?

Drawing

Requirements

  • Compatible with scikit-learn estimators -> check_estimators()
  • API and User Guide documentation
  • PEP8, unit tests and CI

Advantages

  • Implementation of bleeding edge algorithms
  • Experimental feature and API testing
  • Speed and performance benchmarks
7 / 18

imbalanced-learn API

fit, sample, fit_sample

datasets

pipeline

metrics

classifier

8 / 18

Samplers

fit, sample, fit_sample

>>> from imblearn.over_sampling import SMOTE
>>> smote = SMOTE(ratio='auto')
>>> X_resampled, y_resampled = smote.fit_sample(X, y)
>>> Counter(y_resampled)
Counter({-1: 5809, 1: 5809})

ratio parameter --> new in 0.3.0

  • str: 'auto', 'all', 'minority', 'majority', 'not_minority'
  • dict: key -> class; value -> # samples
  • func: function returning a similar dict as previously stated
9 / 18

Over-sampling

10 / 18

Under-sampling

  • Prototype generation
  • Prototype selection
10 / 18

Under-sampling

  • Cleaning under-sampling
10 / 18

Combined over-sampling and under-sampling methods

10 / 18

Over-sampling

  • Random over-sampling / SMOTE / ADASYN

Controlled under-sampling

  • Random under-sampling / NearMiss

Cleaning under-sampling

  • Condensed nearest-neighbors and variants
  • Edited nearest neighbors and variants
  • Instance hardness threshold

Combine over- and under-sampling

  • SMOTE + ENN / SMOTE + Tomek
11 / 18

datasets module

fetch_datasets

>>> from imblearn.datasets import fetch_datasets
>>> dataset = fetch_datasets(filter_data=['satimage'])
{'DESCR': 'satimage',
'data': array([[ 92., 115., 120., ..., 107., 113., 87.],
[ 84., 102., 106., ..., 99., 104., 79.],
[ 84., 102., 102., ..., 99., 104., 79.],
...,
[ 56., 68., 91., ..., 83., 92., 74.],
[ 56., 68., 87., ..., 83., 92., 70.],
[ 60., 71., 91., ..., 79., 108., 92.]]),
'target': array([-1, -1, -1, ..., -1, -1, -1])}

make_imbalance

>>> from sklearn.datasets import load_iris
>>> from imblearn.datasets import make_imbalance
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X_imb, y_imb = make_imbalance(X, y, ratio={0: 50, 1: 20, 2: 25})
Counter({0: 50, 1: 20, 2: 25})
12 / 18

Pipeline and make_pipeline

>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

scikit-learn

>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.decomposition import PCA
>>> from sklearn.linear_model import LogisticRegressionCV
>>> pipeline = make_pipeline(PCA(), LogisticRegressionCV())
>>> pipeline.fit(X_train, y_train)
>>> pipeline.predict(X_test)

imbalanced-learn

>>> from imblearn.pipeline import make_pipeline
>>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV())
>>> pipeline_smote.fit(X_train, y_train)
>>> pipeline_smote.predict(X_test)
13 / 18

metrics module

>>> print('Score pipeline:', pipeline.score(X_test, y_test))
Score pipeline: 0.9808
>>> print('Score pipeline:', pipeline_smote.score(X_test, y_test))
Score pipeline: 0.9752
>>> y_pred = pipeline_smote.predict(X_test)
>>> print(classification_report_imbalanced(y_test, y_pred))
pre rec spe f1 geo iba sup
0 0.83 0.94 1.00 0.88 0.91 0.82 16
1 0.67 0.91 0.98 0.77 0.82 0.65 54
2 1.00 0.98 0.94 0.99 0.85 0.74 1180
avg / total 0.98 0.98 0.95 0.98 0.85 0.74 1250
14 / 18

classifier

>>> from sklearn.ensemble import BaggingClassifier
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bagging = BaggingClassifier(random_state=0)
>>> balanced_bagging = BalancedBaggingClassifier(random_state=0)
>>> bagging.fit(X_train, y_train)
>>> balanced_bagging.fit(X_train, y_train)
>>> y_pred_bagging = bagging.predict(X_test)
>>> y_pred_balanced_bagging = balanced_bagging.predict(X_test)
15 / 18

What else in 0.3.0?

  • support for multi-class for all algorithms
  • support for sparse-matrices
  • new user guide
  • migration from nosetest to pytest

How to install it ...

Via pip

$ pip install imbalanced-learn

Via conda channel

$ conda install -c glemaitre imbalanced-learn
16 / 18

Future work

17 / 18

Perspectives

  • Support categorical features
  • Quantitative benchmark
  • Fast Nearest-Neighbors algorithm
  • Create a keras submodule specifically for sampling in deep learning

Want to discuss -> come to scikit-learn sprint on Friday ...

18 / 18

The curse of imbalanced data set

2 / 18
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow