class: center, middle ## Introduction to imbalanced-learn #### Leverage knowledge from under-represented classes in machine learning ##### **Guillaume Lemaitre** and Christos Aridas ###### PyParis 2017 .affiliations[ ![:scale 65%](img/logoUPSayPlusCDS_990.png) ![:scale 25%](img/inria-logo.png) ] --- ### The curse of imbalanced data set
![:scale 70%](img/imbalanced.png)
--- ### The curse of imbalanced data set Applications * Bioinformatics * Medical imaging: diseases *versus* healthy * Social sciences: prediction of academic dropout * Web services: Service Level Agreement violation prediction * Security services: fraud detection --- class: center, middle # `scikit-learn` offers --- ### `TransformerMixin` in `scikit-learn` Reduce the number of features in `X`: extraction and selection .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_red = [[ 0.33, 0.63], [ 0.29, 0.33], [ 0.28, 0.79], [ 0.11, 0.57], [ 0.43, 0.48]]) ``` ] Transform `X` by scaling, normalizing, etc. .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_sc = [[-1.05, -0.9 , 0.41, 0.47], [ 0.68, -0.73, 0.03, -1.5 ], [ 1.48, -0.26, -0.08, 1.5 ], [ 0.01, 1.9 , -1.72, 0.05], [-1.12, -0.01, 1.36, -0.52]] ``` ] --- ### `TransformerMixin` in `scikit-learn` Transform `y` by encoding, binarizing, or transforming .pull-left[ ```python y = ['A', 'B', 'B', 'A', 'C', 'C'] ``` ] .pull-right[ ```python y_enc = [0, 1, 1, 0, 2, 2] ``` ] Limitation * Impossible to reduce the number of samples * `sample_weight` and `class_weight` solutions Need * **Implementation of a `SamplerMixin`** --- ### The `scikit-learn` dilemma Explosion of machine learning algorithms from pre-processing to classifiers/regressors -- count: false
![:scale 40%](img/swiss_army.jpg)
-- count: false ### Entry barrer in `scikit-learn` * 3 years since publications * 200+ citations * **clear-cut improvement** --- ### Quality in `scikit-learn` For each PR, a large amount of time is spent for: * Consistency in the API * Ensure to reach both good speed and performance -- count: false
![:scale 60%](img/pr_merge.png)
--- ### Emergence of `scikit-learn-contrib`
### Requirements * Compatible with `scikit-learn` estimators -> `check_estimators()` * API and User Guide documentation * PEP8, unit tests and CI --- count: false ### Emergence of `scikit-learn-contrib`
### Advantages #### Programming POV * implementation of bleeding edge algorithms * experimental feature and API testing * speed and performance benchmarks #### Developer POV * empower contributors --- class: center, middle # `imbalanced-learn` API `datasets` `fit`, `sample`, `fit_sample` `Pipeline` `metrics` --- ### `datasets` module ```python >>> from imblearn.datasets import fetch_datasets >>> fetch_datasets >>> fetch_datasets? +--+--------------+-------------------------------+-------+---------+-----+ |ID|Name | Repository & Target | Ratio | #S | #F | +==+==============+===============================+=======+=========+=====+ |1 |ecoli | UCI, target: imU | 8.6:1 | 336 | 7 | +--+--------------+-------------------------------+-------+---------+-----+ |2 |optical_digits| UCI, target: 8 | 9.1:1 | 5,620 | 64 | +--+--------------+-------------------------------+-------+---------+-----+ |3 |satimage | UCI, target: 4 | 9.3:1 | 6,435 | 36 | +--+--------------+-------------------------------+-------+---------+-----+ |4 |pen_digits | UCI, target: 5 | 9.4:1 | 10,992 | 16 | +--+--------------+-------------------------------+-------+---------+-----+ |5 |abalone | UCI, target: 7 | 9.7:1 | 4,177 | 10 | +--+--------------+-------------------------------+-------+---------+-----+ |6 |sick_euthyroid| UCI, target: sick euthyroid | 9.8:1 | 3,163 | 42 | ``` --- count: false ### `datasets` module ```python >>> dataset = fetch_datasets()['satimage'] >>> dataset {'DESCR': 'satimage', 'data': array([[ 92., 115., 120., ..., 107., 113., 87.], [ 84., 102., 106., ..., 99., 104., 79.], [ 84., 102., 102., ..., 99., 104., 79.], ..., [ 56., 68., 91., ..., 83., 92., 74.], [ 56., 68., 87., ..., 83., 92., 70.], [ 60., 71., 91., ..., 79., 108., 92.]]), 'target': array([-1, -1, -1, ..., -1, -1, -1])} ``` ```python >>> from collections import Counter >>> X = dataset.data >>> y = dataset.target >>> Counter(y) Counter({-1: 5809, 1: 626}) ``` --- ### `fit`, `sample`, `fit_sample` ```python >>> from imblearn.over_sampling import SMOTE >>> smote = SMOTE(ratio='auto') >>> X_resampled, y_resampled = smote.fit_sample(X, y) >>> Counter(y_resampled) Counter({-1: 5809, 1: 5809}) ``` ### `ratio` parameter * `str`: `'auto'`, `'all'`, `'minority'`, `'majority'`, `'not_minority'` * `dict`: key -> class; value -> # samples * `func`: function returning a similar `dict` as previously stated --- ### `Pipeline` and `make_pipeline` ```python >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, n_classes=3, n_clusters_per_class=1, weights=[0.01, 0.04, 0.95], class_sep=0.8, random_state=0) ```
![:scale 60%](img/toy.png)
--- count: false ### `Pipeline` and `make_pipeline` ```python >>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) ``` #### `scikit-learn` ```python >>> from sklearn.pipeline import make_pipeline >>> from sklearn.decomposition import PCA >>> from sklearn.linear_model import LogisticRegressionCV >>> pipeline = make_pipeline(PCA(), LogisticRegressionCV()) >>> pipeline.fit(X_train, y_train) >>> pipeline.predict(X_test) ``` #### `imbalanced-learn` ```python >>> from imblearn.pipeline import make_pipeline >>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV()) >>> pipeline_smote.fit(X_train, y_train) >>> pipeline_smote.predict(X_test) ``` --- ### `metrics` ```python >>> print('Score pipeline:', pipeline.score(X_test, y_test)) Score pipeline: 0.9808 >>> print('Score pipeline:', pipeline_smote.score(X_test, y_test)) Score pipeline: 0.9752 ``` ```python >>> y_pred = pipeline_smote.predict(X_test) >>> print(classification_report_imbalanced(y_test, y_pred)) pre rec spe f1 geo iba sup 0 0.83 0.94 1.00 0.88 0.91 0.82 16 1 0.67 0.91 0.98 0.77 0.82 0.65 54 2 1.00 0.98 0.94 0.99 0.85 0.74 1180 avg / total 0.98 0.98 0.95 0.98 0.85 0.74 1250 ``` .pull-left[
![:scale 90%](img/cm1.png)
] .pull-right[
![:scale 90%](img/cm2.png)
] --- class: center, middle # Algorithm specificities --- ### Over-sampling algorithms
![:scale 75%](img/over_sampling.png)
--- ### Under-sampling algorithms
![:scale 55%](img/under_sampling.png)
--- ### Cleaning algorithms
![:scale 75%](img/cleaning.png)
--- class: center, middle # Applications --- ### Melanoma classification
![:scale 100%](img/melanoma_framework.png)
![:scale 100%](img/melanoma_results.png)
--- ### Prostate cancer detection
![:scale 60%](img/prostate_framework.png)
![:scale 50%](img/prostate_results.png)
--- class: center, middle # Future work --- ### Perspectives * Support sparse matrices * Support categorical features * Quantitative benchmark * Fast Nearest-Neighbors algorithm --- count: false ### Perspectives * Support sparse matrices * Support categorical features * Quantitative benchmark * Fast Nearest-Neighbors algorithm ### We are hiring (unpaid ...) * We want a [MRG+2] policy but we are 2 main contributors