class: center, middle ## Imbalanced-learn #### Leverage knowledge from under-represented classes in machine learning ##### Guillaume Lemaitre ###### Euroscipy 2017 .affiliations[ ![:scale 65%](img/logoUPSayPlusCDS_990.png) ![:scale 25%](img/inria-logo.png) ] --- ### The curse of imbalanced data set
![:scale 85%](img/sphx_glr_plot_comparison_over_sampling_001.png)
--- ### The curse of imbalanced data set Applications * Bioinformatics * Medical imaging: diseases *versus* healthy * Social sciences: prediction of academic dropout * Web services: Service Level Agreement violation prediction * Security services: fraud detection --- class: center, middle # What scikit-learn offers? --- ### Transformers in scikit-learn Reduce the number of features in `X`: extraction and selection .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_red = [[ 0.33, 0.63], [ 0.29, 0.33], [ 0.28, 0.79], [ 0.11, 0.57], [ 0.43, 0.48]]) ``` ] Transform `X` by scaling, normalizing, etc. .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_sc = [[-1.05, -0.9 , 0.41, 0.47], [ 0.68, -0.73, 0.03, -1.5 ], [ 1.48, -0.26, -0.08, 1.5 ], [ 0.01, 1.9 , -1.72, 0.05], [-1.12, -0.01, 1.36, -0.52]] ``` ] --- ### Transformers in scikit-learn Transform `y` by encoding, binarizing, or transforming .pull-left[ ```python y = ['A', 'B', 'B', 'A', 'C', 'C'] ``` ] .pull-right[ ```python y_enc = [0, 1, 1, 0, 2, 2] ``` ] Limitation * Impossible to reduce the number of samples * `sample_weight` and `class_weight` solutions Need * **Implementation of specific samplers** --- ### What is scikit-learn-contrib?
### Requirements * Compatible with scikit-learn estimators -> `check_estimators()` * API and User Guide documentation * PEP8, unit tests and CI ### Advantages * Implementation of bleeding edge algorithms * Experimental feature and API testing * Speed and performance benchmarks --- class: center, middle # imbalanced-learn API `fit`, `sample`, `fit_sample` datasets pipeline metrics classifier --- ## Samplers ### `fit`, `sample`, `fit_sample` ```python >>> from imblearn.over_sampling import SMOTE >>> smote = SMOTE(ratio='auto') >>> X_resampled, y_resampled = smote.fit_sample(X, y) >>> Counter(y_resampled) Counter({-1: 5809, 1: 5809}) ``` ### `ratio` parameter --> new in 0.3.0 * `str`: `'auto'`, `'all'`, `'minority'`, `'majority'`, `'not_minority'` * `dict`: key -> class; value -> # samples * `func`: function returning a similar `dict` as previously stated --- ### Over-sampling
![:scale 40%](img/original.png)
![:scale 70%](img/oversampling.png)
--- count: false ### Under-sampling * Prototype generation
![:scale 90%](img/sphx_glr_plot_comparison_under_sampling_001.png)
* Prototype selection
![:scale 90%](img/sphx_glr_plot_comparison_under_sampling_002.png)
--- count: false ### Under-sampling * Cleaning under-sampling
![:scale 90%](img/cleaning.png)
--- count: false ### Combined over-sampling and under-sampling methods
![:scale 90%](img/combine.png)
--- ### Over-sampling * Random over-sampling / SMOTE / ADASYN ### Controlled under-sampling * Random under-sampling / NearMiss ### Cleaning under-sampling * Condensed nearest-neighbors and variants * Edited nearest neighbors and variants * Instance hardness threshold ### Combine over- and under-sampling * SMOTE + ENN / SMOTE + Tomek --- ### datasets module #### `fetch_datasets` ```python >>> from imblearn.datasets import fetch_datasets >>> dataset = fetch_datasets(filter_data=['satimage']) {'DESCR': 'satimage', 'data': array([[ 92., 115., 120., ..., 107., 113., 87.], [ 84., 102., 106., ..., 99., 104., 79.], [ 84., 102., 102., ..., 99., 104., 79.], ..., [ 56., 68., 91., ..., 83., 92., 74.], [ 56., 68., 87., ..., 83., 92., 70.], [ 60., 71., 91., ..., 79., 108., 92.]]), 'target': array([-1, -1, -1, ..., -1, -1, -1])} ``` #### `make_imbalance` ```python >>> from sklearn.datasets import load_iris >>> from imblearn.datasets import make_imbalance >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X_imb, y_imb = make_imbalance(X, y, ratio={0: 50, 1: 20, 2: 25}) Counter({0: 50, 1: 20, 2: 25}) ``` --- ### `Pipeline` and `make_pipeline` ```python >>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) ``` #### scikit-learn ```python >>> from sklearn.pipeline import make_pipeline >>> from sklearn.decomposition import PCA >>> from sklearn.linear_model import LogisticRegressionCV >>> pipeline = make_pipeline(PCA(), LogisticRegressionCV()) >>> pipeline.fit(X_train, y_train) >>> pipeline.predict(X_test) ``` #### imbalanced-learn ```python >>> from imblearn.pipeline import make_pipeline >>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV()) >>> pipeline_smote.fit(X_train, y_train) >>> pipeline_smote.predict(X_test) ``` --- ### metrics module ```python >>> print('Score pipeline:', pipeline.score(X_test, y_test)) Score pipeline: 0.9808 >>> print('Score pipeline:', pipeline_smote.score(X_test, y_test)) Score pipeline: 0.9752 ``` ```python >>> y_pred = pipeline_smote.predict(X_test) >>> print(classification_report_imbalanced(y_test, y_pred)) pre rec spe f1 geo iba sup 0 0.83 0.94 1.00 0.88 0.91 0.82 16 1 0.67 0.91 0.98 0.77 0.82 0.65 54 2 1.00 0.98 0.94 0.99 0.85 0.74 1180 avg / total 0.98 0.98 0.95 0.98 0.85 0.74 1250 ``` .pull-left[
![:scale 90%](img/cm1.png)
] .pull-right[
![:scale 90%](img/cm2.png)
] --- ### classifier ```python >>> from sklearn.ensemble import BaggingClassifier >>> from imblearn.ensemble import BalancedBaggingClassifier >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> bagging = BaggingClassifier(random_state=0) >>> balanced_bagging = BalancedBaggingClassifier(random_state=0) >>> bagging.fit(X_train, y_train) >>> balanced_bagging.fit(X_train, y_train) >>> y_pred_bagging = bagging.predict(X_test) >>> y_pred_balanced_bagging = balanced_bagging.predict(X_test) ``` .pull-left[
![:scale 90%](img/sphx_glr_plot_comparison_bagging_classifier_001.png)
] .pull-right[
![:scale 90%](img/sphx_glr_plot_comparison_bagging_classifier_002.png)
] --- ### What else in 0.3.0? * support for multi-class for all algorithms * support for sparse-matrices * new user guide * migration from nosetest to pytest ### How to install it ... #### Via `pip` ``` $ pip install imbalanced-learn ``` #### Via `conda` channel ``` $ conda install -c glemaitre imbalanced-learn ``` --- class: center, middle # Future work --- ### Perspectives * Support categorical features * Quantitative benchmark * Fast Nearest-Neighbors algorithm * Create a `keras` submodule specifically for sampling in deep learning ### Want to discuss -> come to scikit-learn sprint on Friday ...