class: center, middle ## Imbalanced-learn #### A scikit-learn-contrib to tackle learning from imbalanced data set ##### **Guillaume Lemaitre**, Christos Aridas, and contributors ###### Euroscipy 2018 .affiliations[ ![:scale 65%](img/logoUPSayPlusCDS_990.png) ![:scale 25%](img/inria-logo.png) ] --- class: center, middle # Imbalanced data set # - # What does it look like? --- class: center, middle ### Typical use cases #### Cancer segmentation
![:scale 50%](img/new_img/prostate.png)
=> Ratio between healthy and cancerous tissue - 20:1 --- count: false class: center, middle ### Typical use cases #### Solar wind classification
![:scale 55%](img/new_img/solar_wind.jpeg)
=> Ratio between all records and solar wind records - 26:1 --- count: false class: center, middle ### Typical use cases #### Car insurance claim
![:scale 80%](img/new_img/porto_seguro.png)
=> Ratio between no claim vs claim - 26:1 --- ### Learning becomes difficult ```python for weight_minority in [0.05, 0.1, 0.2, 0.5]: X, y = make_classification(weight=(weight_minority, 1 - weight_minority)) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) ```
![:scale 60%](img/new_img/imbalanced_ratio.png)
--- ### Learning becomes difficult ```python for weight_minority in [0.05, 0.1, 0.2, 0.5]: X, y = make_classification(weight=(weight_minority, 1 - weight_minority)) X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) svc = SVM().fit(X_train, y_test) roc_auc_score(y_test, svc.predict(X_test)) ```
![:scale 60%](img/new_img/imbalanced_svm.png)
--- class: center, middle # How can you tackle the issue? --- class: center, middle # It depends ... # - # 3 different perspectives --- class: center, middle # Unsupervised learning # - # Outlier detection --- ### Unsupervised learning - Outlier detection `X` contains outliers. They are **far** from others. => Fit regions where data are concentrated. ```python outliers_fraction = 0.05 anomaly_algorithms = [ ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)), ("One-Class SVM", OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)), ("Isolation Forest", IsolationForest(behaviour='new', contamination=outliers_fraction, random_state=42)), ("Local Outlier Factor", LocalOutlierFactor( n_neighbors=35, contamination=outliers_fraction))] for algorithm in anomaly_algorithms: y_pred = algorithm.fit(X).predict(X) roc_auc_score(y, y_pred) ``` --- count: false ### Unsupervised learning - Outlier detection `X` contains outliers. They are **far** from others. => Fit regions where data are concentrated.
![:scale 60%](img/new_img/outlier_detection.png)
--- class: center, middle # Semi-supervised learning # - # Novelty detection --- ### Semi-supervised learning - Novelty detection `X` contains outliers. They are **far** from others. => At predict, differenctiate **new** observations which does not follow the distribution. ```python X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) samples_majority_idx = y_train == 0 svm = OneClassSVM(nu=0.05, kernel="rbf", gamma=0.2) svm.fit(X_train[samples_majority_idx], y_train[samples_majority_idx]) roc_auc_score(y_test, svm.predict(X_test)) ```
![:scale 40%](img/new_img/novelty_detection.png)
--- class: center, middle # Supervised learning # - # Resampling --- ### Supervised learning - Resampling => Resample the data set before training the classifier. ```python for sampling_strategy in [0.3, 0.5, 0.7, 1]: sampler_method = [ ('RandomUnderSampler', RandomUnderSampler(sampling_strategy)), ('ClusterCentroids', ClusterCentroids(sampling_strategy)), ('RandomOverSampler', RandomOverSampler(sampling_strategy)), ('SMOTE', SMOTE(sampling_strategy)) ] for sampler in sampler_method: pipe = make_pipeline(sampler, SVC(C=0.1)) pipe.fit(X_train, y_train) roc_auc_score(y_test, pipe.predict(X_test)) ``` --- count: false ### Supervised learning - Resampling
![:scale 85%](img/new_img/resampling.png)
--- class: center, middle # What scikit-learn is missing for sampling? --- ### Transformers in scikit-learn Reduce the number of features in `X`: extraction and selection .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_red = [[ 0.33, 0.63], [ 0.29, 0.33], [ 0.28, 0.79], [ 0.11, 0.57], [ 0.43, 0.48]]) ``` ] Transform `X` by scaling, normalizing, etc. .pull-left[ ```python X = [[ 0.14, 0.05, 0.33, 0.63], [ 0.48, 0.1 , 0.29, 0.33], [ 0.63, 0.25, 0.28, 0.79], [ 0.35, 0.95, 0.11, 0.57], [ 0.13, 0.33, 0.43, 0.48]] ``` ] .pull-right[ ```python X_sc = [[-1.05, -0.9 , 0.41, 0.47], [ 0.68, -0.73, 0.03, -1.5 ], [ 1.48, -0.26, -0.08, 1.5 ], [ 0.01, 1.9 , -1.72, 0.05], [-1.12, -0.01, 1.36, -0.52]] ``` ] --- ### Transformers in scikit-learn Transform `y` by encoding, binarizing, or transforming .pull-left[ ```python y = ['A', 'B', 'B', 'A', 'C', 'C'] ``` ] .pull-right[ ```python y_enc = [0, 1, 1, 0, 2, 2] ``` ] Limitation * Impossible to reduce the number of samples * `sample_weight` and `class_weight` solutions Need * **Implementation of specific samplers** --- ### What is scikit-learn-contrib?
### Requirements * Compatible with scikit-learn estimators -> `check_estimators()` * API and User Guide documentation * PEP8, unit tests and CI ### Advantages * Implementation of bleeding edge algorithms * Experimental feature and API testing * Speed and performance benchmarks --- class: center, middle # imbalanced-learn API --- ## Samplers ### Single method `fit_resample` ```python >>> from imblearn.over_sampling import SMOTE >>> smote = SMOTE(sampling_strategy='auto') >>> X_resampled, y_resampled = smote.fit_resample(X, y) >>> Counter(y_resampled) Counter({-1: 5809, 1: 5809}) ``` ### `sampling_strategy` parameters * `str`: `'auto'`, `'all'`, `'minority'`, `'majority'`, `'not_minority'` * `float`: ratio |majority| / |minority| * `dict`: key -> class; value -> # samples * `func`: function returning a similar `dict` as previously stated --- ### Over-sampling * Random over-sampling / SMOTE / ADASYN ### Controlled under-sampling * Random under-sampling / NearMiss ### Cleaning under-sampling * Condensed nearest-neighbors and variants * Edited nearest neighbors and variants * Instance hardness threshold ### Combine over- and under-sampling * SMOTE + ENN / SMOTE + Tomek --- ## How to use samplers in practise? ### `Pipeline` and `make_pipeline` ```python >>> from sklearn.model_selection import train_test_split >>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) ``` #### scikit-learn ```python >>> from sklearn.pipeline import make_pipeline >>> from sklearn.decomposition import PCA >>> from sklearn.linear_model import LogisticRegressionCV >>> pipeline = make_pipeline(PCA(), LogisticRegressionCV()) >>> pipeline.fit(X_train, y_train) >>> pipeline.predict(X_test) ``` #### imbalanced-learn ```python >>> from imblearn.pipeline import make_pipeline >>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV()) >>> pipeline_smote.fit(X_train, y_train) >>> pipeline_smote.predict(X_test) ``` --- ## Addtional helper modules ### Metrics module ```python >>> from imblearn.metrics import classification_report_imbalanced >>> y_pred = pipeline_smote.predict(X_test) >>> print(classification_report_imbalanced(y_test, y_pred)) pre rec spe f1 geo iba sup 0 0.83 0.94 1.00 0.88 0.91 0.82 16 1 0.67 0.91 0.98 0.77 0.82 0.65 54 2 1.00 0.98 0.94 0.99 0.85 0.74 1180 avg / total 0.98 0.98 0.95 0.98 0.85 0.74 1250 ``` Complementary metrics specific to imbalance data set evaluation ```python >>> from imblearn.metrics import sensitivity_specificity_support >>> from imblearn.metrics import geometric_mean_score >>> from imblearn.metrics import make_index_balanced_accuracy ``` --- ## Addtional helper modules ### Datasets module #### `fetch_datasets` ```python >>> from imblearn.datasets import fetch_datasets >>> dataset = fetch_datasets(filter_data=['satimage']) {'DESCR': 'satimage', 'data': array([[ 92., 115., 120., ..., 107., 113., 87.], [ 84., 102., 106., ..., 99., 104., 79.], ..., [ 56., 68., 87., ..., 83., 92., 70.], [ 60., 71., 91., ..., 79., 108., 92.]]), 'target': array([-1, -1, -1, ..., -1, -1, -1])} ``` #### `make_imbalance` ```python >>> from sklearn.datasets import load_iris >>> from imblearn.datasets import make_imbalance >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X_imb, y_imb = make_imbalance(X, y, ratio={0: 50, 1: 20, 2: 25}) Counter({0: 50, 1: 20, 2: 25}) ``` --- class: center, middle # What's new in 0.4? --- Adding new methods: * `BalancedBaggingClassifier` / `BalancedRandomForestClassifier` * `EasyEnsembleClassifier`/ `BalanceCascadeClassifier` * `FunctionSampler` -> [Example](http://contrib.scikit-learn.org/imbalanced-learn/dev/auto_examples/plot_outlier_rejections.html#sphx-glr-auto-examples-plot-outlier-rejections-py) * Keras and tensorflow mini-batches sampling -> [Examples]( http://contrib.scikit-learn.org/imbalanced-learn/dev/auto_examples/applications/porto_seguro_keras_under_sampling.html#sphx-glr-auto-examples-applications-porto-seguro-keras-under-sampling-py) ```python def fit_predict_balanced_model(X_train, y_train, X_test, y_test): model = make_model(X_train.shape[1]) training_generator = BalancedBatchGenerator(X_train, y_train, batch_size=1000, random_state=42) model.fit_generator(generator=training_generator, epochs=5, verbose=1) y_pred = model.predict_proba(X_test, batch_size=1000) return roc_auc_score(y_test, y_pred) ``` --- ### How to install it ... #### Via `pip` ``` $ pip install imbalanced-learn ``` #### Via `conda` ``` $ conda install -c conda-forge imbalanced-learn ``` ### Perspectives #### Support categorical features #### Cython criterion which could be imported in scikit-learn decision trees #### Implementation of SMOTE variants #### Integration to scikit-learn