imbalanced-learn

class: center, middle

## Imbalanced-learn
#### Leverage knowledge from under-represented classes in machine learning
##### Guillaume Lemaitre
###### Euroscipy 2017

.affiliations[
  ![:scale 65%](img/logoUPSayPlusCDS_990.png)
  ![:scale 25%](img/inria-logo.png)
]

---
### The curse of imbalanced data set

<center>![:scale 85%](img/sphx_glr_plot_comparison_over_sampling_001.png)</center>

---
### The curse of imbalanced data set

Applications

* Bioinformatics
* Medical imaging: diseases *versus* healthy
* Social sciences: prediction of academic dropout
* Web services: Service Level Agreement violation prediction
* Security services: fraud detection

---
class: center, middle
# What scikit-learn offers?

---
### Transformers in scikit-learn

Reduce the number of features in `X`: extraction and selection

.pull-left[
```python
X = [[ 0.14,  0.05,  0.33,  0.63],
     [ 0.48,  0.1 ,  0.29,  0.33],
     [ 0.63,  0.25,  0.28,  0.79],
     [ 0.35,  0.95,  0.11,  0.57],
     [ 0.13,  0.33,  0.43,  0.48]]
```
]
.pull-right[
```python
X_red = [[ 0.33,  0.63],
         [ 0.29,  0.33],
         [ 0.28,  0.79],
         [ 0.11,  0.57],
         [ 0.43,  0.48]])
```
]

Transform `X` by scaling, normalizing, etc.

.pull-left[
```python
X = [[ 0.14,  0.05,  0.33,  0.63],
     [ 0.48,  0.1 ,  0.29,  0.33],
     [ 0.63,  0.25,  0.28,  0.79],
     [ 0.35,  0.95,  0.11,  0.57],
     [ 0.13,  0.33,  0.43,  0.48]]
```
]
.pull-right[
```python
X_sc = [[-1.05, -0.9 ,  0.41,  0.47],
        [ 0.68, -0.73,  0.03, -1.5 ],
        [ 1.48, -0.26, -0.08,  1.5 ],
        [ 0.01,  1.9 , -1.72,  0.05],
        [-1.12, -0.01,  1.36, -0.52]]
```
]

---
### Transformers in scikit-learn

Transform `y` by encoding, binarizing, or transforming

.pull-left[
```python
y = ['A', 'B', 'B', 'A', 'C', 'C']
```
]
.pull-right[
```python
y_enc = [0, 1, 1, 0, 2, 2]
```
]

Limitation

* Impossible to reduce the number of samples
* `sample_weight` and `class_weight` solutions

Need

* **Implementation of specific samplers**

---
### What is scikit-learn-contrib?

### Requirements

* Compatible with scikit-learn estimators -> `check_estimators()`
* API and User Guide documentation
* PEP8, unit tests and CI

### Advantages

* Implementation of bleeding edge algorithms
* Experimental feature and API testing
* Speed and performance benchmarks

---
class: center, middle
# imbalanced-learn API

`fit`, `sample`, `fit_sample`

datasets

pipeline

metrics

classifier

---

## Samplers

### `fit`, `sample`, `fit_sample`

```python
>>> from imblearn.over_sampling import SMOTE

>>> smote = SMOTE(ratio='auto')
>>> X_resampled, y_resampled = smote.fit_sample(X, y)

>>> Counter(y_resampled)
Counter({-1: 5809, 1: 5809})
```

### `ratio` parameter --> new in 0.3.0

* `str`: `'auto'`, `'all'`, `'minority'`, `'majority'`, `'not_minority'`
* `dict`: key -> class; value -> # samples
* `func`: function returning a similar `dict` as previously stated

---
### Over-sampling

<center>![:scale 40%](img/original.png)</center>
<center>![:scale 70%](img/oversampling.png)</center>

---
count: false

### Under-sampling

* Prototype generation

<center>![:scale 90%](img/sphx_glr_plot_comparison_under_sampling_001.png)</center>

* Prototype selection

<center>![:scale 90%](img/sphx_glr_plot_comparison_under_sampling_002.png)</center>

---
count: false

### Under-sampling

* Cleaning under-sampling

<center>![:scale 90%](img/cleaning.png)</center>

---
count: false

### Combined over-sampling and under-sampling methods

<center>![:scale 90%](img/combine.png)</center>

---
### Over-sampling

* Random over-sampling / SMOTE / ADASYN

### Controlled under-sampling

* Random under-sampling / NearMiss

### Cleaning under-sampling

* Condensed nearest-neighbors and variants
* Edited nearest neighbors and variants
* Instance hardness threshold

### Combine over- and under-sampling

* SMOTE + ENN / SMOTE + Tomek

---
### datasets module

#### `fetch_datasets`

```python
>>> from imblearn.datasets import fetch_datasets
>>> dataset = fetch_datasets(filter_data=['satimage'])
{'DESCR': 'satimage',
 'data': array([[  92.,  115.,  120., ...,  107.,  113.,   87.],
                [  84.,  102.,  106., ...,   99.,  104.,   79.],
                [  84.,  102.,  102., ...,   99.,  104.,   79.],
                ..., 
                [  56.,   68.,   91., ...,   83.,   92.,   74.],
                [  56.,   68.,   87., ...,   83.,   92.,   70.],
                [  60.,   71.,   91., ...,   79.,  108.,   92.]]),
 'target': array([-1, -1, -1, ..., -1, -1, -1])}
```

#### `make_imbalance`

```python
>>> from sklearn.datasets import load_iris
>>> from imblearn.datasets import make_imbalance
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X_imb, y_imb = make_imbalance(X, y, ratio={0: 50, 1: 20, 2: 25})
Counter({0: 50, 1: 20, 2: 25})
```

---
### `Pipeline` and `make_pipeline`

```python
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
```

#### scikit-learn

```python
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.decomposition import PCA
>>> from sklearn.linear_model import LogisticRegressionCV

>>> pipeline = make_pipeline(PCA(), LogisticRegressionCV())
>>> pipeline.fit(X_train, y_train)
>>> pipeline.predict(X_test)
```

#### imbalanced-learn

```python
>>> from imblearn.pipeline import make_pipeline

>>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV())
>>> pipeline_smote.fit(X_train, y_train)
>>> pipeline_smote.predict(X_test)
```

---
### metrics module

```python
>>> print('Score pipeline:', pipeline.score(X_test, y_test))
Score pipeline: 0.9808
>>> print('Score pipeline:', pipeline_smote.score(X_test, y_test))
Score pipeline: 0.9752
```

```python
>>> y_pred = pipeline_smote.predict(X_test)
>>> print(classification_report_imbalanced(y_test, y_pred))
                   pre       rec       spe        f1       geo       iba       sup

0       0.83      0.94      1.00      0.88      0.91      0.82        16
          1       0.67      0.91      0.98      0.77      0.82      0.65        54
          2       1.00      0.98      0.94      0.99      0.85      0.74      1180

avg / total       0.98      0.98      0.95      0.98      0.85      0.74      1250
```

.pull-left[
<center>![:scale 90%](img/cm1.png)</center>
]
.pull-right[
<center>![:scale 90%](img/cm2.png)</center>
]

---
### classifier

```python
>>> from sklearn.ensemble import BaggingClassifier
>>> from imblearn.ensemble import BalancedBaggingClassifier

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

>>> bagging = BaggingClassifier(random_state=0)
>>> balanced_bagging = BalancedBaggingClassifier(random_state=0)

>>> bagging.fit(X_train, y_train)
>>> balanced_bagging.fit(X_train, y_train)

>>> y_pred_bagging = bagging.predict(X_test)
>>> y_pred_balanced_bagging = balanced_bagging.predict(X_test)

```

.pull-left[
<center>![:scale 90%](img/sphx_glr_plot_comparison_bagging_classifier_001.png)</center>
]
.pull-right[
<center>![:scale 90%](img/sphx_glr_plot_comparison_bagging_classifier_002.png)</center>
]

---
### What else in 0.3.0?

* support for multi-class for all algorithms
* support for sparse-matrices
* new user guide
* migration from nosetest to pytest

### How to install it ...

#### Via `pip`

```
$ pip install imbalanced-learn
```

#### Via `conda` channel

```
$ conda install -c glemaitre imbalanced-learn
```

---
class: center, middle
# Future work

---
### Perspectives

* Support categorical features
* Quantitative benchmark
* Fast Nearest-Neighbors algorithm
* Create a `keras` submodule specifically for sampling in deep learning

### Want to discuss -> come to scikit-learn sprint on Friday ...