Introduction to imbalanced-learn

class: center, middle

## Introduction to imbalanced-learn
#### Leverage knowledge from under-represented classes in machine learning
##### **Guillaume Lemaitre** and Christos Aridas
###### PyParis 2017

.affiliations[
  ![:scale 65%](img/logoUPSayPlusCDS_990.png)
  ![:scale 25%](img/inria-logo.png)
]

---
### The curse of imbalanced data set

<center>![:scale 70%](img/imbalanced.png)</center>

---
### The curse of imbalanced data set

Applications

* Bioinformatics
* Medical imaging: diseases *versus* healthy
* Social sciences: prediction of academic dropout
* Web services: Service Level Agreement violation prediction
* Security services: fraud detection

---
class: center, middle
# `scikit-learn` offers

---
### `TransformerMixin` in `scikit-learn`

Reduce the number of features in `X`: extraction and selection

.pull-left[
```python
X = [[ 0.14,  0.05,  0.33,  0.63],
     [ 0.48,  0.1 ,  0.29,  0.33],
     [ 0.63,  0.25,  0.28,  0.79],
     [ 0.35,  0.95,  0.11,  0.57],
     [ 0.13,  0.33,  0.43,  0.48]]
```
]
.pull-right[
```python
X_red = [[ 0.33,  0.63],
         [ 0.29,  0.33],
         [ 0.28,  0.79],
         [ 0.11,  0.57],
         [ 0.43,  0.48]])
```
]

Transform `X` by scaling, normalizing, etc.

.pull-left[
```python
X = [[ 0.14,  0.05,  0.33,  0.63],
     [ 0.48,  0.1 ,  0.29,  0.33],
     [ 0.63,  0.25,  0.28,  0.79],
     [ 0.35,  0.95,  0.11,  0.57],
     [ 0.13,  0.33,  0.43,  0.48]]
```
]
.pull-right[
```python
X_sc = [[-1.05, -0.9 ,  0.41,  0.47],
        [ 0.68, -0.73,  0.03, -1.5 ],
        [ 1.48, -0.26, -0.08,  1.5 ],
        [ 0.01,  1.9 , -1.72,  0.05],
        [-1.12, -0.01,  1.36, -0.52]]
```
]

---
### `TransformerMixin` in `scikit-learn`

Transform `y` by encoding, binarizing, or transforming

.pull-left[
```python
y = ['A', 'B', 'B', 'A', 'C', 'C']
```
]
.pull-right[
```python
y_enc = [0, 1, 1, 0, 2, 2]
```
]

Limitation

* Impossible to reduce the number of samples
* `sample_weight` and `class_weight` solutions

Need

* **Implementation of a `SamplerMixin`**

---

### The `scikit-learn` dilemma

Explosion of machine learning algorithms from pre-processing to classifiers/regressors

--
count: false

<center>![:scale 40%](img/swiss_army.jpg)</center>

--
count: false

### Entry barrer in `scikit-learn`

* 3 years since publications
* 200+ citations
* **clear-cut improvement**

---

### Quality in `scikit-learn`

For each PR, a large amount of time is spent for:

* Consistency in the API
* Ensure to reach both good speed and performance

--
count: false
<center>![:scale 60%](img/pr_merge.png)</center>

---
### Emergence of `scikit-learn-contrib`

### Requirements

* Compatible with `scikit-learn` estimators -> `check_estimators()`
* API and User Guide documentation
* PEP8, unit tests and CI

---
count: false
### Emergence of `scikit-learn-contrib`

### Advantages

#### Programming POV

* implementation of bleeding edge algorithms
* experimental feature and API testing
* speed and performance benchmarks

#### Developer POV

* empower contributors

---
class: center, middle
# `imbalanced-learn` API

`datasets`

`fit`, `sample`, `fit_sample`

`Pipeline`

`metrics`

---
### `datasets` module

```python
>>> from imblearn.datasets import fetch_datasets
>>> fetch_datasets

>>> fetch_datasets?

+--+--------------+-------------------------------+-------+---------+-----+
|ID|Name          | Repository & Target           | Ratio | #S      | #F  |
+==+==============+===============================+=======+=========+=====+
|1 |ecoli         | UCI, target: imU              | 8.6:1 | 336     | 7   |
+--+--------------+-------------------------------+-------+---------+-----+
|2 |optical_digits| UCI, target: 8                | 9.1:1 | 5,620   | 64  |
+--+--------------+-------------------------------+-------+---------+-----+
|3 |satimage      | UCI, target: 4                | 9.3:1 | 6,435   | 36  |
+--+--------------+-------------------------------+-------+---------+-----+
|4 |pen_digits    | UCI, target: 5                | 9.4:1 | 10,992  | 16  |
+--+--------------+-------------------------------+-------+---------+-----+
|5 |abalone       | UCI, target: 7                | 9.7:1 | 4,177   | 10  |
+--+--------------+-------------------------------+-------+---------+-----+
|6 |sick_euthyroid| UCI, target: sick euthyroid   | 9.8:1 | 3,163   | 42  |

```

---
count: false
### `datasets` module

```python
>>> dataset = fetch_datasets()['satimage']
>>> dataset

{'DESCR': 'satimage',
 'data': array([[  92.,  115.,  120., ...,  107.,  113.,   87.],
                [  84.,  102.,  106., ...,   99.,  104.,   79.],
                [  84.,  102.,  102., ...,   99.,  104.,   79.],
                ..., 
                [  56.,   68.,   91., ...,   83.,   92.,   74.],
                [  56.,   68.,   87., ...,   83.,   92.,   70.],
                [  60.,   71.,   91., ...,   79.,  108.,   92.]]),
 'target': array([-1, -1, -1, ..., -1, -1, -1])}
```

```python
>>> from collections import Counter
>>> X = dataset.data
>>> y = dataset.target
>>> Counter(y)
Counter({-1: 5809, 1: 626})
```

---
### `fit`, `sample`, `fit_sample`

```python
>>> from imblearn.over_sampling import SMOTE

>>> smote = SMOTE(ratio='auto')
>>> X_resampled, y_resampled = smote.fit_sample(X, y)

>>> Counter(y_resampled)
Counter({-1: 5809, 1: 5809})
```

### `ratio` parameter

* `str`: `'auto'`, `'all'`, `'minority'`, `'majority'`, `'not_minority'`
* `dict`: key -> class; value -> # samples
* `func`: function returning a similar `dict` as previously stated

---
### `Pipeline` and `make_pipeline`

```python
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
                               n_redundant=0, n_repeated=0, n_classes=3,
                               n_clusters_per_class=1, weights=[0.01, 0.04, 0.95],
                               class_sep=0.8, random_state=0)
```

<center>![:scale 60%](img/toy.png)</center>

---
count: false
### `Pipeline` and `make_pipeline`

```python
>>> from sklearn.model_selection import train_test_split
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
```

#### `scikit-learn`

```python
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.decomposition import PCA
>>> from sklearn.linear_model import LogisticRegressionCV

>>> pipeline = make_pipeline(PCA(), LogisticRegressionCV())
>>> pipeline.fit(X_train, y_train)
>>> pipeline.predict(X_test)
```

#### `imbalanced-learn`

```python
>>> from imblearn.pipeline import make_pipeline

>>> pipeline_smote = make_pipeline(SMOTE(), PCA(), LogisticRegressionCV())
>>> pipeline_smote.fit(X_train, y_train)
>>> pipeline_smote.predict(X_test)
```

---
### `metrics`

```python
>>> print('Score pipeline:', pipeline.score(X_test, y_test))
Score pipeline: 0.9808
>>> print('Score pipeline:', pipeline_smote.score(X_test, y_test))
Score pipeline: 0.9752
```

```python
>>> y_pred = pipeline_smote.predict(X_test)
>>> print(classification_report_imbalanced(y_test, y_pred))
                   pre       rec       spe        f1       geo       iba       sup

0       0.83      0.94      1.00      0.88      0.91      0.82        16
          1       0.67      0.91      0.98      0.77      0.82      0.65        54
          2       1.00      0.98      0.94      0.99      0.85      0.74      1180

avg / total       0.98      0.98      0.95      0.98      0.85      0.74      1250
```

.pull-left[
<center>![:scale 90%](img/cm1.png)</center>
]
.pull-right[
<center>![:scale 90%](img/cm2.png)</center>
]

---
class: center, middle
# Algorithm specificities

---
### Over-sampling algorithms

<center>![:scale 75%](img/over_sampling.png)</center>

---
### Under-sampling algorithms

<center>![:scale 55%](img/under_sampling.png)</center>

---
### Cleaning algorithms

<center>![:scale 75%](img/cleaning.png)</center>

---
class: center, middle
# Applications

---
### Melanoma classification

<center>![:scale 100%](img/melanoma_framework.png)</center>

<center>![:scale 100%](img/melanoma_results.png)</center>

---
### Prostate cancer detection

<center>![:scale 60%](img/prostate_framework.png)</center>

<center>![:scale 50%](img/prostate_results.png)</center>

---
class: center, middle
# Future work

---
### Perspectives

* Support sparse matrices
* Support categorical features
* Quantitative benchmark
* Fast Nearest-Neighbors algorithm

---
count: false
### Perspectives

* Support sparse matrices
* Support categorical features
* Quantitative benchmark
* Fast Nearest-Neighbors algorithm

### We are hiring (unpaid ...)

* We want a [MRG+2] policy but we are 2 main contributors