imblearn.combine.SMOTEENN

class imblearn.combine.SMOTEENN(ratio='auto', random_state=None, smote=None, enn=None, k=None, m=None, out_step=None, kind_smote=None, size_ngh=None, n_neighbors=None, kind_enn=None, n_jobs=None)[source][source]

Class to perform over-sampling using SMOTE and cleaning using ENN.

Combine over- and under-sampling using SMOTE and Edited Nearest Neighbours.

Parameters:

ratio : str, dict, or callable, optional (default=’auto’)

Ratio to use for resampling the data set.

  • If str, has to be one of: (i) 'minority': resample the minority class; (ii) 'majority': resample the majority class, (iii) 'not minority': resample all classes apart of the minority class, (iv) 'all': resample all classes, and (v) 'auto': correspond to 'all' with for over-sampling methods and 'not minority' for under-sampling methods. The classes targeted will be over-sampled or under-sampled to achieve an equal number of sample with the majority or minority class.
  • If dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples.
  • If callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

smote : object, optional (default=SMOTE())

The imblearn.over_sampling.SMOTE object to use. If not given, a imblearn.over_sampling.SMOTE object with default parameters will be given.

enn : object, optional (default=EditedNearestNeighbours())

The imblearn.under_sampling.EditedNearestNeighbours object to use. If not given, an imblearn.under_sampling.EditedNearestNeighbours object with default parameters will be given.

k : int, optional (default=None)

Number of nearest neighbours to used to construct synthetic samples.

Deprecated since version 0.2: k is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.over_sampling.SMOTE object.

m : int, optional (default=None)

Number of nearest neighbours to use to determine if a minority sample is in danger.

Deprecated since version 0.2: m is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.over_sampling.SMOTE object.

out_step : float, optional (default=None)

Step size when extrapolating.

Deprecated since version 0.2: out_step is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.over_sampling.SMOTE object.

kind_smote : str, optional (default=None)

The type of SMOTE algorithm to use one of the following options: 'regular', 'borderline1', 'borderline2', 'svm'.

Deprecated since version 0.2: kind_smote is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.over_sampling.SMOTE object.

size_ngh : int, optional (default=None)

Size of the neighbourhood to consider to compute the average distance to the minority point samples.

Deprecated since version 0.2: size_ngh is deprecated from 0.2 and will be replaced in 0.4 Use n_neighbors instead.

n_neighbors : int, optional (default=None)

Size of the neighbourhood to consider to compute the average distance to the minority point samples.

Deprecated since version 0.2: n_neighbors is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.under_sampling.EditedNearestNeighbours object.

kind_sel : str, optional (default=None)

Strategy to use in order to exclude samples.

  • If 'all', all neighbours will have to agree with the samples of interest to not be excluded.
  • If 'mode', the majority vote of the neighbours will be used in order to exclude a sample.

Deprecated since version 0.2: kind_sel is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.under_sampling.EditedNearestNeighbours object.

n_jobs : int, optional (default=None)

The number of threads to open if possible.

Deprecated since version 0.2: n_jobs is deprecated from 0.2 and will be replaced in 0.4 Give directly a imblearn.over_sampling.SMOTE and imblearn.under_sampling.EditedNearestNeighbours object.

Notes

The method is presented in [R22].

Supports mutli-class resampling.

References

[R22](1, 2) G. Batista, R. C. Prati, M. C. Monard. “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter 6 (1), 20-29, 2004.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.combine import SMOTEENN 
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape {}'.format(Counter(y)))
Original dataset shape Counter({1: 900, 0: 100})
>>> sme = SMOTEENN(random_state=42)
>>> X_res, y_res = sme.fit_sample(X, y)
>>> print('Resampled dataset shape {}'.format(Counter(y_res)))
Resampled dataset shape Counter({0: 900, 1: 881})

Methods

fit(X, y) Find the classes statistics before to perform sampling.
fit_sample(X, y) Fit the statistics and resample the data directly.
get_params([deep]) Get parameters for this estimator.
sample(X, y) Resample the dataset.
set_params(**params) Set the parameters of this estimator.
__init__(ratio='auto', random_state=None, smote=None, enn=None, k=None, m=None, out_step=None, kind_smote=None, size_ngh=None, n_neighbors=None, kind_enn=None, n_jobs=None)[source][source]

Methods

__init__([ratio, random_state, smote, enn, ...])
fit(X, y) Find the classes statistics before to perform sampling.
fit_sample(X, y) Fit the statistics and resample the data directly.
get_params([deep]) Get parameters for this estimator.
sample(X, y) Resample the dataset.
set_params(**params) Set the parameters of this estimator.
fit(X, y)[source][source]

Find the classes statistics before to perform sampling.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

self : object,

Return self.

fit_sample(X, y)[source]

Fit the statistics and resample the data directly.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

X_resampled : ndarray, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : ndarray, shape (n_samples_new)

The corresponding label of X_resampled

get_params(deep=True)[source]

Get parameters for this estimator.

Parameters:

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

sample(X, y)[source]

Resample the dataset.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

X_resampled : ndarray, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : ndarray, shape (n_samples_new)

The corresponding label of X_resampled

set_params(**params)[source]

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self