imblearn.over_sampling.SMOTE¶
-
class
imblearn.over_sampling.
SMOTE
(ratio='auto', random_state=None, k=None, k_neighbors=5, m=None, m_neighbors=10, out_step=0.5, kind='regular', svm_estimator=None, n_jobs=1)[source][source]¶ Class to perform over-sampling using SMOTE.
This object is an implementation of SMOTE - Synthetic Minority Over-sampling Technique, and the variants Borderline SMOTE 1, 2 and SVM-SMOTE.
Parameters: ratio : str, dict, or callable, optional (default=’auto’)
Ratio to use for resampling the data set.
- If
str
, has to be one of: (i)'minority'
: resample the minority class; (ii)'majority'
: resample the majority class, (iii)'not minority'
: resample all classes apart of the minority class, (iv)'all'
: resample all classes, and (v)'auto'
: correspond to'all'
with for over-sampling methods and'not minority'
for under-sampling methods. The classes targeted will be over-sampled or under-sampled to achieve an equal number of sample with the majority or minority class. - If
dict
, the keys correspond to the targeted classes. The values correspond to the desired number of samples. - If callable, function taking
y
and returns adict
. The keys correspond to the targeted classes. The values correspond to the desired number of samples.
random_state : int, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; IfRandomState
instance, random_state is the random number generator; IfNone
, the random number generator is theRandomState
instance used bynp.random
.k : int, optional (default=None)
Number of nearest neighbours to used to construct synthetic samples.
Deprecated since version 0.2:
k
is deprecated from 0.2 and will be replaced in 0.4 Usek_neighbors
instead.k_neighbors : int or object, optional (default=5)
If
int
, number of nearest neighbours to used to construct synthetic samples. If object, an estimator that inherits fromsklearn.neighbors.base.KNeighborsMixin
that will be used to find the k_neighbors.m : int, optional (default=None)
Number of nearest neighbours to use to determine if a minority sample is in danger. Used with
kind={'borderline1', 'borderline2', 'svm'}
.Deprecated since version 0.2:
m
is deprecated from 0.2 and will be replaced in 0.4 Usem_neighbors
instead.m_neighbors : int int or object, optional (default=10)
If int, number of nearest neighbours to use to determine if a minority sample is in danger. Used with
kind={'borderline1', 'borderline2', 'svm'}
. If object, an estimator that inherits fromsklearn.neighbors.base.KNeighborsMixin
that will be used to find the k_neighbors.out_step : float, optional (default=0.5)
Step size when extrapolating. Used with
kind='svm'
.kind : str, optional (default=’regular’)
The type of SMOTE algorithm to use one of the following options:
'regular'
,'borderline1'
,'borderline2'
,'svm'
.svm_estimator : object, optional (default=SVC())
If
kind='svm'
, a parametrizedsklearn.svm.SVC
classifier can be passed.n_jobs : int, optional (default=1)
The number of threads to open if possible.
Notes
See the original papers: [R31], [R32], [R33] for more details.
Supports mutli-class resampling.
References
[R31] (1, 2) N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002. [R32] (1, 2) H. Han, W. Wen-Yuan, M. Bing-Huan, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning,” Advances in intelligent computing, 878-887, 2005. [R33] (1, 2) H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), pp.4-21, 2001. Examples
>>> from collections import Counter >>> from sklearn.datasets import make_classification >>> from imblearn.over_sampling import SMOTE >>> X, y = make_classification(n_classes=2, class_sep=2, ... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, ... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10) >>> print('Original dataset shape {}'.format(Counter(y))) Original dataset shape Counter({1: 900, 0: 100}) >>> sm = SMOTE(random_state=42) >>> X_res, y_res = sm.fit_sample(X, y) >>> print('Resampled dataset shape {}'.format(Counter(y_res))) Resampled dataset shape Counter({0: 900, 1: 900})
Methods
fit
(X, y)Find the classes statistics before to perform sampling. fit_sample
(X, y)Fit the statistics and resample the data directly. get_params
([deep])Get parameters for this estimator. sample
(X, y)Resample the dataset. set_params
(**params)Set the parameters of this estimator. -
__init__
(ratio='auto', random_state=None, k=None, k_neighbors=5, m=None, m_neighbors=10, out_step=0.5, kind='regular', svm_estimator=None, n_jobs=1)[source][source]¶
Methods
__init__
([ratio, random_state, k, ...])fit
(X, y)Find the classes statistics before to perform sampling. fit_sample
(X, y)Fit the statistics and resample the data directly. get_params
([deep])Get parameters for this estimator. sample
(X, y)Resample the dataset. set_params
(**params)Set the parameters of this estimator. -
fit
(X, y)[source]¶ Find the classes statistics before to perform sampling.
Parameters: X : ndarray, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : ndarray, shape (n_samples, )
Corresponding label for each sample in X.
Returns: self : object,
Return self.
-
fit_sample
(X, y)[source]¶ Fit the statistics and resample the data directly.
Parameters: X : ndarray, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : ndarray, shape (n_samples, )
Corresponding label for each sample in X.
Returns: X_resampled : ndarray, shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : ndarray, shape (n_samples_new)
The corresponding label of X_resampled
-
get_params
(deep=True)[source]¶ Get parameters for this estimator.
Parameters: deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: params : mapping of string to any
Parameter names mapped to their values.
-
sample
(X, y)[source]¶ Resample the dataset.
Parameters: X : ndarray, shape (n_samples, n_features)
Matrix containing the data which have to be sampled.
y : ndarray, shape (n_samples, )
Corresponding label for each sample in X.
Returns: X_resampled : ndarray, shape (n_samples_new, n_features)
The array containing the resampled data.
y_resampled : ndarray, shape (n_samples_new)
The corresponding label of X_resampled
- If