imbalanced-learn
0.3.0.dev0

User Documentation

  • Getting Started
  • Support

API Documentation

  • imbalanced-learn API
    • Under-sampling methods
      • Prototype generation
      • Prototype selection
        • imblearn.under_sampling.CondensedNearestNeighbour
        • imblearn.under_sampling.EditedNearestNeighbours
        • imblearn.under_sampling.RepeatedEditedNearestNeighbours
        • imblearn.under_sampling.AllKNN
        • imblearn.under_sampling.InstanceHardnessThreshold
        • imblearn.under_sampling.NearMiss
        • imblearn.under_sampling.NeighbourhoodCleaningRule
        • imblearn.under_sampling.OneSidedSelection
        • imblearn.under_sampling.RandomUnderSampler
        • imblearn.under_sampling.TomekLinks
    • Over-sampling methods
    • Combination of over- and under-sampling methods
    • Ensemble methods
    • Pipeline
    • Metrics
    • Datasets
    • Utilities

Tutorial - Examples

  • General examples
  • Examples based on real world datasets
  • Dataset examples
  • Evaluation examples
  • Model Selection

Addtional information

  • Release history
  • To Do list
  • About us
imbalanced-learn
  • Docs »
  • imbalanced-learn API »
  • imblearn.under_sampling.TomekLinks
  • View page source

imblearn.under_sampling.TomekLinks¶

class imblearn.under_sampling.TomekLinks(ratio='auto', return_indices=False, random_state=None, n_jobs=1)[source][source]¶

Class to perform under-sampling by removing Tomek’s links.

Parameters:

ratio : str, dict, or callable, optional (default=’auto’)

Ratio to use for resampling the data set.

  • If str, has to be one of: (i) 'minority': resample the minority class; (ii) 'majority': resample the majority class, (iii) 'not minority': resample all classes apart of the minority class, (iv) 'all': resample all classes, and (v) 'auto': correspond to 'all' with for over-sampling methods and 'not minority' for under-sampling methods. The classes targeted will be over-sampled or under-sampled to achieve an equal number of sample with the majority or minority class.
  • If dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples.
  • If callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples.

return_indices : bool, optional (default=False)

Whether or not to return the indices of the samples randomly selected from the majority class.

random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

n_jobs : int, optional (default=1)

The number of threads to open if possible.

Notes

This method is based on [R42].

Supports mutli-class resampling.

References

[R42](1, 2) I. Tomek, “Two modifications of CNN,” In Systems, Man, and Cybernetics, IEEE Transactions on, vol. 6, pp 769-772, 2010.

Examples

>>> from collections import Counter
>>> from sklearn.datasets import make_classification
>>> from imblearn.under_sampling import TomekLinks 
>>> X, y = make_classification(n_classes=2, class_sep=2,
... weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
... n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
>>> print('Original dataset shape {}'.format(Counter(y)))
Original dataset shape Counter({1: 900, 0: 100})
>>> tl = TomekLinks(random_state=42)
>>> X_res, y_res = tl.fit_sample(X, y)
>>> print('Resampled dataset shape {}'.format(Counter(y_res)))
Resampled dataset shape Counter({1: 897, 0: 100})

Methods

fit(X, y) Find the classes statistics before to perform sampling.
fit_sample(X, y) Fit the statistics and resample the data directly.
get_params([deep]) Get parameters for this estimator.
is_tomek(y, nn_index, class_type) is_tomek uses the target vector and the first neighbour of every
sample(X, y) Resample the dataset.
set_params(**params) Set the parameters of this estimator.
__init__(ratio='auto', return_indices=False, random_state=None, n_jobs=1)[source][source]¶

Methods

__init__([ratio, return_indices, ...])
fit(X, y) Find the classes statistics before to perform sampling.
fit_sample(X, y) Fit the statistics and resample the data directly.
get_params([deep]) Get parameters for this estimator.
is_tomek(y, nn_index, class_type) is_tomek uses the target vector and the first neighbour of every
sample(X, y) Resample the dataset.
set_params(**params) Set the parameters of this estimator.
fit(X, y)[source]¶

Find the classes statistics before to perform sampling.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

self : object,

Return self.

fit_sample(X, y)[source]¶

Fit the statistics and resample the data directly.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

X_resampled : ndarray, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : ndarray, shape (n_samples_new)

The corresponding label of X_resampled

get_params(deep=True)[source]¶

Get parameters for this estimator.

Parameters:

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

static is_tomek(y, nn_index, class_type)[source][source]¶

is_tomek uses the target vector and the first neighbour of every sample point and looks for Tomek pairs. Returning a boolean vector with True for majority Tomek links.

Parameters:

y : ndarray, shape (n_samples, )

Target vector of the data set, necessary to keep track of whether a sample belongs to minority or not

nn_index : ndarray, shape (len(y), )

The index of the closes nearest neighbour to a sample point.

class_type : int or str

The label of the minority class.

Returns:

is_tomek : ndarray, shape (len(y), )

Boolean vector on len( # samples ), with True for majority samples that are Tomek links.

sample(X, y)[source]¶

Resample the dataset.

Parameters:

X : ndarray, shape (n_samples, n_features)

Matrix containing the data which have to be sampled.

y : ndarray, shape (n_samples, )

Corresponding label for each sample in X.

Returns:

X_resampled : ndarray, shape (n_samples_new, n_features)

The array containing the resampled data.

y_resampled : ndarray, shape (n_samples_new)

The corresponding label of X_resampled

set_params(**params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self

© Copyright 2016, G. Lemaitre, F. Nogueira, D. Oliveira, C. Aridas.

Built with Sphinx using a theme provided by Read the Docs.