imblearn.datasets.fetch_datasets

imblearn.datasets.fetch_datasets(data_home=None, filter_data=None, download_if_missing=True, random_state=None, shuffle=False)[source][source]

Load the benchmark datasets from Zenodo, downloading it if necessary.

Parameters:

data_home : string, optional (default=None)

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

filter_data : tuple of str/int or None, optional (default=None)

A tuple containing the ID or the name of the datasets to be returned. Refer to the above table to get the ID and name of the datasets.

download_if_missing : boolean, optional (default=True)

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_state : int, RandomState instance or None, optional (default=None)

Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : bool, optional (default=False)

Whether to shuffle dataset.

Returns:

datasets : OrderedDict of Bunch object,

The ordered is defined by filter_data. Each Bunch object — refered as dataset — have the following attributes:

dataset.data : ndarray, shape (n_samples, n_features)

dataset.target : ndarray, shape (n_samples, )

dataset.DESCR : string

Description of the each dataset.

Notes

This collection of datasets have been proposed in [R24]. The characteristics of the available datasets are presented in the table below.

ID Name Repository & Target Ratio #S #F
1 ecoli UCI, target: imU 8.6:1 336 7
2 optical_digits UCI, target: 8 9.1:1 5,620 64
3 satimage UCI, target: 4 9.3:1 6,435 36
4 pen_digits UCI, target: 5 9.4:1 10,992 16
5 abalone UCI, target: 7 9.7:1 4,177 10
6 sick_euthyroid UCI, target: sick euthyroid 9.8:1 3,163 42
7 spectrometer UCI, target: >=44 11:1 531 93
8 car_eval_34 UCI, target: good, v good 12:1 1,728 21
9 isolet UCI, target: A, B 12:1 7,797 617
10 us_crime UCI, target: >0.65 12:1 1,994 100
11 yeast_ml8 LIBSVM, target: 8 13:1 2,417 103
12 scene LIBSVM, target: >one label 13:1 2,407 294
13 libras_move UCI, target: 1 14:1 360 90
14 thyroid_sick UCI, target: sick 15:1 3,772 52
15 coil_2000 KDD, CoIL, target: minority 16:1 9,822 85
16 arrhythmia UCI, target: 06 17:1 452 278
17 solar_flare_m0 UCI, target: M->0 19:1 1,389 32
18 oil UCI, target: minority 22:1 937 49
19 car_eval_4 UCI, target: vgood 26:1 1,728 21
20 wine_quality UCI, wine, target: <=4 26:1 4,898 11
21 letter_img UCI, target: Z 26:1 20,000 16
22 yeast_me2 UCI, target: ME2 28:1 1,484 8
23 webpage LIBSVM, w7a, target: minority 33:1 34,780 300
24 ozone_level UCI, ozone, data 34:1 2,536 72
25 mammography UCI, target: minority 42:1 11,183 6
26 protein_homo KDD CUP 2004, minority 11:1 145,751 74
27 abalone_19 UCI, target: 19 130:1 4,177 10

References

[R24](1, 2) Ding, Zejin, “Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics.” Dissertation, Georgia State University, (2011).