imblearn.datasets.fetch_datasets¶

imblearn.datasets.fetch_datasets(data_home=None, filter_data=None, download_if_missing=True, random_state=None, shuffle=False)[source][source]¶

Load the benchmark datasets from Zenodo, downloading it if necessary.

Parameters:

data_home : string, optional (default=None)

Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders.

filter_data : tuple of str/int or None, optional (default=None)

A tuple containing the ID or the name of the datasets to be returned. Refer to the above table to get the ID and name of the datasets.

download_if_missing : boolean, optional (default=True)

If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site.

random_state : int, RandomState instance or None, optional (default=None)

Random state for shuffling the dataset. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle : bool, optional (default=False)

Whether to shuffle dataset.

Returns:

datasets : OrderedDict of Bunch object,

The ordered is defined by filter_data. Each Bunch object — refered as dataset — have the following attributes:

dataset.data : ndarray, shape (n_samples, n_features)

dataset.target : ndarray, shape (n_samples, )

dataset.DESCR : string

Description of the each dataset.

Notes

This collection of datasets have been proposed in [R24]. The characteristics of the available datasets are presented in the table below.

ID	Name	Repository & Target	Ratio	#S	#F
1	ecoli	UCI, target: imU	8.6:1	336	7
2	optical_digits	UCI, target: 8	9.1:1	5,620	64
3	satimage	UCI, target: 4	9.3:1	6,435	36
4	pen_digits	UCI, target: 5	9.4:1	10,992	16
5	abalone	UCI, target: 7	9.7:1	4,177	10
6	sick_euthyroid	UCI, target: sick euthyroid	9.8:1	3,163	42
7	spectrometer	UCI, target: >=44	11:1	531	93
8	car_eval_34	UCI, target: good, v good	12:1	1,728	21
9	isolet	UCI, target: A, B	12:1	7,797	617
10	us_crime	UCI, target: >0.65	12:1	1,994	100
11	yeast_ml8	LIBSVM, target: 8	13:1	2,417	103
12	scene	LIBSVM, target: >one label	13:1	2,407	294
13	libras_move	UCI, target: 1	14:1	360	90
14	thyroid_sick	UCI, target: sick	15:1	3,772	52
15	coil_2000	KDD, CoIL, target: minority	16:1	9,822	85
16	arrhythmia	UCI, target: 06	17:1	452	278
17	solar_flare_m0	UCI, target: M->0	19:1	1,389	32
18	oil	UCI, target: minority	22:1	937	49
19	car_eval_4	UCI, target: vgood	26:1	1,728	21
20	wine_quality	UCI, wine, target: <=4	26:1	4,898	11
21	letter_img	UCI, target: Z	26:1	20,000	16
22	yeast_me2	UCI, target: ME2	28:1	1,484	8
23	webpage	LIBSVM, w7a, target: minority	33:1	34,780	300
24	ozone_level	UCI, ozone, data	34:1	2,536	72
25	mammography	UCI, target: minority	42:1	11,183	6
26	protein_homo	KDD CUP 2004, minority	11:1	145,751	74
27	abalone_19	UCI, target: 19	130:1	4,177	10

References

[R24]

(1, 2) Ding, Zejin, “Diversified Ensemble Classifiers for Highly Imbalanced Data Learning and their Application in Bioinformatics.” Dissertation, Georgia State University, (2011).