Datasets

skfair.datasets.fetch_adult(data_home=None, give_pandas=False, download_if_missing=True, return_X_y=False)[source]

Load the ADULT INCOME dataset. Download it if necessary from github.

Parameters
  • data_home – Specify another download and cache folder for the datasets. By default all scikit-learn data is stored in ‘~/scikit_learn_data’ subfolders. optional, default: None

  • give_pandas – give the pandas dataframe instead of X, y matrices (default=False)

  • download_if_missing – If False, raise a IOError if the data is not locally available instead of trying to download the data from the source site. True by default

  • return_X_y – If True, returns (data, target) instead of a dictionary. See below for more information about the data and target object.

Example

>>> from skfair.datasets import fetch_adult
>>> X, y = fetch_adult(return_X_y=True)
>>> X.shape
(32561, 14)
>>> y.shape
(32561,)
>>> fetch_adult(give_pandas=True).columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education.num',
       'marital.status', 'occupation', 'relationship', 'race', 'sex',
       'capital.gain', 'capital.loss', 'hours.per.week', 'native.country',
       'income'],
      dtype='object')
skfair.datasets.load_arrests(return_X_y=False, give_pandas=False)[source]

Loads the arrests dataset which can serve as a benchmark for fairness. It is data on the police treatment of individuals arrested in Toronto for simple possession of small quantities of marijuana. The goal is to predict whether or not the arrestee was released with a summons while maintaining a degree of fairness.

Parameters
  • return_X_y – If True, returns (data, target) instead of a dict object.

  • give_pandas – give the pandas dataframe instead of X, y matrices (default=False)

Example

>>> from skfair.datasets import load_arrests
>>> X, y = load_arrests(return_X_y=True)
>>> X.shape
(5226, 7)
>>> y.shape
(5226,)
>>> load_arrests(give_pandas=True).columns
Index(['released', 'colour', 'year', 'age', 'sex', 'employed', 'citizen',
       'checks'],
      dtype='object')

The dataset was copied from the carData R package and can originally be found in:

  • Personal communication from Michael Friendly, York University.

The documentation page of the dataset from the package can be viewed here: http://vincentarelbundock.github.io/Rdatasets/doc/carData/Arrests.html

skfair.datasets.load_boston(return_X_y=False, give_pandas=False)[source]

Loads the boston housing dataset which can serve as a benchmark for fairness. It will be removed from scikit-learn because there’s big problems with it. In particular there’s a column (named b) that refers to the skin color of inhabitants.

You can read all about it here:

Parameters
  • return_X_y – If True, returns (data, target) instead of a dict object.

  • give_pandas – give the pandas dataframe instead of X, y matrices (default=False)

Example

>>> from skfair.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> X.shape
(506, 13)
>>> y.shape
(506,)
>>> load_boston(give_pandas=True).columns
Index(['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax',
       'ptratio', 'b', 'lstat', 'price'],
      dtype='object')