Data preprocessing

Data preprocessing#

This notebook explores preprocessing requirements for linear models.

Numerical features#

Linear models are sensitive to data scale. While we did not preprocess data in the previous notebook, we should understand this sensitivity.

Let’s examine a simple example.

# When using JupyterLite, you will need to uncomment and install the `skrub` package.
%pip install skrub
import matplotlib.pyplot as plt
import skrub

skrub.patch_display()  # make nice display for pandas tables

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/bin/python: No module named pip

Note: you may need to restart the kernel to use updated packages.

from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)
data

Processing column   1 / 4

Processing column   2 / 4

Processing column   3 / 4

Processing column   4 / 4

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(data, target)

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The model raises a ConvergenceWarning. This indicates that it did not find weights that minimize the loss function.

EXERCISE

LogisticRegression uses an LBFGS solver that iterates to find a solution. Check how many iterations it took and compare with the default in the documentation.
Increase the number of iterations. Find the minimum number needed to avoid the convergence warning.
Instead of increasing iterations, scale the data with StandardScaler before fitting. Note the new iteration count.

# Write your code here.

We did not split the data into training and testing sets in the previous exercise.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

Scikit-learn’s Pipeline is a powerful tool that chains transformations and a final estimator. We can connect a StandardScaler and LogisticRegression like this:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline(
    steps=[("scaler", StandardScaler()), ("logistic_regression", LogisticRegression())]
)
model.fit(data_train, target_train)

Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic_regression', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

print(f"Feature mean on training set: {model[0].mean_}")
print(f"Feature standard deviation on training set: {model[0].scale_}")

Feature mean on training set: [5.83035714 3.04017857 3.80714286 1.21428571]
Feature standard deviation on training set: [0.81545772 0.43516414 1.72754545 0.74460646]

print(f"Number of iterations: {model[-1].n_iter_}")

Number of iterations: [15]

EXERCISE

The output shows that StandardScaler computed feature means and standard deviations from the training set. It uses these statistics to center and scale data before passing it to LogisticRegression.

How do you think StandardScaler behaves on the test set? Consider this code:

from sklearn.metrics import accuracy_score

predicted_target = model.predict(data_test)

print(f"Accuracy on testing set: {accuracy_score(target_test, predicted_target):.3f}")

Accuracy on testing set: 1.000

Categorical features#

We’ve shown how linear models benefit from feature scaling. Now let’s examine categorical features using the penguins dataset.

import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")
penguins

Processing column   1 / 17

Processing column   2 / 17

Processing column   3 / 17

Processing column   4 / 17

Processing column   5 / 17

Processing column   6 / 17

Processing column   7 / 17

Processing column   8 / 17

Processing column   9 / 17

Processing column  10 / 17

Processing column  11 / 17

Processing column  12 / 17

Processing column  13 / 17

Processing column  14 / 17

Processing column  15 / 17

Processing column  16 / 17

Processing column  17 / 17

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Categorical features take discrete values. Here’s an example from the penguins dataset:

penguins["Sex"]

      MALE
    FEMALE
    FEMALE
       NaN
    FEMALE
        ...  
    MALE
  FEMALE
    MALE
    MALE
  FEMALE
Name: Sex, Length: 344, dtype: object

penguins["Sex"].value_counts()

Sex
MALE      168
FEMALE    165
.           1
Name: count, dtype: int64

These categories use non-numeric values. Models cannot process them directly, so we must convert categories to numbers.

We can use two main strategies:

Ordinal encoding: Assigns a numeric value to each category
One-hot encoding: Creates binary features for each category

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
encoder.fit_transform(penguins[["Sex"]])

Processing column   1 / 4

Processing column   2 / 4

Processing column   3 / 4

Processing column   4 / 4

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().set_output(transform="pandas")
encoder.fit_transform(penguins[["Sex"]])

Processing column   1 / 1

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

EXERCISE

List advantages and disadvantages of both encoding strategies
Create a Pipeline that chains an encoder with a LogisticRegression model
Use cross-validation to evaluate model performance

# Write your code here.

Combine numerical and categorical features#

Scikit-learn’s ColumnTransformer helps us handle both numerical and categorical features. Let’s prepare our dataset:

categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"

data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
data

Processing column   1 / 4

Processing column   2 / 4

Processing column   3 / 4

Processing column   4 / 4

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Our data contains missing values. For now, we’ll simply drop rows with missing values in both data and target. We’ll address this topic more thoroughly in the next section.

data = data.dropna()
target = target.loc[data.index]

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ("numerical", StandardScaler(), numerical_features),
        ("categorical", OneHotEncoder(), categorical_features),
    ]
)
preprocessor

ColumnTransformer(transformers=[('numerical', StandardScaler(),
                                 ['Culmen Length (mm)', 'Culmen Depth (mm)']),
                                ('categorical', OneHotEncoder(),
                                 ['Island', 'Sex'])])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The ColumnTransformer splits columns and sends each subset to its appropriate transformer.

We can chain it with LogisticRegression:

model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("logistic_regression", LogisticRegression()),
    ]
)
model

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical', StandardScaler(),
                                                  ['Culmen Length (mm)',
                                                   'Culmen Depth (mm)']),
                                                 ('categorical',
                                                  OneHotEncoder(),
                                                  ['Island', 'Sex'])])),
                ('logistic_regression', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

model.fit(data_train, target_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical', StandardScaler(),
                                                  ['Culmen Length (mm)',
                                                   'Culmen Depth (mm)']),
                                                 ('categorical',
                                                  OneHotEncoder(),
                                                  ['Island', 'Sex'])])),
                ('logistic_regression', LogisticRegression())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

predicted_target = model.predict(data_test)
print(f"Accuracy: {accuracy_score(target_test, predicted_target):.3f}")

Accuracy: 0.976

Dealing with missing values#

Let’s reload our dataset with missing values intact:

categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"

data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
data

Processing column   1 / 4

Processing column   2 / 4

Processing column   3 / 4

Processing column   4 / 4

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

Try fitting the previous model again. What happens?

# Write your code here.

Models that don’t handle missing values need imputation - replacing missing values with computed values from the data.

EXERCISE

Build a model that chains ColumnTransformer, SimpleImputer, and LogisticRegression.

# Write your code here.

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

145	6.7	3.0	5.2	2.3
146	6.3	2.5	5.0	1.9
147	6.5	3.0	5.2	2.0
148	6.2	3.4	5.4	2.3
149	5.9	3.0	5.1	1.8

Column	Column name	dtype	Unique values	Mean	Std	Min	Median	Max
0	sepal length (cm)	Float64DType	35 (23.3%)	5.84	0.828	4.30	5.80	7.90
1	sepal width (cm)	Float64DType	23 (15.3%)	3.06	0.436	2.00	3.00	4.40
2	petal length (cm)	Float64DType	43 (28.7%)	3.76	1.77	1.00	4.30	6.90
3	petal width (cm)	Float64DType	22 (14.7%)	1.20	0.762	0.100	1.30	2.50

Column 1	Column 2	Cramér's V
sepal length (cm)	petal length (cm)	0.557
petal length (cm)	petal width (cm)	0.512
sepal length (cm)	petal width (cm)	0.395
sepal width (cm)	petal length (cm)	0.330
sepal length (cm)	sepal width (cm)	0.320
sepal width (cm)	petal width (cm)	0.308

	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
	studyName	Sample Number	Species	Region	Island	Stage	Individual ID	Clutch Completion	Date Egg	Culmen Length (mm)	Culmen Depth (mm)	Flipper Length (mm)	Body Mass (g)	Sex	Delta 15 N (o/oo)	Delta 13 C (o/oo)	Comments
0	PAL0708	1	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A1	Yes	2007-11-11	39.1	18.7	181.0	3750.0	MALE			Not enough blood for isotopes.
1	PAL0708	2	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N1A2	Yes	2007-11-11	39.5	17.4	186.0	3800.0	FEMALE	8.94956	-24.69454
2	PAL0708	3	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A1	Yes	2007-11-16	40.3	18.0	195.0	3250.0	FEMALE	8.36821	-25.33302
3	PAL0708	4	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N2A2	Yes	2007-11-16								Adult not sampled.
4	PAL0708	5	Adelie Penguin (Pygoscelis adeliae)	Anvers	Torgersen	Adult, 1 Egg Stage	N3A1	Yes	2007-11-16	36.7	19.3	193.0	3450.0	FEMALE	8.76651	-25.32426

339	PAL0910	64	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N98A2	Yes	2009-11-19	55.8	19.8	207.0	4000.0	MALE	9.70465	-24.53494
340	PAL0910	65	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A1	No	2009-11-21	43.5	18.1	202.0	3400.0	FEMALE	9.37608	-24.40753	Nest never observed with full clutch.
341	PAL0910	66	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N99A2	No	2009-11-21	49.6	18.2	193.0	3775.0	MALE	9.4618	-24.70615	Nest never observed with full clutch.
342	PAL0910	67	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A1	Yes	2009-11-21	50.8	19.0	210.0	4100.0	MALE	9.98044	-24.68741
343	PAL0910	68	Chinstrap penguin (Pygoscelis antarctica)	Anvers	Dream	Adult, 1 Egg Stage	N100A2	Yes	2009-11-21	50.2	18.7	198.0	3775.0	FEMALE	9.39305	-24.25255

Column	Column name	dtype	Null values	Unique values	Mean	Std	Min	Median	Max
0	studyName	ObjectDType	0 (0.0%)	3 (0.9%)
1	Sample Number	Int64DType	0 (0.0%)	152 (44.2%)	63.2	40.4	1	58	152
2	Species	ObjectDType	0 (0.0%)	3 (0.9%)
3	Region	ObjectDType	0 (0.0%)	1 (0.3%)
4	Island	ObjectDType	0 (0.0%)	3 (0.9%)
5	Stage	ObjectDType	0 (0.0%)	1 (0.3%)
6	Individual ID	ObjectDType	0 (0.0%)	190 (55.2%)
7	Clutch Completion	ObjectDType	0 (0.0%)	2 (0.6%)
8	Date Egg	ObjectDType	0 (0.0%)	50 (14.5%)
9	Culmen Length (mm)	Float64DType	2 (0.6%)	164 (47.7%)	43.9	5.46	32.1	44.4	59.6
10	Culmen Depth (mm)	Float64DType	2 (0.6%)	80 (23.3%)	17.2	1.97	13.1	17.3	21.5
11	Flipper Length (mm)	Float64DType	2 (0.6%)	55 (16.0%)	201.	14.1	172.	197.	231.
12	Body Mass (g)	Float64DType	2 (0.6%)	94 (27.3%)	4.20e+03	802.	2.70e+03	4.05e+03	6.30e+03
13	Sex	ObjectDType	10 (2.9%)	3 (0.9%)
14	Delta 15 N (o/oo)	Float64DType	14 (4.1%)	330 (95.9%)	8.73	0.552	7.63	8.65	10.0
15	Delta 13 C (o/oo)	Float64DType	13 (3.8%)	331 (96.2%)	-25.7	0.794	-27.0	-25.8	-23.8
16	Comments	ObjectDType	290 (84.3%)	10 (2.9%)

Column 1	Column 2	Cramér's V
Clutch Completion	Comments	0.992
Delta 13 C (o/oo)	Comments	0.818
studyName	Sample Number	0.781
Flipper Length (mm)	Body Mass (g)	0.757
Culmen Depth (mm)	Flipper Length (mm)	0.741
Delta 15 N (o/oo)	Delta 13 C (o/oo)	0.692
Culmen Length (mm)	Flipper Length (mm)	0.670
Species	Flipper Length (mm)	0.669
Species	Island	0.660
Culmen Depth (mm)	Body Mass (g)	0.651
Sex	Comments	0.629
Flipper Length (mm)	Comments	0.626
Species	Body Mass (g)	0.612
Species	Culmen Depth (mm)	0.605
studyName	Date Egg	0.599
Delta 15 N (o/oo)	Comments	0.588
Species	Culmen Length (mm)	0.579
Culmen Length (mm)	Culmen Depth (mm)	0.559
Culmen Length (mm)	Body Mass (g)	0.543
Sample Number	Island	0.525

	Sex_.	Sex_FEMALE	Sex_MALE	Sex_nan
	Sex_.	Sex_FEMALE	Sex_MALE	Sex_nan
0	0.0	0.0	1.0	0.0
1	0.0	1.0	0.0	0.0
2	0.0	1.0	0.0	0.0
3	0.0	0.0	0.0	1.0
4	0.0	1.0	0.0	0.0

339	0.0	0.0	1.0	0.0
340	0.0	1.0	0.0	0.0
341	0.0	0.0	1.0	0.0
342	0.0	0.0	1.0	0.0
343	0.0	1.0	0.0	0.0

Column 1	Column 2	Cramér's V
Sex_FEMALE	Sex_MALE	0.938
Sex_MALE	Sex_nan	0.169
Sex_FEMALE	Sex_nan	0.166
Sex_.	Sex_MALE	0.0528
Sex_.	Sex_FEMALE	0.0518
Sex_.	Sex_nan	0.00934

Data preprocessing

Contents

Data preprocessing#

Numerical features#

sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

Please enable javascript

Categorical features#

studyName

Sample Number

Species

Region

Island

Stage

Individual ID

Clutch Completion

Date Egg

Culmen Length (mm)

Culmen Depth (mm)

Flipper Length (mm)

Body Mass (g)

Sex

Delta 15 N (o/oo)

Delta 13 C (o/oo)

Comments

studyName

Sample Number

Species

Region

Island

Stage

Individual ID

Clutch Completion

Date Egg

Culmen Length (mm)

Culmen Depth (mm)

Flipper Length (mm)

Body Mass (g)

Sex

Delta 15 N (o/oo)

Delta 13 C (o/oo)

Comments

Please enable javascript

Sex_.

Sex_FEMALE

Sex_MALE

Sex_nan

Sex_.

Sex_FEMALE

Sex_MALE

Sex_nan

Please enable javascript

Sex

Sex

Please enable javascript

Combine numerical and categorical features#

Island

Sex

Culmen Length (mm)

Culmen Depth (mm)

Island

Sex

Culmen Length (mm)

Culmen Depth (mm)

Please enable javascript

Dealing with missing values#

Island

Sex

Culmen Length (mm)

Culmen Depth (mm)

Island

Sex

Culmen Length (mm)

Culmen Depth (mm)

`skrub` to help you out#