Data preprocessing#

This notebook explores preprocessing requirements for linear models.

Numerical features#

Linear models are sensitive to data scale. While we did not preprocess data in the previous notebook, we should understand this sensitivity.

Let’s examine a simple example.

# When using JupyterLite, you will need to uncomment and install the `skrub` package.
%pip install skrub
import matplotlib.pyplot as plt
import skrub

skrub.patch_display()  # make nice display for pandas tables
from sklearn.datasets import load_iris

data, target = load_iris(return_X_y=True, as_frame=True)
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(), target)
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/linear_model/ ConvergenceWarning: lbfgs failed to converge (status=1):

Increase the number of iterations (max_iter) or scale the data as shown in:
Please also refer to the documentation for alternative solver options:
  n_iter_i = _check_optimize_result(
The model raises a ConvergenceWarning. This indicates that it did not find weights that minimize the loss function.


  1. LogisticRegression uses an LBFGS solver that iterates to find a solution. Check how many iterations it took and compare with the default in the documentation.

  2. Increase the number of iterations. Find the minimum number needed to avoid the convergence warning.

  3. Instead of increasing iterations, scale the data with StandardScaler before fitting. Note the new iteration count.

# Write your code here.

We did not split the data into training and testing sets in the previous exercise.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42

Scikit-learn’s Pipeline is a powerful tool that chains transformations and a final estimator. We can connect a StandardScaler and LogisticRegression like this:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline(
    steps=[("scaler", StandardScaler()), ("logistic_regression", LogisticRegression())]
), target_train)
Pipeline(steps=[('scaler', StandardScaler()),
                ('logistic_regression', LogisticRegression())])
print(f"Feature mean on training set: {model[0].mean_}")
print(f"Feature standard deviation on training set: {model[0].scale_}")
Feature mean on training set: [5.83035714 3.04017857 3.80714286 1.21428571]
Feature standard deviation on training set: [0.81545772 0.43516414 1.72754545 0.74460646]
print(f"Number of iterations: {model[-1].n_iter_}")
Number of iterations: [15]


The output shows that StandardScaler computed feature means and standard deviations from the training set. It uses these statistics to center and scale data before passing it to LogisticRegression.

How do you think StandardScaler behaves on the test set? Consider this code:

from sklearn.metrics import accuracy_score

predicted_target = model.predict(data_test)

print(f"Accuracy on testing set: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy on testing set: 1.000

Categorical features#

We’ve shown how linear models benefit from feature scaling. Now let’s examine categorical features using the penguins dataset.

import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")
Categorical features take discrete values. Here’s an example from the penguins dataset:

0        MALE
1      FEMALE
2      FEMALE
3         NaN
4      FEMALE
339      MALE
340    FEMALE
341      MALE
342      MALE
343    FEMALE
Name: Sex, Length: 344, dtype: object
MALE      168
FEMALE    165
.           1
Name: count, dtype: int64

These categories use non-numeric values. Models cannot process them directly, so we must convert categories to numbers.

We can use two main strategies:

  • Ordinal encoding: Assigns a numeric value to each category

  • One-hot encoding: Creates binary features for each category

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder().set_output(transform="pandas")
  1. List advantages and disadvantages of both encoding strategies

  2. Create a Pipeline that chains an encoder with a LogisticRegression model

  3. Use cross-validation to evaluate model performance

# Write your code here.

Combine numerical and categorical features#

Scikit-learn’s ColumnTransformer helps us handle both numerical and categorical features. Let’s prepare our dataset:

categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"
data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
Our data contains missing values. For now, we’ll simply drop rows with missing values in both data and target. We’ll address this topic more thoroughly in the next section.

data = data.dropna()
target = target.loc[data.index]
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
        ("numerical", StandardScaler(), numerical_features),
        ("categorical", OneHotEncoder(), categorical_features),
The ColumnTransformer splits columns and sends each subset to its appropriate transformer.

We can chain it with LogisticRegression:

model = Pipeline(
        ("preprocessor", preprocessor),
        ("logistic_regression", LogisticRegression()),
predicted_target = model.predict(data_test)
print(f"Accuracy: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy: 0.976

Dealing with missing values#

Let’s reload our dataset with missing values intact:

categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"
data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42

Try fitting the previous model again. What happens?

# Write your code here.

Models that don’t handle missing values need imputation - replacing missing values with computed values from the data.


Build a model that chains ColumnTransformer, SimpleImputer, and LogisticRegression.

# Write your code here.

skrub to help you out#

The skrub library offers utilities for baseline preprocessing. Use tabular_learner to quickly build a pipeline:

model = skrub.tabular_learner(estimator=LogisticRegression())
predicted_target = model.predict(data_test)
print(f"Accuracy: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy: 0.988