Data preprocessing#
This notebook explores preprocessing requirements for linear models.
Numerical features#
Linear models are sensitive to data scale. While we did not preprocess data in the previous notebook, we should understand this sensitivity.
Let’s examine a simple example.
# When using JupyterLite, you will need to uncomment and install the `skrub` package.
%pip install skrub
import matplotlib.pyplot as plt
import skrub
skrub.patch_display() # make nice display for pandas tables
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.
from sklearn.datasets import load_iris
data, target = load_iris(return_X_y=True, as_frame=True)
data
Processing column 1 / 4
Processing column 2 / 4
Processing column 3 / 4
Processing column 4 / 4
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
145 | 6.7 | 3.0 | 5.2 | 2.3 |
146 | 6.3 | 2.5 | 5.0 | 1.9 |
147 | 6.5 | 3.0 | 5.2 | 2.0 |
148 | 6.2 | 3.4 | 5.4 | 2.3 |
149 | 5.9 | 3.0 | 5.1 | 1.8 |
sepal length (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 35 (23.3%)
- Mean ± Std
- 5.84 ± 0.828
- Median ± IQR
- 5.80 ± 1.30
- Min | Max
- 4.30 | 7.90
sepal width (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 23 (15.3%)
- Mean ± Std
- 3.06 ± 0.436
- Median ± IQR
- 3.00 ± 0.500
- Min | Max
- 2.00 | 4.40
petal length (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 43 (28.7%)
- Mean ± Std
- 3.76 ± 1.77
- Median ± IQR
- 4.30 ± 3.50
- Min | Max
- 1.00 | 6.90
petal width (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 22 (14.7%)
- Mean ± Std
- 1.20 ± 0.762
- Median ± IQR
- 1.30 ± 1.50
- Min | Max
- 0.100 | 2.50
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | sepal length (cm) | Float64DType | 0 (0.0%) | 35 (23.3%) | 5.84 | 0.828 | 4.30 | 5.80 | 7.90 |
1 | sepal width (cm) | Float64DType | 0 (0.0%) | 23 (15.3%) | 3.06 | 0.436 | 2.00 | 3.00 | 4.40 |
2 | petal length (cm) | Float64DType | 0 (0.0%) | 43 (28.7%) | 3.76 | 1.77 | 1.00 | 4.30 | 6.90 |
3 | petal width (cm) | Float64DType | 0 (0.0%) | 22 (14.7%) | 1.20 | 0.762 | 0.100 | 1.30 | 2.50 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
sepal length (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 35 (23.3%)
- Mean ± Std
- 5.84 ± 0.828
- Median ± IQR
- 5.80 ± 1.30
- Min | Max
- 4.30 | 7.90
sepal width (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 23 (15.3%)
- Mean ± Std
- 3.06 ± 0.436
- Median ± IQR
- 3.00 ± 0.500
- Min | Max
- 2.00 | 4.40
petal length (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 43 (28.7%)
- Mean ± Std
- 3.76 ± 1.77
- Median ± IQR
- 4.30 ± 3.50
- Min | Max
- 1.00 | 6.90
petal width (cm)
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 22 (14.7%)
- Mean ± Std
- 1.20 ± 0.762
- Median ± IQR
- 1.30 ± 1.50
- Min | Max
- 0.100 | 2.50
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
sepal length (cm) | petal length (cm) | 0.557 |
petal length (cm) | petal width (cm) | 0.512 |
sepal length (cm) | petal width (cm) | 0.395 |
sepal width (cm) | petal length (cm) | 0.330 |
sepal length (cm) | sepal width (cm) | 0.320 |
sepal width (cm) | petal width (cm) | 0.308 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(data, target)
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
The model raises a ConvergenceWarning
. This indicates that it did not find weights
that minimize the loss function.
EXERCISE
LogisticRegression
uses an LBFGS solver that iterates to find a solution. Check how many iterations it took and compare with the default in the documentation.Increase the number of iterations. Find the minimum number needed to avoid the convergence warning.
Instead of increasing iterations, scale the data with
StandardScaler
before fitting. Note the new iteration count.
# Write your code here.
We did not split the data into training and testing sets in the previous exercise.
from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
Scikit-learn’s Pipeline
is a powerful tool that chains transformations and a final
estimator. We can connect a StandardScaler
and LogisticRegression
like this:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
model = Pipeline(
steps=[("scaler", StandardScaler()), ("logistic_regression", LogisticRegression())]
)
model.fit(data_train, target_train)
Pipeline(steps=[('scaler', StandardScaler()), ('logistic_regression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('logistic_regression', LogisticRegression())])
StandardScaler()
LogisticRegression()
print(f"Feature mean on training set: {model[0].mean_}")
print(f"Feature standard deviation on training set: {model[0].scale_}")
Feature mean on training set: [5.83035714 3.04017857 3.80714286 1.21428571]
Feature standard deviation on training set: [0.81545772 0.43516414 1.72754545 0.74460646]
print(f"Number of iterations: {model[-1].n_iter_}")
Number of iterations: [15]
EXERCISE
The output shows that StandardScaler
computed feature means and standard deviations
from the training set. It uses these statistics to center and scale data before
passing it to LogisticRegression
.
How do you think StandardScaler
behaves on the test set? Consider this code:
from sklearn.metrics import accuracy_score
predicted_target = model.predict(data_test)
print(f"Accuracy on testing set: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy on testing set: 1.000
Categorical features#
We’ve shown how linear models benefit from feature scaling. Now let’s examine categorical features using the penguins dataset.
import pandas as pd
penguins = pd.read_csv("../datasets/penguins.csv")
penguins
Processing column 1 / 17
Processing column 2 / 17
Processing column 3 / 17
Processing column 4 / 17
Processing column 5 / 17
Processing column 6 / 17
Processing column 7 / 17
Processing column 8 / 17
Processing column 9 / 17
Processing column 10 / 17
Processing column 11 / 17
Processing column 12 / 17
Processing column 13 / 17
Processing column 14 / 17
Processing column 15 / 17
Processing column 16 / 17
Processing column 17 / 17
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 2007-11-11 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | Not enough blood for isotopes. | ||
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 2007-11-11 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 2007-11-16 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | |
3 | PAL0708 | 4 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A2 | Yes | 2007-11-16 | Adult not sampled. | |||||||
4 | PAL0708 | 5 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N3A1 | Yes | 2007-11-16 | 36.7 | 19.3 | 193.0 | 3450.0 | FEMALE | 8.76651 | -25.32426 | |
339 | PAL0910 | 64 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N98A2 | Yes | 2009-11-19 | 55.8 | 19.8 | 207.0 | 4000.0 | MALE | 9.70465 | -24.53494 | |
340 | PAL0910 | 65 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N99A1 | No | 2009-11-21 | 43.5 | 18.1 | 202.0 | 3400.0 | FEMALE | 9.37608 | -24.40753 | Nest never observed with full clutch. |
341 | PAL0910 | 66 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N99A2 | No | 2009-11-21 | 49.6 | 18.2 | 193.0 | 3775.0 | MALE | 9.4618 | -24.70615 | Nest never observed with full clutch. |
342 | PAL0910 | 67 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N100A1 | Yes | 2009-11-21 | 50.8 | 19.0 | 210.0 | 4100.0 | MALE | 9.98044 | -24.68741 | |
343 | PAL0910 | 68 | Chinstrap penguin (Pygoscelis antarctica) | Anvers | Dream | Adult, 1 Egg Stage | N100A2 | Yes | 2009-11-21 | 50.2 | 18.7 | 198.0 | 3775.0 | FEMALE | 9.39305 | -24.25255 |
studyName
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sample Number
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 152 (44.2%)
- Mean ± Std
- 63.2 ± 40.4
- Median ± IQR
- 58 ± 66
- Min | Max
- 1 | 152
Species
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Region
ObjectDType- Null values
- 0 (0.0%)
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Stage
ObjectDType- Null values
- 0 (0.0%)
Individual ID
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 190 (55.2%)
Most frequent values
Clutch Completion
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
Most frequent values
Date Egg
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 50 (14.5%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
Flipper Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 55 (16.0%)
- Mean ± Std
- 201. ± 14.1
- Median ± IQR
- 197. ± 23.0
- Min | Max
- 172. | 231.
Body Mass (g)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 94 (27.3%)
- Mean ± Std
- 4.20e+03 ± 802.
- Median ± IQR
- 4.05e+03 ± 1.20e+03
- Min | Max
- 2.70e+03 | 6.30e+03
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Delta 15 N (o/oo)
Float64DType- Null values
- 14 (4.1%)
- Unique values
- 330 (95.9%)
- Mean ± Std
- 8.73 ± 0.552
- Median ± IQR
- 8.65 ± 0.879
- Min | Max
- 7.63 | 10.0
Delta 13 C (o/oo)
Float64DType- Null values
- 13 (3.8%)
- Unique values
- 331 (96.2%)
- Mean ± Std
- -25.7 ± 0.794
- Median ± IQR
- -25.8 ± 1.27
- Min | Max
- -27.0 | -23.8
Comments
ObjectDType- Null values
- 290 (84.3%)
- Unique values
- 10 (2.9%)
Most frequent values
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | studyName | ObjectDType | 0 (0.0%) | 3 (0.9%) | |||||
1 | Sample Number | Int64DType | 0 (0.0%) | 152 (44.2%) | 63.2 | 40.4 | 1 | 58 | 152 |
2 | Species | ObjectDType | 0 (0.0%) | 3 (0.9%) | |||||
3 | Region | ObjectDType | 0 (0.0%) | 1 (0.3%) | |||||
4 | Island | ObjectDType | 0 (0.0%) | 3 (0.9%) | |||||
5 | Stage | ObjectDType | 0 (0.0%) | 1 (0.3%) | |||||
6 | Individual ID | ObjectDType | 0 (0.0%) | 190 (55.2%) | |||||
7 | Clutch Completion | ObjectDType | 0 (0.0%) | 2 (0.6%) | |||||
8 | Date Egg | ObjectDType | 0 (0.0%) | 50 (14.5%) | |||||
9 | Culmen Length (mm) | Float64DType | 2 (0.6%) | 164 (47.7%) | 43.9 | 5.46 | 32.1 | 44.4 | 59.6 |
10 | Culmen Depth (mm) | Float64DType | 2 (0.6%) | 80 (23.3%) | 17.2 | 1.97 | 13.1 | 17.3 | 21.5 |
11 | Flipper Length (mm) | Float64DType | 2 (0.6%) | 55 (16.0%) | 201. | 14.1 | 172. | 197. | 231. |
12 | Body Mass (g) | Float64DType | 2 (0.6%) | 94 (27.3%) | 4.20e+03 | 802. | 2.70e+03 | 4.05e+03 | 6.30e+03 |
13 | Sex | ObjectDType | 10 (2.9%) | 3 (0.9%) | |||||
14 | Delta 15 N (o/oo) | Float64DType | 14 (4.1%) | 330 (95.9%) | 8.73 | 0.552 | 7.63 | 8.65 | 10.0 |
15 | Delta 13 C (o/oo) | Float64DType | 13 (3.8%) | 331 (96.2%) | -25.7 | 0.794 | -27.0 | -25.8 | -23.8 |
16 | Comments | ObjectDType | 290 (84.3%) | 10 (2.9%) |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
studyName
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sample Number
Int64DType- Null values
- 0 (0.0%)
- Unique values
- 152 (44.2%)
- Mean ± Std
- 63.2 ± 40.4
- Median ± IQR
- 58 ± 66
- Min | Max
- 1 | 152
Species
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Region
ObjectDType- Null values
- 0 (0.0%)
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Stage
ObjectDType- Null values
- 0 (0.0%)
Individual ID
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 190 (55.2%)
Most frequent values
Clutch Completion
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
Most frequent values
Date Egg
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 50 (14.5%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
Flipper Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 55 (16.0%)
- Mean ± Std
- 201. ± 14.1
- Median ± IQR
- 197. ± 23.0
- Min | Max
- 172. | 231.
Body Mass (g)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 94 (27.3%)
- Mean ± Std
- 4.20e+03 ± 802.
- Median ± IQR
- 4.05e+03 ± 1.20e+03
- Min | Max
- 2.70e+03 | 6.30e+03
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Delta 15 N (o/oo)
Float64DType- Null values
- 14 (4.1%)
- Unique values
- 330 (95.9%)
- Mean ± Std
- 8.73 ± 0.552
- Median ± IQR
- 8.65 ± 0.879
- Min | Max
- 7.63 | 10.0
Delta 13 C (o/oo)
Float64DType- Null values
- 13 (3.8%)
- Unique values
- 331 (96.2%)
- Mean ± Std
- -25.7 ± 0.794
- Median ± IQR
- -25.8 ± 1.27
- Min | Max
- -27.0 | -23.8
Comments
ObjectDType- Null values
- 290 (84.3%)
- Unique values
- 10 (2.9%)
Most frequent values
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
Clutch Completion | Comments | 0.992 |
Delta 13 C (o/oo) | Comments | 0.818 |
studyName | Sample Number | 0.781 |
Flipper Length (mm) | Body Mass (g) | 0.757 |
Culmen Depth (mm) | Flipper Length (mm) | 0.741 |
Delta 15 N (o/oo) | Delta 13 C (o/oo) | 0.692 |
Culmen Length (mm) | Flipper Length (mm) | 0.670 |
Species | Flipper Length (mm) | 0.669 |
Species | Island | 0.660 |
Culmen Depth (mm) | Body Mass (g) | 0.651 |
Sex | Comments | 0.629 |
Flipper Length (mm) | Comments | 0.626 |
Species | Body Mass (g) | 0.612 |
Species | Culmen Depth (mm) | 0.605 |
studyName | Date Egg | 0.599 |
Delta 15 N (o/oo) | Comments | 0.588 |
Species | Culmen Length (mm) | 0.579 |
Culmen Length (mm) | Culmen Depth (mm) | 0.559 |
Culmen Length (mm) | Body Mass (g) | 0.543 |
Sample Number | Island | 0.525 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
Categorical features take discrete values. Here’s an example from the penguins dataset:
penguins["Sex"]
0 MALE
1 FEMALE
2 FEMALE
3 NaN
4 FEMALE
...
339 MALE
340 FEMALE
341 MALE
342 MALE
343 FEMALE
Name: Sex, Length: 344, dtype: object
penguins["Sex"].value_counts()
Sex
MALE 168
FEMALE 165
. 1
Name: count, dtype: int64
These categories use non-numeric values. Models cannot process them directly, so we must convert categories to numbers.
We can use two main strategies:
Ordinal encoding: Assigns a numeric value to each category
One-hot encoding: Creates binary features for each category
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
encoder.fit_transform(penguins[["Sex"]])
Processing column 1 / 4
Processing column 2 / 4
Processing column 3 / 4
Processing column 4 / 4
Sex_. | Sex_FEMALE | Sex_MALE | Sex_nan | |
---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.0 | 1.0 | 0.0 | 0.0 |
339 | 0.0 | 0.0 | 1.0 | 0.0 |
340 | 0.0 | 1.0 | 0.0 | 0.0 |
341 | 0.0 | 0.0 | 1.0 | 0.0 |
342 | 0.0 | 0.0 | 1.0 | 0.0 |
343 | 0.0 | 1.0 | 0.0 | 0.0 |
Sex_.
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.00291 ± 0.0539
- Median ± IQR
- 0.00 ± 0.00
- Min | Max
- 0.00 | 1.00
Sex_FEMALE
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.480 ± 0.500
- Median ± IQR
- 0.00 ± 1.00
- Min | Max
- 0.00 | 1.00
Sex_MALE
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.488 ± 0.501
- Median ± IQR
- 0.00 ± 1.00
- Min | Max
- 0.00 | 1.00
Sex_nan
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.0291 ± 0.168
- Median ± IQR
- 0.00 ± 0.00
- Min | Max
- 0.00 | 1.00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | Sex_. | Float64DType | 0 (0.0%) | 2 (0.6%) | 0.00291 | 0.0539 | 0.00 | 0.00 | 1.00 |
1 | Sex_FEMALE | Float64DType | 0 (0.0%) | 2 (0.6%) | 0.480 | 0.500 | 0.00 | 0.00 | 1.00 |
2 | Sex_MALE | Float64DType | 0 (0.0%) | 2 (0.6%) | 0.488 | 0.501 | 0.00 | 0.00 | 1.00 |
3 | Sex_nan | Float64DType | 0 (0.0%) | 2 (0.6%) | 0.0291 | 0.168 | 0.00 | 0.00 | 1.00 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Sex_.
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.00291 ± 0.0539
- Median ± IQR
- 0.00 ± 0.00
- Min | Max
- 0.00 | 1.00
Sex_FEMALE
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.480 ± 0.500
- Median ± IQR
- 0.00 ± 1.00
- Min | Max
- 0.00 | 1.00
Sex_MALE
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.488 ± 0.501
- Median ± IQR
- 0.00 ± 1.00
- Min | Max
- 0.00 | 1.00
Sex_nan
Float64DType- Null values
- 0 (0.0%)
- Unique values
- 2 (0.6%)
- Mean ± Std
- 0.0291 ± 0.168
- Median ± IQR
- 0.00 ± 0.00
- Min | Max
- 0.00 | 1.00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
Sex_FEMALE | Sex_MALE | 0.938 |
Sex_MALE | Sex_nan | 0.169 |
Sex_FEMALE | Sex_nan | 0.166 |
Sex_. | Sex_MALE | 0.0528 |
Sex_. | Sex_FEMALE | 0.0518 |
Sex_. | Sex_nan | 0.00934 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder().set_output(transform="pandas")
encoder.fit_transform(penguins[["Sex"]])
Processing column 1 / 1
Sex | |
---|---|
0 | 2.0 |
1 | 1.0 |
2 | 1.0 |
3 | |
4 | 1.0 |
339 | 2.0 |
340 | 1.0 |
341 | 2.0 |
342 | 2.0 |
343 | 1.0 |
Sex
Float64DType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
- Mean ± Std
- 1.50 ± 0.507
- Median ± IQR
- 2.00 ± 1.00
- Min | Max
- 0.00 | 2.00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | Sex | Float64DType | 10 (2.9%) | 3 (0.9%) | 1.50 | 0.507 | 0.00 | 2.00 | 2.00 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Sex
Float64DType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
- Mean ± Std
- 1.50 ± 0.507
- Median ± IQR
- 2.00 ± 1.00
- Min | Max
- 0.00 | 2.00
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
EXERCISE
List advantages and disadvantages of both encoding strategies
Create a
Pipeline
that chains an encoder with aLogisticRegression
modelUse cross-validation to evaluate model performance
# Write your code here.
Combine numerical and categorical features#
Scikit-learn’s ColumnTransformer
helps us handle both numerical and categorical
features. Let’s prepare our dataset:
categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"
data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
data
Processing column 1 / 4
Processing column 2 / 4
Processing column 3 / 4
Processing column 4 / 4
Island | Sex | Culmen Length (mm) | Culmen Depth (mm) | |
---|---|---|---|---|
0 | Torgersen | MALE | 39.1 | 18.7 |
1 | Torgersen | FEMALE | 39.5 | 17.4 |
2 | Torgersen | FEMALE | 40.3 | 18.0 |
3 | Torgersen | |||
4 | Torgersen | FEMALE | 36.7 | 19.3 |
339 | Dream | MALE | 55.8 | 19.8 |
340 | Dream | FEMALE | 43.5 | 18.1 |
341 | Dream | MALE | 49.6 | 18.2 |
342 | Dream | MALE | 50.8 | 19.0 |
343 | Dream | FEMALE | 50.2 | 18.7 |
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | Island | ObjectDType | 0 (0.0%) | 3 (0.9%) | |||||
1 | Sex | ObjectDType | 10 (2.9%) | 3 (0.9%) | |||||
2 | Culmen Length (mm) | Float64DType | 2 (0.6%) | 164 (47.7%) | 43.9 | 5.46 | 32.1 | 44.4 | 59.6 |
3 | Culmen Depth (mm) | Float64DType | 2 (0.6%) | 80 (23.3%) | 17.2 | 1.97 | 13.1 | 17.3 | 21.5 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
Culmen Length (mm) | Culmen Depth (mm) | 0.559 |
Island | Culmen Depth (mm) | 0.470 |
Sex | Culmen Depth (mm) | 0.353 |
Sex | Culmen Length (mm) | 0.323 |
Island | Culmen Length (mm) | 0.299 |
Island | Sex | 0.129 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
Our data contains missing values. For now, we’ll simply drop rows with missing values in both data and target. We’ll address this topic more thoroughly in the next section.
data = data.dropna()
target = target.loc[data.index]
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
("numerical", StandardScaler(), numerical_features),
("categorical", OneHotEncoder(), categorical_features),
]
)
preprocessor
ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])
['Culmen Length (mm)', 'Culmen Depth (mm)']
StandardScaler()
['Island', 'Sex']
OneHotEncoder()
The ColumnTransformer
splits columns and sends each subset to its appropriate
transformer.
We can chain it with LogisticRegression
:
model = Pipeline(
steps=[
("preprocessor", preprocessor),
("logistic_regression", LogisticRegression()),
]
)
model
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])), ('logistic_regression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])), ('logistic_regression', LogisticRegression())])
ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])
['Culmen Length (mm)', 'Culmen Depth (mm)']
StandardScaler()
['Island', 'Sex']
OneHotEncoder()
LogisticRegression()
model.fit(data_train, target_train)
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])), ('logistic_regression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])), ('logistic_regression', LogisticRegression())])
ColumnTransformer(transformers=[('numerical', StandardScaler(), ['Culmen Length (mm)', 'Culmen Depth (mm)']), ('categorical', OneHotEncoder(), ['Island', 'Sex'])])
['Culmen Length (mm)', 'Culmen Depth (mm)']
StandardScaler()
['Island', 'Sex']
OneHotEncoder()
LogisticRegression()
predicted_target = model.predict(data_test)
print(f"Accuracy: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy: 0.976
Dealing with missing values#
Let’s reload our dataset with missing values intact:
categorical_features = ["Island", "Sex"]
numerical_features = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_name = "Species"
data = penguins[categorical_features + numerical_features]
target = penguins[target_name]
data
Processing column 1 / 4
Processing column 2 / 4
Processing column 3 / 4
Processing column 4 / 4
Island | Sex | Culmen Length (mm) | Culmen Depth (mm) | |
---|---|---|---|---|
0 | Torgersen | MALE | 39.1 | 18.7 |
1 | Torgersen | FEMALE | 39.5 | 17.4 |
2 | Torgersen | FEMALE | 40.3 | 18.0 |
3 | Torgersen | |||
4 | Torgersen | FEMALE | 36.7 | 19.3 |
339 | Dream | MALE | 55.8 | 19.8 |
340 | Dream | FEMALE | 43.5 | 18.1 |
341 | Dream | MALE | 49.6 | 18.2 |
342 | Dream | MALE | 50.8 | 19.0 |
343 | Dream | FEMALE | 50.2 | 18.7 |
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column | Column name | dtype | Null values | Unique values | Mean | Std | Min | Median | Max |
---|---|---|---|---|---|---|---|---|---|
0 | Island | ObjectDType | 0 (0.0%) | 3 (0.9%) | |||||
1 | Sex | ObjectDType | 10 (2.9%) | 3 (0.9%) | |||||
2 | Culmen Length (mm) | Float64DType | 2 (0.6%) | 164 (47.7%) | 43.9 | 5.46 | 32.1 | 44.4 | 59.6 |
3 | Culmen Depth (mm) | Float64DType | 2 (0.6%) | 80 (23.3%) | 17.2 | 1.97 | 13.1 | 17.3 | 21.5 |
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Island
ObjectDType- Null values
- 0 (0.0%)
- Unique values
- 3 (0.9%)
Most frequent values
Sex
ObjectDType- Null values
- 10 (2.9%)
- Unique values
- 3 (0.9%)
Most frequent values
Culmen Length (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 164 (47.7%)
- Mean ± Std
- 43.9 ± 5.46
- Median ± IQR
- 44.4 ± 9.30
- Min | Max
- 32.1 | 59.6
Culmen Depth (mm)
Float64DType- Null values
- 2 (0.6%)
- Unique values
- 80 (23.3%)
- Mean ± Std
- 17.2 ± 1.97
- Median ± IQR
- 17.3 ± 3.10
- Min | Max
- 13.1 | 21.5
No columns match the selected filter: . You can change the column filter in the dropdown menu above.
Column 1 | Column 2 | Cramér's V |
---|---|---|
Culmen Length (mm) | Culmen Depth (mm) | 0.559 |
Island | Culmen Depth (mm) | 0.470 |
Sex | Culmen Depth (mm) | 0.353 |
Sex | Culmen Length (mm) | 0.323 |
Island | Culmen Length (mm) | 0.299 |
Island | Sex | 0.129 |
Please enable javascript
The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42
)
Try fitting the previous model again. What happens?
# Write your code here.
Models that don’t handle missing values need imputation - replacing missing values with computed values from the data.
EXERCISE
Build a model that chains ColumnTransformer
, SimpleImputer
, and
LogisticRegression
.
# Write your code here.
skrub
to help you out#
The skrub
library offers utilities for baseline preprocessing. Use tabular_learner
to quickly build a pipeline:
model = skrub.tabular_learner(estimator=LogisticRegression())
model
Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
TableVectorizer()
PassThrough()
DatetimeEncoder()
OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore', sparse_output=False)
GapEncoder(n_components=30)
SimpleImputer(add_indicator=True)
StandardScaler()
LogisticRegression()
model.fit(data_train, target_train)
Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tablevectorizer', TableVectorizer()), ('simpleimputer', SimpleImputer(add_indicator=True)), ('standardscaler', StandardScaler()), ('logisticregression', LogisticRegression())])
TableVectorizer()
['Culmen Length (mm)', 'Culmen Depth (mm)']
PassThrough()
[]
DatetimeEncoder()
['Island', 'Sex']
OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore', sparse_output=False)
[]
GapEncoder(n_components=30)
SimpleImputer(add_indicator=True)
StandardScaler()
LogisticRegression()
predicted_target = model.predict(data_test)
print(f"Accuracy: {accuracy_score(target_test, predicted_target):.3f}")
Accuracy: 0.988