Regularization#

This notebook explores regularization in linear models.

Introductory example#

We demonstrate a common issue with correlated features when fitting linear models.

We use the penguins dataset to illustrate this issue.

# When using JupyterLite, uncomment and install the `skrub` package.
%pip install skrub
import matplotlib.pyplot as plt
import skrub

skrub.patch_display()  # makes nice display for pandas tables
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.
import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")
penguins
Processing column   1 / 17
Processing column   2 / 17
Processing column   3 / 17
Processing column   4 / 17
Processing column   5 / 17
Processing column   6 / 17
Processing column   7 / 17
Processing column   8 / 17
Processing column   9 / 17
Processing column  10 / 17
Processing column  11 / 17
Processing column  12 / 17
Processing column  13 / 17
Processing column  14 / 17
Processing column  15 / 17
Processing column  16 / 17
Processing column  17 / 17

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

We select features to predict penguin body mass. We remove rows with missing target values.

features = [
    "Island",
    "Clutch Completion",
    "Flipper Length (mm)",
    "Culmen Length (mm)",
    "Culmen Depth (mm)",
    "Species",
    "Sex",
]
target = "Body Mass (g)"
data, target = penguins[features], penguins[target]
target = target.dropna()
data = data.loc[target.index]
data
Processing column   1 / 7
Processing column   2 / 7
Processing column   3 / 7
Processing column   4 / 7
Processing column   5 / 7
Processing column   6 / 7
Processing column   7 / 7

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Let’s evaluate a simple linear model using skrub’s preprocessing.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_validate

model = skrub.tabular_learner(estimator=LinearRegression())
model.set_output(transform="pandas")

cv = KFold(n_splits=5, shuffle=True, random_state=42)
cv_results = cross_validate(
    model, data, target, cv=cv, return_estimator=True, return_train_score=True
)
pd.DataFrame(cv_results)[["train_score", "test_score"]]
Processing column   1 / 2
Processing column   2 / 2

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

The test score looks good overall but performs poorly on one fold. Let’s examine the coefficient values to understand why.

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
coefs = pd.DataFrame(coefs, columns=cv_results["estimator"][0][-1].feature_names_in_)
coefs.plot.box(whis=[0, 100], vert=False)
plt.show()
../../_images/67d92bbaa96c7fecca4362dbe1cab80fabad53a6c27b621f662fda26107f749e.png

EXERCISE

What do you observe? What causes this behavior? Apply the preprocessing chain and check skrub’s statistics on the resulting data to understand these coefficients.

# Write your code here.

Ridge regressor - L2 regularization#

We saw that coefficients can grow arbitrarily large when features correlate.

\[ loss = (y - X \beta)^2 + \alpha \|\beta\|_2 \]

L2 regularization forces weights toward zero. The parameter \(\alpha\) controls this shrinkage. Scikit-learn implements this as the Ridge model. Let’s fit it and examine its effect on weights.

from sklearn.linear_model import Ridge

model = skrub.tabular_learner(estimator=Ridge(alpha=1)).set_output(transform="pandas")

cv_results = cross_validate(
    model, data, target, cv=cv, return_estimator=True, return_train_score=True
)
pd.DataFrame(cv_results)[["train_score", "test_score"]]
Processing column   1 / 2
Processing column   2 / 2

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
coefs = pd.DataFrame(coefs, columns=cv_results["estimator"][0][-1].feature_names_in_)
coefs.plot.box(whis=[0, 100], vert=False)
plt.show()
../../_images/d5a6570b6db9a5eaa91035eff021b292af7e0bf00ff77e66638538537cd62f6d.png

A small regularization solves the weight problem. We recover the original relationship:

EXERCISE

Try different \(\alpha\) values and examine how they affect the weights.

# Write your code here.

Lasso regressor - L1 regularization#

L1 provides another regularization type. It follows this formula:

\[ loss = (y - X \beta)^2 + \alpha \|\beta\|_1 \]

Scikit-learn implements this as the Lasso regressor.

EXERCISE

Repeat the previous experiment with different \(\alpha\) values and examine how they affect the weights \(\beta\).

# Write your code here.

Elastic net - Combining L2 and L1 regularization#

Combining L2 and L1 regularization offers unique benefits: it identifies important features while preventing non-zero coefficients from growing too large.

from sklearn.linear_model import ElasticNet

model = skrub.tabular_learner(estimator=ElasticNet(alpha=10, l1_ratio=0.95))
model.set_output(transform="pandas")

cv_results = cross_validate(
    model, data, target, cv=cv, return_estimator=True, return_train_score=True
)
pd.DataFrame(cv_results)[["train_score", "test_score"]]
Processing column   1 / 2
Processing column   2 / 2

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

coefs = [est[-1].coef_ for est in cv_results["estimator"]]
coefs = pd.DataFrame(coefs, columns=cv_results["estimator"][0][-1].feature_names_in_)
coefs.plot.box(whis=[0, 100], vert=False)
plt.show()
../../_images/8a482b913e1230b334d0e6f13987d62584b0fe968a5871f9b610c961bbcc3406.png

Hyperparameter tuning#

How do we choose the regularization parameter? The validation curve helps analyze single parameter effects. It plots scores versus parameter values.

Let’s use ValidationCurveDisplay to analyze how the alpha parameter affects Ridge regression.

model = skrub.tabular_learner(estimator=Ridge()).set_output(transform="pandas")

We need to find the parameter name for alpha in the model.

model.get_params()
{'memory': None,
 'steps': [('tablevectorizer', TableVectorizer()),
  ('simpleimputer', SimpleImputer(add_indicator=True)),
  ('standardscaler', StandardScaler()),
  ('ridge', Ridge())],
 'verbose': False,
 'tablevectorizer': TableVectorizer(),
 'simpleimputer': SimpleImputer(add_indicator=True),
 'standardscaler': StandardScaler(),
 'ridge': Ridge(),
 'tablevectorizer__cardinality_threshold': 40,
 'tablevectorizer__datetime__add_total_seconds': True,
 'tablevectorizer__datetime__add_weekday': False,
 'tablevectorizer__datetime__resolution': 'hour',
 'tablevectorizer__datetime': DatetimeEncoder(),
 'tablevectorizer__drop_null_fraction': 1.0,
 'tablevectorizer__high_cardinality__add_words': False,
 'tablevectorizer__high_cardinality__analyzer': 'char',
 'tablevectorizer__high_cardinality__batch_size': 1024,
 'tablevectorizer__high_cardinality__gamma_scale_prior': 1.0,
 'tablevectorizer__high_cardinality__gamma_shape_prior': 1.1,
 'tablevectorizer__high_cardinality__hashing': False,
 'tablevectorizer__high_cardinality__hashing_n_features': 4096,
 'tablevectorizer__high_cardinality__init': 'k-means++',
 'tablevectorizer__high_cardinality__max_iter': 5,
 'tablevectorizer__high_cardinality__max_iter_e_step': 1,
 'tablevectorizer__high_cardinality__max_no_improvement': 5,
 'tablevectorizer__high_cardinality__n_components': 30,
 'tablevectorizer__high_cardinality__ngram_range': (2, 4),
 'tablevectorizer__high_cardinality__random_state': None,
 'tablevectorizer__high_cardinality__rescale_W': True,
 'tablevectorizer__high_cardinality__rescale_rho': False,
 'tablevectorizer__high_cardinality__rho': 0.95,
 'tablevectorizer__high_cardinality__verbose': 0,
 'tablevectorizer__high_cardinality': GapEncoder(n_components=30),
 'tablevectorizer__low_cardinality__categories': 'auto',
 'tablevectorizer__low_cardinality__drop': 'if_binary',
 'tablevectorizer__low_cardinality__dtype': 'float32',
 'tablevectorizer__low_cardinality__feature_name_combiner': 'concat',
 'tablevectorizer__low_cardinality__handle_unknown': 'ignore',
 'tablevectorizer__low_cardinality__max_categories': None,
 'tablevectorizer__low_cardinality__min_frequency': None,
 'tablevectorizer__low_cardinality__sparse_output': False,
 'tablevectorizer__low_cardinality': OneHotEncoder(drop='if_binary', dtype='float32', handle_unknown='ignore',
               sparse_output=False),
 'tablevectorizer__n_jobs': None,
 'tablevectorizer__numeric': PassThrough(),
 'tablevectorizer__specific_transformers': (),
 'simpleimputer__add_indicator': True,
 'simpleimputer__copy': True,
 'simpleimputer__fill_value': None,
 'simpleimputer__keep_empty_features': False,
 'simpleimputer__missing_values': nan,
 'simpleimputer__strategy': 'mean',
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'ridge__alpha': 1.0,
 'ridge__copy_X': True,
 'ridge__fit_intercept': True,
 'ridge__max_iter': None,
 'ridge__positive': False,
 'ridge__random_state': None,
 'ridge__solver': 'auto',
 'ridge__tol': 0.0001}
import numpy as np
from sklearn.model_selection import ValidationCurveDisplay

disp = ValidationCurveDisplay.from_estimator(
    model,
    data,
    target,
    cv=cv,
    std_display_style="errorbar",
    param_name="ridge__alpha",
    param_range=np.logspace(-3, 3, num=20),
    n_jobs=2,
)
plt.show()
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/linear_model/_ridge.py:216: LinAlgWarning: Ill-conditioned matrix (rcond=5.26528e-08): result may not be accurate.
  return linalg.solve(A, Xy, assume_a="pos", overwrite_a=True).T
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
../../_images/2e5fef8c997242ee755a2327c209353405d69b84bdd1162358d53d88d56836db.png

Too much regularization degrades model performance.

EXERCISE

Try a very small alpha (e.g. 1e-16) and observe its effect on the validation curve.

# Write your code here.

In practice, we often use grid or random search instead of validation curves to choose regularization parameters. These methods run internal cross-validation to select the best-performing model. Let’s demonstrate random search.

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)
from scipy.stats import loguniform
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {"ridge__alpha": loguniform(1e-3, 1e3)}
search = RandomizedSearchCV(model, param_distributions, n_iter=10, cv=cv)
search.fit(data_train, target_train)
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
RandomizedSearchCV(cv=KFold(n_splits=5, random_state=42, shuffle=True),
                   estimator=Pipeline(steps=[('tablevectorizer',
                                              TableVectorizer()),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=True)),
                                             ('standardscaler',
                                              StandardScaler()),
                                             ('ridge', Ridge())]),
                   param_distributions={'ridge__alpha': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7faac8202570>})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
search.best_params_
{'ridge__alpha': np.float64(1.8686339471165179)}
pd.DataFrame(search.cv_results_)
Processing column   1 / 14
Processing column   2 / 14
Processing column   3 / 14
Processing column   4 / 14
Processing column   5 / 14
Processing column   6 / 14
Processing column   7 / 14
Processing column   8 / 14
Processing column   9 / 14
Processing column  10 / 14
Processing column  11 / 14
Processing column  12 / 14
Processing column  13 / 14
Processing column  14 / 14

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

This approach enables nested cross-validation. The inner loop selects parameters while the outer loop evaluates model performance.

cv_results = cross_validate(
    search, data, target, cv=cv, return_estimator=True, return_train_score=True
)
pd.DataFrame(cv_results)[["train_score", "test_score"]]
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(
Processing column   1 / 2
Processing column   2 / 2

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

Some scikit-learn models efficiently search hyperparameters internally. Models with “CV” in their name, like RidgeCV, automatically find optimal regularization parameters.

from sklearn.linear_model import RidgeCV

model = skrub.tabular_learner(estimator=RidgeCV(alphas=np.logspace(-3, 3, num=100)))
model.set_output(transform="pandas")

cv_results = cross_validate(
    model, data, target, cv=cv, return_estimator=True, return_train_score=True
)
pd.DataFrame(cv_results)[["train_score", "test_score"]]
Processing column   1 / 2
Processing column   2 / 2

/home/runner/work/traces-sklearn/traces-sklearn/.pixi/envs/docs/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:242: UserWarning: Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros
  warnings.warn(

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

alphas = [est[-1].alpha_ for est in cv_results["estimator"]]
alphas
[np.float64(1.4174741629268048),
 np.float64(2.4770763559917115),
 np.float64(3.2745491628777286),
 np.float64(2.4770763559917115),
 np.float64(2.848035868435802)]

What about classification?#

Classification handles regularization differently. Instead of creating new estimators, regularization becomes a model parameter. LogisticRegression and LinearSVC offer two main models. Both use penalty and C parameters (C inverts regression’s alpha).

We’ll explore parameter C with LogisticRegression. First, let’s load classification data to predict penguin species from culmen measurements.

data = pd.read_csv("../datasets/penguins_classification.csv")
data = data[data["Species"].isin(["Adelie", "Chinstrap"])]
data["Species"] = data["Species"].astype("category")
data.head()
Processing column   1 / 3
Processing column   2 / 3
Processing column   3 / 3

Please enable javascript

The skrub table reports need javascript to display correctly. If you are displaying a report in a Jupyter notebook and you see this message, you may need to re-execute the cell or to trust the notebook (button on the top right or "File > Trust notebook").

X, y = data[["Culmen Length (mm)", "Culmen Depth (mm)"]], data["Species"]
import matplotlib.pyplot as plt

data.plot.scatter(
    x="Culmen Length (mm)",
    y="Culmen Depth (mm)",
    c="Species",
    edgecolor="black",
    s=50,
)
plt.show()
../../_images/69d0ba8f1c0262159d2f0d9067e6524e379ecff5b2608d9d1aba7ef4a5d8014b.png

QUESTION

What regularization does LogisticRegression use by default? Check the documentation.

Let’s fit a model and visualize its decision boundary.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X, y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
from sklearn.inspection import DecisionBoundaryDisplay

display = DecisionBoundaryDisplay.from_estimator(
    model,
    X,
    response_method="decision_function",
    cmap=plt.cm.RdBu,
    plot_method="pcolormesh",
    shading="auto",
)
data.plot.scatter(
    x="Culmen Length (mm)",
    y="Culmen Depth (mm)",
    c="Species",
    edgecolor="black",
    s=50,
    ax=display.ax_,
)
plt.show()
../../_images/5edb815aed80141a11efb2ad8c7a98dea9a70cfa29cfb689c3f97933448ed9bd.png
coef = pd.Series(model.coef_[0], index=X.columns)
coef.plot.barh()
plt.show()
../../_images/20fc521f7adad83128dbab176ad91d8b4070bc765d8dcd6ea3bf94389893c8b0.png

This example establishes a baseline for studying parameter C effects. The logistic regression loss function is:

\[ loss = \frac{1 - \rho}{2} w^T w + \rho \|w\|_1 + C \log ( \exp (y_i (X \beta)) + 1) \]

EXERCISE

Fit models with different C values and examine how they affect coefficients and decision boundaries.

# Write your code here.

The loss formula shows C affects the data term (error between true and predicted targets). In regression, alpha affects the weights instead. This explains why C inversely relates to alpha.