Conference talks

View project on GitHub

TRACES winter school 2024

Introduction to machine learning in Python

Tutorial: This tutorial introduces how to use scikit-learn to craft predictive models using machine learning. It covers the basics of machine learning: the evaluation, give insights about linear models, tree-based models, discussed about hyperparameter tuning and finally goes a bit into confidence intervals prediction.

Tutorials repository Static course

Sample space podcast 2024

Imbalanced-learn: regrets and onwards

Abstract: Imbalanced-learn is one of the most popular scikit-learn projects out there. It has support for resampling techniques which historically have always been used for imbalanced classification use-cases. However, now that we are a few years down the line, it may be time to start rethinking the library. As it turns out, other techniques may be preferable.

Videos

Practical AI podcast 2024

scikit-learn & data science you own

Abstract: We are at GenAI saturation, so let’s talk about scikit-learn, a long time favorite for data scientists building classifiers, time series analyzers, dimensionality reducers, and more! Scikit-learn is deployed across industry and driving a significant portion of the “AI” that is actually in production. :probabl is a new kind of company that is stewarding this project along with a variety of other open source projects. Yann Lechelle and Guillaume Lemaitre share some of the vision behind the company and talk about the future of scikit-learn!

Podcast

ENGIE 2024

scikit-learn: community insights & latest features

Abstract: Insights regarding the scikit-learn community and some new features available in the latest versions.

Slides

PyData Paris 2024

An update on the latest scikit-learn features

Abstract: In this talk, we provide an update on the latest scikit-learn features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:

  • the metadata routing API allowing to pass metadata around estimators;
  • the TunedThresholdClassifierCV allowing to tuned operational decision through custom metric;
  • better support for categorical features and missing values;
  • interoperability of array and dataframe.

Slides Videos Tutorials repository

EuroSciPy 2024

Probabilistic classification and cost-sensitive learning with scikit-learn

Tutorial: Data scientists are repeatedly told that it is absolutely critical to align their model training methodology with a specific business objective. While being a rather good advice, it usually falls short on details on how to achieve this in practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete software tools to help them bridge this gap. This method will be illustrated on a worked practical use case: optimizing the operations of a fraud detection system for a payment processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic classifiers, how to evaluate them and fix common causes of mis-calibration. In a second part, we will explore how to turn probabilistic classifiers into optimal business decision makers.

Slides Tutorials repository

DeepLabCut AI Residency 2024

scikit-learn: An OSS community-driven development

Abstract: This talk provides insights regarding the scikit-learn community and the development of the library.

Slides

Sacl-AI 2024

Introduction to machine learning in Python

Tutorial: This tutorial provides an introduction to machine learning in Python, notably using scikit-learn.

Tutorials repository

DataTalksClub podcast 2024

Insights regarding the scikit-learn project

Abstract: This podcast discusses some insights regarding the scikit-learn project and imbalanced-learn project.

Videos

PyCon Italia 2024

A Retrieval Augmented Generation system to query the scikit-learn documentation

Abstract: Rubber ducks are used for many years to help Pythonistas in their everyday quest. At scikit-learn, we’ve elevated ducky to another level: come and meet the scikit-learn Ragger Duck, a RAG system designed to answer all your scikit-learn questions – at least as effectively as a duck can.

Slides Blog post Tutorials repository

PyConDE & PyData Berlin 2024

A Retrieval Augmented Generation system to query the scikit-learn documentation

Abstract: The scikit-learn website currently employs an “exact” search engine based on the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes and queries based on natural language. To address these constraints, we experimented with using large language models (LLMs) and opted for a retrieval augmented generation (RAG) system due to resource constraints.

This talk introduces our experimental RAG system for querying scikit-learn documentation. We focus on an open-source software stack and open-weight models. The talk presents the different stages of the RAG pipeline. We provide documentation scraping strategies that we designed based on numpydoc and sphinx-gallery, which are used to build vector indices for the lexical and semantic searches. We compare our RAG approach with an LLM-only approach to demonstrate the advantage of providing context. The source code for this experiment is available on GitHub: https://github.com/glemaitre/sklearn-ragger-duck.

Finally, we discuss the gains and challenges of integrating such a system into an open-source project, including hosting and cost considerations, comparing it with alternative approaches.

Slides Blog post Videos Tutorials repository

CDiscount 2024

Get the best from your scikit-learn classifier

Abstract: When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a “hard” decision used to leverage a business decision or/and a “soft” decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from “soft” to “hard” predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate “hard” predictions using this heuristic. Reversely, training a classifier for an optimum “hard” prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum “hard” predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

Slides

PyData Paris Meetup 2024

Get the best from your scikit-learn classifier

Abstract: When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a “hard” decision used to leverage a business decision or/and a “soft” decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from “soft” to “hard” predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate “hard” predictions using this heuristic. Reversely, training a classifier for an optimum “hard” prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum “hard” predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

Slides

PyData Global 2023

Get the best from your scikit-learn classifier

Abstract: When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a “hard” decision used to leverage a business decision or/and a “soft” decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from “soft” to “hard” predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate “hard” predictions using this heuristic. Reversely, training a classifier for an optimum “hard” prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum “hard” predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

Slides Videos

EuroSciPy 2023

Get the best from your scikit-learn classifier

Abstract: When operating a classifier in a production setting (i.e. predictive phase), practitioners are interested in potentially two different outputs: a “hard” decision used to leverage a business decision or/and a “soft” decision to get a confidence score linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from “soft” to “hard” predictions: it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function) to get class labels. However, optimizing a classifier to get a confidence score close to the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain accurate “hard” predictions using this heuristic. Reversely, training a classifier for an optimum “hard” prediction accuracy (with the cut-off constraint at 0.5) does not guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the best of the two worlds: a calibrated classifier providing optimum “hard” predictions. This meta-estimator will land in a future version of scikit-learn: https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and predictions and also illustrate how to use in practice this model on different use cases: cost-sensitive problems and imbalanced classification problems.

Slides Videos

PyConDE & PyData Berlin 2022

Inspect an try to interpret your scikit-learn machine-learning models

Abstract: This tutorial is subdivided into three parts. First, we focus on the family of linear models and present the common pitfalls to be aware of when interpreting the coefficients of such models. Then, we look at a larger range of models (e.g. gradient-boosting) and put into practice available inspection techniques developed in scikit-learn to inspect such models. Finally, we present other tools to interpret models (i.e. shap), not currently available in scikit-learn, but widely used in practice.

Slides Videos Tutorials repository

PyLadies Paris 2022

Inspecting your predictive model in Python

Abstract: This presentation intends to present the available tools allowing to inspect your predictive model in Python. We will first quickly present what we mean by predictive model and what it implies when one wants to explain the decision of such a model. We will provide a quick taxonomy of the current methods intending to explain predictive model. Finally, we will give an overview of the available tools in scikit-learn and shap.

Slides

Euler Hermes 2019

Learning from imbalanced datasets: state of the art

Abstract: This presentation gives an overview of the state of the art of predictive modelling with imbalanced datasets.

Slides

Euroscipy 2019

Rapid Analytics & Model Prototyping (RAMP)

Abstract: We will give an overview of the RAMP framework, which provides a platform to organize reproducible and transparent data challenges. RAMP workflow is a python package used to define and formalize the data science problem to be solved. It can be used as a standalone package and allows a user to prototype different solutions. In addition to RAMP workflow, a set of packages have been developed allowing to share and collaborate around the developer solutions. Therefore, RAMP database provides a database structure to store the solutions of different users and the performance of these solutions. RAMP engine is the package to run the user solutions (possibly on the cloud) and populate the database. Finally, RAMP frontend is the web frontend where users can upload their solutions and which shows the leaderboard of the challenge. The project is open-source and can be deployed on any local server. The framework has been used at the Paris-Saclay Center for Data Science for setting up and solving about twenty scientific problems, for organizing collaborative data challenges, for organizing scientific sub-communities around these events, and for training novice data scientists.

Slides RAMP board RAMP workflow

Introduction to scikit-learn: from model fitting to model interpretation

Abstract: Our introduction to scikit-learn will be subdivided into 2 parts. We will give a general introduction to scikit-learn presenting basic concepts around cross-validation, pipeline estimator, and hyperparameter search. Then, we will focus on model interpretation presenting the challenges and the available tools to understand a trained machine-learning model: partial independence plot, features importance, LIME, shapley values, etc.

Slides Tutorials repository

Euroscipy 2018

Imbalanced-learn: A scikit-learn-contrib to tackle learning from imbalanced data set

Abstract: The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples. In this talk, we review the different available strategy to learn a statistical model under those specific condition. Then, we will present imbalanced-learn package and the new features which will be released in the new version 0.4.

Slides Package

CDS Pitching Day 2017

RAMP on predicting autism from resting-state functional MRI and anatomical MRI

Abstract: This talk will present the ongoing preparation of a RAMP aiming at distinguishing subjects with Autism Spectrum Disorder (ASD) from typical control subjects. This analysis will use the Autism Brain Imaging Data Exchange (ABIDE I & II) database and data from Robert Debre Hospital based on R-fMRI and anatomical MRI. We will particularly focus on presenting the problematic, the typical pipeline answering this problem, and the current status of this RAMP. This work is in collaboration with the Pasteur Institute (Neuroanatomy group of the Unit of Human Genetics and Cognitive Functions).

Slides

Euroscipy 2017

Leverage knowledge from under-represented classes in machine learning: imbalanced-learn release 0.3.0

Abstract: The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples. In this talks, we present the new feature which are available in the release 0.3.0.

Slides Package

PyParis 2017

Leverage knowledge from under-represented classes in machine learning: an introduction to imbalanced-learn

Abstract: The curse of imbalanced data set refers to data sets in which the number of samples in one class is less than in others. This issue is often encountered in real world data sets such as medical imaging applications (e.g. cancer detection), fraud detection, etc. In such particular condition, machine learning algorithms learn sub-optimal models which will generally favor the class having the largest number of samples. In this talk, we will present the imbalanced-learn package which implement some of the state-of-the-art algorithms, tackling the class imbalance problem.

Slides Package