scikit-learn Cheatsheet – TheCoatlessProfessor

scikit-learn is built around one object, the estimator, and three verbs: .fit(X, y) learns from data, .predict(X) applies what was learned, and .transform(X) reshapes features. Everything else, including pipelines, cross-validation, and grid search, composes those same objects. X is a 2-D array or pandas DataFrame of shape (n_samples, n_features) built on top of NumPy, and y is the 1-D target. The daily supervised-learning loop is simple: get and split data, preprocess, pick and fit an estimator, bundle everything into a Pipeline so each preprocessing step is fit only on the training folds, then let cross-validation estimate honest performance and GridSearchCV pick the hyperparameters. This cheatsheet walks that loop end to end across eight panels.

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Data: load & split

scikit-learn speaks NumPy arrays and pandas DataFrames shaped (n_samples, n_features) for X and a 1-D y. Bundled loaders, the OpenML fetcher, and make_* generators get you data fast. Split off a test set immediately with train_test_split (use stratify=y for classification) and do not look at it again until final evaluation.

scikit-learn data panel: load_iris, fetch_openml, make_classification, train_test_split with an 80/20 stratified split.

Get arrays in, hold out a test set you never touch while modeling.

scikit-learn data panel: load_iris, fetch_openml, make_classification, train_test_split with an 80/20 stratified split.

Get arrays in, hold out a test set you never touch while modeling.

from sklearn.datasets import load_iris, fetch_openml, make_classification
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True, as_frame=True)                       # toy dataset as a DataFrame
X, y = fetch_openml("titanic", version=1, return_X_y=True, as_frame=True)  # real dataset by name
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=3)   # synthetic data for demos
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=0)                  # stratified 80/20 hold-out

See dataset loading utilities and train_test_split.

Preprocess: scale, encode, impute

Transformers learn statistics from the training data in .fit, then apply them with .transform (fit_transform does both). This is what keeps test data from leaking into training. Scale numeric features, one-hot encode categoricals, and impute missing values, then call set_output(transform="pandas") to keep column names.

scikit-learn preprocessing panel: StandardScaler, OneHotEncoder, SimpleImputer, get_feature_names_out, set_output pandas.

Transformers learn statistics from training data, then reshape any data.

scikit-learn preprocessing panel: StandardScaler, OneHotEncoder, SimpleImputer, get_feature_names_out, set_output pandas.

Transformers learn statistics from training data, then reshape any data.

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

StandardScaler().fit_transform(X_tr)                          # standardize -> mean 0, std 1
OneHotEncoder(handle_unknown="ignore", sparse_output=False)   # one-hot encode categoricals
SimpleImputer(strategy="median")                              # fill missing values
enc.get_feature_names_out()                                   # output column names back
StandardScaler().set_output(transform="pandas")               # return a DataFrame, not an array

See preprocessing data, imputation, and set_output.

Estimator: fit, predict, score

Every model is an estimator with the same interface: .fit(X, y) learns, .predict(X) applies, classifiers add .predict_proba, and .score returns a sensible default (accuracy for classifiers, R^2 for regressors). Because the interface is uniform, swapping algorithms is a one-line change.

scikit-learn estimator panel: RandomForestClassifier fit, predict, predict_proba, score, swapping in SVC.

One interface for every model: fit learns, predict applies, score reports.

scikit-learn estimator panel: RandomForestClassifier fit, predict, predict_proba, score, swapping in SVC.

One interface for every model: fit learns, predict applies, score reports.

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

clf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_tr, y_tr)  # pick & train
y_pred = clf.predict(X_te)            # predict labels on new data
proba = clf.predict_proba(X_te)      # predict class probabilities (rows sum to 1.0)
clf.score(X_te, y_te)                # default score (accuracy for clf, R^2 for reg)
SVC(kernel="rbf")                    # swap in another algorithm, same fit/predict

See supervised learning and the glossary.

Pipeline: chain steps safely

A Pipeline glues transformers and a final estimator into one object so the transforms are fit only on training folds, which prevents the most common form of data leakage. ColumnTransformer (with make_column_selector) sends numeric and categorical columns down separate branches before they merge.

scikit-learn pipeline panel: make_pipeline, ColumnTransformer, make_column_selector, pipe.fit, named_steps.

Wrap preprocessing and a model in one estimator that travels together.

scikit-learn pipeline panel: make_pipeline, ColumnTransformer, make_column_selector, pipe.fit, named_steps.

Wrap preprocessing and a model in one estimator that travels together.

from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))  # chain a scaler + model
ColumnTransformer([("num", StandardScaler(), num_cols),
                   ("cat", OneHotEncoder(), cat_cols)])   # route numeric vs categorical columns
make_column_selector(dtype_include="number")             # auto-select columns by dtype
pipe.fit(X_tr, y_tr)                                      # fit the whole chain at once
pipe.named_steps["logisticregression"].coef_             # reach into a named step

See pipelines and composite estimators.

Validate & tune

Cross-validation rotates which fold is held out and averages the scores, giving a more honest estimate than a single split. Pass the whole pipeline so preprocessing is re-fit inside every fold. GridSearchCV and RandomizedSearchCV wrap that loop, search the hyperparameter space (use the step__param double-underscore naming), and refit the winner.

scikit-learn validation panel: cross_val_score, cross_validate, GridSearchCV, best_params_, RandomizedSearchCV.

Rotate the held-out fold for honest scores, then search and refit the best config.

scikit-learn validation panel: cross_val_score, cross_validate, GridSearchCV, best_params_, RandomizedSearchCV.

Rotate the held-out fold for honest scores, then search and refit the best config.

from sklearn.model_selection import (cross_val_score, cross_validate,
                                     GridSearchCV, RandomizedSearchCV)

scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")   # K-fold CV score -> mean +/- std
cross_validate(pipe, X, y, cv=5, scoring=["accuracy", "f1_macro"],
               return_train_score=True)                          # multiple metrics + train scores
gs = GridSearchCV(pipe, {"logisticregression__C": [0.1, 1, 10]},
                  cv=5).fit(X, y)                                 # grid search over a pipeline
gs.best_params_   gs.best_score_                                  # read the winner: {C: 10}, 0.97
RandomizedSearchCV(pipe, param_distributions, n_iter=20, cv=5)   # randomized search (big spaces)

See cross-validation and tuning hyper-parameters.

Metrics & evaluation

Choose the metric for the question: accuracy can mislead on imbalanced classes, so lean on classification_report, the confusion matrix, and ROC AUC for classification. For regression use root_mean_squared_error and r2_score. Wrap any metric in make_scorer to drive model selection with it.

scikit-learn metrics panel: classification_report, ConfusionMatrixDisplay, roc_auc_score, root_mean_squared_error, r2_score, make_scorer.

Pick the metric for the task: reports, confusion matrices, AUC, RMSE, and R^2.

scikit-learn metrics panel: classification_report, ConfusionMatrixDisplay, roc_auc_score, root_mean_squared_error, r2_score, make_scorer.

Pick the metric for the task: reports, confusion matrices, AUC, RMSE, and R^2.

from sklearn.metrics import (classification_report, ConfusionMatrixDisplay,
                             roc_auc_score, root_mean_squared_error, r2_score,
                             make_scorer, f1_score)

print(classification_report(y_te, y_pred))               # per-class precision / recall / f1 / support
ConfusionMatrixDisplay.from_estimator(clf, X_te, y_te)   # plotted confusion matrix
roc_auc_score(y_te, clf.predict_proba(X_te)[:, 1])       # ranking quality (AUC = 0.98)
root_mean_squared_error(y_te, y_pred)   r2_score(y_te, y_pred)   # regression error (R^2 = 0.91)
make_scorer(f1_score, average="macro")                   # custom scorer for search

See metrics and scoring.

Persist: save & load

Serialize the entire fitted pipeline with joblib so a separate process can load and predict without the training data or code path. The serialized format is not guaranteed across versions, so pin and record them via sklearn.show_versions(). Only unpickle models you trust.

scikit-learn persistence panel: joblib.dump, joblib.load, load-and-predict in a fresh process, show_versions.

Train once, serve many times: dump the fitted pipeline, load it elsewhere.

scikit-learn persistence panel: joblib.dump, joblib.load, load-and-predict in a fresh process, show_versions.

Train once, serve many times: dump the fitted pipeline, load it elsewhere.

import joblib
import sklearn

joblib.dump(pipe, "model.joblib")                  # save a fitted pipeline to disk
pipe = joblib.load("model.joblib")                 # load it back later
joblib.load("model.joblib").predict(X_new)         # predict in a fresh process, no training data
sklearn.show_versions()                            # check installed versions -> pin these

See model persistence.

Inspect & select

Open the box: tree models expose feature_importances_, while permutation_importance works for any fitted estimator, and SelectKBest, SelectFromModel, and PCA shrink the feature matrix. set_config(display="diagram") renders any pipeline as an inspectable HTML diagram.

scikit-learn inspection panel: feature_importances_, permutation_importance, SelectKBest, SelectFromModel, PCA, set_config diagram.

Understand what the model uses and trim the inputs.

scikit-learn inspection panel: feature_importances_, permutation_importance, SelectKBest, SelectFromModel, PCA, set_config diagram.

Understand what the model uses and trim the inputs.

from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectKBest, f_classif, SelectFromModel
from sklearn.decomposition import PCA
from sklearn import set_config

clf.feature_importances_                              # tree feature importances (bar chart)
permutation_importance(clf, X_te, y_te, n_repeats=10)  # model-agnostic importance
SelectKBest(f_classif, k=10).fit_transform(X, y)     # keep the best K features
SelectFromModel(estimator, threshold="median")       # select via a model's weights
PCA(n_components=2).fit_transform(X)                  # compress with PCA (var 0.93, 0.05)
set_config(display="diagram")                        # then display `pipe` as an HTML diagram

See inspection, feature selection, and decomposition / PCA.

Quick Reference

The estimator interface, the universal verbs.
Method	Used by	Does
`.fit(X, y)`	all estimators	Learn parameters from data
`.predict(X)`	classifiers, regressors, clusterers	Apply the model to new data
`.predict_proba(X)`	most classifiers	Per-class probabilities
`.transform(X)`	transformers	Reshape features
`.fit_transform(X)`	transformers	Fit then transform in one call
`.score(X, y)`	supervised estimators	Default metric (accuracy / R^2)

Pick a model, daily-use defaults.
Task	Start here	Then try
Classification	`LogisticRegression`	`RandomForestClassifier`, `HistGradientBoostingClassifier`, `SVC`
Regression	`Ridge`	`RandomForestRegressor`, `HistGradientBoostingRegressor`
Clustering	`KMeans`	`DBSCAN`, `AgglomerativeClustering`
Dimensionality reduction	`PCA`	`TruncatedSVD`, `UMAP` (external)

Common constructor params.
Param	Meaning
`random_state=0`	Reproducible results
`n_jobs=-1`	Use all CPU cores
`class_weight="balanced"`	Reweight for imbalanced classes
`max_iter`	Iterations for iterative solvers
`cv=5`	Number of cross-validation folds
`scoring="f1_macro"`	Metric driving CV / search

Appendix: Sample Code

The whole workflow in one screen (the canonical pattern)

import joblib
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# 1. Data + held-out test set
X, y = fetch_openml("titanic", version=1, return_X_y=True, as_frame=True)
X = X.drop(columns=["name", "ticket", "cabin", "boat", "body", "home.dest"])
X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=0
)

# 2. Preprocess numeric vs categorical columns
pre = ColumnTransformer([
    ("num", make_pipeline(SimpleImputer(strategy="median"), StandardScaler()),
        selector(dtype_include="number")),
    ("cat", make_pipeline(SimpleImputer(strategy="most_frequent"),
        OneHotEncoder(handle_unknown="ignore")),
        selector(dtype_include=["object", "category"])),
])

# 3+4. Pipeline = preprocessing + model
pipe = Pipeline([("pre", pre), ("clf", RandomForestClassifier(random_state=0))])

# 5. Tune with cross-validated grid search
grid = {"clf__n_estimators": [200, 400], "clf__max_depth": [None, 6, 12]}
search = GridSearchCV(pipe, grid, cv=5, scoring="f1_macro", n_jobs=-1)
search.fit(X_tr, y_tr)
print("best params:", search.best_params_)

# 6. Evaluate on the untouched test set
print(classification_report(y_te, search.predict(X_te)))

# 7. Persist the fitted pipeline
joblib.dump(search.best_estimator_, "titanic_model.joblib")

Reload and serve (separate process)

import joblib
import pandas as pd

model = joblib.load("titanic_model.joblib")
new = pd.DataFrame([{ "pclass": 1, "sex": "female", "age": 29, "sibsp": 0,
                      "parch": 0, "fare": 211.3, "embarked": "S" }])
print(model.predict(new))          # -> array(['1'])  (survived)
print(model.predict_proba(new))    # class probabilities

Reproducible environment header

import sklearn
sklearn.show_versions()
# Pin in pyproject / requirements, e.g.:
#   scikit-learn==1.9.0
#   numpy==2.4.6
#   scipy==1.17.1
#   joblib==1.5.3

References

scikit-learn documentation

scikit-learn documentation home, getting started, and the API reference
Dataset loading utilities and train_test_split
Preprocessing data, imputation, and set_output
Supervised learning and the glossary
Pipelines and composite estimators
Cross-validation and tuning hyper-parameters
Metrics and scoring
Model persistence
Inspection, feature selection, and decomposition / PCA