scikit-learn is built around one object, the estimator, and three verbs: .fit(X, y) learns from data, .predict(X) applies what was learned, and .transform(X) reshapes features. Everything else, including pipelines, cross-validation, and grid search, composes those same objects. X is a 2-D array or pandas DataFrame of shape (n_samples, n_features) built on top of NumPy, and y is the 1-D target. The daily supervised-learning loop is simple: get and split data, preprocess, pick and fit an estimator, bundle everything into a Pipeline so each preprocessing step is fit only on the training folds, then let cross-validation estimate honest performance and GridSearchCV pick the hyperparameters. This cheatsheet walks that loop end to end across eight panels.
Data: load & split
scikit-learn speaks NumPy arrays and pandas DataFrames shaped (n_samples, n_features) for X and a 1-D y. Bundled loaders, the OpenML fetcher, and make_* generators get you data fast. Split off a test set immediately with train_test_split (use stratify=y for classification) and do not look at it again until final evaluation.
from sklearn.datasets import load_iris, fetch_openml, make_classification
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True, as_frame=True) # toy dataset as a DataFrame
X, y = fetch_openml("titanic", version=1, return_X_y=True, as_frame=True) # real dataset by name
X, y = make_classification(n_samples=1000, n_classes=3, n_informative=3) # synthetic data for demos
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=0) # stratified 80/20 hold-outPreprocess: scale, encode, impute
Transformers learn statistics from the training data in .fit, then apply them with .transform (fit_transform does both). This is what keeps test data from leaking into training. Scale numeric features, one-hot encode categoricals, and impute missing values, then call set_output(transform="pandas") to keep column names.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
StandardScaler().fit_transform(X_tr) # standardize -> mean 0, std 1
OneHotEncoder(handle_unknown="ignore", sparse_output=False) # one-hot encode categoricals
SimpleImputer(strategy="median") # fill missing values
enc.get_feature_names_out() # output column names back
StandardScaler().set_output(transform="pandas") # return a DataFrame, not an arraySee preprocessing data, imputation, and set_output.
Estimator: fit, predict, score
Every model is an estimator with the same interface: .fit(X, y) learns, .predict(X) applies, classifiers add .predict_proba, and .score returns a sensible default (accuracy for classifiers, R^2 for regressors). Because the interface is uniform, swapping algorithms is a one-line change.
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
clf = RandomForestClassifier(n_estimators=300, random_state=0).fit(X_tr, y_tr) # pick & train
y_pred = clf.predict(X_te) # predict labels on new data
proba = clf.predict_proba(X_te) # predict class probabilities (rows sum to 1.0)
clf.score(X_te, y_te) # default score (accuracy for clf, R^2 for reg)
SVC(kernel="rbf") # swap in another algorithm, same fit/predictSee supervised learning and the glossary.
Pipeline: chain steps safely
A Pipeline glues transformers and a final estimator into one object so the transforms are fit only on training folds, which prevents the most common form of data leakage. ColumnTransformer (with make_column_selector) sends numeric and categorical columns down separate branches before they merge.
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000)) # chain a scaler + model
ColumnTransformer([("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(), cat_cols)]) # route numeric vs categorical columns
make_column_selector(dtype_include="number") # auto-select columns by dtype
pipe.fit(X_tr, y_tr) # fit the whole chain at once
pipe.named_steps["logisticregression"].coef_ # reach into a named stepValidate & tune
Cross-validation rotates which fold is held out and averages the scores, giving a more honest estimate than a single split. Pass the whole pipeline so preprocessing is re-fit inside every fold. GridSearchCV and RandomizedSearchCV wrap that loop, search the hyperparameter space (use the step__param double-underscore naming), and refit the winner.
from sklearn.model_selection import (cross_val_score, cross_validate,
GridSearchCV, RandomizedSearchCV)
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy") # K-fold CV score -> mean +/- std
cross_validate(pipe, X, y, cv=5, scoring=["accuracy", "f1_macro"],
return_train_score=True) # multiple metrics + train scores
gs = GridSearchCV(pipe, {"logisticregression__C": [0.1, 1, 10]},
cv=5).fit(X, y) # grid search over a pipeline
gs.best_params_ gs.best_score_ # read the winner: {C: 10}, 0.97
RandomizedSearchCV(pipe, param_distributions, n_iter=20, cv=5) # randomized search (big spaces)See cross-validation and tuning hyper-parameters.
Metrics & evaluation
Choose the metric for the question: accuracy can mislead on imbalanced classes, so lean on classification_report, the confusion matrix, and ROC AUC for classification. For regression use root_mean_squared_error and r2_score. Wrap any metric in make_scorer to drive model selection with it.
from sklearn.metrics import (classification_report, ConfusionMatrixDisplay,
roc_auc_score, root_mean_squared_error, r2_score,
make_scorer, f1_score)
print(classification_report(y_te, y_pred)) # per-class precision / recall / f1 / support
ConfusionMatrixDisplay.from_estimator(clf, X_te, y_te) # plotted confusion matrix
roc_auc_score(y_te, clf.predict_proba(X_te)[:, 1]) # ranking quality (AUC = 0.98)
root_mean_squared_error(y_te, y_pred) r2_score(y_te, y_pred) # regression error (R^2 = 0.91)
make_scorer(f1_score, average="macro") # custom scorer for searchSee metrics and scoring.
Persist: save & load
Serialize the entire fitted pipeline with joblib so a separate process can load and predict without the training data or code path. The serialized format is not guaranteed across versions, so pin and record them via sklearn.show_versions(). Only unpickle models you trust.
import joblib
import sklearn
joblib.dump(pipe, "model.joblib") # save a fitted pipeline to disk
pipe = joblib.load("model.joblib") # load it back later
joblib.load("model.joblib").predict(X_new) # predict in a fresh process, no training data
sklearn.show_versions() # check installed versions -> pin theseSee model persistence.
Inspect & select
Open the box: tree models expose feature_importances_, while permutation_importance works for any fitted estimator, and SelectKBest, SelectFromModel, and PCA shrink the feature matrix. set_config(display="diagram") renders any pipeline as an inspectable HTML diagram.
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectKBest, f_classif, SelectFromModel
from sklearn.decomposition import PCA
from sklearn import set_config
clf.feature_importances_ # tree feature importances (bar chart)
permutation_importance(clf, X_te, y_te, n_repeats=10) # model-agnostic importance
SelectKBest(f_classif, k=10).fit_transform(X, y) # keep the best K features
SelectFromModel(estimator, threshold="median") # select via a model's weights
PCA(n_components=2).fit_transform(X) # compress with PCA (var 0.93, 0.05)
set_config(display="diagram") # then display `pipe` as an HTML diagramSee inspection, feature selection, and decomposition / PCA.
Quick Reference
| Method | Used by | Does |
|---|---|---|
.fit(X, y) |
all estimators | Learn parameters from data |
.predict(X) |
classifiers, regressors, clusterers | Apply the model to new data |
.predict_proba(X) |
most classifiers | Per-class probabilities |
.transform(X) |
transformers | Reshape features |
.fit_transform(X) |
transformers | Fit then transform in one call |
.score(X, y) |
supervised estimators | Default metric (accuracy / R^2) |
| Task | Start here | Then try |
|---|---|---|
| Classification | LogisticRegression |
RandomForestClassifier, HistGradientBoostingClassifier, SVC |
| Regression | Ridge |
RandomForestRegressor, HistGradientBoostingRegressor |
| Clustering | KMeans |
DBSCAN, AgglomerativeClustering |
| Dimensionality reduction | PCA |
TruncatedSVD, UMAP (external) |
| Param | Meaning |
|---|---|
random_state=0 |
Reproducible results |
n_jobs=-1 |
Use all CPU cores |
class_weight="balanced" |
Reweight for imbalanced classes |
max_iter |
Iterations for iterative solvers |
cv=5 |
Number of cross-validation folds |
scoring="f1_macro" |
Metric driving CV / search |
Appendix: Sample Code
The whole workflow in one screen (the canonical pattern)
import joblib
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# 1. Data + held-out test set
X, y = fetch_openml("titanic", version=1, return_X_y=True, as_frame=True)
X = X.drop(columns=["name", "ticket", "cabin", "boat", "body", "home.dest"])
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=0
)
# 2. Preprocess numeric vs categorical columns
pre = ColumnTransformer([
("num", make_pipeline(SimpleImputer(strategy="median"), StandardScaler()),
selector(dtype_include="number")),
("cat", make_pipeline(SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore")),
selector(dtype_include=["object", "category"])),
])
# 3+4. Pipeline = preprocessing + model
pipe = Pipeline([("pre", pre), ("clf", RandomForestClassifier(random_state=0))])
# 5. Tune with cross-validated grid search
grid = {"clf__n_estimators": [200, 400], "clf__max_depth": [None, 6, 12]}
search = GridSearchCV(pipe, grid, cv=5, scoring="f1_macro", n_jobs=-1)
search.fit(X_tr, y_tr)
print("best params:", search.best_params_)
# 6. Evaluate on the untouched test set
print(classification_report(y_te, search.predict(X_te)))
# 7. Persist the fitted pipeline
joblib.dump(search.best_estimator_, "titanic_model.joblib")Reload and serve (separate process)
import joblib
import pandas as pd
model = joblib.load("titanic_model.joblib")
new = pd.DataFrame([{ "pclass": 1, "sex": "female", "age": 29, "sibsp": 0,
"parch": 0, "fare": 211.3, "embarked": "S" }])
print(model.predict(new)) # -> array(['1']) (survived)
print(model.predict_proba(new)) # class probabilitiesReproducible environment header
import sklearn
sklearn.show_versions()
# Pin in pyproject / requirements, e.g.:
# scikit-learn==1.9.0
# numpy==2.4.6
# scipy==1.17.1
# joblib==1.5.3References
scikit-learn documentation
- scikit-learn documentation home, getting started, and the API reference
- Dataset loading utilities and
train_test_split - Preprocessing data, imputation, and
set_output - Supervised learning and the glossary
- Pipelines and composite estimators
- Cross-validation and tuning hyper-parameters
- Metrics and scoring
- Model persistence
- Inspection, feature selection, and decomposition / PCA