statsmodels Cheatsheet – TheCoatlessProfessor

statsmodels is the library for statistical inference in Python. Where scikit-learn is about prediction, fit, predict, score, cross-validate, statsmodels answers a different question: “is this coefficient different from zero, and by how much, with what uncertainty?” Where sklearn hands you .coef_ as a bare array, statsmodels hands you a full .summary() table with standard errors, t-statistics, p-values, confidence intervals, R^2, AIC, and BIC, plus the residual diagnostics and hypothesis tests that scipy and sklearn leave out. If you know lm(), glm(), and summary() from R, this is the same workflow in Python, formula strings and all. The recurring mental model in this sheet is one picture: a DataFrame plus a formula string ("y ~ x1 + x2") flow into a smf.*(...) spec, you call .fit() to get a green fitted results object, and every later panel is a different thing you read off that one box: .summary(), .params, .pvalues, .conf_int(), .resid, .predict(), and .get_prediction(). statsmodels has two front doors and you use both: import statsmodels.api as sm for the array classes, families, datasets, and graphics, and import statsmodels.formula.api as smf for the R-style formula functions (smf.ols, smf.glm, smf.logit). Everything here is verified against statsmodels 0.14.6 on CPython 3.12.

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Fit a Model with the Formula API

You describe a model the way you would in R, as a string like "y ~ x1 + x2" where ~ separates the outcome from the predictors, then pass it and a DataFrame to smf.ols(...).fit(). The formula does real work: C(grp) expands a column into dummy variables, x1 * x2 adds an interaction, and np.log(x1) or I(x2**2) transforms a variable in place, so you rarely touch the design matrix by hand. There is also an arrays-in API, sm.OLS(y, X), but then you must add the intercept yourself with sm.add_constant.

statsmodels formula panel: fit OLS from a formula, add a categorical factor with C(), add an interaction with star, transform variables inline, robust standard errors with cov_type, the arrays API with add_constant.

A formula string plus a DataFrame, then .fit().

import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

model = smf.ols("y ~ x1 + x2", data=df).fit()         # ordinary least squares
smf.ols("y ~ x1 + C(grp)", data=df).fit()             # categorical factor -> dummies
smf.ols("y ~ x1 * x2", data=df).fit()                 # x1 + x2 + x1:x2 (interaction)
smf.ols("y ~ np.log(x1) + I(x2**2)", data=df).fit()   # transform inside the formula
smf.ols("y ~ x1", data=df).fit(cov_type="HC3")        # heteroskedasticity-robust SE

sm.OLS(y, sm.add_constant(X)).fit()                   # arrays API: add intercept yourself

See Fitting models using R-style formulas. The formula API adds the intercept automatically; the array API does not.

Read the Summary Table

The payoff of statsmodels is print(model.summary()), one printed report with three blocks: the fit statistics (R^2, AIC, F-statistic), the coefficient table (estimate, standard error, t-value, p-value, confidence interval for every term), and a diagnostics footer. Pull any piece programmatically with model.params, model.pvalues, model.conf_int(), and model.rsquared, or get the coefficient block as a tidy DataFrame via model.summary2().tables[1].

statsmodels summary panel: print the full report, coefficient estimates with params, p-values, confidence intervals as a forest plot, goodness-of-fit gauges, the coefficient table as a tidy DataFrame.

One .summary() gives coefficients, p-values, R-squared, AIC.

print(model.summary())            # full three-block printed report

model.params                      # coefficient estimates
model.pvalues                     # p-values (green < 0.05)
model.conf_int()                  # 95% confidence intervals
model.rsquared, model.rsquared_adj   # goodness of fit
model.aic, model.bic              # information criteria
model.summary2().tables[1]        # coefficient table as a tidy DataFrame

See OLSResults. Standard errors live in model.bse, t-statistics in model.tvalues.

Residual Diagnostics

A p-value is only trustworthy if the model’s assumptions hold, so check them on the residuals: a residuals-vs-fitted plot should look like a flat patternless band, a Q-Q plot should hug the 45-degree line, and tests like Jarque-Bera (normality) and Breusch-Pagan (constant variance) put numbers on that. Watch two more things: variance inflation factors (VIF) above 5 to 10 signal multicollinearity, and Cook’s distance from model.get_influence() flags individual points that bend the fit.

statsmodels diagnostics panel: residuals-vs-fitted plot, normal Q-Q plot, Jarque-Bera normality test, Breusch-Pagan heteroskedasticity test, variance inflation factors, Cook's distance for influential points.

Check the assumptions behind the p-values.

from statsmodels.stats.stattools import jarque_bera
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

sm.graphics.plot_regress_exog(model, "x1")    # residuals-vs-fitted + partial plots
sm.qqplot(model.resid, line="45")             # normal Q-Q of residuals

jarque_bera(model.resid)                       # normality: want p > 0.05
het_breuschpagan(model.resid, model.model.exog)   # constant variance: want p > 0.05
variance_inflation_factor(X.values, i)         # multicollinearity per predictor
model.get_influence().cooks_distance[0]        # influential points (Cook's distance)

See Regression diagnostics. Note the doubled .model in model.model.exog (the design matrix).

Logistic Regression & GLM

When the outcome is not continuous and normal, reach for a generalized linear model: smf.logit(...) for a binary outcome, or smf.glm(..., family=...) for the general case, where the family= argument (Binomial, Poisson, Gamma, Gaussian) picks the distribution and link. Logistic coefficients are log-odds, so exponentiate them with np.exp(model.params) to read odds ratios, and use get_margeff() to express effects on the probability scale instead.

statsmodels logit and GLM panel: logistic regression with smf.logit, coefficients as odds ratios via exp, the same model as a GLM with a Binomial family, count data with a Poisson family, the classification table, marginal effects on probability.

Non-normal outcomes: binary, counts, rates.

logit = smf.logit("y ~ x1 + x2", data=df).fit()    # binary outcome, sigmoid fit
np.exp(logit.params)                                # log-odds -> odds ratios

smf.glm("y ~ x1 + x2", data=df,
        family=sm.families.Binomial()).fit()        # logit is GLM + Binomial
smf.glm("count ~ x1", data=df,
        family=sm.families.Poisson()).fit()         # count data (Poisson)

logit.pred_table()                                  # 2x2 confusion table at p = 0.5
logit.get_margeff().summary()                       # effect on probability, not log-odds

See Generalized linear models. The family= socket picks the distribution and link function.

Time Series

Start by decomposing a series into trend, seasonal, and residual components with seasonal_decompose or the more robust STL, and test stationarity with the Augmented Dickey-Fuller test (adfuller) so you know how much to difference. Then fit ARIMA(y, order=(p,d,q)) for non-seasonal data or SARIMAX(y, order=..., seasonal_order=(P,D,Q,s)) when there is a seasonal cycle, and call get_forecast(steps=...) to project forward with confidence bands that widen into the future.

statsmodels time series panel: seasonal decomposition into trend, season, and residual, robust STL decomposition, the augmented Dickey-Fuller stationarity test, fitting ARIMA, the seasonal SARIMAX model, and forecasting with widening confidence bands.

Decompose, fit ARIMA / SARIMAX, forecast.

from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX

seasonal_decompose(y, model="additive", period=12)   # trend / season / residual
STL(y, period=12).fit()                              # robust decomposition
adfuller(y)                                          # stationarity: difference if p > 0.05

ARIMA(y, order=(1, 1, 1)).fit()                      # non-seasonal model
model = SARIMAX(y, order=(1, 1, 1),
                seasonal_order=(1, 1, 0, 12)).fit()  # seasonal model
fc = model.get_forecast(steps=12)
fc.conf_int()                                        # widening forecast band

See Time series analysis. Use the new tsa.arima.model.ARIMA; the old tsa.arima_model is removed.

Hypothesis Tests

Beyond regression, statsmodels carries the classical tests scipy and sklearn leave out or scatter around: two-sample ttest_ind, proportions_ztest, one-way ANOVA via anova_lm, and Wald / F-tests on linear restrictions of a fitted model (model.f_test("x1 = 0")). When you run many tests, control the false-discovery rate with multipletests(..., method="fdr_bh"), and follow a significant ANOVA with pairwise_tukeyhsd to see which groups actually differ.

statsmodels tests panel: two-sample t-test, two-proportion z-test, one-way ANOVA from a model, an F-test on a coefficient restriction, multiple-comparison correction with fdr_bh, post-hoc Tukey group comparisons.

t-tests, ANOVA, proportions, and tests on coefficients.

from statsmodels.stats.weightstats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.multicomp import pairwise_tukeyhsd

ttest_ind(a, b)                                  # two-sample t-test -> (t, p, df)
proportions_ztest([40, 30], [100, 100])          # two-proportion z-test -> (z, p)
anova_lm(model, typ=2)                           # one-way ANOVA table (F, PR(>F))
model.f_test("x1 = 0")                           # test a coefficient restriction
multipletests(pvals, method="fdr_bh")            # correct for many comparisons
pairwise_tukeyhsd(y, groups)                      # post-hoc group comparisons

See Statistics (statsmodels.stats). f_test and t_test accept R-style restriction strings.

Predict with Confidence Intervals

model.predict(new_df) gives point predictions, but the inference-first move is model.get_prediction(new_df), which returns an object whose summary_frame() reports two intervals: the narrow confidence interval for the mean and the wider prediction interval for a new observation. The alpha argument sets the width (alpha=0.05 for 95%), and for a GLM the prediction comes back on the response scale (probabilities, rates), not the link scale.

statsmodels predict panel: point predictions on new data, richer prediction with get_prediction, mean versus observation intervals in summary_frame, tighter or looser intervals via alpha, in-sample fitted values, predicting from a GLM on the response scale.

Point predictions plus mean and observation intervals.

model.predict(new_df)                       # point predictions only

pred = model.get_prediction(new_df)         # a PredictionResults object
pred.summary_frame(alpha=0.05)              # mean_ci_* (narrow), obs_ci_* (wide)
pred.summary_frame(alpha=0.10)              # 90% intervals: alpha sets the width

model.fittedvalues                          # in-sample fitted values
glm.get_prediction(new_df).summary_frame()  # GLM: response scale, not link scale

See OLSResults.get_prediction. The mean interval is narrower than the observation interval.

Which API Door?

statsmodels has two import surfaces and using both is normal: statsmodels.formula.api as smf holds the lowercase, R-style formula functions (ols, glm, logit, poisson) that take a string and a DataFrame, while statsmodels.api as sm holds the uppercase array classes (OLS, GLM), the GLM families, sample datasets, add_constant, and the graphics helpers. They fit the same models; pick smf when you have a DataFrame and want formulas, and sm when you already have NumPy design matrices or need a family, a dataset, or a diagnostic plot.

statsmodels API map panel: the formula door smf with lowercase functions, the array door sm with uppercase classes, two spellings of the same model converging, picking a GLM family, loading a sample dataset, and built-in diagnostic graphics.

smf for formulas, sm for arrays and families.

import statsmodels.formula.api as smf   # formula + DataFrame: ols, glm, logit, poisson
import statsmodels.api as sm            # arrays + extras: OLS, GLM, families, graphics

smf.ols("y ~ x", df)                    # auto-adds intercept
sm.OLS(y, sm.add_constant(X))           # same model; you add the intercept

sm.families.Binomial()                  # swappable GLM families
sm.families.Poisson(); sm.families.Gaussian(); sm.families.Gamma()

sm.datasets.get_rdataset("mtcars").data   # R datasets, ready to model
sm.graphics.influence_plot(model)         # diagnostic graphics off a fitted model

See API reference. Formula functions are lowercase (smf.ols); array classes are uppercase (sm.OLS).

Quick Reference

Choosing a statsmodels model.
Outcome looks like	Use	Example
Continuous, roughly normal	`smf.ols`	`smf.ols("y ~ x1 + x2", df).fit()`
Binary (0/1, yes/no)	`smf.logit`	`smf.logit("y ~ x1", df).fit()`
Counts (0, 1, 2, …)	`smf.poisson` or `smf.glm(... Poisson())`	`smf.glm("n ~ x", df, family=sm.families.Poisson()).fit()`
Any GLM (custom family)	`smf.glm`	`smf.glm("y ~ x", df, family=sm.families.Gamma()).fit()`
One time series	`ARIMA` / `SARIMAX`	`ARIMA(y, order=(1,1,1)).fit()`

What a results object exposes.
Attribute / call	Gives you
`model.summary()`	The full printed report
`model.params`	Coefficient estimates
`model.bse`	Standard errors
`model.tvalues` / `model.pvalues`	t-statistics and p-values
`model.conf_int()`	Confidence intervals
`model.rsquared` / `rsquared_adj`	R-squared (OLS)
`model.aic` / `model.bic`	Information criteria
`model.resid` / `model.fittedvalues`	Residuals and fitted values
`model.predict(new)`	Point predictions
`model.get_prediction(new).summary_frame()`	Predictions with mean and obs intervals

patsy formula operators.
Syntax	Meaning
`y ~ x1 + x2`	Outcome `y` on predictors `x1`, `x2` (intercept added by default)
`y ~ x1 - 1`	Drop the intercept
`C(grp)`	Treat `grp` as a categorical factor (dummy-coded)
`x1 : x2`	Interaction term only
`x1 * x2`	Main effects plus interaction (`x1 + x2 + x1:x2`)
`np.log(x1)`, `I(x2**2)`	Transform a variable inside the formula

Common statsmodels diagnostics and tests.
Question	Call
Are residuals normal?	`jarque_bera(model.resid)`
Is variance constant?	`het_breuschpagan(model.resid, model.model.exog)`
Autocorrelated residuals?	`durbin_watson(model.resid)`
Multicollinearity?	`variance_inflation_factor(X.values, i)`
Influential points?	`model.get_influence().cooks_distance`
Is the series stationary?	`adfuller(y)`
Do two means differ?	`ttest_ind(a, b)`
Do group means differ?	`anova_lm(model, typ=2)`

Appendix: Sample Code

The formula to fit to summary loop

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# A toy dataset
rng = np.random.default_rng(0)
n = 200
df = pd.DataFrame({"x1": rng.normal(size=n), "x2": rng.normal(size=n)})
df["grp"] = rng.choice(["a", "b", "c"], size=n)
df["y"] = 2 + 1.5 * df.x1 - 0.7 * df.x2 + (df.grp == "b") + rng.normal(scale=0.5, size=n)

model = smf.ols("y ~ x1 + x2 + C(grp)", data=df).fit()
print(model.summary())

model.params          # coefficient estimates
model.pvalues         # p-values
model.conf_int()      # 95% confidence intervals
model.rsquared        # 0.916
model.aic, model.bic  # information criteria

Logistic regression with odds ratios

import numpy as np
import statsmodels.formula.api as smf

logit = smf.logit("y ~ x1 + x2", data=df).fit()
print(logit.summary())

# Coefficients are log-odds; exponentiate for odds ratios
odds_ratios = np.exp(logit.params)

# How well does it classify?
logit.pred_table()            # 2x2 confusion table at p = 0.5
logit.get_margeff().summary() # average marginal effects (on probability)

Residual diagnostics in one place

import statsmodels.api as sm
from statsmodels.stats.stattools import jarque_bera, durbin_watson
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor

model = smf.ols("y ~ x1 + x2", data=df).fit()

# Normality of residuals (Jarque-Bera): want p > 0.05
jb_stat, jb_p, skew, kurt = jarque_bera(model.resid)

# Constant variance (Breusch-Pagan): want p > 0.05
lm, lm_p, f, f_p = het_breuschpagan(model.resid, model.model.exog)

# Independence of residuals (~2.0 is ideal)
dw = durbin_watson(model.resid)

# Multicollinearity: VIF per predictor (watch for > 5 to 10)
X = sm.add_constant(df[["x1", "x2"]])
vifs = {col: variance_inflation_factor(X.values, i)
        for i, col in enumerate(X.columns)}

# Influential points
cooks_d, _ = model.get_influence().cooks_distance

Time series: decompose, fit, forecast

import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX

# A monthly series with trend + seasonality
idx = pd.date_range("2018-01-01", periods=120, freq="MS")
y = pd.Series(
    np.linspace(10, 30, 120) + 5 * np.sin(2 * np.pi * np.arange(120) / 12)
    + np.random.default_rng(0).normal(size=120),
    index=idx,
)

stl = STL(y, period=12).fit()       # stl.trend, stl.seasonal, stl.resid
adf_stat, adf_p, *_ = adfuller(y)   # stationarity test

model = SARIMAX(y, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12)).fit(disp=0)
fc = model.get_forecast(steps=12)
fc.predicted_mean                   # 12 future points
fc.conf_int()                       # lower / upper confidence band

Predicting with confidence and prediction intervals

import pandas as pd
import statsmodels.formula.api as smf

model = smf.ols("y ~ x1 + x2", data=df).fit()

new = pd.DataFrame({"x1": [0.5, -1.0], "x2": [0.0, 1.0]})

model.predict(new)                       # point predictions only

pred = model.get_prediction(new)
frame = pred.summary_frame(alpha=0.05)   # 95% intervals
# columns: mean, mean_se,
#          mean_ci_lower, mean_ci_upper   (interval for the mean)
#          obs_ci_lower,  obs_ci_upper    (wider interval for a new observation)

Behavior notes

Mixing the case is the most common beginner error. The formula API functions are lowercase (smf.ols, smf.glm, smf.logit); the array classes are uppercase (sm.OLS, sm.GLM, sm.Logit).
The formula API adds the intercept automatically; the array API does not. Wrap X in sm.add_constant(X) or your intercept is silently dropped.
het_breuschpagan takes the design matrix, not the model object. Pass model.model.exog (note the doubled .model) alongside the residuals.
Use the new ARIMA. Import from statsmodels.tsa.arima.model; the old statsmodels.tsa.arima_model.ARIMA (and ARMA) are removed and raise NotImplementedError. For seasonal models use SARIMAX, not the legacy path.
Logistic coefficients are log-odds. Exponentiate with np.exp(model.params) for odds ratios, or use get_margeff() to read effects on the probability scale.
get_prediction beats predict for inference. predict returns point estimates; get_prediction returns an object whose summary_frame() carries both the mean interval and the wider observation interval.
patsy today, formulaic tomorrow. statsmodels installs patsy now and is migrating the formula engine to formulaic; the "y ~ x1 + C(grp)" string syntax shown here is stable across both.

References

statsmodels documentation (stable)

Models and topics

Project and related

statsmodels on PyPI and on GitHub
patsy formula language (the ~ / C() / I() syntax)