statsmodels is the library for statistical inference in Python. Where scikit-learn is about prediction, fit, predict, score, cross-validate, statsmodels answers a different question: “is this coefficient different from zero, and by how much, with what uncertainty?” Where sklearn hands you .coef_ as a bare array, statsmodels hands you a full .summary() table with standard errors, t-statistics, p-values, confidence intervals, R^2, AIC, and BIC, plus the residual diagnostics and hypothesis tests that scipy and sklearn leave out. If you know lm(), glm(), and summary() from R, this is the same workflow in Python, formula strings and all. The recurring mental model in this sheet is one picture: a DataFrame plus a formula string ("y ~ x1 + x2") flow into a smf.*(...) spec, you call .fit() to get a green fitted results object, and every later panel is a different thing you read off that one box: .summary(), .params, .pvalues, .conf_int(), .resid, .predict(), and .get_prediction(). statsmodels has two front doors and you use both: import statsmodels.api as sm for the array classes, families, datasets, and graphics, and import statsmodels.formula.api as smf for the R-style formula functions (smf.ols, smf.glm, smf.logit). Everything here is verified against statsmodels 0.14.6 on CPython 3.12.
Fit a Model with the Formula API
You describe a model the way you would in R, as a string like "y ~ x1 + x2" where ~ separates the outcome from the predictors, then pass it and a DataFrame to smf.ols(...).fit(). The formula does real work: C(grp) expands a column into dummy variables, x1 * x2 adds an interaction, and np.log(x1) or I(x2**2) transforms a variable in place, so you rarely touch the design matrix by hand. There is also an arrays-in API, sm.OLS(y, X), but then you must add the intercept yourself with sm.add_constant.
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
model = smf.ols("y ~ x1 + x2", data=df).fit() # ordinary least squares
smf.ols("y ~ x1 + C(grp)", data=df).fit() # categorical factor -> dummies
smf.ols("y ~ x1 * x2", data=df).fit() # x1 + x2 + x1:x2 (interaction)
smf.ols("y ~ np.log(x1) + I(x2**2)", data=df).fit() # transform inside the formula
smf.ols("y ~ x1", data=df).fit(cov_type="HC3") # heteroskedasticity-robust SE
sm.OLS(y, sm.add_constant(X)).fit() # arrays API: add intercept yourselfSee Fitting models using R-style formulas. The formula API adds the intercept automatically; the array API does not.
Read the Summary Table
The payoff of statsmodels is print(model.summary()), one printed report with three blocks: the fit statistics (R^2, AIC, F-statistic), the coefficient table (estimate, standard error, t-value, p-value, confidence interval for every term), and a diagnostics footer. Pull any piece programmatically with model.params, model.pvalues, model.conf_int(), and model.rsquared, or get the coefficient block as a tidy DataFrame via model.summary2().tables[1].
print(model.summary()) # full three-block printed report
model.params # coefficient estimates
model.pvalues # p-values (green < 0.05)
model.conf_int() # 95% confidence intervals
model.rsquared, model.rsquared_adj # goodness of fit
model.aic, model.bic # information criteria
model.summary2().tables[1] # coefficient table as a tidy DataFrameSee OLSResults. Standard errors live in model.bse, t-statistics in model.tvalues.
Residual Diagnostics
A p-value is only trustworthy if the model’s assumptions hold, so check them on the residuals: a residuals-vs-fitted plot should look like a flat patternless band, a Q-Q plot should hug the 45-degree line, and tests like Jarque-Bera (normality) and Breusch-Pagan (constant variance) put numbers on that. Watch two more things: variance inflation factors (VIF) above 5 to 10 signal multicollinearity, and Cook’s distance from model.get_influence() flags individual points that bend the fit.
from statsmodels.stats.stattools import jarque_bera
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
sm.graphics.plot_regress_exog(model, "x1") # residuals-vs-fitted + partial plots
sm.qqplot(model.resid, line="45") # normal Q-Q of residuals
jarque_bera(model.resid) # normality: want p > 0.05
het_breuschpagan(model.resid, model.model.exog) # constant variance: want p > 0.05
variance_inflation_factor(X.values, i) # multicollinearity per predictor
model.get_influence().cooks_distance[0] # influential points (Cook's distance)See Regression diagnostics. Note the doubled .model in model.model.exog (the design matrix).
Logistic Regression & GLM
When the outcome is not continuous and normal, reach for a generalized linear model: smf.logit(...) for a binary outcome, or smf.glm(..., family=...) for the general case, where the family= argument (Binomial, Poisson, Gamma, Gaussian) picks the distribution and link. Logistic coefficients are log-odds, so exponentiate them with np.exp(model.params) to read odds ratios, and use get_margeff() to express effects on the probability scale instead.
logit = smf.logit("y ~ x1 + x2", data=df).fit() # binary outcome, sigmoid fit
np.exp(logit.params) # log-odds -> odds ratios
smf.glm("y ~ x1 + x2", data=df,
family=sm.families.Binomial()).fit() # logit is GLM + Binomial
smf.glm("count ~ x1", data=df,
family=sm.families.Poisson()).fit() # count data (Poisson)
logit.pred_table() # 2x2 confusion table at p = 0.5
logit.get_margeff().summary() # effect on probability, not log-oddsSee Generalized linear models. The family= socket picks the distribution and link function.
Time Series
Start by decomposing a series into trend, seasonal, and residual components with seasonal_decompose or the more robust STL, and test stationarity with the Augmented Dickey-Fuller test (adfuller) so you know how much to difference. Then fit ARIMA(y, order=(p,d,q)) for non-seasonal data or SARIMAX(y, order=..., seasonal_order=(P,D,Q,s)) when there is a seasonal cycle, and call get_forecast(steps=...) to project forward with confidence bands that widen into the future.
from statsmodels.tsa.seasonal import seasonal_decompose, STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
seasonal_decompose(y, model="additive", period=12) # trend / season / residual
STL(y, period=12).fit() # robust decomposition
adfuller(y) # stationarity: difference if p > 0.05
ARIMA(y, order=(1, 1, 1)).fit() # non-seasonal model
model = SARIMAX(y, order=(1, 1, 1),
seasonal_order=(1, 1, 0, 12)).fit() # seasonal model
fc = model.get_forecast(steps=12)
fc.conf_int() # widening forecast bandSee Time series analysis. Use the new tsa.arima.model.ARIMA; the old tsa.arima_model is removed.
Hypothesis Tests
Beyond regression, statsmodels carries the classical tests scipy and sklearn leave out or scatter around: two-sample ttest_ind, proportions_ztest, one-way ANOVA via anova_lm, and Wald / F-tests on linear restrictions of a fitted model (model.f_test("x1 = 0")). When you run many tests, control the false-discovery rate with multipletests(..., method="fdr_bh"), and follow a significant ANOVA with pairwise_tukeyhsd to see which groups actually differ.
from statsmodels.stats.weightstats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.multitest import multipletests
from statsmodels.stats.multicomp import pairwise_tukeyhsd
ttest_ind(a, b) # two-sample t-test -> (t, p, df)
proportions_ztest([40, 30], [100, 100]) # two-proportion z-test -> (z, p)
anova_lm(model, typ=2) # one-way ANOVA table (F, PR(>F))
model.f_test("x1 = 0") # test a coefficient restriction
multipletests(pvals, method="fdr_bh") # correct for many comparisons
pairwise_tukeyhsd(y, groups) # post-hoc group comparisonsSee Statistics (statsmodels.stats). f_test and t_test accept R-style restriction strings.
Predict with Confidence Intervals
model.predict(new_df) gives point predictions, but the inference-first move is model.get_prediction(new_df), which returns an object whose summary_frame() reports two intervals: the narrow confidence interval for the mean and the wider prediction interval for a new observation. The alpha argument sets the width (alpha=0.05 for 95%), and for a GLM the prediction comes back on the response scale (probabilities, rates), not the link scale.
model.predict(new_df) # point predictions only
pred = model.get_prediction(new_df) # a PredictionResults object
pred.summary_frame(alpha=0.05) # mean_ci_* (narrow), obs_ci_* (wide)
pred.summary_frame(alpha=0.10) # 90% intervals: alpha sets the width
model.fittedvalues # in-sample fitted values
glm.get_prediction(new_df).summary_frame() # GLM: response scale, not link scaleSee OLSResults.get_prediction. The mean interval is narrower than the observation interval.
Which API Door?
statsmodels has two import surfaces and using both is normal: statsmodels.formula.api as smf holds the lowercase, R-style formula functions (ols, glm, logit, poisson) that take a string and a DataFrame, while statsmodels.api as sm holds the uppercase array classes (OLS, GLM), the GLM families, sample datasets, add_constant, and the graphics helpers. They fit the same models; pick smf when you have a DataFrame and want formulas, and sm when you already have NumPy design matrices or need a family, a dataset, or a diagnostic plot.
import statsmodels.formula.api as smf # formula + DataFrame: ols, glm, logit, poisson
import statsmodels.api as sm # arrays + extras: OLS, GLM, families, graphics
smf.ols("y ~ x", df) # auto-adds intercept
sm.OLS(y, sm.add_constant(X)) # same model; you add the intercept
sm.families.Binomial() # swappable GLM families
sm.families.Poisson(); sm.families.Gaussian(); sm.families.Gamma()
sm.datasets.get_rdataset("mtcars").data # R datasets, ready to model
sm.graphics.influence_plot(model) # diagnostic graphics off a fitted modelSee API reference. Formula functions are lowercase (smf.ols); array classes are uppercase (sm.OLS).
Quick Reference
| Outcome looks like | Use | Example |
|---|---|---|
| Continuous, roughly normal | smf.ols |
smf.ols("y ~ x1 + x2", df).fit() |
| Binary (0/1, yes/no) | smf.logit |
smf.logit("y ~ x1", df).fit() |
| Counts (0, 1, 2, …) | smf.poisson or smf.glm(... Poisson()) |
smf.glm("n ~ x", df, family=sm.families.Poisson()).fit() |
| Any GLM (custom family) | smf.glm |
smf.glm("y ~ x", df, family=sm.families.Gamma()).fit() |
| One time series | ARIMA / SARIMAX |
ARIMA(y, order=(1,1,1)).fit() |
| Attribute / call | Gives you |
|---|---|
model.summary() |
The full printed report |
model.params |
Coefficient estimates |
model.bse |
Standard errors |
model.tvalues / model.pvalues |
t-statistics and p-values |
model.conf_int() |
Confidence intervals |
model.rsquared / rsquared_adj |
R-squared (OLS) |
model.aic / model.bic |
Information criteria |
model.resid / model.fittedvalues |
Residuals and fitted values |
model.predict(new) |
Point predictions |
model.get_prediction(new).summary_frame() |
Predictions with mean and obs intervals |
| Syntax | Meaning |
|---|---|
y ~ x1 + x2 |
Outcome y on predictors x1, x2 (intercept added by default) |
y ~ x1 - 1 |
Drop the intercept |
C(grp) |
Treat grp as a categorical factor (dummy-coded) |
x1 : x2 |
Interaction term only |
x1 * x2 |
Main effects plus interaction (x1 + x2 + x1:x2) |
np.log(x1), I(x2**2) |
Transform a variable inside the formula |
| Question | Call |
|---|---|
| Are residuals normal? | jarque_bera(model.resid) |
| Is variance constant? | het_breuschpagan(model.resid, model.model.exog) |
| Autocorrelated residuals? | durbin_watson(model.resid) |
| Multicollinearity? | variance_inflation_factor(X.values, i) |
| Influential points? | model.get_influence().cooks_distance |
| Is the series stationary? | adfuller(y) |
| Do two means differ? | ttest_ind(a, b) |
| Do group means differ? | anova_lm(model, typ=2) |
Appendix: Sample Code
The formula to fit to summary loop
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
# A toy dataset
rng = np.random.default_rng(0)
n = 200
df = pd.DataFrame({"x1": rng.normal(size=n), "x2": rng.normal(size=n)})
df["grp"] = rng.choice(["a", "b", "c"], size=n)
df["y"] = 2 + 1.5 * df.x1 - 0.7 * df.x2 + (df.grp == "b") + rng.normal(scale=0.5, size=n)
model = smf.ols("y ~ x1 + x2 + C(grp)", data=df).fit()
print(model.summary())
model.params # coefficient estimates
model.pvalues # p-values
model.conf_int() # 95% confidence intervals
model.rsquared # 0.916
model.aic, model.bic # information criteriaLogistic regression with odds ratios
import numpy as np
import statsmodels.formula.api as smf
logit = smf.logit("y ~ x1 + x2", data=df).fit()
print(logit.summary())
# Coefficients are log-odds; exponentiate for odds ratios
odds_ratios = np.exp(logit.params)
# How well does it classify?
logit.pred_table() # 2x2 confusion table at p = 0.5
logit.get_margeff().summary() # average marginal effects (on probability)Residual diagnostics in one place
import statsmodels.api as sm
from statsmodels.stats.stattools import jarque_bera, durbin_watson
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
model = smf.ols("y ~ x1 + x2", data=df).fit()
# Normality of residuals (Jarque-Bera): want p > 0.05
jb_stat, jb_p, skew, kurt = jarque_bera(model.resid)
# Constant variance (Breusch-Pagan): want p > 0.05
lm, lm_p, f, f_p = het_breuschpagan(model.resid, model.model.exog)
# Independence of residuals (~2.0 is ideal)
dw = durbin_watson(model.resid)
# Multicollinearity: VIF per predictor (watch for > 5 to 10)
X = sm.add_constant(df[["x1", "x2"]])
vifs = {col: variance_inflation_factor(X.values, i)
for i, col in enumerate(X.columns)}
# Influential points
cooks_d, _ = model.get_influence().cooks_distanceTime series: decompose, fit, forecast
import numpy as np
import pandas as pd
from statsmodels.tsa.seasonal import STL
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.statespace.sarimax import SARIMAX
# A monthly series with trend + seasonality
idx = pd.date_range("2018-01-01", periods=120, freq="MS")
y = pd.Series(
np.linspace(10, 30, 120) + 5 * np.sin(2 * np.pi * np.arange(120) / 12)
+ np.random.default_rng(0).normal(size=120),
index=idx,
)
stl = STL(y, period=12).fit() # stl.trend, stl.seasonal, stl.resid
adf_stat, adf_p, *_ = adfuller(y) # stationarity test
model = SARIMAX(y, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12)).fit(disp=0)
fc = model.get_forecast(steps=12)
fc.predicted_mean # 12 future points
fc.conf_int() # lower / upper confidence bandPredicting with confidence and prediction intervals
import pandas as pd
import statsmodels.formula.api as smf
model = smf.ols("y ~ x1 + x2", data=df).fit()
new = pd.DataFrame({"x1": [0.5, -1.0], "x2": [0.0, 1.0]})
model.predict(new) # point predictions only
pred = model.get_prediction(new)
frame = pred.summary_frame(alpha=0.05) # 95% intervals
# columns: mean, mean_se,
# mean_ci_lower, mean_ci_upper (interval for the mean)
# obs_ci_lower, obs_ci_upper (wider interval for a new observation)Behavior notes
- Mixing the case is the most common beginner error. The formula API functions are lowercase (
smf.ols,smf.glm,smf.logit); the array classes are uppercase (sm.OLS,sm.GLM,sm.Logit). - The formula API adds the intercept automatically; the array API does not. Wrap
Xinsm.add_constant(X)or your intercept is silently dropped. het_breuschpagantakes the design matrix, not the model object. Passmodel.model.exog(note the doubled.model) alongside the residuals.- Use the new ARIMA. Import from
statsmodels.tsa.arima.model; the oldstatsmodels.tsa.arima_model.ARIMA(andARMA) are removed and raiseNotImplementedError. For seasonal models useSARIMAX, not the legacy path. - Logistic coefficients are log-odds. Exponentiate with
np.exp(model.params)for odds ratios, or useget_margeff()to read effects on the probability scale. get_predictionbeatspredictfor inference.predictreturns point estimates;get_predictionreturns an object whosesummary_frame()carries both the mean interval and the wider observation interval.- patsy today, formulaic tomorrow. statsmodels installs patsy now and is migrating the formula engine to formulaic; the
"y ~ x1 + C(grp)"string syntax shown here is stable across both.
References
statsmodels documentation (stable)
- Documentation home and Getting started
- Fitting models using R-style formulas, API reference, Install
- Regression diagnostics, Statistics (statsmodels.stats)
Models and topics
- Linear regression (OLS), OLSResults, get_prediction
- Generalized linear models, Discrete choice (Logit, Poisson)
- Time series analysis, ARIMA, SARIMAX, Seasonal decomposition / STL
- ANOVA, Multiple testing
Project and related
- statsmodels on PyPI and on GitHub
- patsy formula language (the
~/C()/I()syntax)