pandas Cheatsheet – TheCoatlessProfessor

pandas is the standard library for tabular data in Python. A DataFrame is a dict of aligned, named columns (Series) sharing one row index, and almost every call returns a new frame rather than mutating in place, so you read pandas as a pipeline. The conventional import is import pandas as pd, and every command below assumes a DataFrame named df. This cheatsheet groups the daily workflow into eight areas, each paired with a small diagram. To install pandas, see the pip, uv, or conda sheets; for a faster columnar alternative, see Polars.

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Build a Frame

A DataFrame is just a set of named, equal-length columns (Series) sharing one row index, so you can build it from a dict of columns, a list of row records, or by reading a file. read_csv and read_parquet are the workhorses: CSV is universal and human-readable, Parquet is typed and far faster for anything large.

pandas build panel: from a dict, a Series, row records, read_csv, read_parquet, and an empty frame.

From dicts, lists, or files into a DataFrame.

pandas build panel: from a dict, a Series, row records, read_csv, read_parquet, and an empty frame.

From dicts, lists, or files into a DataFrame.

pd.DataFrame({"city": cities, "pop": pops})   # from a dict of columns
pd.Series([8.4, 3.9, 0.9], name="pop")        # one labeled column
pd.DataFrame.from_records(rows)               # from row records
pd.read_csv("cities.csv")                     # read a CSV
pd.read_parquet("cities.parquet")             # read Parquet (typed, fast)
pd.DataFrame(columns=["city", "pop"])         # empty frame with set columns

See intro to data structures and read_csv.

Look at It

Before you transform anything, look: head/tail peek at the edges, shape gives (rows, cols), and info plus dtypes tell you the schema and where nulls hide. describe summarizes the numbers and value_counts shows how a categorical column is distributed.

pandas look panel: head/tail, shape, info, dtypes, describe, and value_counts.

Shape, types, summary, a peek at the data.

pandas look panel: head/tail, shape, info, dtypes, describe, and value_counts.

Shape, types, summary, a peek at the data.

df.head(5)              # first n rows  (df.tail(5) for last n)
df.shape                # (rows, cols)
df.info()               # schema, dtypes, non-null counts
df.dtypes               # per-column dtypes
df.describe()           # numeric summary stats
df["region"].value_counts()   # count distinct values

See essential basic functionality.

Pick Rows and Columns

Selection has two doors. Use .loc for labels (column names, index values, boolean masks) and .iloc for integer positions. A boolean mask like df[df["pop"] > 5] is the everyday filter, and query is the same idea written as a readable string.

pandas select panel: select columns, .loc, .iloc, boolean mask, query, and isin.

Labels with .loc, positions with .iloc, masks for filtering.

pandas select panel: select columns, .loc, .iloc, boolean mask, query, and isin.

Labels with .loc, positions with .iloc, masks for filtering.

df[["city", "pop"]]                       # select columns by name
df.loc[df["pop"] > 5, "city"]             # by label (rows, cols)
df.iloc[0:5, 0:2]                         # by integer position
df[df["pop"] > 5]                         # filter with a boolean mask
df.query("pop > 5 and region == 'E'")     # filter with a query string
df[df["region"].isin(["E", "W"])]         # keep rows where value in a set

See indexing and selecting data.

Make New Columns

pandas favors returning a new frame over mutating in place, so assign adds columns as part of a chain, rename and astype adjust labels and types, and the .str and .dt accessors plus apply/map transform values without writing explicit loops.

pandas columns panel: assign, rename, astype, .str transform, apply, and drop.

Add, rename, retype, and transform values.

pandas columns panel: assign, rename, astype, .str transform, apply, and drop.

Add, rename, retype, and transform values.

df.assign(pop_k=df["pop"] * 1000)         # add or replace columns
df.rename(columns={"pop": "population"})   # rename columns
df.astype({"pop": "int64"})               # change a column's dtype
df["city"].str.upper()                    # elementwise string transform
df["pop"].apply(round)                    # map / apply a function
df.drop(columns=["notes"])                # drop columns

See assign and the accessors in the API reference.

Fix Missing & Messy Data

Missing values are NaN/NA; isna().sum() finds them, then dropna or fillna/ffill resolve them depending on whether the row is worth keeping. drop_duplicates removes repeated rows and sort_values orders by a key, both returning fresh frames.

pandas missing panel: isna().sum(), dropna, fillna, ffill, drop_duplicates, and sort_values.

Nulls, duplicates, and ordering.

pandas missing panel: isna().sum(), dropna, fillna, ffill, drop_duplicates, and sort_values.

Nulls, duplicates, and ordering.

df.isna().sum()                           # count nulls per column
df.dropna()                               # drop rows with any null
df.fillna(0)                              # fill nulls with a value
df.ffill()                                # forward-fill nulls
df.drop_duplicates()                      # remove duplicate rows
df.sort_values("pop", ascending=False)     # sort by column values

See working with missing data.

Group and Summarize

groupby follows a split-apply-combine model: split rows into groups by a key, apply an aggregation per group, and combine the results back into one table. Use agg with named outputs for tidy summaries, transform when you need the group stat broadcast back to every original row, and pivot_table/crosstab for spreadsheet-style summaries. The group key lands in the result’s index, so chain .reset_index() when you want it back as a plain column.

pandas groupby panel: groupby aggregate, named agg, size, transform, pivot_table, and crosstab.

Split into groups, aggregate, summarize into tables.

pandas groupby panel: groupby aggregate, named agg, size, transform, pivot_table, and crosstab.

Split into groups, aggregate, summarize into tables.

df.groupby("region")["pop"].sum()         # group then aggregate one column
df.groupby("region").agg(total=("pop", "sum"), n=("pop", "size"))   # named aggregations
df.groupby("region").size()               # count rows per group
df.groupby("region")["pop"].transform("mean")   # broadcast a group stat back
df.pivot_table(index="region", values="pop", aggfunc="mean")   # pivot summary
pd.crosstab(df["region"], df["size"])     # frequency cross-tab

See group by: split-apply-combine.

Reshape and Combine

merge joins frames on a shared key (think SQL joins via how=), while concat simply stacks frames by rows or columns. melt (wide to long) and pivot (long to wide) are inverses that switch between a tidy “one observation per row” layout and a human-friendly matrix.

pandas reshape panel: merge, concat rows, concat columns, melt, pivot, and explode.

Join frames, stack rows, wide-to-long and back.

pandas reshape panel: merge, concat rows, concat columns, melt, pivot, and explode.

Join frames, stack rows, wide-to-long and back.

pd.merge(left, right, on="id", how="left")    # join on a key column
pd.concat([df_jan, df_feb])                   # stack rows (same columns)
pd.concat([df_a, df_b], axis=1)               # glue columns side by side
df.melt(id_vars="city", var_name="year", value_name="pop")   # wide to long
df.pivot(index="city", columns="year", values="pop")         # long to wide
df.explode("tags")                            # expand list-valued cells

See reshaping and merge, join, concatenate.

Read, Write & Dates

Round-trip with to_csv/read_csv for portability and to_parquet/read_parquet when types and speed matter. Once a column is real datetimes via to_datetime, the .dt accessor exposes calendar parts and resample lets you re-bin a time series to a coarser frequency such as month-end ("ME").

pandas IO panel: to_csv, to_parquet, to_datetime, .dt accessor, date_range, and resample.

File I/O round-trips and datetime handling.

pandas IO panel: to_csv, to_parquet, to_datetime, .dt accessor, date_range, and resample.

File I/O round-trips and datetime handling.

df.to_csv("out.csv", index=False)         # write a CSV (no index col)
df.to_parquet("out.parquet")              # write Parquet (typed, fast)
pd.to_datetime(df["date"])                # parse strings to datetimes
df["date"].dt.year                        # pull date parts (.dt.month_name(), ...)
pd.date_range("2024-01-01", periods=12, freq="ME")   # a date range index
df.resample("ME").sum()                   # downsample a time series

See IO tools and time series.

Quick Reference

Selection cheats.
Goal	Call
Column by name	`df["pop"]` or `df[["city", "pop"]]`
Row/col by label	`df.loc[mask, "city"]`
Row/col by position	`df.iloc[0:5, 0:2]`
Single cell, fast	`df.at[0, "city"]` / `df.iat[0, 0]`
Filter rows	`df[df["pop"] > 5]` / `df.query("pop > 5")`

Common dtypes (pandas 3.0).
dtype	Holds
`int64` / `float64`	Whole numbers / decimals
`str`	Text (the default string dtype in pandas 3.0)
`bool`	True / False
`datetime64[us]`	Timestamps (microsecond resolution in 3.0)
`category`	Repeated labels stored compactly

Datetime offset aliases (pandas 3.0).
Alias	Meaning
`D`	Calendar day
`h`	Hour (lowercase in 3.0)
`ME`	Month end (replaces deprecated `M`)
`MS`	Month start
`YE`	Year end (replaces deprecated `Y`/`A`)

Appendix: Sample Files

The toy data behind the diagrams, plus a pipeline that shows the chaining mental model.

`cities.csv`

city,region,pop,date
NY,E,8.4,2024-01-01
LA,W,3.9,2024-02-01
SF,W,0.9,2024-03-01
BO,E,0.7,2024-04-01

A daily-workflow pipeline

Because each call returns a new frame, a whole analysis reads as one chain:

import pandas as pd

summary = (
    pd.read_csv("cities.csv")
    .assign(date=lambda d: pd.to_datetime(d["date"]))   # text -> datetime
    .query("pop > 0.5")                                 # filter rows
    .groupby("region")                                  # split
    .agg(total_pop=("pop", "sum"), n=("city", "size"))  # apply + combine
    .sort_values("total_pop", ascending=False)          # order
)

summary.to_parquet("region_summary.parquet")            # typed, fast output

Wide vs long

# wide: one row per city, a column per year
wide = pd.DataFrame({"city": ["NY"], "y2020": [8.3], "y2021": [8.4]})

# long: tidy, one observation per row
long = wide.melt(id_vars="city", var_name="year", value_name="pop")

# round-trip back to wide
back = long.pivot(index="city", columns="year", values="pop")

On pandas 3.0, Copy-on-Write is always on: chained assignment like df[mask]["col"] = x silently does nothing, so assign with df.loc[mask, "col"] = x instead. Text columns default to the str dtype, and to_parquet/read_parquet need a separately-installed engine such as pyarrow.

References

pandas documentation

Related tools

Polars, a fast columnar DataFrame library
NumPy, the array library pandas is built on