PyTorch Cheatsheet – TheCoatlessProfessor

PyTorch is the deep-learning framework built around one data type, the tensor, an n-dimensional array (much like a NumPy array) that can also track its own computation history for automatic differentiation and move onto a GPU. The whole library speaks tensors, and a typical day with PyTorch traces a single loop: you make tensors, reshape and index them, ask autograd for gradients, build a model out of nn.Module layers, feed it batches from a DataLoader, run the five-line training step, save the state_dict, and finally move everything to an accelerator to run inference. This cheatsheet walks those eight stages in order, with version-verified commands at every step.

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Create tensors

A tensor is PyTorch’s n-dimensional array and the only data type the whole library speaks. You build them from Python lists, from constant fills, from ranges, or from random noise, and you can bridge a NumPy array in (it even shares the same memory). Dtype and shape are inferred at creation, so a list of ints gives you an int64 tensor while a list of floats gives float32.

pytorch tensors panel: torch.tensor, zeros/ones, arange, randn, from_numpy, zeros_like.

Spin up tensors from data, ranges, and random noise.

pytorch tensors panel: torch.tensor, zeros/ones, arange, randn, from_numpy, zeros_like.

Spin up tensors from data, ranges, and random noise.

import torch

torch.tensor([[1, 2], [3, 4]])     # from a Python list (dtype inferred -> int64)
torch.zeros(2, 3)  torch.ones(2, 3)  # constant-filled tensors of a given shape
torch.arange(0, 10, 2)             # evenly spaced range -> [0, 2, 4, 6, 8]
torch.randn(2, 2)                  # random normal noise, N(0, 1)
torch.from_numpy(arr)              # bridge a NumPy array (shares memory)
torch.zeros_like(x)               # match shape/dtype/device of an existing tensor

See tensor creation ops.

Shapes, indexing, and ops

Most of deep learning is bookkeeping on shapes. reshape/view reinterpret the same data, unsqueeze/squeeze add and drop size-1 axes (the trick behind batching and broadcasting), and permute/transpose reorder axes. NumPy-style slicing and boolean masks read and write in place, and @ does the matrix multiply at the heart of every layer.

pytorch ops panel: shape/dtype/ndim, reshape/view, unsqueeze/squeeze, permute, indexing and masks, cat/stack, matmul.

Reshape, slice, and combine tensors without copying when you can.

pytorch ops panel: shape/dtype/ndim, reshape/view, unsqueeze/squeeze, permute, indexing and masks, cat/stack, matmul.

Reshape, slice, and combine tensors without copying when you can.

x.shape  x.dtype  x.ndim          # inspect metadata -> (3,4), float32, 2
x.reshape(2, 6)  x.view(-1)        # reinterpret same data; view(-1) flattens to (12,)
x.unsqueeze(0)  x.squeeze()        # add a leading dim -> (1,3,4); drop size-1 dims
x.permute(1, 0)                    # reorder axes -> (4,3), swap dim0 and dim1
x[:, 1]  x[x > 5]                  # column slice; boolean mask of values > 5
torch.cat([a, b], dim=0)           # join along an existing dim -> (6,4)
a @ b.T                            # matrix multiply, (3,4) @ (4,3) -> (3,3)

See torch.Tensor.

Autograd (gradients)

Set requires_grad=True and PyTorch silently records every operation into a graph. Calling .backward() on a scalar walks that graph in reverse and deposits a gradient in each leaf’s .grad. When you only want predictions, wrap the math in torch.no_grad() (or inference_mode()) so no graph is built, which is faster and uses less memory.

pytorch autograd panel: requires_grad, build a computation, backward, read a grad, no_grad, detach.

Tensors that track history; .backward() fills in the gradients.

pytorch autograd panel: requires_grad, build a computation, backward, read a grad, no_grad, detach.

Tensors that track history; .backward() fills in the gradients.

w = torch.tensor([2.], requires_grad=True)  # track gradients on this leaf
y = w * x + b                      # build a computation (history is recorded)
y.backward()                       # walk the graph backward; for w=2, x=3, b=1 -> w.grad=3, b.grad=1
w.grad                             # read the gradient at a leaf -> 3.0
with torch.no_grad():              # stop tracking: no graph built (eval / infer)
    ...
z.detach()                         # cut a branch off the graph (no grad)

See autograd mechanics.

Build a model (`nn.Module`)

Every model subclasses nn.Module: layers are created in __init__ and the data path is spelled out in forward. The module auto-registers each layer’s weights as parameters, so the optimizer can find them later. nn.Sequential is the quick way to chain layers when the forward pass is just “run them in order”.

pytorch nn panel: nn.Linear, nn.Sequential, custom nn.Module, F.relu, CrossEntropyLoss, count parameters.

Layers are parameters with a forward; stack them into a module.

pytorch nn panel: nn.Linear, nn.Sequential, custom nn.Module, F.relu, CrossEntropyLoss, count parameters.

Layers are parameters with a forward; stack them into a module.

import torch.nn as nn
import torch.nn.functional as F

nn.Linear(10, 20)                  # a linear (dense) layer -> 220 params
nn.Sequential(nn.Linear(10, 20), nn.ReLU(), nn.Linear(20, 2))  # stack layers in order

class Net(nn.Module):              # subclass for a custom forward
    def forward(self, x):
        ...

F.relu(x)                          # apply an activation (negatives clipped to 0)
nn.CrossEntropyLoss()              # a loss: logits + target -> scalar loss
sum(p.numel() for p in m.parameters())  # count parameters -> 262 for the two-layer Net

See torch.nn and torch.nn.functional.

Data loading

A Dataset answers two questions, “how many samples?” (__len__) and “give me sample i” (__getitem__). A DataLoader wraps it to batch, shuffle, and (optionally) parallelize loading, yielding ready-to-train (inputs, targets) tuples. next(iter(loader)) is the fastest way to sanity-check one batch’s shapes before you train.

pytorch data panel: TensorDataset, custom Dataset, DataLoader, iterate batches, grab one batch, random_split.

A Dataset indexes samples; a DataLoader batches and shuffles them.

pytorch data panel: TensorDataset, custom Dataset, DataLoader, iterate batches, grab one batch, random_split.

A Dataset indexes samples; a DataLoader batches and shuffles them.

from torch.utils.data import TensorDataset, DataLoader, Dataset, random_split

TensorDataset(X, y)                # wrap tensors, zipped row-by-row into (x, y) samples

class DS(Dataset):                 # custom dataset (the three methods)
    def __len__(self): ...         # how many samples
    def __getitem__(self, i): ...  # give me sample i

DataLoader(ds, batch_size=16, shuffle=True)  # batch + shuffle
for xb, yb in loader:              # iterate batches one (xb, yb) at a time
    ...
xb, yb = next(iter(loader))        # grab one batch (debug) -> (16,4) and (16,)
random_split(ds, [80, 20])         # split into train / val

See Dataset / DataLoader.

The training loop

The loop is always the same five beats: zero_grad (clear last step’s gradients), forward, compute loss, backward (fill gradients), step (update weights). Gradients accumulate by default, which is why zeroing first is mandatory. A learning-rate scheduler optionally decays the rate as training progresses.

pytorch train panel: AdamW optimizer, zero_grad, forward and loss, backward, step, scheduler.

Zero grads, forward, loss, backward, step. Repeat each batch.

pytorch train panel: AdamW optimizer, zero_grad, forward and loss, backward, step, scheduler.

Zero grads, forward, loss, backward, step. Repeat each batch.

import torch.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=1e-3)  # pick an optimizer
optimizer.zero_grad()              # clear old gradients (they accumulate by default)
loss = loss_fn(model(xb), yb)      # forward + loss
loss.backward()                    # backpropagate: fill every parameter's .grad
optimizer.step()                   # update the weights
scheduler.step()                   # anneal the learning rate (optional)

See torch.optim.

Save and load

Save the state_dict (a plain dict of tensors), not the Python object, so your checkpoint survives refactors. Reload it into a freshly constructed model of the same shape, and always pass weights_only=True to avoid executing arbitrary pickled code. Toggle model.eval() before inference so dropout and batch-norm behave correctly, and back to model.train() to resume.

pytorch save panel: save state_dict, load_state_dict, full checkpoint, eval/train modes, save a tensor, torch.export.

Save the state_dict, not the object; reload weights_only=True.

pytorch save panel: save state_dict, load_state_dict, full checkpoint, eval/train modes, save a tensor, torch.export.

Save the state_dict, not the object; reload weights_only=True.

torch.save(model.state_dict(), "model.pth")   # save just the weights (recommended)
model.load_state_dict(torch.load("model.pth", weights_only=True))  # load weights back (safe flag)
torch.save({"model": ..., "opt": ..., "epoch": e}, "ckpt.pth")     # save a full checkpoint
model.eval()  model.train()        # set the right mode (inference vs training)
torch.save(t, "t.pt")              # save a plain tensor
torch.export.export(model, (example,))   # export a portable graph (ExportedProgram)

See saving and loading and torch.export.

Devices and inference

Tensors and models live on a device; computation only happens when both are on the same one, so you .to(device) both. torch.accelerator (PyTorch 2.x) picks CUDA, Apple MPS, or CPU without hard-coding. For predictions, run under inference_mode(), then turn raw logits into a class with argmax or into probabilities with softmax.

pytorch device panel: pick the accelerator, move model and data, inference_mode, argmax, softmax, manual_seed.

Move model and data to the accelerator; run with no gradients.

pytorch device panel: pick the accelerator, move model and data, inference_mode, argmax, softmax, manual_seed.

Move model and data to the accelerator; run with no gradients.

device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else "cpu"  # cuda / mps / cpu
model.to(device)  xb.to(device)    # move model + data to the same device
with torch.inference_mode():       # run inference safely (no autograd)
    out = model(x)
out.argmax(dim=1)                  # logits -> predicted class index -> 2
torch.softmax(out, dim=1)          # logits -> probabilities summing to 1.0
torch.manual_seed(0)               # make a run reproducible

See torch.accelerator and Apple MPS backend.

Quick Reference

Common tensor creation.
Call	Makes
`torch.tensor(data)`	Tensor from a list/array (dtype inferred)
`torch.zeros(s)` · `torch.ones(s)`	Constant-filled tensor of shape `s`
`torch.full(s, v)`	Tensor of shape `s` filled with `v`
`torch.arange(a, b, step)`	1-D range `[a, b)`
`torch.linspace(a, b, n)`	`n` evenly spaced points in `[a, b]`
`torch.randn(s)` · `torch.rand(s)`	Normal / uniform random
`torch.randint(lo, hi, s)`	Random integers
`torch.eye(n)`	n x n identity
`torch.from_numpy(arr)`	Tensor sharing memory with a NumPy array
`torch.zeros_like(x)`	Same shape/dtype/device as `x`

Shape and op cheatsheet.
Call	Does
`x.shape` · `x.dtype` · `x.device`	Inspect metadata
`x.reshape(s)` · `x.view(s)`	Reinterpret to shape `s`
`x.flatten()`	Collapse to 1-D
`x.unsqueeze(d)` · `x.squeeze(d)`	Add / drop a size-1 dim at `d`
`x.permute(*dims)` · `x.transpose(a, b)`	Reorder / swap axes
`torch.cat([a, b], dim)`	Join along an existing dim
`torch.stack([a, b], dim)`	Join along a new dim
`a @ b` · `a.matmul(b)`	Matrix multiply
`x.sum()` · `x.mean()` · `x.max(dim)`	Reductions
`x.argmax(dim)` · `torch.topk(x, k)`	Indices of the largest values
`x.to(device)` · `x.to(dtype)`	Move / cast

The training loop in five lines.
Step	Call
1. Clear gradients	`optimizer.zero_grad()`
2. Forward	`out = model(xb)`
3. Loss	`loss = loss_fn(out, yb)`
4. Backward	`loss.backward()`
5. Update	`optimizer.step()`

Artifacts and modes.
Thing	Meaning
`state_dict`	Plain dict of tensors; the thing you save
`model.pth` / `.pt`	Convention for a saved checkpoint/tensor
`model.train()`	Dropout + BatchNorm active (training)
`model.eval()`	Dropout off, BatchNorm uses running stats (inference)
`requires_grad=True`	Tensor tracks history for autograd
`torch.no_grad()` / `inference_mode()`	Disable history tracking for speed
`weights_only=True`	Safe load flag for `torch.load`

Appendix: Sample Code

Minimal end-to-end training loop

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

# Data: 100 samples, 4 features, 2 classes
X = torch.randn(100, 4)
y = torch.randint(0, 2, (100,))
loader = DataLoader(TensorDataset(X, y), batch_size=16, shuffle=True)

# Model
model = nn.Sequential(nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 2))

# Loss + optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Train
model.train()
for epoch in range(5):
    for xb, yb in loader:
        optimizer.zero_grad()
        loss = loss_fn(model(xb), yb)
        loss.backward()
        optimizer.step()
    print(f"epoch {epoch}: loss {loss.item():.4f}")

Custom `nn.Module` and a custom `Dataset`

import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 2)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

class MyDataset(Dataset):
    def __init__(self, X, y):
        self.X, self.y = X, y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, i):
        return self.X[i], self.y[i]

Save, load, and run on the best device

import torch

# Save just the weights (recommended)
torch.save(model.state_dict(), "model.pth")

# Rebuild the same architecture, then load
model.load_state_dict(torch.load("model.pth", weights_only=True))

# Pick the accelerator (CUDA, Apple MPS, or CPU)
device = torch.accelerator.current_accelerator() if torch.accelerator.is_available() else torch.device("cpu")
model = model.to(device)

# Inference
model.eval()
with torch.inference_mode():
    x = torch.randn(1, 4).to(device)
    logits = model(x)
    probs = torch.softmax(logits, dim=1)
    pred = logits.argmax(dim=1)

References

PyTorch documentation

PyTorch documentation home, tensor creation ops, and torch.Tensor
torch.from_numpy, autograd mechanics, and torch.autograd / backward
torch.no_grad and inference_mode
torch.nn, torch.nn.functional, and Dataset / DataLoader
torch.optim, saving and loading, torch.save, and torch.load
torch.accelerator, Apple MPS backend, and torch.export

More

TorchScript (legacy), torch.compile, and the 60-Minute Blitz tutorial