sentence-transformers Cheatsheet

A visual guide to sentence-transformers covering model loading, encoding text to vectors, cosine similarity, semantic search, building and querying an embedding index, reranking with a CrossEncoder, batch encoding at scale, and the retrieve-and-rerank (RAG) flow.

python
sentence-transformers
cheatsheet
Author

James Balamuta

Published

July 14, 2026

sentence-transformers (SBERT) is the standard library for turning text into dense embedding vectors and running semantic search, similarity, and retrieve-and-rerank. You load a pretrained model by Hub name, call model.encode(...) to map any string into a fixed-length vector, and then compare vectors by cosine similarity to measure how close two pieces of text are in meaning. The recurring mental model in this sheet is one picture: text on the left flows along a gray arrow through an accent model box into a vector (a short row of numbered cells tagged with its dimension, e.g. 384), and two vectors flow into a green score chip (cosine in [-1, 1], bright when high). A bi-encoder embeds each text once and compares vectors; a cross-encoder reads a (query, doc) pair together and emits one relevance score. The conventional import is from sentence_transformers import SentenceTransformer, CrossEncoder, util, and everything here is sentence-transformers v5 (5.x renamed encode’s first argument from sentences to inputs, so each section flags the current spelling).

Complete sentence-transformers cheatsheet (light mode): eight panels covering loading a model, encoding text to vectors, cosine similarity between sentences, semantic search over a corpus, building and querying an embedding index, reranking with a CrossEncoder, batch encoding at scale, and the retrieve-and-rerank RAG flow.

Complete sentence-transformers cheatsheet (dark mode): eight panels covering loading a model, encoding text to vectors, cosine similarity between sentences, semantic search over a corpus, building and querying an embedding index, reranking with a CrossEncoder, batch encoding at scale, and the retrieve-and-rerank RAG flow.

Download the full cheatsheet

All eight panels in a single, printable SVG.

Light SVG Dark SVG

Load a Model

A SentenceTransformer wraps a pretrained transformer plus a pooling layer that turns any text into one fixed-length vector, and you load it by Hub name in a single line. Start with all-MiniLM-L6-v2 (fast, 384 dims) for prototyping and move to all-mpnet-base-v2 (slower, 768 dims, stronger) when quality matters. Pass device="cuda" to run on a GPU, and use get_sentence_embedding_dimension() and max_seq_length to inspect the output size and the input length cap.

sentence-transformers load panel: load a small default model, check the embedding dimension, see the input length limit, pick a device, pick a stronger but slower model.

One line pulls a pretrained embedding model from the Hub.

sentence-transformers load panel: load a small default model, check the embedding dimension, see the input length limit, pick a device, pick a stronger but slower model.

One line pulls a pretrained embedding model from the Hub.
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # fast default
model.get_sentence_embedding_dimension()        # 384  (vector size)
model.max_seq_length                            # 256  (token cap; longer is truncated)

SentenceTransformer("all-mpnet-base-v2", device="cuda")   # run on a GPU
SentenceTransformer("all-mpnet-base-v2")        # stronger, 768 dims, slower

See Quickstart. Browse the pretrained bi-encoders for other models.

Encode Text to Vectors

model.encode(...) is the workhorse: give it a string or a list of strings and get back a NumPy array of shape (n_texts, dim), or a torch tensor with convert_to_tensor=True. Set normalize_embeddings=True so cosine similarity reduces to a dot product, and use encode_query / encode_document for asymmetric search so the model applies its built-in query and document prompts. Turn on show_progress_bar=True to watch long jobs run batch by batch.

sentence-transformers encode panel: encode a list of sentences, get a torch tensor, normalize to unit length, encode a query versus a document, show a progress bar.

model.encode turns strings into dense embedding vectors.

sentence-transformers encode panel: encode a list of sentences, get a torch tensor, normalize to unit length, encode a query versus a document, show a progress bar.

model.encode turns strings into dense embedding vectors.
emb = model.encode(["a cat on a mat", "a dog in a park"])  # (2, 384) float32 ndarray
model.encode(texts, convert_to_tensor=True)     # torch.Tensor (for GPU math)
model.encode(texts, normalize_embeddings=True)  # unit length: cosine becomes a dot product

model.encode_query(q)                           # query prompt applied
model.encode_document(doc)                       # document prompt applied
model.encode(texts, show_progress_bar=True)     # per-batch progress bar

See Computing embeddings. The first positional argument is now inputs (was sentences).

Cosine Similarity Between Sentences

Once two texts are vectors, their meaning-closeness is the cosine of the angle between them, a number in [-1, 1] where higher is more similar. Call model.similarity(a, b) for the full pairwise matrix or model.similarity_pairwise(a, b) to compare rows one-to-one; both honor the model’s configured metric, which is cosine by default. When you do not have a model handle, util.cos_sim computes the same score directly from two vectors.

sentence-transformers similarity panel: score every pair as a matrix, read one pair, pairwise row-to-row scoring, cosine without a model handle, switch the similarity metric.

Compare embedding vectors; cosine in [-1, 1], higher means closer.

sentence-transformers similarity panel: score every pair as a matrix, read one pair, pairwise row-to-row scoring, cosine without a model handle, switch the similarity metric.

Compare embedding vectors; cosine in [-1, 1], higher means closer.
from sentence_transformers.util import cos_sim

sim = model.similarity(emb, emb)        # full 3 x 3 cosine matrix
float(sim[0, 1])                        # 0.67  (close);  float(sim[0, 2]) -> 0.08 (far)
model.similarity_pairwise(a, b)         # compare matching rows one-to-one

cos_sim(emb[0], emb[1])                 # 0.67  cosine without a model handle
SentenceTransformer(name, similarity_fn_name="dot")   # cosine | dot | euclidean | manhattan

See Semantic textual similarity. Prefer util.cos_sim over the legacy util.pytorch_cos_sim.

Build and Query an Embedding Index

The expensive step is encoding, so encode the corpus once with encode_document, keep the resulting tensor as your index, and reuse it for every incoming query embedded with encode_query. Persist the tensor with torch.save / torch.load to skip re-encoding between runs, and append new document embeddings with torch.cat as your corpus grows. Each query then becomes a cheap top-k lookup over the saved matrix.

sentence-transformers index panel: build the index with encode_document, save it to disk, load it back later, embed the query side, query the index and read hits, keep adding documents.

Encode the corpus once, reuse it for every query.

sentence-transformers index panel: build the index with encode_document, save it to disk, load it back later, embed the query side, query the index and read hits, keep adding documents.

Encode the corpus once, reuse it for every query.
import torch
from sentence_transformers.util import semantic_search

corpus_emb = model.encode_document(corpus, convert_to_tensor=True)  # build the index
torch.save(corpus_emb, "corpus.pt")             # encode once, save
corpus_emb = torch.load("corpus.pt")            # load it back later

q_emb = model.encode_query("how to reset password")
for h in semantic_search(q_emb, corpus_emb, top_k=5)[0]:
    corpus[h["corpus_id"]], h["score"]          # ranked hits, top first
corpus_emb = torch.cat([corpus_emb, model.encode_document(new)])   # grow the index

See semantic_search reference. encode_document applies the model’s document prompt.

Rerank with a CrossEncoder

A bi-encoder embeds query and document separately, which is fast and scalable but a little blurry; a CrossEncoder instead reads the (query, document) pair together and outputs a single, sharper relevance score. Cross-encoders are too slow to score a whole corpus, so you run them only on the top-k candidates from retrieval, using ce.predict(pairs) for raw scores or ce.rank(query, docs) for a sorted result. Pass activation_fn=torch.nn.Sigmoid() to map raw logits into 0..1.

sentence-transformers rerank panel: load a reranker model, score query-document pairs, rank candidates directly, get 0-to-1 probabilities, bi-encoder versus cross-encoder.

A cross-encoder reads (query, doc) together for sharper scores.

sentence-transformers rerank panel: load a reranker model, score query-document pairs, rank candidates directly, get 0-to-1 probabilities, bi-encoder versus cross-encoder.

A cross-encoder reads (query, doc) together for sharper scores.
import torch
from sentence_transformers import CrossEncoder

ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")     # a reranker
scores = ce.predict([(query, d) for d in docs])  # raw logits: [8.7, -4.3, -11.1]
ranks = ce.rank(query, docs, top_k=3)             # sorted [{corpus_id, score}, ...]
ce.predict(pairs, activation_fn=torch.nn.Sigmoid())   # squash logits into 0..1
# bi-encoder: two vectors then compare (fast); cross-encoder: one joined pass (accurate)

See CrossEncoder usage. predict returns raw logits, not probabilities.

Batch Encode at Scale

For large corpora, raise batch_size to use the hardware fully and turn on show_progress_bar; on a GPU, model.half() roughly doubles throughput for a tiny accuracy cost. Shrink the index with precision="int8" quantization or truncate_dim (Matryoshka) embeddings, and on multi-core CPUs use start_multi_process_pool() plus encode(..., pool=pool) instead of the deprecated standalone encode_multi_process.

sentence-transformers batch panel: encode in tuned batches, half precision for GPU speed, quantize embeddings smaller, truncate dimensions with Matryoshka, parallel multi-process pool.

Tune batch size, precision, and parallelism for big corpora.

sentence-transformers batch panel: encode in tuned batches, half precision for GPU speed, quantize embeddings smaller, truncate dimensions with Matryoshka, parallel multi-process pool.

Tune batch size, precision, and parallelism for big corpora.
model.encode(texts, batch_size=256, show_progress_bar=True)  # tune throughput
model.half()                                     # fp32 -> fp16, ~2x faster on GPU
model.encode(texts, precision="int8")            # quantize: 4x smaller index
model.encode(texts, truncate_dim=128)            # Matryoshka: keep the first 128 dims

pool = model.start_multi_process_pool()          # parallel CPU workers
model.encode(texts, pool=pool, chunk_size=2_000)  # replaces deprecated encode_multi_process

See Efficiency. The standalone encode_multi_process is deprecated in favor of pool=.

The Retrieve-and-Rerank (RAG) Flow

Production semantic search and RAG retrieval use two stages: a cheap bi-encoder retrieves the top-k candidates from millions of documents (high recall), then an expensive cross-encoder reranks just those candidates (high precision). The reranked top passages become the context you hand to a language model, so this pattern is the standard “retriever” half of a retrieval-augmented pipeline. Wide cheap retrieve, then narrow expensive rerank: recall first, precision second.

sentence-transformers RAG panel: retrieve top-k fast with the bi-encoder, rerank the candidates with the cross-encoder, take the best passages, build the prompt context, why two stages recall then precision.

Bi-encoder retrieves top-k, cross-encoder reranks, then you answer.

sentence-transformers RAG panel: retrieve top-k fast with the bi-encoder, rerank the candidates with the cross-encoder, take the best passages, build the prompt context, why two stages recall then precision.

Bi-encoder retrieves top-k, cross-encoder reranks, then you answer.
from sentence_transformers.util import semantic_search

# 1. Retrieve top-k fast with the bi-encoder (high recall)
hits = semantic_search(model.encode_query(q), corpus_emb, top_k=50)
# 2. Rerank just those candidates with the cross-encoder (high precision)
ranks = ce.rank(q, [corpus[h["corpus_id"]] for h in hits[0]])
# 3. Take the best passages and build the prompt context for an LLM
top = [corpus[r["corpus_id"]] for r in ranks[:5]]
context = "\n\n".join(top)              # wide cheap retrieve -> narrow expensive rerank

See Retrieve and rerank. Two stages: recall over millions, precision over dozens.

Quick Reference

Key sentence-transformers calls.
Command What it does Area
SentenceTransformer("all-MiniLM-L6-v2") Load a bi-encoder model Load
model.encode(texts) Text to embedding array (n, dim) Encode
model.encode(texts, convert_to_tensor=True) Encode to a torch tensor Encode
model.encode_query(q) / encode_document(d) Asymmetric query / document encoding Encode
model.similarity(a, b) Pairwise cosine matrix Similarity
util.cos_sim(a, b) Cosine without a model handle Similarity
util.semantic_search(q, corpus, top_k=k) Top-k corpus search Search
util.dot_score Dot-product scorer (normalized vectors) Search
CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2") Load a reranker Rerank
ce.predict(pairs) / ce.rank(query, docs) Score / sort (query, doc) pairs Rerank
model.encode(texts, batch_size=256) Tune throughput on big corpora Scale
model.encode(texts, precision="int8") Quantize embeddings (smaller index) Scale
model.encode(texts, truncate_dim=128) Matryoshka truncated dims Scale
model.start_multi_process_pool() Parallel multi-process encoding Scale
Common encode() keywords.
Argument Meaning
inputs (positional) The text(s) to encode (renamed from sentences)
batch_size=32 Texts per forward pass; raise it on a GPU
convert_to_tensor=False Return a torch tensor instead of a NumPy array
normalize_embeddings=False Scale each vector to unit length
show_progress_bar Show a per-batch progress bar
precision="float32" "int8" / "uint8" / "binary" to quantize
truncate_dim Keep only the first N dims (Matryoshka)
prompt_name / prompt Apply a named or literal instruction prompt
pool / chunk_size Multi-process encoding across workers
Bi-encoder vs cross-encoder.
Bi-encoder (SentenceTransformer) Cross-encoder (CrossEncoder)
Input Each text alone A (query, doc) pair together
Output One reusable vector per text One relevance score per pair
Speed Fast, embed once and reuse Slow, recompute per pair
Scales to Millions of documents Tens of candidates
Use for Retrieval / first-stage search Reranking the top-k
Model starting points.
Need Start here Notes
Fast general embeddings all-MiniLM-L6-v2 384 dims, great default
Higher-quality embeddings all-mpnet-base-v2 768 dims, slower
Multilingual paraphrase-multilingual-MiniLM-L12-v2 50+ languages
Asymmetric QA retrieval multi-qa-mpnet-base-dot-v1 Short query, long doc
Reranking cross-encoder/ms-marco-MiniLM-L6-v2 Cross-encoder, not a bi-encoder

Appendix: Sample Code

The text to vector to score mental model

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

emb = model.encode(["a cat on a mat", "a kitten on a rug", "stock prices fell"])
emb.shape                    # (3, 384)  -> three vectors, 384 dims each

sim = model.similarity(emb, emb)   # 3 x 3 cosine matrix
float(sim[0, 1])             # ~0.67  -> cat / kitten are close
float(sim[0, 2])             # ~0.08  -> cat / stocks are far

Semantic search over a corpus

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A cheetah is running behind its prey.",
]
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)

q_emb = model.encode_query("Someone eats bread.", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=3)

for h in hits[0]:
    print(round(h["score"], 3), corpus[h["corpus_id"]])
# 0.567 A man is eating food.
# 0.553 A man is eating pasta.
# ...

Persist the index, reuse it across runs

import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Build once and save:
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)
torch.save(corpus_emb, "corpus.pt")

# Later, in a fresh process:
corpus_emb = torch.load("corpus.pt")
q_emb = model.encode_query("how to reset my password", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=5)

Retrieve-and-rerank (the RAG retriever)

This is the pattern to copy for production semantic search: a cheap bi-encoder retrieves many candidates, an expensive cross-encoder reranks the few that matter.

from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import semantic_search

bi = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

corpus_emb = bi.encode_document(corpus, convert_to_tensor=True)
query = "How many people live in Berlin?"

# 1. Retrieve top-k fast with the bi-encoder
hits = semantic_search(bi.encode_query(query), corpus_emb, top_k=50)
candidates = [corpus[h["corpus_id"]] for h in hits[0]]

# 2. Rerank just those candidates with the cross-encoder
ranks = ce.rank(query, candidates, top_k=5)

# 3. Take the best passages as context for an LLM
context = "\n\n".join(candidates[r["corpus_id"]] for r in ranks)

Batch encode at scale

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [f"document number {i}" for i in range(100_000)]

# Tune throughput; quantize the index to shrink it 4x.
emb = model.encode(
    texts,
    batch_size=256,
    show_progress_bar=True,
    precision="int8",        # -> int8 vectors, much smaller on disk
    normalize_embeddings=True,
)

# CPU parallelism (replaces the deprecated encode_multi_process):
pool = model.start_multi_process_pool()
emb = model.encode(texts, pool=pool, chunk_size=2_000)
model.stop_multi_process_pool(pool)

Behavior notes

  • encode’s first argument is now inputs. Passing sentences= still works but emits a deprecation warning; pass the text positionally or as inputs=.
  • Prefer util.cos_sim over util.pytorch_cos_sim. The legacy alias is still present for back-compat, but cos_sim is the current name; reach for model.similarity(...) when you already have a model handle.
  • Use encode_query / encode_document for asymmetric search. They apply the model’s built-in query and document prompts; plain encode does not add them unless you pass prompt_name.
  • CrossEncoder.predict returns raw logits, not probabilities. Pass activation_fn=torch.nn.Sigmoid() to map them into 0..1 before thresholding.
  • encode_multi_process is deprecated. Build a pool with start_multi_process_pool() and pass pool= and chunk_size= straight to encode(...) instead.
  • Quantize and truncate to shrink the index. precision="int8" gives a 4x smaller index and truncate_dim (Matryoshka) keeps only the most important dims for huge corpora.

References

sentence-transformers documentation (latest)

Project and related