sentence-transformers Cheatsheet – TheCoatlessProfessor

sentence-transformers (SBERT) is the standard library for turning text into dense embedding vectors and running semantic search, similarity, and retrieve-and-rerank. You load a pretrained model by Hub name, call model.encode(...) to map any string into a fixed-length vector, and then compare vectors by cosine similarity to measure how close two pieces of text are in meaning. The recurring mental model in this sheet is one picture: text on the left flows along a gray arrow through an accent model box into a vector (a short row of numbered cells tagged with its dimension, e.g. 384), and two vectors flow into a green score chip (cosine in [-1, 1], bright when high). A bi-encoder embeds each text once and compares vectors; a cross-encoder reads a (query, doc) pair together and emits one relevance score. The conventional import is from sentence_transformers import SentenceTransformer, CrossEncoder, util, and everything here is sentence-transformers v5 (5.x renamed encode’s first argument from sentences to inputs, so each section flags the current spelling).

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Load a Model

A SentenceTransformer wraps a pretrained transformer plus a pooling layer that turns any text into one fixed-length vector, and you load it by Hub name in a single line. Start with all-MiniLM-L6-v2 (fast, 384 dims) for prototyping and move to all-mpnet-base-v2 (slower, 768 dims, stronger) when quality matters. Pass device="cuda" to run on a GPU, and use get_sentence_embedding_dimension() and max_seq_length to inspect the output size and the input length cap.

sentence-transformers load panel: load a small default model, check the embedding dimension, see the input length limit, pick a device, pick a stronger but slower model.

One line pulls a pretrained embedding model from the Hub.

sentence-transformers load panel: load a small default model, check the embedding dimension, see the input length limit, pick a device, pick a stronger but slower model.

One line pulls a pretrained embedding model from the Hub.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")  # fast default
model.get_sentence_embedding_dimension()        # 384  (vector size)
model.max_seq_length                            # 256  (token cap; longer is truncated)

SentenceTransformer("all-mpnet-base-v2", device="cuda")   # run on a GPU
SentenceTransformer("all-mpnet-base-v2")        # stronger, 768 dims, slower

See Quickstart. Browse the pretrained bi-encoders for other models.

Encode Text to Vectors

model.encode(...) is the workhorse: give it a string or a list of strings and get back a NumPy array of shape (n_texts, dim), or a torch tensor with convert_to_tensor=True. Set normalize_embeddings=True so cosine similarity reduces to a dot product, and use encode_query / encode_document for asymmetric search so the model applies its built-in query and document prompts. Turn on show_progress_bar=True to watch long jobs run batch by batch.

sentence-transformers encode panel: encode a list of sentences, get a torch tensor, normalize to unit length, encode a query versus a document, show a progress bar.

model.encode turns strings into dense embedding vectors.

sentence-transformers encode panel: encode a list of sentences, get a torch tensor, normalize to unit length, encode a query versus a document, show a progress bar.

model.encode turns strings into dense embedding vectors.

emb = model.encode(["a cat on a mat", "a dog in a park"])  # (2, 384) float32 ndarray
model.encode(texts, convert_to_tensor=True)     # torch.Tensor (for GPU math)
model.encode(texts, normalize_embeddings=True)  # unit length: cosine becomes a dot product

model.encode_query(q)                           # query prompt applied
model.encode_document(doc)                       # document prompt applied
model.encode(texts, show_progress_bar=True)     # per-batch progress bar

See Computing embeddings. The first positional argument is now inputs (was sentences).

Cosine Similarity Between Sentences

Once two texts are vectors, their meaning-closeness is the cosine of the angle between them, a number in [-1, 1] where higher is more similar. Call model.similarity(a, b) for the full pairwise matrix or model.similarity_pairwise(a, b) to compare rows one-to-one; both honor the model’s configured metric, which is cosine by default. When you do not have a model handle, util.cos_sim computes the same score directly from two vectors.

sentence-transformers similarity panel: score every pair as a matrix, read one pair, pairwise row-to-row scoring, cosine without a model handle, switch the similarity metric.

Compare embedding vectors; cosine in [-1, 1], higher means closer.

sentence-transformers similarity panel: score every pair as a matrix, read one pair, pairwise row-to-row scoring, cosine without a model handle, switch the similarity metric.

Compare embedding vectors; cosine in [-1, 1], higher means closer.

from sentence_transformers.util import cos_sim

sim = model.similarity(emb, emb)        # full 3 x 3 cosine matrix
float(sim[0, 1])                        # 0.67  (close);  float(sim[0, 2]) -> 0.08 (far)
model.similarity_pairwise(a, b)         # compare matching rows one-to-one

cos_sim(emb[0], emb[1])                 # 0.67  cosine without a model handle
SentenceTransformer(name, similarity_fn_name="dot")   # cosine | dot | euclidean | manhattan

See Semantic textual similarity. Prefer util.cos_sim over the legacy util.pytorch_cos_sim.

Semantic Search

To rank a corpus against a query, embed the corpus once into a matrix, embed the query, and call util.semantic_search, which returns the top-k matches as {corpus_id, score} dicts sorted best first. Under the hood it is util.cos_sim (or util.dot_score) plus a top-k sort, so you can also compute the score matrix yourself with cos_sim when you need custom filtering. Swap in score_function=dot_score for speed when your vectors are already normalized.

Rank a whole corpus against a query by vector similarity.

from sentence_transformers.util import semantic_search, cos_sim, dot_score

corpus_emb = model.encode(corpus, convert_to_tensor=True)   # N x 384 matrix
q_emb = model.encode(query, convert_to_tensor=True)         # 1 x 384 vector

hits = semantic_search(q_emb, corpus_emb, top_k=3)          # [{corpus_id, score}, ...]
cos_sim(q_emb, corpus_emb)                                  # 1 x N score strip, then argsort
semantic_search(q_emb, corpus_emb, score_function=dot_score)  # dot (needs normalized vectors)

See Semantic search. The hits are {corpus_id, score} dicts sorted best first.

Build and Query an Embedding Index

The expensive step is encoding, so encode the corpus once with encode_document, keep the resulting tensor as your index, and reuse it for every incoming query embedded with encode_query. Persist the tensor with torch.save / torch.load to skip re-encoding between runs, and append new document embeddings with torch.cat as your corpus grows. Each query then becomes a cheap top-k lookup over the saved matrix.

sentence-transformers index panel: build the index with encode_document, save it to disk, load it back later, embed the query side, query the index and read hits, keep adding documents.

Encode the corpus once, reuse it for every query.

sentence-transformers index panel: build the index with encode_document, save it to disk, load it back later, embed the query side, query the index and read hits, keep adding documents.

Encode the corpus once, reuse it for every query.

import torch
from sentence_transformers.util import semantic_search

corpus_emb = model.encode_document(corpus, convert_to_tensor=True)  # build the index
torch.save(corpus_emb, "corpus.pt")             # encode once, save
corpus_emb = torch.load("corpus.pt")            # load it back later

q_emb = model.encode_query("how to reset password")
for h in semantic_search(q_emb, corpus_emb, top_k=5)[0]:
    corpus[h["corpus_id"]], h["score"]          # ranked hits, top first
corpus_emb = torch.cat([corpus_emb, model.encode_document(new)])   # grow the index

See semantic_search reference. encode_document applies the model’s document prompt.

Rerank with a CrossEncoder

A bi-encoder embeds query and document separately, which is fast and scalable but a little blurry; a CrossEncoder instead reads the (query, document) pair together and outputs a single, sharper relevance score. Cross-encoders are too slow to score a whole corpus, so you run them only on the top-k candidates from retrieval, using ce.predict(pairs) for raw scores or ce.rank(query, docs) for a sorted result. Pass activation_fn=torch.nn.Sigmoid() to map raw logits into 0..1.

sentence-transformers rerank panel: load a reranker model, score query-document pairs, rank candidates directly, get 0-to-1 probabilities, bi-encoder versus cross-encoder.

A cross-encoder reads (query, doc) together for sharper scores.

sentence-transformers rerank panel: load a reranker model, score query-document pairs, rank candidates directly, get 0-to-1 probabilities, bi-encoder versus cross-encoder.

A cross-encoder reads (query, doc) together for sharper scores.

import torch
from sentence_transformers import CrossEncoder

ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")     # a reranker
scores = ce.predict([(query, d) for d in docs])  # raw logits: [8.7, -4.3, -11.1]
ranks = ce.rank(query, docs, top_k=3)             # sorted [{corpus_id, score}, ...]
ce.predict(pairs, activation_fn=torch.nn.Sigmoid())   # squash logits into 0..1
# bi-encoder: two vectors then compare (fast); cross-encoder: one joined pass (accurate)

See CrossEncoder usage. predict returns raw logits, not probabilities.

Batch Encode at Scale

For large corpora, raise batch_size to use the hardware fully and turn on show_progress_bar; on a GPU, model.half() roughly doubles throughput for a tiny accuracy cost. Shrink the index with precision="int8" quantization or truncate_dim (Matryoshka) embeddings, and on multi-core CPUs use start_multi_process_pool() plus encode(..., pool=pool) instead of the deprecated standalone encode_multi_process.

sentence-transformers batch panel: encode in tuned batches, half precision for GPU speed, quantize embeddings smaller, truncate dimensions with Matryoshka, parallel multi-process pool.

Tune batch size, precision, and parallelism for big corpora.

sentence-transformers batch panel: encode in tuned batches, half precision for GPU speed, quantize embeddings smaller, truncate dimensions with Matryoshka, parallel multi-process pool.

Tune batch size, precision, and parallelism for big corpora.

model.encode(texts, batch_size=256, show_progress_bar=True)  # tune throughput
model.half()                                     # fp32 -> fp16, ~2x faster on GPU
model.encode(texts, precision="int8")            # quantize: 4x smaller index
model.encode(texts, truncate_dim=128)            # Matryoshka: keep the first 128 dims

pool = model.start_multi_process_pool()          # parallel CPU workers
model.encode(texts, pool=pool, chunk_size=2_000)  # replaces deprecated encode_multi_process

See Efficiency. The standalone encode_multi_process is deprecated in favor of pool=.

The Retrieve-and-Rerank (RAG) Flow

Production semantic search and RAG retrieval use two stages: a cheap bi-encoder retrieves the top-k candidates from millions of documents (high recall), then an expensive cross-encoder reranks just those candidates (high precision). The reranked top passages become the context you hand to a language model, so this pattern is the standard “retriever” half of a retrieval-augmented pipeline. Wide cheap retrieve, then narrow expensive rerank: recall first, precision second.

sentence-transformers RAG panel: retrieve top-k fast with the bi-encoder, rerank the candidates with the cross-encoder, take the best passages, build the prompt context, why two stages recall then precision.

Bi-encoder retrieves top-k, cross-encoder reranks, then you answer.

from sentence_transformers.util import semantic_search

# 1. Retrieve top-k fast with the bi-encoder (high recall)
hits = semantic_search(model.encode_query(q), corpus_emb, top_k=50)
# 2. Rerank just those candidates with the cross-encoder (high precision)
ranks = ce.rank(q, [corpus[h["corpus_id"]] for h in hits[0]])
# 3. Take the best passages and build the prompt context for an LLM
top = [corpus[r["corpus_id"]] for r in ranks[:5]]
context = "\n\n".join(top)              # wide cheap retrieve -> narrow expensive rerank

See Retrieve and rerank. Two stages: recall over millions, precision over dozens.

Quick Reference

Key sentence-transformers calls.
Command	What it does	Area
`SentenceTransformer("all-MiniLM-L6-v2")`	Load a bi-encoder model	Load
`model.encode(texts)`	Text to embedding array `(n, dim)`	Encode
`model.encode(texts, convert_to_tensor=True)`	Encode to a torch tensor	Encode
`model.encode_query(q)` / `encode_document(d)`	Asymmetric query / document encoding	Encode
`model.similarity(a, b)`	Pairwise cosine matrix	Similarity
`util.cos_sim(a, b)`	Cosine without a model handle	Similarity
`util.semantic_search(q, corpus, top_k=k)`	Top-k corpus search	Search
`util.dot_score`	Dot-product scorer (normalized vectors)	Search
`CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")`	Load a reranker	Rerank
`ce.predict(pairs)` / `ce.rank(query, docs)`	Score / sort `(query, doc)` pairs	Rerank
`model.encode(texts, batch_size=256)`	Tune throughput on big corpora	Scale
`model.encode(texts, precision="int8")`	Quantize embeddings (smaller index)	Scale
`model.encode(texts, truncate_dim=128)`	Matryoshka truncated dims	Scale
`model.start_multi_process_pool()`	Parallel multi-process encoding	Scale

Common `encode()` keywords.
Argument	Meaning
`inputs` (positional)	The text(s) to encode (renamed from `sentences`)
`batch_size=32`	Texts per forward pass; raise it on a GPU
`convert_to_tensor=False`	Return a torch tensor instead of a NumPy array
`normalize_embeddings=False`	Scale each vector to unit length
`show_progress_bar`	Show a per-batch progress bar
`precision="float32"`	`"int8"` / `"uint8"` / `"binary"` to quantize
`truncate_dim`	Keep only the first N dims (Matryoshka)
`prompt_name` / `prompt`	Apply a named or literal instruction prompt
`pool` / `chunk_size`	Multi-process encoding across workers

Bi-encoder vs cross-encoder.
	Bi-encoder (`SentenceTransformer`)	Cross-encoder (`CrossEncoder`)
Input	Each text alone	A `(query, doc)` pair together
Output	One reusable vector per text	One relevance score per pair
Speed	Fast, embed once and reuse	Slow, recompute per pair
Scales to	Millions of documents	Tens of candidates
Use for	Retrieval / first-stage search	Reranking the top-k

Model starting points.
Need	Start here	Notes
Fast general embeddings	`all-MiniLM-L6-v2`	384 dims, great default
Higher-quality embeddings	`all-mpnet-base-v2`	768 dims, slower
Multilingual	`paraphrase-multilingual-MiniLM-L12-v2`	50+ languages
Asymmetric QA retrieval	`multi-qa-mpnet-base-dot-v1`	Short query, long doc
Reranking	`cross-encoder/ms-marco-MiniLM-L6-v2`	Cross-encoder, not a bi-encoder

Appendix: Sample Code

The text to vector to score mental model

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

emb = model.encode(["a cat on a mat", "a kitten on a rug", "stock prices fell"])
emb.shape                    # (3, 384)  -> three vectors, 384 dims each

sim = model.similarity(emb, emb)   # 3 x 3 cosine matrix
float(sim[0, 1])             # ~0.67  -> cat / kitten are close
float(sim[0, 2])             # ~0.08  -> cat / stocks are far

Semantic search over a corpus

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

corpus = [
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A cheetah is running behind its prey.",
]
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)

q_emb = model.encode_query("Someone eats bread.", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=3)

for h in hits[0]:
    print(round(h["score"], 3), corpus[h["corpus_id"]])
# 0.567 A man is eating food.
# 0.553 A man is eating pasta.
# ...

Persist the index, reuse it across runs

import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# Build once and save:
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)
torch.save(corpus_emb, "corpus.pt")

# Later, in a fresh process:
corpus_emb = torch.load("corpus.pt")
q_emb = model.encode_query("how to reset my password", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=5)

Retrieve-and-rerank (the RAG retriever)

This is the pattern to copy for production semantic search: a cheap bi-encoder retrieves many candidates, an expensive cross-encoder reranks the few that matter.

from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import semantic_search

bi = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")

corpus_emb = bi.encode_document(corpus, convert_to_tensor=True)
query = "How many people live in Berlin?"

# 1. Retrieve top-k fast with the bi-encoder
hits = semantic_search(bi.encode_query(query), corpus_emb, top_k=50)
candidates = [corpus[h["corpus_id"]] for h in hits[0]]

# 2. Rerank just those candidates with the cross-encoder
ranks = ce.rank(query, candidates, top_k=5)

# 3. Take the best passages as context for an LLM
context = "\n\n".join(candidates[r["corpus_id"]] for r in ranks)

Batch encode at scale

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [f"document number {i}" for i in range(100_000)]

# Tune throughput; quantize the index to shrink it 4x.
emb = model.encode(
    texts,
    batch_size=256,
    show_progress_bar=True,
    precision="int8",        # -> int8 vectors, much smaller on disk
    normalize_embeddings=True,
)

# CPU parallelism (replaces the deprecated encode_multi_process):
pool = model.start_multi_process_pool()
emb = model.encode(texts, pool=pool, chunk_size=2_000)
model.stop_multi_process_pool(pool)

Behavior notes

encode’s first argument is now inputs. Passing sentences= still works but emits a deprecation warning; pass the text positionally or as inputs=.
Prefer util.cos_sim over util.pytorch_cos_sim. The legacy alias is still present for back-compat, but cos_sim is the current name; reach for model.similarity(...) when you already have a model handle.
Use encode_query / encode_document for asymmetric search. They apply the model’s built-in query and document prompts; plain encode does not add them unless you pass prompt_name.
CrossEncoder.predict returns raw logits, not probabilities. Pass activation_fn=torch.nn.Sigmoid() to map them into 0..1 before thresholding.
encode_multi_process is deprecated. Build a pool with start_multi_process_pool() and pass pool= and chunk_size= straight to encode(...) instead.
Quantize and truncate to shrink the index. precision="int8" gives a 4x smaller index and truncate_dim (Matryoshka) keeps only the most important dims for huge corpora.

References

sentence-transformers documentation (latest)

Project and related