sentence-transformers (SBERT) is the standard library for turning text into dense embedding vectors and running semantic search, similarity, and retrieve-and-rerank. You load a pretrained model by Hub name, call model.encode(...) to map any string into a fixed-length vector, and then compare vectors by cosine similarity to measure how close two pieces of text are in meaning. The recurring mental model in this sheet is one picture: text on the left flows along a gray arrow through an accent model box into a vector (a short row of numbered cells tagged with its dimension, e.g. 384), and two vectors flow into a green score chip (cosine in [-1, 1], bright when high). A bi-encoder embeds each text once and compares vectors; a cross-encoder reads a (query, doc) pair together and emits one relevance score. The conventional import is from sentence_transformers import SentenceTransformer, CrossEncoder, util, and everything here is sentence-transformers v5 (5.x renamed encode’s first argument from sentences to inputs, so each section flags the current spelling).
Load a Model
A SentenceTransformer wraps a pretrained transformer plus a pooling layer that turns any text into one fixed-length vector, and you load it by Hub name in a single line. Start with all-MiniLM-L6-v2 (fast, 384 dims) for prototyping and move to all-mpnet-base-v2 (slower, 768 dims, stronger) when quality matters. Pass device="cuda" to run on a GPU, and use get_sentence_embedding_dimension() and max_seq_length to inspect the output size and the input length cap.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") # fast default
model.get_sentence_embedding_dimension() # 384 (vector size)
model.max_seq_length # 256 (token cap; longer is truncated)
SentenceTransformer("all-mpnet-base-v2", device="cuda") # run on a GPU
SentenceTransformer("all-mpnet-base-v2") # stronger, 768 dims, slowerSee Quickstart. Browse the pretrained bi-encoders for other models.
Encode Text to Vectors
model.encode(...) is the workhorse: give it a string or a list of strings and get back a NumPy array of shape (n_texts, dim), or a torch tensor with convert_to_tensor=True. Set normalize_embeddings=True so cosine similarity reduces to a dot product, and use encode_query / encode_document for asymmetric search so the model applies its built-in query and document prompts. Turn on show_progress_bar=True to watch long jobs run batch by batch.
emb = model.encode(["a cat on a mat", "a dog in a park"]) # (2, 384) float32 ndarray
model.encode(texts, convert_to_tensor=True) # torch.Tensor (for GPU math)
model.encode(texts, normalize_embeddings=True) # unit length: cosine becomes a dot product
model.encode_query(q) # query prompt applied
model.encode_document(doc) # document prompt applied
model.encode(texts, show_progress_bar=True) # per-batch progress barSee Computing embeddings. The first positional argument is now inputs (was sentences).
Cosine Similarity Between Sentences
Once two texts are vectors, their meaning-closeness is the cosine of the angle between them, a number in [-1, 1] where higher is more similar. Call model.similarity(a, b) for the full pairwise matrix or model.similarity_pairwise(a, b) to compare rows one-to-one; both honor the model’s configured metric, which is cosine by default. When you do not have a model handle, util.cos_sim computes the same score directly from two vectors.
from sentence_transformers.util import cos_sim
sim = model.similarity(emb, emb) # full 3 x 3 cosine matrix
float(sim[0, 1]) # 0.67 (close); float(sim[0, 2]) -> 0.08 (far)
model.similarity_pairwise(a, b) # compare matching rows one-to-one
cos_sim(emb[0], emb[1]) # 0.67 cosine without a model handle
SentenceTransformer(name, similarity_fn_name="dot") # cosine | dot | euclidean | manhattanSee Semantic textual similarity. Prefer util.cos_sim over the legacy util.pytorch_cos_sim.
Semantic Search
To rank a corpus against a query, embed the corpus once into a matrix, embed the query, and call util.semantic_search, which returns the top-k matches as {corpus_id, score} dicts sorted best first. Under the hood it is util.cos_sim (or util.dot_score) plus a top-k sort, so you can also compute the score matrix yourself with cos_sim when you need custom filtering. Swap in score_function=dot_score for speed when your vectors are already normalized.
from sentence_transformers.util import semantic_search, cos_sim, dot_score
corpus_emb = model.encode(corpus, convert_to_tensor=True) # N x 384 matrix
q_emb = model.encode(query, convert_to_tensor=True) # 1 x 384 vector
hits = semantic_search(q_emb, corpus_emb, top_k=3) # [{corpus_id, score}, ...]
cos_sim(q_emb, corpus_emb) # 1 x N score strip, then argsort
semantic_search(q_emb, corpus_emb, score_function=dot_score) # dot (needs normalized vectors)See Semantic search. The hits are {corpus_id, score} dicts sorted best first.
Build and Query an Embedding Index
The expensive step is encoding, so encode the corpus once with encode_document, keep the resulting tensor as your index, and reuse it for every incoming query embedded with encode_query. Persist the tensor with torch.save / torch.load to skip re-encoding between runs, and append new document embeddings with torch.cat as your corpus grows. Each query then becomes a cheap top-k lookup over the saved matrix.
import torch
from sentence_transformers.util import semantic_search
corpus_emb = model.encode_document(corpus, convert_to_tensor=True) # build the index
torch.save(corpus_emb, "corpus.pt") # encode once, save
corpus_emb = torch.load("corpus.pt") # load it back later
q_emb = model.encode_query("how to reset password")
for h in semantic_search(q_emb, corpus_emb, top_k=5)[0]:
corpus[h["corpus_id"]], h["score"] # ranked hits, top first
corpus_emb = torch.cat([corpus_emb, model.encode_document(new)]) # grow the indexSee semantic_search reference. encode_document applies the model’s document prompt.
Rerank with a CrossEncoder
A bi-encoder embeds query and document separately, which is fast and scalable but a little blurry; a CrossEncoder instead reads the (query, document) pair together and outputs a single, sharper relevance score. Cross-encoders are too slow to score a whole corpus, so you run them only on the top-k candidates from retrieval, using ce.predict(pairs) for raw scores or ce.rank(query, docs) for a sorted result. Pass activation_fn=torch.nn.Sigmoid() to map raw logits into 0..1.
import torch
from sentence_transformers import CrossEncoder
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2") # a reranker
scores = ce.predict([(query, d) for d in docs]) # raw logits: [8.7, -4.3, -11.1]
ranks = ce.rank(query, docs, top_k=3) # sorted [{corpus_id, score}, ...]
ce.predict(pairs, activation_fn=torch.nn.Sigmoid()) # squash logits into 0..1
# bi-encoder: two vectors then compare (fast); cross-encoder: one joined pass (accurate)See CrossEncoder usage. predict returns raw logits, not probabilities.
Batch Encode at Scale
For large corpora, raise batch_size to use the hardware fully and turn on show_progress_bar; on a GPU, model.half() roughly doubles throughput for a tiny accuracy cost. Shrink the index with precision="int8" quantization or truncate_dim (Matryoshka) embeddings, and on multi-core CPUs use start_multi_process_pool() plus encode(..., pool=pool) instead of the deprecated standalone encode_multi_process.
model.encode(texts, batch_size=256, show_progress_bar=True) # tune throughput
model.half() # fp32 -> fp16, ~2x faster on GPU
model.encode(texts, precision="int8") # quantize: 4x smaller index
model.encode(texts, truncate_dim=128) # Matryoshka: keep the first 128 dims
pool = model.start_multi_process_pool() # parallel CPU workers
model.encode(texts, pool=pool, chunk_size=2_000) # replaces deprecated encode_multi_processSee Efficiency. The standalone encode_multi_process is deprecated in favor of pool=.
The Retrieve-and-Rerank (RAG) Flow
Production semantic search and RAG retrieval use two stages: a cheap bi-encoder retrieves the top-k candidates from millions of documents (high recall), then an expensive cross-encoder reranks just those candidates (high precision). The reranked top passages become the context you hand to a language model, so this pattern is the standard “retriever” half of a retrieval-augmented pipeline. Wide cheap retrieve, then narrow expensive rerank: recall first, precision second.
from sentence_transformers.util import semantic_search
# 1. Retrieve top-k fast with the bi-encoder (high recall)
hits = semantic_search(model.encode_query(q), corpus_emb, top_k=50)
# 2. Rerank just those candidates with the cross-encoder (high precision)
ranks = ce.rank(q, [corpus[h["corpus_id"]] for h in hits[0]])
# 3. Take the best passages and build the prompt context for an LLM
top = [corpus[r["corpus_id"]] for r in ranks[:5]]
context = "\n\n".join(top) # wide cheap retrieve -> narrow expensive rerankSee Retrieve and rerank. Two stages: recall over millions, precision over dozens.
Quick Reference
| Command | What it does | Area |
|---|---|---|
SentenceTransformer("all-MiniLM-L6-v2") |
Load a bi-encoder model | Load |
model.encode(texts) |
Text to embedding array (n, dim) |
Encode |
model.encode(texts, convert_to_tensor=True) |
Encode to a torch tensor | Encode |
model.encode_query(q) / encode_document(d) |
Asymmetric query / document encoding | Encode |
model.similarity(a, b) |
Pairwise cosine matrix | Similarity |
util.cos_sim(a, b) |
Cosine without a model handle | Similarity |
util.semantic_search(q, corpus, top_k=k) |
Top-k corpus search | Search |
util.dot_score |
Dot-product scorer (normalized vectors) | Search |
CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2") |
Load a reranker | Rerank |
ce.predict(pairs) / ce.rank(query, docs) |
Score / sort (query, doc) pairs |
Rerank |
model.encode(texts, batch_size=256) |
Tune throughput on big corpora | Scale |
model.encode(texts, precision="int8") |
Quantize embeddings (smaller index) | Scale |
model.encode(texts, truncate_dim=128) |
Matryoshka truncated dims | Scale |
model.start_multi_process_pool() |
Parallel multi-process encoding | Scale |
| Argument | Meaning |
|---|---|
inputs (positional) |
The text(s) to encode (renamed from sentences) |
batch_size=32 |
Texts per forward pass; raise it on a GPU |
convert_to_tensor=False |
Return a torch tensor instead of a NumPy array |
normalize_embeddings=False |
Scale each vector to unit length |
show_progress_bar |
Show a per-batch progress bar |
precision="float32" |
"int8" / "uint8" / "binary" to quantize |
truncate_dim |
Keep only the first N dims (Matryoshka) |
prompt_name / prompt |
Apply a named or literal instruction prompt |
pool / chunk_size |
Multi-process encoding across workers |
Bi-encoder (SentenceTransformer) |
Cross-encoder (CrossEncoder) |
|
|---|---|---|
| Input | Each text alone | A (query, doc) pair together |
| Output | One reusable vector per text | One relevance score per pair |
| Speed | Fast, embed once and reuse | Slow, recompute per pair |
| Scales to | Millions of documents | Tens of candidates |
| Use for | Retrieval / first-stage search | Reranking the top-k |
| Need | Start here | Notes |
|---|---|---|
| Fast general embeddings | all-MiniLM-L6-v2 |
384 dims, great default |
| Higher-quality embeddings | all-mpnet-base-v2 |
768 dims, slower |
| Multilingual | paraphrase-multilingual-MiniLM-L12-v2 |
50+ languages |
| Asymmetric QA retrieval | multi-qa-mpnet-base-dot-v1 |
Short query, long doc |
| Reranking | cross-encoder/ms-marco-MiniLM-L6-v2 |
Cross-encoder, not a bi-encoder |
Appendix: Sample Code
The text to vector to score mental model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
emb = model.encode(["a cat on a mat", "a kitten on a rug", "stock prices fell"])
emb.shape # (3, 384) -> three vectors, 384 dims each
sim = model.similarity(emb, emb) # 3 x 3 cosine matrix
float(sim[0, 1]) # ~0.67 -> cat / kitten are close
float(sim[0, 2]) # ~0.08 -> cat / stocks are farSemantic search over a corpus
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
corpus = [
"A man is eating food.",
"A man is eating pasta.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A cheetah is running behind its prey.",
]
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)
q_emb = model.encode_query("Someone eats bread.", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=3)
for h in hits[0]:
print(round(h["score"], 3), corpus[h["corpus_id"]])
# 0.567 A man is eating food.
# 0.553 A man is eating pasta.
# ...Persist the index, reuse it across runs
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import semantic_search
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
# Build once and save:
corpus_emb = model.encode_document(corpus, convert_to_tensor=True)
torch.save(corpus_emb, "corpus.pt")
# Later, in a fresh process:
corpus_emb = torch.load("corpus.pt")
q_emb = model.encode_query("how to reset my password", convert_to_tensor=True)
hits = semantic_search(q_emb, corpus_emb, top_k=5)Retrieve-and-rerank (the RAG retriever)
This is the pattern to copy for production semantic search: a cheap bi-encoder retrieves many candidates, an expensive cross-encoder reranks the few that matter.
from sentence_transformers import SentenceTransformer, CrossEncoder
from sentence_transformers.util import semantic_search
bi = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
ce = CrossEncoder("cross-encoder/ms-marco-MiniLM-L6-v2")
corpus_emb = bi.encode_document(corpus, convert_to_tensor=True)
query = "How many people live in Berlin?"
# 1. Retrieve top-k fast with the bi-encoder
hits = semantic_search(bi.encode_query(query), corpus_emb, top_k=50)
candidates = [corpus[h["corpus_id"]] for h in hits[0]]
# 2. Rerank just those candidates with the cross-encoder
ranks = ce.rank(query, candidates, top_k=5)
# 3. Take the best passages as context for an LLM
context = "\n\n".join(candidates[r["corpus_id"]] for r in ranks)Batch encode at scale
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
texts = [f"document number {i}" for i in range(100_000)]
# Tune throughput; quantize the index to shrink it 4x.
emb = model.encode(
texts,
batch_size=256,
show_progress_bar=True,
precision="int8", # -> int8 vectors, much smaller on disk
normalize_embeddings=True,
)
# CPU parallelism (replaces the deprecated encode_multi_process):
pool = model.start_multi_process_pool()
emb = model.encode(texts, pool=pool, chunk_size=2_000)
model.stop_multi_process_pool(pool)Behavior notes
encode’s first argument is nowinputs. Passingsentences=still works but emits a deprecation warning; pass the text positionally or asinputs=.- Prefer
util.cos_simoverutil.pytorch_cos_sim. The legacy alias is still present for back-compat, butcos_simis the current name; reach formodel.similarity(...)when you already have a model handle. - Use
encode_query/encode_documentfor asymmetric search. They apply the model’s built-in query and document prompts; plainencodedoes not add them unless you passprompt_name. CrossEncoder.predictreturns raw logits, not probabilities. Passactivation_fn=torch.nn.Sigmoid()to map them into0..1before thresholding.encode_multi_processis deprecated. Build a pool withstart_multi_process_pool()and passpool=andchunk_size=straight toencode(...)instead.- Quantize and truncate to shrink the index.
precision="int8"gives a 4x smaller index andtruncate_dim(Matryoshka) keeps only the most important dims for huge corpora.
References
sentence-transformers documentation (latest)
- Documentation home and the Quickstart
- Computing embeddings, Semantic textual similarity, Semantic search
- CrossEncoder usage, Efficiency, Retrieve and rerank
- Pretrained bi-encoders, pretrained cross-encoders, the util reference
Project and related