spaCy Cheatsheet

A visual guide to spaCy covering loading a pipeline, tokens and linguistic features, named entities, the dependency parse, spans and the Matcher, vectors and similarity, pipeline components, and processing at scale with nlp.pipe.

python
spacy
nlp
cheatsheet
Author

James Balamuta

Published

August 1, 2026

spaCy is the industrial-strength library for turning raw text into structured linguistic objects. It is built around one idea: a trained pipeline (conventionally called nlp) takes a string and returns a Doc, a container of tokens already annotated with parts of speech, lemmas, dependency arcs, and named entities. The recurring mental model in this sheet is one picture: a raw text chip on the left flows along a gray arrow into the spaCy-blue nlp pipeline capsule (a row of component boxes) and emerges as a green Doc on the right, a ribbon of token boxes that later panels annotate (POS tags below, entity bands above, dependency arcs over the top). Where this looks like the scikit-learn sheet, the contrast is the point: scikit-learn fits estimators on feature matrices and returns arrays, while spaCy runs a pre-trained linguistic pipeline on text and returns typed annotation objects (Token, Span, Doc.ents). The conventional import is import spacy, with from spacy.matcher import Matcher, PhraseMatcher and from spacy import displacy where needed, and everything here is spaCy v3.

Complete spaCy cheatsheet (light mode): eight panels covering loading a pipeline into a Doc, reading token-level linguistic features, named entities, the dependency parse, spans and the Matcher, similarity and word vectors, pipeline components, and processing at scale with nlp.pipe.

Complete spaCy cheatsheet (dark mode): eight panels covering loading a pipeline into a Doc, reading token-level linguistic features, named entities, the dependency parse, spans and the Matcher, similarity and word vectors, pipeline components, and processing at scale with nlp.pipe.

Download the full cheatsheet

All eight panels in a single, printable SVG.

Light SVG Dark SVG

Load a Pipeline, Get a Doc

spaCy is built around one idea: a trained pipeline object (conventionally nlp) takes a string and returns a Doc, a container of tokens already annotated with parts of speech, lemmas, dependencies, and entities. You load a pipeline package with spacy.load("en_core_web_sm"), call nlp(text) to process a single document, and then index the Doc like a sequence: doc[0] is a Token, doc[1:3] is a Span, and doc.sents are sentences.

spaCy pipeline panel: load a trained pipeline, process text into a Doc, index the Doc as a sequence, split into sentences, round-trip the original text, start blank with no model.

A trained pipeline maps text to an annotated Doc.

spaCy pipeline panel: load a trained pipeline, process text into a Doc, index the Doc as a sequence, split into sentences, round-trip the original text, start blank with no model.

A trained pipeline maps text to an annotated Doc.
import spacy

nlp = spacy.load("en_core_web_sm")        # load a trained pipeline
doc = nlp("Apple is looking at buying U.K. startup.")   # text -> Doc

len(doc)                                  # number of tokens
doc[0], doc[1:3]                          # a Token, then a Span
list(doc.sents)                           # sentences as Spans
doc.text                                  # round-trips to the original string
nlp = spacy.blank("en")                   # tokenizer only, no components

See spaCy 101. Use the modern spacy.load(...); nlp(text) returns the annotated Doc.

Tokens and Linguistic Features

Each Token exposes its linguistic annotations as attributes: token.pos_ is the coarse part of speech, token.tag_ the fine-grained tag, token.lemma_ the dictionary form, and token.morph the morphological features. The convention is that string-valued attributes end in an underscore (pos_, lemma_, dep_), while the bare name returns an integer hash; spacy.explain("VBG") turns any tag into a human-readable description.

spaCy tokens panel: part-of-speech tags, lemma base form, token shape and boolean flags, morphological features, explain a tag in words, iterate tokens and their syntactic heads.

Each Token carries POS, lemma, dependency, and morphology.

spaCy tokens panel: part-of-speech tags, lemma base form, token shape and boolean flags, morphological features, explain a tag in words, iterate tokens and their syntactic heads.

Each Token carries POS, lemma, dependency, and morphology.
token.pos_, token.tag_                    # coarse POS, fine-grained tag
token.lemma_                              # dictionary base form
token.is_stop, token.is_alpha, token.like_num   # boolean shape flags
token.morph.to_dict()                     # {'Tense': 'Pres', 'Number': 'Sing', ...}
spacy.explain("VBG")                      # 'verb, gerund or present participle'
[(t.text, t.pos_, t.head.text) for t in doc]   # token, POS, syntactic head

See POS tagging. String-valued attributes end in _; the bare name is an integer hash.

Named Entities (doc.ents)

The named-entity recognizer labels real-world spans, so doc.ents is a tuple of Span objects, each with ent.text, ent.label_ (such as ORG, GPE, MONEY), and character offsets. Per token, the same information appears as BIO tags (token.ent_iob_, token.ent_type_), and displacy.render(doc, style="ent") draws the entities as colored highlights.

spaCy entities panel: list the entities, read one entity's text label and character offsets, decode a label with explain, per-token BIO tags, visualize entities as colored highlights with displacy.

Real-world spans the model labels (ORG, GPE, MONEY, …).

spaCy entities panel: list the entities, read one entity's text label and character offsets, decode a label with explain, per-token BIO tags, visualize entities as colored highlights with displacy.

Real-world spans the model labels (ORG, GPE, MONEY, …).
from spacy import displacy

doc.ents                                  # (Apple/ORG, U.K./GPE, $1 billion/MONEY, ...)
ent.text, ent.label_, ent.start_char, ent.end_char   # one entity's fields
spacy.explain("GPE")                      # 'Countries, cities, states'
token.ent_iob_, token.ent_type_          # per-token BIO tag and entity label
displacy.render(doc, style="ent")         # colored highlight marks behind entities

See Named entities. doc.ents holds Spans; per token use ent_iob_ / ent_type_.

The Dependency Parse

spaCy parses every sentence into a tree where each token has exactly one syntactic head and a typed relation token.dep_ (such as nsubj, dobj, prep); the one token whose head is itself is the ROOT. Walk token.children and token.head to navigate the tree, read doc.noun_chunks for base noun phrases, and call displacy.render(doc, style="dep") for the classic arc diagram.

spaCy parse panel: find the sentence ROOT, read a token's relation and head, walk a token's children, extract base noun chunks, decode a relation label, visualize the parse as a dependency arc diagram.

Every token has one head; arcs are typed syntactic relations.

spaCy parse panel: find the sentence ROOT, read a token's relation and head, walk a token's children, extract base noun chunks, decode a relation label, visualize the parse as a dependency arc diagram.

Every token has one head; arcs are typed syntactic relations.
from spacy import displacy

[t for t in doc if t.dep_ == "ROOT"]      # the sentence root (its head is itself)
token.dep_, token.head                    # relation to head, and the head Token
list(token.children)                      # tokens that depend on this one
list(doc.noun_chunks)                     # base noun phrases as Spans
spacy.explain("nsubj")                    # 'nominal subject'
displacy.render(doc, style="dep")         # the classic curved-arc diagram (SVG)

See Dependency parse. Each token has one head; the ROOT is its own head.

Spans and the Matcher

A Span is a slice of a Doc (doc[2:4], or doc.char_span(0, 5) from character offsets) and is the unit for entities, noun chunks, and your own annotations stored in doc.spans. The Matcher finds sequences by matching token-attribute patterns (lists of dicts like {"LOWER": "looking"} or {"POS": "VERB"}), while the PhraseMatcher matches exact phrases quickly by pre-processing them into Doc objects.

spaCy matcher panel: slice tokens into a Span, build a span from character offsets, token-attribute patterns with the Matcher, run the matcher, match exact phrases with PhraseMatcher, store named span groups.

Slice the Doc into spans; find token patterns with the Matcher.

spaCy matcher panel: slice tokens into a Span, build a span from character offsets, token-attribute patterns with the Matcher, run the matcher, match exact phrases with PhraseMatcher, store named span groups.

Slice the Doc into spans; find token patterns with the Matcher.
from spacy.matcher import Matcher, PhraseMatcher

span = doc[2:4]                           # a Span slice of the Doc
doc.char_span(0, 5, label="ORG")          # a Span from character offsets

m = Matcher(nlp.vocab)                     # token-attribute pattern matcher
m.add("LOOK", [[{"LOWER": "looking"}, {"LOWER": "at"}, {"POS": "VERB"}]])
for mid, start, end in m(doc):
    doc[start:end]                         # the matched Span, e.g. "looking at buying"

pm = PhraseMatcher(nlp.vocab)              # fast exact-phrase matcher
pm.add("CO", [nlp.make_doc("Apple")])
doc.spans["sc"] = [doc[0:1], doc[5:6]]    # named span groups (overlap allowed)

See Rule-based matching. Matcher matches token attributes; PhraseMatcher matches phrases.

Similarity and Word Vectors

Pipelines that ship word vectors give every Token, Span, and Doc a .vector, and .similarity() returns the cosine similarity between two of them. The important caveat: the small packages (names ending in sm) do not ship word vectors, only context-sensitive tensors, so for meaningful similarity load a medium (md) or large (lg) package, and remember that similarity reflects the training data, not ground truth.

spaCy vectors panel: read a token's vector and shape, check for real vectors, compare two docs or tokens with cosine similarity, load a vectors pipeline, the small model warning, manage expectations about similarity.

Compare meaning with vectors. Use md/lg, not sm, for real vectors.

spaCy vectors panel: read a token's vector and shape, check for real vectors, compare two docs or tokens with cosine similarity, load a vectors pipeline, the small model warning, manage expectations about similarity.

Compare meaning with vectors. Use md/lg, not sm, for real vectors.
token.vector, token.vector.shape          # a word vector, shape (300,) for md/lg
token.has_vector, token.is_oov            # real vector? out of vocabulary?
doc1.similarity(doc2)                      # cosine similarity (needs vectors)
token1.similarity(token2)                  # same, token to token

nlp = spacy.load("en_core_web_md")        # md/lg ship genuine 300-dim vectors
# en_core_web_sm has tensors, NOT vectors: similarity there is unreliable.
# similarity reflects the training data, not ground truth.

See Vectors and similarity. sm has tensors only; use md or lg for real vectors.

Pipeline Components

The nlp object is an ordered sequence of named components (nlp.pipe_names lists them, for example tok2vec, tagger, parser, lemmatizer, ner), and you can edit that sequence: nlp.add_pipe("entity_ruler", before="ner") inserts a built-in component, @Language.component registers your own, and nlp.select_pipes(disable=[...]) temporarily turns components off. nlp.analyze_pipes() reports what each component requires and assigns so you can reason about the order.

spaCy components panel: list the components, add a built-in component, add rule-based entity patterns, run without some components, register a custom component, inspect what each requires and assigns.

nlp is a sequence of named components you can inspect and edit.

spaCy components panel: list the components, add a built-in component, add rule-based entity patterns, run without some components, register a custom component, inspect what each requires and assigns.

nlp is a sequence of named components you can inspect and edit.
from spacy.language import Language

nlp.pipe_names                            # ['tok2vec', 'tagger', 'parser', ..., 'ner']
ruler = nlp.add_pipe("entity_ruler", before="ner")   # insert a built-in component
ruler.add_patterns([{"label": "ORG", "pattern": "spaCy"}])   # rule-based entities

with nlp.select_pipes(disable=["ner"]):   # temporarily turn components off
    nlp(text)

@Language.component("my_comp")            # register a custom component
def my_comp(doc):
    return doc
nlp.add_pipe("my_comp", last=True)
nlp.analyze_pipes(pretty=True)            # per component: requires / assigns

See Processing pipelines. Edit with add_pipe; inspect with analyze_pipes.

Process at Scale with nlp.pipe

When you have many texts, never loop nlp(text) one at a time; pass an iterable to nlp.pipe(texts), which batches them for a large speedup and yields Doc objects as a stream. Tune batch_size, set n_process for multiprocessing, pass as_tuples=True to carry per-text metadata through, and disable=[...] to skip components you do not need for the task.

spaCy pipe panel: stream many texts, tune the batch size, run components in parallel with n_process, keep metadata with as_tuples, skip unused components, process then collect results.

Stream many texts efficiently in batches.

spaCy pipe panel: stream many texts, tune the batch size, run components in parallel with n_process, keep metadata with as_tuples, skip unused components, process then collect results.

Stream many texts efficiently in batches.
docs = nlp.pipe(texts)                     # stream + batch many texts (a generator)
nlp.pipe(texts, batch_size=50)             # tune the batch size for throughput
nlp.pipe(texts, n_process=4)               # multiprocessing across workers
for doc, ctx in nlp.pipe(data, as_tuples=True):   # carry per-text metadata
    ctx["id"]                              # the context dict rides through unchanged
nlp.pipe(texts, disable=["parser", "ner"])   # skip components you do not need
for doc in nlp.pipe(texts):
    doc.ents                               # collect results from the Doc stream

See Processing text. Prefer nlp.pipe(texts) over a loop of nlp(text).

Quick Reference

Key spaCy calls.
Command What it does Area
nlp = spacy.load("en_core_web_sm") Load a trained pipeline Pipeline
doc = nlp(text) Process text into a Doc Pipeline
doc.sents Iterate sentences (Spans) Pipeline
token.pos_ / token.tag_ Coarse / fine part of speech Tokens
token.lemma_ Dictionary base form Tokens
token.morph Morphological features Tokens
doc.ents Named-entity Spans Entities
ent.label_ Entity type (ORG, GPE, …) Entities
token.dep_ / token.head Dependency relation / head Parse
doc.noun_chunks Base noun phrases Parse
doc[i:j] / doc.char_span(a, b) Make a Span Spans
Matcher(nlp.vocab) Token-pattern matcher Matcher
PhraseMatcher(nlp.vocab) Exact-phrase matcher Matcher
doc1.similarity(doc2) Cosine similarity (needs vectors) Vectors
nlp.pipe_names List pipeline components Components
nlp.add_pipe("entity_ruler") Insert a component Components
nlp.select_pipes(disable=[...]) Temporarily turn components off Components
nlp.pipe(texts) Stream/batch many texts Scale
spacy.explain("nsubj") Decode a tag/label Helpers
The core spaCy objects.
Object What it is You get it from
Language (nlp) The trained pipeline spacy.load(...) / spacy.blank(...)
Doc A processed document of tokens nlp(text) / nlp.pipe(texts)
Token One token with its annotations doc[i], iterating doc
Span A slice of a Doc doc[i:j], doc.ents, doc.sents
Lexeme A context-free vocab entry nlp.vocab[string]
Vocab Shared strings and vectors nlp.vocab
Frequently used Token attributes.
Attribute Type Meaning
token.text str The verbatim token text
token.lemma_ str Base form
token.pos_ str Coarse part of speech (UPOS)
token.tag_ str Fine-grained tag
token.dep_ str Dependency relation to head
token.head Token Syntactic parent
token.morph MorphAnalysis Morphological features
token.ent_type_ str Entity label if part of an entity
token.is_stop bool Is a stop word
token.is_alpha bool All alphabetic characters
token.like_num bool Resembles a number
token.vector ndarray Word vector (md/lg pipelines)
Selected doc.ents labels (use spacy.explain for the rest).
Label Meaning
PERSON People, including fictional
ORG Companies, agencies, institutions
GPE Countries, cities, states
LOC Non-GPE locations (mountains, water)
DATE / TIME Absolute or relative dates and times
MONEY Monetary values, with unit
PERCENT Percentages
PRODUCT Objects, vehicles, foods (not services)
English pipeline packages.
Package Vectors? Use it for
en_core_web_sm No (tensors only) Fast tagging, parsing, NER; small footprint
en_core_web_md Yes (300-dim) Adds word vectors for similarity
en_core_web_lg Yes (300-dim, larger) Best vector coverage
en_core_web_trf No (transformer) Highest accuracy, transformer-based

Appendix: Sample Code

The text to Doc mental model

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion in San Francisco.")

len(doc)              # 15  (tokens, not words)
doc[0].text           # 'Apple'
doc[1:3].text         # 'is looking'  (a Span)
[s.text for s in doc.sents]   # one sentence here
doc.text              # round-trips to the original string

Reading token-level linguistic features

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a startup.")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.text)
# Apple Apple PROPN NNP nsubj looking
# is    be    AUX   VBZ aux   looking
# ...

doc[1].morph.to_dict()        # {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', ...}
spacy.explain("VBG")          # 'verb, gerund or present participle'

Named entities and rule-based entities

import spacy

nlp = spacy.load("en_core_web_sm")

# Model-predicted entities
doc = nlp("Apple is in San Francisco.")
for ent in doc.ents:
    print(ent.text, ent.label_, ent.start_char, ent.end_char)
# Apple ORG 0 5
# San Francisco GPE 9 22

# Add rule-based entities with an EntityRuler before the statistical NER
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([{"label": "ORG", "pattern": "spaCy"}])
doc = nlp("I use spaCy every day.")
print([(e.text, e.label_) for e in doc.ents])   # [('spaCy', 'ORG')]

Matching token patterns

import spacy
from spacy.matcher import Matcher, PhraseMatcher

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup.")

# Token-attribute pattern: "looking at <VERB>"
matcher = Matcher(nlp.vocab)
matcher.add("LOOK_AT_VERB", [[{"LOWER": "looking"}, {"LOWER": "at"}, {"POS": "VERB"}]])
for match_id, start, end in matcher(doc):
    print(nlp.vocab.strings[match_id], "->", doc[start:end].text)
# LOOK_AT_VERB -> looking at buying

# Exact phrases, fast
phrase = PhraseMatcher(nlp.vocab)
phrase.add("COMPANY", [nlp.make_doc("Apple")])
print([doc[s:e].text for _id, s, e in phrase(doc)])   # ['Apple']

Similarity (load a vectors pipeline first)

import spacy

# en_core_web_sm has NO word vectors (tensors only); use md or lg for similarity.
# python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

doc1.similarity(doc2)                 # e.g. 0.69  (cosine similarity)
nlp("dog")[0].similarity(nlp("cat")[0])   # vectors must exist (has_vector == True)

Editing the pipeline and processing at scale

import spacy

nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

# Inspect what each component requires/assigns
nlp.analyze_pipes(pretty=True)

texts = ["First document.", "Second document.", "Third document."]

# Stream many texts efficiently; carry metadata with as_tuples
data = [(t, {"id": i}) for i, t in enumerate(texts)]
for doc, ctx in nlp.pipe(data, as_tuples=True, batch_size=2):
    print(ctx["id"], [(e.text, e.label_) for e in doc.ents])

# Skip components you do not need for the task (faster)
for doc in nlp.pipe(texts, disable=["parser", "ner"]):
    print([t.pos_ for t in doc])

Behavior notes

  • One pipeline, one Doc. spaCy’s whole model is nlp(text) -> Doc; every later object (Token, Span, doc.ents, dependency arcs) is read off that single annotated Doc.
  • String attributes end in an underscore. token.pos_, token.lemma_, token.dep_, and ent.label_ return strings; the underscore-free name returns an integer hash, not a string.
  • The small pipeline has no word vectors. en_core_web_sm ships context-sensitive tensors but not word vectors, so .similarity() is unreliable there; load en_core_web_md or en_core_web_lg for genuine 300-dim word vectors.
  • Use the modern factory form. Add components with nlp.add_pipe("component") and disable them with nlp.select_pipes(disable=[...]); avoid the spaCy 2.x spellings nlp.create_pipe(...) and nlp.disable_pipes(...).
  • Prefer nlp.pipe for many texts. It batches internally for a large speedup and supports batch_size, n_process, as_tuples, and disable; do not loop nlp(text) one at a time.

References

spaCy documentation (v3)

Project and related