spaCy is the industrial-strength library for turning raw text into structured linguistic objects. It is built around one idea: a trained pipeline (conventionally called nlp) takes a string and returns a Doc, a container of tokens already annotated with parts of speech, lemmas, dependency arcs, and named entities. The recurring mental model in this sheet is one picture: a raw text chip on the left flows along a gray arrow into the spaCy-blue nlp pipeline capsule (a row of component boxes) and emerges as a green Doc on the right, a ribbon of token boxes that later panels annotate (POS tags below, entity bands above, dependency arcs over the top). Where this looks like the scikit-learn sheet, the contrast is the point: scikit-learn fits estimators on feature matrices and returns arrays, while spaCy runs a pre-trained linguistic pipeline on text and returns typed annotation objects (Token, Span, Doc.ents). The conventional import is import spacy, with from spacy.matcher import Matcher, PhraseMatcher and from spacy import displacy where needed, and everything here is spaCy v3.
Load a Pipeline, Get a Doc
spaCy is built around one idea: a trained pipeline object (conventionally nlp) takes a string and returns a Doc, a container of tokens already annotated with parts of speech, lemmas, dependencies, and entities. You load a pipeline package with spacy.load("en_core_web_sm"), call nlp(text) to process a single document, and then index the Doc like a sequence: doc[0] is a Token, doc[1:3] is a Span, and doc.sents are sentences.
import spacy
nlp = spacy.load("en_core_web_sm") # load a trained pipeline
doc = nlp("Apple is looking at buying U.K. startup.") # text -> Doc
len(doc) # number of tokens
doc[0], doc[1:3] # a Token, then a Span
list(doc.sents) # sentences as Spans
doc.text # round-trips to the original string
nlp = spacy.blank("en") # tokenizer only, no componentsSee spaCy 101. Use the modern spacy.load(...); nlp(text) returns the annotated Doc.
Tokens and Linguistic Features
Each Token exposes its linguistic annotations as attributes: token.pos_ is the coarse part of speech, token.tag_ the fine-grained tag, token.lemma_ the dictionary form, and token.morph the morphological features. The convention is that string-valued attributes end in an underscore (pos_, lemma_, dep_), while the bare name returns an integer hash; spacy.explain("VBG") turns any tag into a human-readable description.
token.pos_, token.tag_ # coarse POS, fine-grained tag
token.lemma_ # dictionary base form
token.is_stop, token.is_alpha, token.like_num # boolean shape flags
token.morph.to_dict() # {'Tense': 'Pres', 'Number': 'Sing', ...}
spacy.explain("VBG") # 'verb, gerund or present participle'
[(t.text, t.pos_, t.head.text) for t in doc] # token, POS, syntactic headSee POS tagging. String-valued attributes end in _; the bare name is an integer hash.
Named Entities (doc.ents)
The named-entity recognizer labels real-world spans, so doc.ents is a tuple of Span objects, each with ent.text, ent.label_ (such as ORG, GPE, MONEY), and character offsets. Per token, the same information appears as BIO tags (token.ent_iob_, token.ent_type_), and displacy.render(doc, style="ent") draws the entities as colored highlights.
from spacy import displacy
doc.ents # (Apple/ORG, U.K./GPE, $1 billion/MONEY, ...)
ent.text, ent.label_, ent.start_char, ent.end_char # one entity's fields
spacy.explain("GPE") # 'Countries, cities, states'
token.ent_iob_, token.ent_type_ # per-token BIO tag and entity label
displacy.render(doc, style="ent") # colored highlight marks behind entitiesSee Named entities. doc.ents holds Spans; per token use ent_iob_ / ent_type_.
The Dependency Parse
spaCy parses every sentence into a tree where each token has exactly one syntactic head and a typed relation token.dep_ (such as nsubj, dobj, prep); the one token whose head is itself is the ROOT. Walk token.children and token.head to navigate the tree, read doc.noun_chunks for base noun phrases, and call displacy.render(doc, style="dep") for the classic arc diagram.
from spacy import displacy
[t for t in doc if t.dep_ == "ROOT"] # the sentence root (its head is itself)
token.dep_, token.head # relation to head, and the head Token
list(token.children) # tokens that depend on this one
list(doc.noun_chunks) # base noun phrases as Spans
spacy.explain("nsubj") # 'nominal subject'
displacy.render(doc, style="dep") # the classic curved-arc diagram (SVG)See Dependency parse. Each token has one head; the ROOT is its own head.
Spans and the Matcher
A Span is a slice of a Doc (doc[2:4], or doc.char_span(0, 5) from character offsets) and is the unit for entities, noun chunks, and your own annotations stored in doc.spans. The Matcher finds sequences by matching token-attribute patterns (lists of dicts like {"LOWER": "looking"} or {"POS": "VERB"}), while the PhraseMatcher matches exact phrases quickly by pre-processing them into Doc objects.
from spacy.matcher import Matcher, PhraseMatcher
span = doc[2:4] # a Span slice of the Doc
doc.char_span(0, 5, label="ORG") # a Span from character offsets
m = Matcher(nlp.vocab) # token-attribute pattern matcher
m.add("LOOK", [[{"LOWER": "looking"}, {"LOWER": "at"}, {"POS": "VERB"}]])
for mid, start, end in m(doc):
doc[start:end] # the matched Span, e.g. "looking at buying"
pm = PhraseMatcher(nlp.vocab) # fast exact-phrase matcher
pm.add("CO", [nlp.make_doc("Apple")])
doc.spans["sc"] = [doc[0:1], doc[5:6]] # named span groups (overlap allowed)See Rule-based matching. Matcher matches token attributes; PhraseMatcher matches phrases.
Similarity and Word Vectors
Pipelines that ship word vectors give every Token, Span, and Doc a .vector, and .similarity() returns the cosine similarity between two of them. The important caveat: the small packages (names ending in sm) do not ship word vectors, only context-sensitive tensors, so for meaningful similarity load a medium (md) or large (lg) package, and remember that similarity reflects the training data, not ground truth.
token.vector, token.vector.shape # a word vector, shape (300,) for md/lg
token.has_vector, token.is_oov # real vector? out of vocabulary?
doc1.similarity(doc2) # cosine similarity (needs vectors)
token1.similarity(token2) # same, token to token
nlp = spacy.load("en_core_web_md") # md/lg ship genuine 300-dim vectors
# en_core_web_sm has tensors, NOT vectors: similarity there is unreliable.
# similarity reflects the training data, not ground truth.See Vectors and similarity. sm has tensors only; use md or lg for real vectors.
Pipeline Components
The nlp object is an ordered sequence of named components (nlp.pipe_names lists them, for example tok2vec, tagger, parser, lemmatizer, ner), and you can edit that sequence: nlp.add_pipe("entity_ruler", before="ner") inserts a built-in component, @Language.component registers your own, and nlp.select_pipes(disable=[...]) temporarily turns components off. nlp.analyze_pipes() reports what each component requires and assigns so you can reason about the order.
from spacy.language import Language
nlp.pipe_names # ['tok2vec', 'tagger', 'parser', ..., 'ner']
ruler = nlp.add_pipe("entity_ruler", before="ner") # insert a built-in component
ruler.add_patterns([{"label": "ORG", "pattern": "spaCy"}]) # rule-based entities
with nlp.select_pipes(disable=["ner"]): # temporarily turn components off
nlp(text)
@Language.component("my_comp") # register a custom component
def my_comp(doc):
return doc
nlp.add_pipe("my_comp", last=True)
nlp.analyze_pipes(pretty=True) # per component: requires / assignsSee Processing pipelines. Edit with add_pipe; inspect with analyze_pipes.
Process at Scale with nlp.pipe
When you have many texts, never loop nlp(text) one at a time; pass an iterable to nlp.pipe(texts), which batches them for a large speedup and yields Doc objects as a stream. Tune batch_size, set n_process for multiprocessing, pass as_tuples=True to carry per-text metadata through, and disable=[...] to skip components you do not need for the task.
docs = nlp.pipe(texts) # stream + batch many texts (a generator)
nlp.pipe(texts, batch_size=50) # tune the batch size for throughput
nlp.pipe(texts, n_process=4) # multiprocessing across workers
for doc, ctx in nlp.pipe(data, as_tuples=True): # carry per-text metadata
ctx["id"] # the context dict rides through unchanged
nlp.pipe(texts, disable=["parser", "ner"]) # skip components you do not need
for doc in nlp.pipe(texts):
doc.ents # collect results from the Doc streamSee Processing text. Prefer nlp.pipe(texts) over a loop of nlp(text).
Quick Reference
| Command | What it does | Area |
|---|---|---|
nlp = spacy.load("en_core_web_sm") |
Load a trained pipeline | Pipeline |
doc = nlp(text) |
Process text into a Doc |
Pipeline |
doc.sents |
Iterate sentences (Spans) |
Pipeline |
token.pos_ / token.tag_ |
Coarse / fine part of speech | Tokens |
token.lemma_ |
Dictionary base form | Tokens |
token.morph |
Morphological features | Tokens |
doc.ents |
Named-entity Spans |
Entities |
ent.label_ |
Entity type (ORG, GPE, …) |
Entities |
token.dep_ / token.head |
Dependency relation / head | Parse |
doc.noun_chunks |
Base noun phrases | Parse |
doc[i:j] / doc.char_span(a, b) |
Make a Span |
Spans |
Matcher(nlp.vocab) |
Token-pattern matcher | Matcher |
PhraseMatcher(nlp.vocab) |
Exact-phrase matcher | Matcher |
doc1.similarity(doc2) |
Cosine similarity (needs vectors) | Vectors |
nlp.pipe_names |
List pipeline components | Components |
nlp.add_pipe("entity_ruler") |
Insert a component | Components |
nlp.select_pipes(disable=[...]) |
Temporarily turn components off | Components |
nlp.pipe(texts) |
Stream/batch many texts | Scale |
spacy.explain("nsubj") |
Decode a tag/label | Helpers |
| Object | What it is | You get it from |
|---|---|---|
Language (nlp) |
The trained pipeline | spacy.load(...) / spacy.blank(...) |
Doc |
A processed document of tokens | nlp(text) / nlp.pipe(texts) |
Token |
One token with its annotations | doc[i], iterating doc |
Span |
A slice of a Doc |
doc[i:j], doc.ents, doc.sents |
Lexeme |
A context-free vocab entry | nlp.vocab[string] |
Vocab |
Shared strings and vectors | nlp.vocab |
| Attribute | Type | Meaning |
|---|---|---|
token.text |
str |
The verbatim token text |
token.lemma_ |
str |
Base form |
token.pos_ |
str |
Coarse part of speech (UPOS) |
token.tag_ |
str |
Fine-grained tag |
token.dep_ |
str |
Dependency relation to head |
token.head |
Token |
Syntactic parent |
token.morph |
MorphAnalysis |
Morphological features |
token.ent_type_ |
str |
Entity label if part of an entity |
token.is_stop |
bool |
Is a stop word |
token.is_alpha |
bool |
All alphabetic characters |
token.like_num |
bool |
Resembles a number |
token.vector |
ndarray |
Word vector (md/lg pipelines) |
| Label | Meaning |
|---|---|
PERSON |
People, including fictional |
ORG |
Companies, agencies, institutions |
GPE |
Countries, cities, states |
LOC |
Non-GPE locations (mountains, water) |
DATE / TIME |
Absolute or relative dates and times |
MONEY |
Monetary values, with unit |
PERCENT |
Percentages |
PRODUCT |
Objects, vehicles, foods (not services) |
| Package | Vectors? | Use it for |
|---|---|---|
en_core_web_sm |
No (tensors only) | Fast tagging, parsing, NER; small footprint |
en_core_web_md |
Yes (300-dim) | Adds word vectors for similarity |
en_core_web_lg |
Yes (300-dim, larger) | Best vector coverage |
en_core_web_trf |
No (transformer) | Highest accuracy, transformer-based |
Appendix: Sample Code
The text to Doc mental model
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion in San Francisco.")
len(doc) # 15 (tokens, not words)
doc[0].text # 'Apple'
doc[1:3].text # 'is looking' (a Span)
[s.text for s in doc.sents] # one sentence here
doc.text # round-trips to the original stringReading token-level linguistic features
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a startup.")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.text)
# Apple Apple PROPN NNP nsubj looking
# is be AUX VBZ aux looking
# ...
doc[1].morph.to_dict() # {'Mood': 'Ind', 'Number': 'Sing', 'Person': '3', ...}
spacy.explain("VBG") # 'verb, gerund or present participle'Named entities and rule-based entities
import spacy
nlp = spacy.load("en_core_web_sm")
# Model-predicted entities
doc = nlp("Apple is in San Francisco.")
for ent in doc.ents:
print(ent.text, ent.label_, ent.start_char, ent.end_char)
# Apple ORG 0 5
# San Francisco GPE 9 22
# Add rule-based entities with an EntityRuler before the statistical NER
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([{"label": "ORG", "pattern": "spaCy"}])
doc = nlp("I use spaCy every day.")
print([(e.text, e.label_) for e in doc.ents]) # [('spaCy', 'ORG')]Matching token patterns
import spacy
from spacy.matcher import Matcher, PhraseMatcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup.")
# Token-attribute pattern: "looking at <VERB>"
matcher = Matcher(nlp.vocab)
matcher.add("LOOK_AT_VERB", [[{"LOWER": "looking"}, {"LOWER": "at"}, {"POS": "VERB"}]])
for match_id, start, end in matcher(doc):
print(nlp.vocab.strings[match_id], "->", doc[start:end].text)
# LOOK_AT_VERB -> looking at buying
# Exact phrases, fast
phrase = PhraseMatcher(nlp.vocab)
phrase.add("COMPANY", [nlp.make_doc("Apple")])
print([doc[s:e].text for _id, s, e in phrase(doc)]) # ['Apple']Similarity (load a vectors pipeline first)
import spacy
# en_core_web_sm has NO word vectors (tensors only); use md or lg for similarity.
# python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
doc1.similarity(doc2) # e.g. 0.69 (cosine similarity)
nlp("dog")[0].similarity(nlp("cat")[0]) # vectors must exist (has_vector == True)Editing the pipeline and processing at scale
import spacy
nlp = spacy.load("en_core_web_sm")
print(nlp.pipe_names)
# ['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
# Inspect what each component requires/assigns
nlp.analyze_pipes(pretty=True)
texts = ["First document.", "Second document.", "Third document."]
# Stream many texts efficiently; carry metadata with as_tuples
data = [(t, {"id": i}) for i, t in enumerate(texts)]
for doc, ctx in nlp.pipe(data, as_tuples=True, batch_size=2):
print(ctx["id"], [(e.text, e.label_) for e in doc.ents])
# Skip components you do not need for the task (faster)
for doc in nlp.pipe(texts, disable=["parser", "ner"]):
print([t.pos_ for t in doc])Behavior notes
- One pipeline, one
Doc. spaCy’s whole model isnlp(text) -> Doc; every later object (Token,Span,doc.ents, dependency arcs) is read off that single annotatedDoc. - String attributes end in an underscore.
token.pos_,token.lemma_,token.dep_, andent.label_return strings; the underscore-free name returns an integer hash, not a string. - The small pipeline has no word vectors.
en_core_web_smships context-sensitive tensors but not word vectors, so.similarity()is unreliable there; loaden_core_web_mdoren_core_web_lgfor genuine 300-dim word vectors. - Use the modern factory form. Add components with
nlp.add_pipe("component")and disable them withnlp.select_pipes(disable=[...]); avoid the spaCy 2.x spellingsnlp.create_pipe(...)andnlp.disable_pipes(...). - Prefer
nlp.pipefor many texts. It batches internally for a large speedup and supportsbatch_size,n_process,as_tuples, anddisable; do not loopnlp(text)one at a time.
References
spaCy documentation (v3)
- Documentation home and spaCy 101
- Linguistic features: POS tagging, named entities, dependency parse, vectors and similarity
- Rule-based matching, processing pipelines, the visualizers
- API reference:
Language,Doc,Token,Span,Matcher
Project and related
- spaCy on PyPI and on GitHub
- Trained pipeline packages (sm/md/lg) and the displaCy live demo