Beautiful Soup Cheatsheet – TheCoatlessProfessor

Beautiful Soup (the bs4 package) turns messy HTML or XML into a tree of Python objects you can search and navigate. Parsing a document gives you three kinds of nodes: the whole BeautifulSoup document, each Tag (an element like <a> or <li>), and NavigableString text. Almost everything you do falls into three verbs: searching the tree (find, find_all, select), navigating it (.parent, .children, siblings), or extracting strings and attributes from the Tag objects you land on. The conventional import is from bs4 import BeautifulSoup, and a Tag behaves like both a dict (for attributes, tag["href"]) and a node you can descend into (tag.span). Every command below assumes a parsed document named soup, built from the sample HTML in the Appendix.

Download the full cheatsheet

All eight panels as one SVG (light or dark), or a print-ready multi-page PDF.

Light SVG Dark SVG Print PDF

Parse a Document

BeautifulSoup(markup, parser) reads a string of HTML or XML and builds a tree of Python objects you can search and navigate. The second argument picks the engine: html.parser is built in and needs no install, lxml is faster and more forgiving, and html5lib parses exactly like a browser (auto-adding <html>, <head>, <body>). Always name the parser explicitly so a missing optional dependency does not silently change your results.

Beautiful Soup parse panel: html.parser, lxml, html5lib, file object, prettify, xml.

Turn an HTML or XML string into a navigable tree.

Beautiful Soup parse panel: html.parser, lxml, html5lib, file object, prettify, xml.

Turn an HTML or XML string into a navigable tree.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")        # built-in parser, no install needed
BeautifulSoup(html, "lxml")                       # fast and lenient; needs pip install lxml
BeautifulSoup(html, "html5lib")                   # browser-style; wraps fragments in html/head/body
BeautifulSoup(open("page.html"), "html.parser")  # parse a file object
print(soup.prettify())                            # pretty-print the nested tree
BeautifulSoup(xml, "xml")                         # XML mode; needs lxml

See Making the soup and Installing a parser.

Read Tags and Attributes

Each element is a Tag that doubles as a dictionary. Descend to the first match with dotted access like soup.h1, read its .name, and pull attributes with tag["href"] (raises if missing) or tag.get("href") (returns None if missing). Use tag.attrs to see the whole attribute dict at once. Note that class comes back as a list, because an element can carry several classes.

Beautiful Soup tags panel: dotted access, name, dict-style attribute, safe get, attrs, has_attr.

A Tag is a dict of attributes plus a node you can descend into.

Beautiful Soup tags panel: dotted access, name, dict-style attribute, safe get, attrs, has_attr.

A Tag is a dict of attributes plus a node you can descend into.

soup.h1   soup.title              # reach the first matching tag by dotted access
tag.name                          # the tag's name, e.g. "h1"
tag["href"]   tag["class"]        # one attribute, dict-style (class -> a list)
tag.get("href")   tag.get("x", default)   # safe lookup, missing key -> None
tag.attrs                         # all attributes as a dict
tag.has_attr("class")             # test whether an attribute is present

See Kinds of objects.

Search with find / find_all

find returns the first matching Tag or None; find_all returns a list of every match. You filter by tag name, by CSS class via the class_= keyword (the trailing underscore avoids the Python keyword), by any attribute through attrs={...}, and by regex or a custom function for the value. The limit= keyword caps how many matches you collect.

Beautiful Soup find panel: find, find_all, class_, attrs, regex href, limit.

One match or a list, filtered by name, attrs, or text.

Beautiful Soup find panel: find, find_all, class_, attrs, regex href, limit.

One match or a list, filtered by name, attrs, or text.

soup.find("a")                                  # first match (or None)
soup.find_all("li")                             # all matches as a list
soup.find_all("a", class_="name")               # filter by CSS class (note class_)
soup.find_all("li", attrs={"data-id": "2"})     # filter by any attribute
soup.find_all("a", href=re.compile(r"^/p/"))    # match an attribute by regex
soup.find_all("li", limit=2)                    # cap the number of results

See Searching the tree.

Search with CSS Selectors

select and select_one accept the same selectors you would type in a browser console, powered by the Soup Sieve library (they are convenient shortcuts for the soup.css.select(...) property). They shine for nested conditions: descendant chains (ul.products .price), direct children (ul > li), attribute selectors (a[href]), grouping (h1, h2), and positional pseudo-classes (li:nth-of-type(2)).

Beautiful Soup select panel: select, select_one, descendant, attribute selector, child/grouping, nth-of-type.

The same selectors you would use in a browser.

Beautiful Soup select panel: select, select_one, descendant, attribute selector, child/grouping, nth-of-type.

The same selectors you would use in a browser.

soup.select("li.product")                # all matches for a selector -> list
soup.select_one("a.name")                # first selector match
soup.select("ul.products .price")        # descendant combinator
soup.select('li[data-id="2"]')           # attribute selector
soup.select("ul > li")   soup.select("h1, h2")   # direct child; grouping
soup.select("li:nth-of-type(2)")         # positional pseudo-class

See CSS selectors and the Soup Sieve selector reference.

Walk the Tree

Once you have a tag, move around it: .children and .contents give the immediate nodes, .descendants reaches everything below, .parent and .parents climb upward, and the sibling helpers step sideways. Prefer find_next_sibling and find_parent over raw .next_sibling and .parent when whitespace text nodes would otherwise get in your way.

Beautiful Soup walk panel: children/contents, descendants, parent/parents, sibling, ancestor, find_next.

Move up to parents, down to children, sideways to siblings.

Beautiful Soup walk panel: children/contents, descendants, parent/parents, sibling, ancestor, find_next.

Move up to parents, down to children, sideways to siblings.

list(tag.children)   tag.contents      # direct children (tags + strings)
tag.descendants                        # all descendants, deep
tag.parent   tag.parents               # up to the parent; chain upward
tag.find_next_sibling("li")            # next sibling tag, skipping whitespace
tag.find_parent("li")                  # nearest ancestor matching a filter
tag.find_next("span")                  # next match anywhere in document order

See Navigating the tree.

Pull the Data Out

Extraction is the payoff. get_text() flattens a subtree into one string (add separator= and strip=True to clean it up), .string grabs a single tag’s text, stripped_strings yields the cleaned pieces, and a list comprehension over a search result collects an attribute or text from many nodes at once.

Beautiful Soup extract panel: get_text, separator/strip, string, stripped_strings, collect attribute, collect text.

Get text and attributes, clean and collected.

Beautiful Soup extract panel: get_text, separator/strip, string, stripped_strings, collect attribute, collect text.

Get text and attributes, clean and collected.

tag.get_text()                                   # all visible text, joined -> "Bolt$3.00"
tag.get_text(separator=" | ", strip=True)        # text with a separator, trimmed
tag.string                                       # one tag's own string (None if mixed)
list(tag.stripped_strings)                        # cleaned text pieces -> ["Bolt", "$3.00"]
[a["href"] for a in soup.select("a[href]")]       # collect an attribute across matches
[li.get_text(strip=True) for li in soup.select("li")]   # collect text across matches

See get_text.

Edit the Tree

Beautiful Soup edits the tree in place. Assign to tag.string or tag["attr"] to change content and attributes, build fresh nodes with soup.new_tag(...) and place them using append, insert_before, or insert_after, swap a node with replace_with, and remove one with decompose() (destroy it) or extract() (detach and return it for reuse).

Beautiful Soup edit panel: change text, set attribute, new_tag/append, insert beside, replace_with, decompose/extract.

Add, change, replace, and remove nodes in place.

Beautiful Soup edit panel: change text, set attribute, new_tag/append, insert beside, replace_with, decompose/extract.

Add, change, replace, and remove nodes in place.

tag.string = "Gadgets"                           # change a tag's text
tag["href"] = "/changed"   tag["class"] = "x"     # set or change an attribute
t = soup.new_tag("span"); parent.append(t)        # build and append a new node
tag.insert_before(node)   tag.insert_after(node)  # insert next to a node
tag.replace_with(new)                            # replace one node with another
tag.decompose()   tag.extract()                  # remove (destroy) / remove (return it)

See Modifying the tree.

Serialize Back to a String

When you are done, turn the tree back into text. str(soup) (or str(tag)) renders HTML, prettify() renders it indented, and decode_contents() gives just the inner markup. encode() produces bytes with an explicit encoding, and get_text() on the whole document strips every tag down to plain text.

Beautiful Soup output panel: str, prettify, decode_contents, encode, get_text, write to file.

Turn the tree (or a piece of it) back into HTML text.

Beautiful Soup output panel: str, prettify, decode_contents, encode, get_text, write to file.

Turn the tree (or a piece of it) back into HTML text.

str(soup)   str(tag)                     # whole tag back to HTML
soup.prettify()                          # indented, human-readable HTML
tag.decode_contents()                    # inner HTML only (no wrapper)
soup.encode("utf-8")                     # bytes with an encoding -> b"..."
soup.get_text()                          # strip all tags, keep text
Path("out.html").write_text(str(soup))   # save to a file

See Output.

Quick Reference

Parsers (pass as the second argument).
Parser	Call	Notes
Built-in	`BeautifulSoup(html, "html.parser")`	No install, decent, pure Python
lxml (HTML)	`BeautifulSoup(html, "lxml")`	Fastest, lenient; needs `pip install lxml`
lxml (XML)	`BeautifulSoup(xml, "xml")`	XML mode; needs `lxml`
html5lib	`BeautifulSoup(html, "html5lib")`	Browser-grade, slowest; needs `pip install html5lib`

find / find_all filters.
Goal	Call
By tag name	`soup.find_all("a")`
By several names	`soup.find_all(["h1", "h2"])`
By CSS class	`soup.find_all("a", class_="name")`
By id	`soup.find(id="main")`
By any attribute	`soup.find_all("li", attrs={"data-id": "2"})`
By attribute regex	`soup.find_all("a", href=re.compile(r"^/p/"))`
By visible text	`soup.find_all(string="Bolt")`
By custom function	`soup.find_all(lambda t: t.has_attr("data-id"))`
Every tag	`soup.find_all(True)`

CSS selector patterns (`select` / `select_one`).
Pattern	Matches
`li.product`	`<li>` with class `product`
`#main`	element with `id="main"`
`ul.products .price`	any `.price` inside `ul.products`
`ul > li`	direct `<li>` children of `<ul>`
`a[href]` · `a[href^="/"]`	links with an `href` / starting with `/`
`h1, h2`	either tag (grouping)
`li:nth-of-type(2)`	the second `<li>` among its siblings

Get text and attributes.
Goal	Call
All text joined	`tag.get_text()`
Text, cleaned	`tag.get_text(separator=" ", strip=True)`
Single tag’s text	`tag.string`
Cleaned text pieces	`list(tag.stripped_strings)`
One attribute (raises if missing)	`tag["href"]`
One attribute (safe)	`tag.get("href")`
All attributes	`tag.attrs`

Edit and output.
Goal	Call
Make a new tag	`soup.new_tag("span", attrs={"class": "x"})`
Add as last child	`parent.append(node)`
Insert beside	`tag.insert_before(node)` · `tag.insert_after(node)`
Replace a node	`tag.replace_with(new)`
Remove (destroy)	`tag.decompose()`
Remove (return it)	`tag.extract()`
Empty a node	`tag.clear()`
Back to HTML	`str(soup)` · `soup.prettify()`
Inner HTML only	`tag.decode_contents()`

Appendix: Sample Code

`page.html` (sample input file)

<!doctype html>
<html>
<head><title>Widgets Store</title></head>
<body>
  <div id="main" class="container">
    <h1 class="title">Widgets</h1>
    <ul class="products">
      <li class="product" data-id="1"><a href="/p/1" class="name">Bolt</a><span class="price">$3.00</span></li>
      <li class="product on-sale" data-id="2"><a href="/p/2" class="name">Nut</a><span class="price">$1.50</span></li>
      <li class="product" data-id="3"><a href="https://ex.com/p/3" class="name">Washer</a><span class="price">$0.50</span></li>
    </ul>
    <p>Contact <a href="mailto:a@b.com">us</a>.</p>
  </div>
</body>
</html>

Building the `soup` used across panels

from bs4 import BeautifulSoup
from pathlib import Path

html = Path("page.html").read_text()
soup = BeautifulSoup(html, "html.parser")   # name the parser for reproducibility

A daily scraping pipeline (the search to extract mental model)

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

# 1. find the list, 2. iterate items, 3. pull text + attributes per item
products = []
for li in soup.select("ul.products > li.product"):     # CSS search
    products.append(
        {
            "id": li["data-id"],                        # attribute
            "name": li.select_one("a.name").get_text(strip=True),  # nested text
            "url": li.select_one("a.name")["href"],     # nested attribute
            "price": li.select_one(".price").get_text(strip=True),
        }
    )

# products == [
#   {"id": "1", "name": "Bolt",   "url": "/p/1", "price": "$3.00"},
#   {"id": "2", "name": "Nut",    "url": "/p/2", "price": "$1.50"},
#   {"id": "3", "name": "Washer", "url": "https://ex.com/p/3", "price": "$0.50"},
# ]

# collect all internal links with a regex attribute filter
internal = [a["href"] for a in soup.find_all("a", href=re.compile(r"^/p/"))]

Editing then serializing

soup = BeautifulSoup(html, "html.parser")

soup.h1.string = "Gadgets"                       # change text
badge = soup.new_tag("span", attrs={"class": "badge"})
badge.string = "NEW"
soup.find("li").append(badge)                    # add a child node
soup.find("p").decompose()                       # remove a node

out = soup.prettify()                            # back to indented HTML

Behavior notes

class_ not class. class is a Python keyword, so the search keyword is class_=. A tag’s class attribute comes back as a list (tag["class"] returns ["product", "on-sale"]) because HTML allows multiple classes.
string=, not text=. The keyword to match by visible text is string=. The old text= argument still works but emits a DeprecationWarning in bs4 4.x.
find_all, not findAll. CamelCase method names (findAll, findAllNext) are deprecated aliases kept for bs4 3 compatibility; use the snake_case forms.
Whitespace is text. .contents, .children, and .next_sibling include NavigableString whitespace between tags. Use find_next_sibling and find_all (which skip strings) when you only want elements.
Parsers can disagree. html.parser, lxml, and html5lib handle broken markup differently; html5lib wraps fragments in <html><head><body>. Always pass the parser name explicitly so a missing optional dependency does not silently change results.
select needs Soup Sieve. CSS selectors are handled by the bundled soupsieve package; very exotic pseudo-classes may be unsupported, but everything in the selectors panel works.

References

Beautiful Soup documentation

Selectors and project