How to use Beautiful Soup to Parse a Web Page for Scraping: Web Scraping Tutorial #2

Note: This tutorial is part of our “Python Web Scraping Tutorial for Beginners” series.

What is Beautiful Soup?

Beautiful Soup is a Python library designed for parsing and navigating HTML and XML documents, which is also helpful for web scraping purposes.

In this tutorial, we will see in-depth how to use Beautiful Soup for web scraping to parse web pages.

HTML Structure of a Web Page

So, in order to understand Beautiful Soup, you need to first understand DOM or simply the HTML structure of a web page.

HTML of a web page is written in a tree-like structure. See the below image to understand it.

This is just a basic example to tell you about the tree structure of an HTML document.
Let me explain more.

In HTML, there is a starting tag called html and this tag has further two child tags i.e., head and body. And, the head and body tags have further child tags. For example, the head tag can have child tags like meta, link and title, etc.
And body tag can have child tags like div, p, ul, li, span, heading tags, and img tags, etc.
In this way, the whole HTML document behaves like a tree structure, in which there is a tag that can have multiple child tags.

So, now I hope you understand the tree structure of HTML.

How does Beautiful Soup work?

The primary purpose of Beautiful Soup is to parse an HTML document, allowing you to select specific tags or elements within the document and retrieve data from those selected elements

You use a selector on the base of which Beautiful Soup will find that tag from the HTML document.
Don’t be confused about this, we will do it practically later.

How we can select a tag from the HTML document?

There are a number of ways to select a tag. Some most common are the following.

By CSS classes
By ID
By other tag attributes
By CSS selectors

Let’s see their details.

By CSS classes:

In HTML, CSS classes are a way to apply styling rules to one or more HTML elements. CSS classes are defined using the class attribute within HTML tags. You can assign one or more class names to an element. You can select elements on the basis of their CSS class name
Consider the following example:

First, see this HTML code:

<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
    <title>HTML Classes Example</title>
</head>
<body>
    <header>
        <h1 class="header-title">Welcome to Our Website</h1>
        <p class="header-description" id="slogan">Learn and Explore</p>
    </header>

    <nav>
        <ul data-test-id="nav-bar">
            <li><a href="#" class="nav-link">Home</a></li>
            <li><a href="#" class="nav-link">About</a></li>
            <li><a href="#" class="nav-link">Services</a></li>
            <li><a href="#" class="nav-link">Contact</a></li>
        </ul>
    </nav>

    <main>
        <section>
            <h2 class="section-title">About Us</h2>
            <p class="section-content">Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
        </section>

        <section>
            <h2 class="section-title">Our Services</h2>
            <ul>
                <li class="service-item">Service 1</li>
                <li class="service-item">Service 2</li>
                <li class="service-item">Service 3</li>
            </ul>
        </section>
    </main>

    <footer>
        <p class="footer-text">&copy; 2023 Our Website</p>
    </footer>
</body>
</html>

Now you want to get all the service names (Service 1, Service 2, and Service 3).
You can use their class name "service-item" to select those elements

Here is how you can do it in Beautiful Soup:

from bs4 import BeautifulSoup

# Assuming you have the above HTML document in a variable called 'html_content'
soup = BeautifulSoup(html_content, 'html.parser')
#Find all the elements with the class name "service-item"
service_items_tags = soup.find_all('li',class_="service-item")
for item in service_items_tags:
    service_name = item.get_text() #to get inner text of tag
    print(service_name)

Let’s see one more example:

If you want to get text from the header title(Welcome to Our Website) then you can get this way:

from bs4 import BeautifulSoup
# Assuming you have the above HTML document in a variable called 'html_content'
soup = BeautifulSoup(html_content, 'html.parser')
#Find the h1 element with the class name "header-title"
header_title_tag = soup.find('h1',class_="header-title")
header_title = header_title_tag.get_text() #to get inner text of tag
print(header_title)

By ID:

In HTML, the id attribute is used to give a unique identifier to an individual HTML element on a web page. This identifier must be unique within the entire HTML document, meaning that no two elements can have the same id value within the same page. The id attribute is typically used for JavaScript, CSS, or other purposes where a unique reference to an element is required.

You can use id it to select a unique element within the HTML document.
For example in the above HTML document, there is an <p> element having an id “slogan”. You can get the slogan text by using id of this element

from bs4 import BeautifulSoup
# Assuming you have the above HTML document in a variable called 'html_content'
soup = BeautifulSoup(html_content, 'html.parser')
#Find the p element with Id 'slogan'
slogan_element = soup.find("p",{"id":"slogan"})
slogan = slogan_element.get_text() #to get inner text of tag
print(slogan)

By other tag attributes:

Class and id are two of the tag attributes in an HTML document. You can also use tag attributes other than class and id to select elements from HTML content.

Again consider the above HTML document. In this document, there is a <ul> element having a “data-test-id” attribute. You can use this attribute to select this element.

from bs4 import BeautifulSoup
# Assuming you have the above HTML document in a variable called 'html_content'
soup = BeautifulSoup(html_content, 'html.parser')
#Find the ul element with data-test-id 'nav-bar'
nav-list-element = soup.find("ul",{"data-test-id":"nav-bar"})
nav-list-element_text = nav-list-element.get_text() #to get inner text of tag
print(nav-list-element-text)

By CSS selectors:

CSS selectors are a powerful and flexible way to target and select HTML elements based on their attributes, tag names, and relationships with other elements.

CSS selectors are versatile and can be used to select elements based on various criteria. You can select elements by tag name (div, p, a), class (.classname), ID (#idname), attribute ([attribute=value]), and more.

I personally like CSS selectors due to their flexibility.

In Beautiful Soup, you can utilize the power of CSS selectors using the select() function to select multiple elements matching your criteria or the select_one() function to pick a single matching element matching your criteria.

Let’s dive into examples:

from bs4 import BeautifulSoup
# Assuming you have the above HTML document in a variable called 'html_content'
soup = BeautifulSoup(html_content, 'html.parser')

Universal Selector (*): Select all elements on the page.

all_elements = soup.select('*')

Type Selector (e.g., p, div, a): Select elements by their HTML tag name.

paragraphs = soup.select('p')
divs = soup.select('div')
links = soup.select('a')

Class Selector (e.g., .classname): Selects elements with a specific class attribute.

header_titles = soup.select('.header-title')
nav_links = soup.select('.nav-link')
service_items = soup.select('.service-item')

ID Selector (e.g., #idname): Select a single element with a specific ID attribute.

slogan = soup.select('#slogan')

Descendant Selector (e.g., div p): Selects elements that are descendants(an element that is nested inside another HTML element) of a specified element.

section_content = soup.select('section p')

#Select all elements with the class "section-content" that are descendants of <section> elements.
section_content = soup.select('section .section-content')

Child Selector (e.g., div > p): Selects elements that are direct children of a specified element.

main_sections = soup.select('main > section')

Adjacent Sibling Selector (e.g., h2 + p): Selects an element that is immediately preceded by a specified element.

section_titles = soup.select('h2 + p')

Attribute Selector (e.g., [attribute=value]): Selects elements with a specific attribute and value.

footer_text = soup.select('[class=footer-text]')

You can use multiple selectors together to create more specific and complex queries for selecting elements:

Let’s see examples:

Select all <a> elements with the class “nav-link.”

nav_links = soup.select('a.nav-link')

Select the <p> element with the ID “slogan.”

slogan = soup.select('p#slogan')

Select all elements with the class “footer-text” that have the attribute id="copyright".

copyright_elements = soup.select('[id=copyright].footer-text')

Select all <a> elements that are descendants of a <nav> element and have the class “nav-link.”

nav_links = soup.select('nav a.nav-link')

Summary:

In this tutorial we have covered:

What is Beautiful Soup? And its use.
What is the tree structure of HTML?
How can you select elements from a web page using Beautiful Soup and get data from it?
How to use different CSS selectors to select elements?
How to write complex queries for element selection?

Table of Contents