Exploring BeautifulSoup Methods
In this tutorial we will learn various different ways to access HTML tags using different methods of the BeautifulSoup module. For a basic introduction to the BeautifulSoup module, start from the previous tutorial.
BeautifulSoup: Accessing HTML Tags
The methods that we will cover in this section are used to traverse through different HTML tags considering HTML code as a tree.
Create a file sample_webpage.html and copy the following HTML code in it:
<!DOCTYPE html>
<html>
<head>
<title> Sample HTML Page</title>
<style>
* {
margin: 0;
padding: 0;
}
div {
width: 95%;
height: 75px;
margin: 10px 2.5%;
border: 1px dotted grey;
text-align: center;
}
p {
font-family: sans-serif;
font-size: 18px;
color: #000;
line-height: 75px;
}
a {
position: relative;
top: 25px;
}
</style>
</head>
<body>
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
</body>
</html>
Now to read the content of the above HTML file, use the following python code to store the content into a variable:
## reading content from the file
with open("sample_webpage.html") as html_file:
html = html_file.read()
Now we will use different methods of the BeautifulSoup module and see how they work.
For warmup, let's start with using the prettify
method.
import bs4
## reading content from the file
with open("sample_webpage.html") as html_file:
html = html_file.read()
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")
print(soup.prettify)
<!DOCTYPE html>
<html>
<head>
<title> Sample HTML Page</title>
<style>
* {
margin: 0;
padding: 0;
}
div {
width: 95%;
height: 75px;
margin: 10px 2.5%;
border: 1px dotted grey;
text-align: center;
}
p {
font-family: sans-serif;
font-size: 18px;
color: #000;
line-height: 75px;
}
a {
position: relative;
top: 25px;
}
</style>
</head>
<body>
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
</body>
</html>
BeautifulSoup: Accessing HTML Tag Attributes
We can retrieve the attributes of any HTML tag using the following syntax:
TagName["AttributeName"]
Let's extract the href
attribute from the anchor tag in our HTML code.
import bs4
## reading content from the file
with open("sample_webpage.html") as html_file:
html = html_file.read()
## creating a BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")
## getting anchor tag
link = soup.a
## printing the 'href' attribute of anchor tag
print(link["href"])
https://www.studytonight.com
BeautifulSoup: contents
method
contents
method is used to list out all the tags that are present in the parent tag. Let's list all the children HTML tags of the body tag using the contents
method.
body = soup.body
## getting all the children of 'body' using 'contents'
content_list = body.contents
## printing all the children using for loop
for tag in content_list:
if tag != "\n":
print(tag)
print("\n")
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
BeautifulSoup: children
method
children
method is similar to the contents
method, but children
method returns an iterator while the contents
method returns a list of all the children. Let's see an example:
body = soup.body
## we can also convert iterator into list using the 'list(iterator)'
for tag in body.children:
if tag != "\n":
print(tag)
print("\n")
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
BeautifulSoup: descendants
method
descendants
method helps to retrieve all the child tags of a parent tag. You must be wondering that is what the two methods above also did. Well this method is different from contents
and children
method as this method extracts all the child tags and content up until the end. In simple words if we use it to extract the body tag then it will print the first div tag, then it will print the child of the div tag and then their child until it reaches the end, then it will move on to the next div tag and so on.
This method returns a generator. Let's see an example:
body = soup.body
## getting child tags of 'body' tag using 'descendants' method
for tag in body.descendants:
if tag != "\n":
print(tag)
print("\n")
<div id="first-div">
<p class="first">First Paragraph</p>
</div>
<p class="first">First Paragraph</p>
First Paragraph
<div id="second-div">
<p class="second">Second Paragraph</p>
</div>
<p class="second">Second Paragraph</p>
Second Paragraph
<div id="third-div">
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
</div>
<a href="https://www.studytonight.com">Studytonight</a>
Studytonight
<p class="third">Third Paragraph</p>
Third Paragraph
<div id="fourth-div">
<p class="fourth">Fourth Paragraph</p>
</div>
<p class="fourth">Fourth Paragraph</p>
Fourth Paragraph
<div id="fifth-div">
<p class="fifth">Fifth Paragraph</p>
</div>
<p class="fifth">Fifth Paragraph</p>
Fifth Paragraph
As you can see in the output above the descendants
method keeps entering inside the tag it reads until it reaches the end, and then it moves onto the next HTML tag.
BeautifulSoup: parent
method
parent
method is used to get the parent tag of a child tag. Let's see an example:
body = soup.body
## getting parent of 'body'
body_parent = body.parent
## you have to use 'name' method to print the name of the tag
## printing the name of the parent using 'name' method
print(body_parent.name)
html
BeautifulSoup: parents
method
parent
method is used to get all the parent tags of a child tag. It returns a generator. Let's see an example:
body = soup.body
## getting parents of 'body'
body_parents = body.parents
## if the child has more than one parent it will print all parent names
for parent in body_parents:
print(parent.name)
print("\n")
html
[document]
BeautifulSoup: next_sibling
method
next_sibling
method is used to get the next tag of the specified tag from the same parent. Now let's print the sibling tag of the anchor tag in out HTML code:
anchor_tag = soup.a
print(anchor_tag)
## getting third paragraph using anchor tag
## here we have written 'next_sibling' two times
## means there is a line break in between them
## anchor_tag.next_sibling gives a line break
## next to line break is the third paragraph
third_para = anchor_tag.next_sibling.next_sibling
print(third_para)
<a href="https://www.studytonight.com">Studytonight</a>
<p class="third">Third Paragraph</p>
BeautifulSoup: previous_sibling
method
previous_sibling
method is similar to the next_sibling
method. It returns the previous tag instead of the next tag. Let's see an example(this is in continuation to the above code snippet):
## getting anchor tag from the third_para
print(third_para.previous_sibling.previous_sibling)
<a href="https://www.studytonight.com">Studytonight</a>
BeautifulSoup: next_siblings
method
next_siblings
returns a generator with all available next tags. Let's see an example(this is in continuation to the above code snippet):
## using anchor_tag variable here
a_siblings = anchor_tag.next_siblings
print(list(a_siblings))
['\n', <p class="third">Third Paragraph</p>, '\n']
BeautifulSoup: previous_siblings
method
previous_siblings
returns a generator with all available previous tags. Let's see an example(this is in continuation to the above code snippet):
## using third_para variable here
p_siblings = third_para.previous_siblings
print(list(p_siblings))
['\n', lt;a href="https://www.studytonight.com">Studytonight</a>, '\n']
Now you are familiar with most of the methods that are used in web scraping. In the following tutorial, we will learn how to find a specific tag from a bunch of similar tags.