Introduction to BeautifulSoup Module
In this tutorial we will learn how we can use the BeautifulSoup module of python to parse the source code of webpage(which we can get using the requests
module) and find various useful information from the source code like all the HTML table headings, or all the links on the webpage etc.
BeautifulSoup can search and return all occurences of an HTML tag, if we provide all the information to it about the HTML tag.
Before we jump into searching HTML tags and accessing information from a webpage, let's see how we can format the HTTP response content received to make it more readable.
BeautifulSoup: Prettify Content
The method prettify
available in BeautifulSOup module can be used to format the HTTP response received using the requests
module.
Below we have the code example, extending teh example from last tutorial:
## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
## using 'prettify' method to print the content
print(soup.prettify())
In the code above we did the following:
- Imported the modules: requests, fake_useragent and bs4.
- Get teh response from any URL you like.
- Create a BeautifulSoup object using the
BeautifulSoup
class.
- Print the response using the
prettify
method using the BeautifulSoup object.
If you are coming here after reading the previous tutorial, you must have seen how the response from the GET request made using the requests
module looked like.
When we format that response using the prettify
method, it looks like this(click on this to download the file).
Now that the response is formatted, let's learn how can we use BeautifulSoup to access various HTML tags and related information from the HTTP response(source code).
BeautifulSoup: Accessing HTML Tags
Using the BeautifulSoup module we can easily find and access the content of various HTML tags like head, title, div, p, h1 etc. Let's see a simple example where we will print the title tag of the webpage.
## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
## getting 'title' tag from the google BeautifulSoup -> 'soup'
title_tag = soup.title
print(title_tag)
<title>Google</title>
We can also get only the text enclosed within the opening and closing title tag:
## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
## getting 'title' tag from the google BeautifulSoup -> 'soup'
title_text = soup.title.text
print(title_text)
Google
This is standard for all the HTML tags, for example to get the head tag, we can use soup.head
like this,
## import modules
import requests
from fake_useragent import UserAgent
## importing the beautifulsoup module
import bs4
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## creating BeautifulSoup object
soup = bs4.BeautifulSoup(response.content, "html.parser")
## getting 'head' tag from the google BeautifulSoup -> 'soup'
print(soup.head)
This will return the complete head tag from the page's source code.
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/><
meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"/>
<title>Google</title
><script nonce="GWwjLi7M0YGkyNTLDmVPsQ==">
...
<style> ... </style>
...
</head>
We have not added the complete code in the output as it is huge. But as you can see that the title tag is inside the head tag and there is style tag too in there.
We can also get the title tag content via the head tag:
## getting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.title.text)
Google
This is just to show you that as the BeautifulSoup follows the tree traversal technique to parse the HTML code, we can also access the tags by following their heirarchy.
Similarly let's access the style tag:
## getting 'title' tag from the google BeautifulSoup -> 'soup'
print(soup.head.style.text)
#gbar,#guser{font-size:13px;padding-top:1px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}
Up until now we have covered basic HTML parsing and accessing the tags. In the next tutorial we will see some more methods of the BeautifulSoup module and some more ways of navigating through the HTML source code of any webpage to collect useful data.