PUBLISHED ON: FEBRUARY 23, 2021
Decode HTML entities into Python String
In this article, we will learn to decode HTML entities into Python String. We will use some built-in functions and some custom code as well.
Let us discuss decode HTML scripts or entities into Python String. It increases the readability of the script. A programmer who does not know about HTML script can decode it and read it using Strings. So, these three methods will decode the ASCII characters in an HTML script into a Special Character.
Example: Use HTML Parser to decode HTML Entities
It imports html
library of Python. It has html.unescape()
function to remove and decode HTML entities and returns a Python String. It replaces ASCII characters with their original character.
import html
print(html.unescape('£682m'))
print(html.unescape('© 2010'))
£682m
© 2010
Example: Use Beautiful Soup to decode HTML Entities
It uses BeautifulSoup
for decoding HTML entities.This represents Beautiful Soup 4 as it works in Python 3.x. For versions below this, use Beautiful Soup 3. For Python 2.x, you will need to specify the convertEntities
argument to the BeautifulSoup constructor. But in the case of Beautiful Soup 4, entities get decoded automatically. html.parser
is passed as an argument along with the HTML script to BeautifulSoup because it removes all the extraneous HTML that wasn't part of the original string (i.e. <html> and <body>).
# Beautiful Soup 4
from bs4 import BeautifulSoup
print(BeautifulSoup("£682m", "html.parser"))
£682m
Example: Use w3lib.html Library to decode HTML Entities
This method uses w3lib.html
module. In order to avoid "ModuleNotFoundError", install w3lib
using pip
install using the given command. It provides replace_entities
to replace HTML script with Python String.
pip install w3lib
from w3lib.html import replace_entities print(replace_entities("£682m"))
£682m
Conclusion
In this article, we learned to decode HTML entities into Python String using three built-in libraries of Python such as html
, w3lib.html
, and BeautifulSoup
. We saw how HTML script is removed and replaced with ASCII characters. Install your packages correctly if you are getting "ModuleNot FoundError".