What is Web Scraping?
Web Scraping is a technique of extracting/scraping information from websites. This technique mostly focuses on the transformation of unstructured data (HTML format) on the web into structured data.
We will use python library named BeautifulSoup
for this purpose. Do not worry about it right now, we will have program examples in the next tutorial. Our web scraper program will use this library to parse the website's HTML and extract the data. We will also use the Requests Library
to open the URL, download the HTML and pass it to BeautifulSoup
.
For detailed tutorial, visit our Web Scraping with Python using BeautifulSoup tutorial.
NOTE: Many websites do not allow Web Scraping, and it might get you in legal troubles. Hence, we advice you to use this only for learning purposes and not to steal or copy data from websites.
Installing the modules
To install the required python modules, follow the instructions below:
- For Linux/Mac OSx users:
$ sudo pip install BeautifulSoup
$ sudo pip install Requests
- For Window users:
$ pip install BeautifulSoup
$ pip install Requests
Basics of Web Scraping
If you have ever visited a website and looked at the source code(Right Click → View Page Source) you must have seen lots of crappy or non understandable information there. Well, unless we get something understandable or well structured, it is of no use. So to scrap data from a website(say we want to get prices for all the products on a particular page of an e-commerce website), first of all we need to uniquely identify the HTML tags that hold the data on the website. The question is how?
So, if you know HTML basics(click on the link, to learn HTML using our Interactive Course), you must be knowing about HTML tags and attributes. Well, this is the trick, we use HTML tags or attributes or both to uniquely identify any data on a website. Let's see an example.
To uniquely identify the price tag from the website:
- Right click on the price displayed.
- Click on Inspect Element.
Here you can see that price 449.00 can be identified uniquely by:
<span id="priceblock_saleprice" class="a-size-medium a-color-price">
Now, let's say we want to get data and want to compare/store it with data gathered from some other websites. Here is where scraping comes into play. It can be used for mining data from multiple websites. This technology is being used vigrously now-a-days. Many websites like Trivago, which only compares price for the same product from different platforms uses same technology.