Introduction to Web Scraping
Internet is an ocean of information spread accross various websites, where it is categorized, interlinked and mostly freely available for everyone.
For example: If you want to know about the market price for a particular product, you can go out in the market physically and ask shopkeepers or you can search the product on online stores like amazon, ebay etc. But what if you want list of all the products of a particular category in a certain price range? You will obviously prefer online stores because shopkeepers around your home will not entertain you with the list of all the products.
But getting such data/information from the internet is not easy at times. For such situations and requirements, we use programs to parse and fetch data from any website and this technique of extracting large amount of data from websites by parsing the HTML code is known as Web Scraping.
And we are here to learn how to do it. So without further delay, let's begin.
Prerequisites
To follow this tutorial well you must know a few things beforehand and they are:
- Python programming language. If you don't know python take the Studytonight Python course.
- Basics of HTML. Knowing HTML will boost your progress. You can start learning HTML on Studytonight itself.
Considering that you know python and HTML, let's set up all the tools/packages that are required for Web Scraping work.
If you don't have Python installed on your PC. Go to python's official website and download it. Once downloaded, follow the tutorial Getting started with Python for stepwise guide.
We would also require bs4
and requests
modules from the python library to work on web acraping. Now, we are going to install all the required modules which we are going to use.
Installing bs4
module
To install the bs4
module, run the following command in the command line:
pip install bs4
After installing, check whether it is installed or not by running the following code.
import bs4
If the above command executes without any errors then you have successfully installed the bs4
module in your computer.
Installing requests
module
Run the following command in the command line to install the requests
module.
pip install requests
Again, after installing, check whether it is installed or not by running the following code.
import requests
If the above command executes without any errors then you have successfully installed the request
module in your computer.
And with this, we are all set for web scraping.
Rules for Scraping
Following are some of the things that you should keep in mind while scraping data from any website:
- Go through the Terms & Conditions of the website from which you want to scrape data. Some websites do not allow web scraping to use the data for commercial use, while some allow, so we recommend you read the terms and conditions.
- Do not request data from the website too agressively while running your web scraping program because that might slow down the website.
- Once written your script might not work if the website changes its interface, do verify any changes in the website's layout before running your web scraping script.
Inspecting any Webpage
This is always the first step of the web scraping process, or we can say that this is the step 0.
We can inspect the user interface and the related HTML code for any website using browser tools like Chrome's Developer Console etc.
For example, if you want to get all the topic names from the left sidebar of the tutorial on Studytonight's website, then if you are using Chrome browser on Windows press F12 key to open the developer tools(For macOS, press Command + Option + I).
Then click on the button in the top-left corner as shown in the picture above. Once you click on that button, then all you have to do is hover your mouse pointer on any webpage's element and you will see the HTML code for it in the developer tools' Elements view.
This is how we initially search and find the elements to start with web scraping.