Signup/Sign In

Introduction to Web Scraping

Internet is an ocean of information spread accross various websites, where it is categorized, interlinked and mostly freely available for everyone.

For example: If you want to know about the market price for a particular product, you can go out in the market physically and ask shopkeepers or you can search the product on online stores like amazon, ebay etc. But what if you want list of all the products of a particular category in a certain price range? You will obviously prefer online stores because shopkeepers around your home will not entertain you with the list of all the products.

But getting such data/information from the internet is not easy at times. For such situations and requirements, we use programs to parse and fetch data from any website and this technique of extracting large amount of data from websites by parsing the HTML code is known as Web Scraping.

And we are here to learn how to do it. So without further delay, let's begin.


Prerequisites

To follow this tutorial well you must know a few things beforehand and they are:

  1. Python programming language. If you don't know python take the Studytonight Python course.
  2. Basics of HTML. Knowing HTML will boost your progress. You can start learning HTML on Studytonight itself.

Considering that you know python and HTML, let's set up all the tools/packages that are required for Web Scraping work.

If you don't have Python installed on your PC. Go to python's official website and download it. Once downloaded, follow the tutorial Getting started with Python for stepwise guide.

We would also require bs4 and requests modules from the python library to work on web acraping. Now, we are going to install all the required modules which we are going to use.


Installing bs4 module

To install the bs4 module, run the following command in the command line:

pip install bs4

After installing, check whether it is installed or not by running the following code.

import bs4

If the above command executes without any errors then you have successfully installed the bs4 module in your computer.


Installing requests module

Run the following command in the command line to install the requests module.

pip install requests

Again, after installing, check whether it is installed or not by running the following code.

import requests

If the above command executes without any errors then you have successfully installed the request module in your computer.

And with this, we are all set for web scraping.


Rules for Scraping

Following are some of the things that you should keep in mind while scraping data from any website:

  1. Go through the Terms & Conditions of the website from which you want to scrape data. Some websites do not allow web scraping to use the data for commercial use, while some allow, so we recommend you read the terms and conditions.
  2. Do not request data from the website too agressively while running your web scraping program because that might slow down the website.
  3. Once written your script might not work if the website changes its interface, do verify any changes in the website's layout before running your web scraping script.

Inspecting any Webpage

This is always the first step of the web scraping process, or we can say that this is the step 0.

We can inspect the user interface and the related HTML code for any website using browser tools like Chrome's Developer Console etc.

For example, if you want to get all the topic names from the left sidebar of the tutorial on Studytonight's website, then if you are using Chrome browser on Windows press F12 key to open the developer tools(For macOS, press Command + Option + I).

Inspect element for web scraping

Then click on the button in the top-left corner as shown in the picture above. Once you click on that button, then all you have to do is hover your mouse pointer on any webpage's element and you will see the HTML code for it in the developer tools' Elements view.

Inspect element for web scraping

This is how we initially search and find the elements to start with web scraping.