Scraping Product Names from ConsumerReports Website

NOTE: This tutorial is just for educational purpose and we request the reader to not use the code to carry out harm to the website in any form whatsoever.

In this tutorial we will learn how to actually scrape data off any website. The website from which we will get the data is ConsumerReports website. We will be requesting data from this URL and then collect the product names list from it.

Let the scraping begin...

## importing bs4, requests and fake_useragent modules
import bs4
import requests
from fake_useragent import UserAgent

## initializing the UserAgent object
user_agent = UserAgent()
url = "https://www.consumerreports.org/cro/a-to-z-index/products/index.htm"

## getting the reponse from the page using get method of requests module
page = requests.get(url, headers={"user-agent": user_agent.chrome})

## storing the content of the page in a variable
html = page.content

By this step, we already have the complete source code for the webpage stored in our variable html. Now let's create BeautifulSoup object. You can even try and run the prettify method.

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

We have also created the BeautifulSoup object, now what? How do we know which tag to find and extract from the HTML code. Should we search HTML code for it? No way!

Remember in the first tutorial of this series when we introduced the term web scraping, we did share a technique with you, where we could use the Chrome browser's Developer tool to find the HTML code for any webpage element.(other browsers like Firefox etc too have there own developer tools which can also be used.)

Open the Developer Tools(in chrome browser) by pressing F12 key if you are using Windows and Option + Command + I if you are a Mac user.

Click on the top-left corner button:

And then hover your mouse cursor on the Product list entries to find their HTML tags:

As we can see, the anchor tags holding the URL to the product report page and the name of the product is enclosed within a div tag with class attribute value crux-body-copy. That is where we start from, we will fetch all the div tags with class attribute value equal to crux-body-copy:

## div tags with crux-body-copy class
div_class = "crux-body-copy"

## getting all the divs with class 'crux-body-copy'
div_tags = soup.find_all("div", class_="div_class")

## we will see all the div tags 
## enclosing the anchor tags with the required info
for tag in div_tags:
    print(tag)

<div class="crux-body-copy"> <a class="products-a-z__results__item" href="https://www.consumerreports.org/cro/air-conditioners.htm"> Air conditioners </a> </div> ... ... ...

As the complete output for the above code was too long, we have stored it in a file(download the file to see).

Now the next step is to extract the product name and the links to individual product webpage from the enclosing div tag.

## extracting the names and links from the div tags
for tag in div_tags:
    name = tag.a.text.strip()
    link = tag.a['href']
    
    print("{} ---- {}".format(name, link))

Air conditioners ---- https://www.consumerreports.org/cro/air-conditioners.htm Air Filters ---- https://www.consumerreports.org/cro/air-filters.htm Air Fryers ---- https://www.consumerreports.org/cro/air-fryers.htm Air Mattresses ---- https://www.consumerreports.org/cro/air-mattresses.htm ... ... ...

See, how simple it was. And with this we have successfully scraped data from a website.

Here is the complete code for your reference:

## importing bs4, requests and fake_useragent modules
import bs4
import requests
from fake_useragent import UserAgent

## initializing the UserAgent object
user_agent = UserAgent()
url = "https://www.consumerreports.org/cro/a-to-z-index/products/index.htm"

## getting the reponse from the page using get method of requests module
page = requests.get(url, headers={"user-agent": user_agent.chrome})

## storing the content of the page in a variable
html = page.content

## creating BeautifulSoup object
soup = bs4.BeautifulSoup(html, "html.parser")

## div tags with crux-body-copy class
div_class = "crux-body-copy"

## getting all the divs with class 'crux-body-copy'
div_tags = soup.find_all("div", class_="div_class")

## extracting the names and links from the div tags
for tag in div_tags:
    name = tag.a.text.strip()
    link = tag.a['href']
    
    print("{} ---- {}".format(name, link))

Try to run this code in you machine and if you face any issue you can post your questions here: Studytonight Q & A Forum

Python Interview Tests

Best Python questions to crack job interview.

Python Tutorial

Best Python tutorial for Beginners to learn Python with examples, programs and projects.

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Getting Started

More About BeautifulSoup

Advanced

Scraping Product Names from ConsumerReports Website

Python Interview Tests

Python Tutorial