Introduction to requests
Module
requests
module is used to send HTTP request to any server and receive the HTTP response back. So we will use the requests
module to send HTTP request using website's URL and get back in return the response, for which we will be using Beautiful Soup module to take out the useful data/information/content of the website from the response.
So let's learn how to send HTTP request and receive the response from the server using requests
module.
Some Useful request
Module Methods
Following are some of the commonly used methods available in the requests
module for making HTTP requests.
- requests.get()
- requests.post()
- requests.put()
- requests.delete()
- requests.head()
- requests.options()
In this tutorial, we will be using requests.get()
and requests.post()
methods to make HTTP requests for web scraping.
If you are new to HTTP requests and wondering what GET and POST requests are, here is a simple explanation:
- GET: It is used to retrieve information(webpage) using a URL.
- POST: It is used to send information to a URL.
Making Request using requests.get
requests.get(URL)
method is used to send HTTP request and receive the data back as response. It takes a URL for a website or any API.
response.content
is a variable in which the response content is stored which is the response of the get()
method.
Let's take an example,
## import requests module
import requests
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## printing the response
print(response.content)
The output for the above script would be the entire page source(or, source code) for the specified URL, which we have stored in the following file(as it's too long).
You must be wondering how we can read anything from this as it's too complicated. Well to make the response content readable we use Beautiful Soup module, which we will cover in the coming tutorials.
We can print the header information sent by the website in the response using the response.headers
method.
For newbies, header information contains general meta information about the HTTP connection along with some connection properties.
Let's print headers for the above get request:
## import requests module
import requests
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## headers of the website
print(response.headers)
{'Date': 'Wed, 07 Nov 2018 08:56:29 GMT', 'Expires': '-1', 'Cache-Control': 'private, max-age=0', 'Content-Type': 'text/html; charset=ISO-8859-1', 'P3P': 'CP="This is not a P3P policy! See g.co/p3phelp for more info."', 'Content-Encoding': 'gzip', 'Server': 'gws', 'X-XSS-Protection': '1; mode=block', 'X-Frame-Options': 'SAMEORIGIN', 'Set-Cookie': '1P_JAR=2018-11-07-08; expires=Fri, 07-Dec-2018 08:56:29 GMT; path=/; domain=.google.com, NID=144=cPBAw4RAx5TZoBZ3WtDNN54qgUt198oVTvdyWYx0iFIPo-MLX_qcQ8DjZXQNkO7WqRD4KOGnXShYh9TFmmZKtOZ0OoNBu-9Nlw50ocpoGMxvt9SNRZgXPUJgMv0D5A7URfeSV0BLihLp24UPNWhOQjMO5sbZNndc0Dvd3DHVR5s; expires=Thu, 09-May-2019 08:56:29 GMT; path=/; domain=.google.com; HttpOnly', 'Alt-Svc': 'quic=":443"; ma=2592000; v="44,43,39,35"', 'Transfer-Encoding': 'chunked'}
To print the values in a more readable format we can access each value separately using the method response.headers.items()
and the use a for
loop to print each value.
## import requests module
import requests
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## headers of the website
for key, value in response.headers.items():
print(key, '\t\t', value)
Date Wed, 07 Nov 2018 08:56:29 GMT
Expires -1
Cache-Control private, max-age=0
Content-Type text/html; charset=ISO-8859-1
P3P CP="This is not a P3P policy! See g.co/p3phelp for more info."
Content-Encoding gzip
Server gws
X-XSS-Protection 1; mode=block
X-Frame-Options SAMEORIGIN
Set-Cookie 1P_JAR=2018-11-07-08; expires=Fri, 07-Dec-2018 08:56:29 GMT; path=/; domain=.google.com, NID=144=cPBAw4RAx5TZoBZ3WtDNN54qgUt198oVTvdyWYx0iFIPo-MLX_qcQ8DjZXQNkO7WqRD4KOGnXShYh9TFmmZKtOZ0OoNBu-9Nlw50ocpoGMxvt9SNRZgXPUJgMv0D5A7URfeSV0BLihLp24UPNWhOQjMO5sbZNndc0Dvd3DHVR5s; expires=Thu, 09-May-2019 08:56:29 GMT; path=/; domain=.google.com; HttpOnly
Alt-Svc quic=":443"; ma=2592000; v="44,43,39,35"
Transfer-Encoding chunked
Status of Request
When we make a GET request using requests.get()
method, the request might fail, get re-directed to some other URL, can fail at client side or server side or it may successfully complete.
To know the status of the request, we can check the status code of the response received.
This can be done using the response.status_code
value. It's very simple,
## import requests module
import requests
## send a request and receive the information from https://www.google.com
response = requests.get("https://www.google.com")
## status of request
print(response.status_code)
200
Following are the different status values that you may get in the response:
Status Code | Description |
1XX | Informational |
2XX | Success |
3XX | Redirection |
4XX | Client Error |
5XX | Server Error |
For example: 200 status code is for success. Whereas, 201 status code is for created(when we send a request to create some resource) etc.
Just like GET request, we can make the POST request using the requests.post(URL)
method and handling the response is same.
For web scraping we will mostly use GET request.
Setup User Agent
When we try to access websites using a program, some websites doesn't allow it for security reasons, as that makes a website susceptible to unnecessary request generated via programs which in extreme case can even burden the website server by sending a large number of requests.
To overcome this, we will use Fake Useragent module, which is used to fake a request to a server as if it is initiated by a user and not a program.
To install the module fake_useragent, run the following command:
pip install fake_useragent
Once it is installed, we can use it to generate a fake user request like this:
## import UserAgent from the fake_useragent module
from fake_useragent import UserAgent
## create an instance of the 'UserAgent' class
obj = UserAgent()
## create a dictionary with key 'user-agent' and value 'obj.chrome'
header = {'user-agent': obj.chrome}
## send request by passing 'header' to the 'headers' parameter in 'get' method
r = requests.get('https://google.com', headers=header)
print(r.content)
The output for this request will be the source code of the webpage https://google.com as if it was opened by a user using the Chrome browser.
So now we know how to send a HTTP request to any URL and receive the response using the requests
module. In the next tutorial we will learn how to get real useful content from the HTTP response using the Beautiful Soup module.