Signup/Sign In
LAST UPDATED: JUNE 28, 2023

Extract Text from PDF in Python - PyPDF2 Module

Technology #pdf#python

    In this simple tutorial, we will learn how we can extract text from a given PDF in Python. The PDF can be a multipage PDF too, we will extract the text for all the pages of PDF.

    We will be using the PyPDF2 module for extracting text from PDF files.

    Extract Text from PDF in Python

    Extract Text from PDF in Python using pypdf2 module

    To install the PyPDF2 module, you can use pip command. Run the below pip command to download the PyPDF2 module:

    pip install PyPDF2

    Once we have downloaded the PyPDF2 module, we can write the code for opening the PDF file, then read its text and printing it on the console or write the text in a separate text file.

    Using the PyPDF2 module

    For extracting text from a PDF file we will be using the PdfFileReader class which is used to initialize PdfFileReader object, taking a stream parameter, in which we will provide the file stream for the PDF file.

    Now let's see how we can use the PyPDF2 module to read PDF files:

    from PyPDF2 import PdfFileReader
    
    # open the PDF file
    pdfFile = open('mypdf.pdf', 'rb')
    
    # create PDFFileReader object to read the file
    pdfReader = PdfFileReader(pdfFile)
    
    print("Printing the document info: " + str(pdfReader.getDocumentInfo()))
    print("- - - - - - - - - - - - - - - - - - - -")
    print("Number of Pages: " + str(pdfReader.getNumPages()))
    
    # close the PDF file object
    pdfFile.close()

    In the code above, we have first used the open() method used to open a file in Python for reading, then we will use this file object to initialize the PdfFileReader object.

    Once we have the PdfFileReader object ready, we can use its methods like getDocumentInfo() to get the file information, or getNumPages() to get the total number of pages in the PDF file.

    Then we have the getPage() method to get the page from the PDF file using the page index which starts from 0, and finally the extractText() method which is used to extract the text from the PDF file page.

    from PyPDF2 import PdfFileReader
    
    # open the PDF file
    pdfFile = open('mypdf.pdf', 'rb')
    
    # create PDFFileReader object to read the file
    pdfReader = PdfFileReader(pdfFile)
    
    print("PDF File name: " + str(pdfReader.getDocumentInfo().title))
    print("PDF File created by: " + str(pdfReader.getDocumentInfo().creator))
    print("- - - - - - - - - - - - - - - - - - - -")
    
    numOfPages = pdfReader.getNumPages()
    
    for i in range(0, numOfPages):
    	print("Page Number: " + str(i))
    	print("- - - - - - - - - - - - - - - - - - - -")
    	pageObj = pdfReader.getPage(i)
    	print(pageObj.extractText())
    	print("- - - - - - - - - - - - - - - - - - - -")
    # close the PDF file object
    pdfFile.close()

    In the code above, we are printing the title and the name of the creator for the PDF file mypdf.pdf(change it as per your PDF file name and provide the full path for the file) which are attributes of the getDocumentInfo() method.

    Then we used Python for loop, to print the text of all the pages of the PDF. Once we are done, we can call the close() method on the file object to close the file resource.

    Other Applications of PyPDF2 Module

    The PyPDF2 module can be used to perform many operations on PDF files, such as:

    1. Reading the text of the PDF file, which we just did above

    2. Rotating a PDF file page by any defined angle

    3. Merging two or more PDF files at a defined page number.

    4. Appending two or more PDF files, one after another.

    5. Find all the meta information for any PDF file to get information like creator, author, date of creation, etc.

    6. We can even create a new PDF file using the text coming from some text file.

    Conclusion

    In this tutorial, we covered how we can extract text from a PDF file. This is a great use case if you are working on a project where you want to convert scanned files in PDF format to text which can be stored in a database for data collection.

    Similarly, there can be many different use cases, like scanning physical documents like candidate resumes, and then reading text from it for analysis, or maybe reading text from invoices, etc.

    If you have a special use case, do share it with us in the comment section below. Also, if you face any issues while running the Python script, do share the error with us by posting in the comments and we will definitely help you.

    Frequently Asked Questions(FAQs)

    1. How can I extract text from a PDF file using PyPDF2 in Python?

    PyPDF2 provides a simple and intuitive API to extract text from PDF files. You can open a PDF, iterate over its pages, and use the extract_text() method to retrieve the text content.

    2. Does PyPDF2 handle scanned or image-based PDFs?

    No, PyPDF2 is primarily designed for extracting text from text-based PDFs. It may not work well with scanned or image-based PDFs that lack textual content.

    3. Can PyPDF2 preserve the original formatting and layout of the extracted text?

    PyPDF2 focuses on extracting the textual content from PDF files rather than preserving the original formatting or layout. The extracted text is returned as a plain string.

    4. Are there any limitations or considerations when using PyPDF2 for text extraction?

    PyPDF2 relies on the structure and encoding of PDF files. If a PDF file has complex formatting, unusual encoding, or encrypted content, PyPDF2's text extraction may encounter limitations or difficulties.

    5. Are there alternative libraries for extracting text from PDFs in Python?

    Yes, there are alternative libraries like PDFMiner, PyMuPDF, and pdftotext that can be used for text extraction from PDFs in Python. These libraries offer different features and capabilities, so it's worth exploring them to find the best fit for your specific requirements.

    You may also like:

    I like writing content about C/C++, DBMS, Java, Docker, general How-tos, Linux, PHP, Java, Go lang, Cloud, and Web development. I have 10 years of diverse experience in software development. Founder @ Studytonight
    IF YOU LIKE IT, THEN SHARE IT
    Advertisement

    RELATED POSTS