Signup/Sign In

Extracting Meta Data from PDF Files

This tutorial comes under the category of Cyber Forensics. The example we are going to discuss is a real life incident in which a member of the hacker group Anonymous was arrested, after they released a PDF file (as oress release) with information about their group and the online attacks conducted by them. The proof against him was, the metadata extracted from the PDF file that was released. You can download the pdf from here.

In this tutorial we will also try to extract the metadata from the above pdf file and see what information does it contains. But first of all, let's try to understand What a Metadata is?

Metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient.

In simple words, metadata holds information about the data. In case of a PDF file with some data, the metadata will be the date of creation of the PDF, it may even have the Mac address of the computer on which it was created, name of the author, which software was used to create the PDF etc.

Now there is still one more thing we need to do before we could actually start to extract meta data from a PDF file. We need to install yet another python module known as pyPdf. To install it, just follow the steps:

  • Download pyPdf tar.gz file from here.
  • Extract the tar.gz file using the following command: tar -xvzf 'filename'
  • Now change your directory to the freshly extracted folder.
  • Install package by running, python setup.py install command.

Program to extract metadata from PDF file

Below is the program to extract the metadata from a PDF file:

meta_extract.py

#!usr/bin/env python
# This program displays metadata from pdf file

import pyPdf

def main():
    # Enter the location of 'ANONOPS_The_Press_Release.pdf'
    # Download the PDF if you haven't already
	filename = <LOCATION_OF_THE_PDF>
	
	pdfFile = pyPdf.PdfFileReader(file(filename,'rb'))
	data = pdfFile.getDocumentInfo()

	print "----Metadata of the file----"
	
	for metadata in data:
		print metadata+ ":" +data[metadata]

if __name__ == '__main__':
	main()

Output:

Extracting MetaData from PDF Files using pyPDF

So now, whenever you receive any suspicious PDF file, you can easily access it metadata to find out its origin. Cool! isn't it?