Extracting Meta Data from PDF Files

This tutorial comes under the category of Cyber Forensics. The example we are going to discuss is a real life incident in which a member of the hacker group Anonymous was arrested, after they released a PDF file (as oress release) with information about their group and the online attacks conducted by them. The proof against him was, the metadata extracted from the PDF file that was released. You can download the pdf from here.

In this tutorial we will also try to extract the metadata from the above pdf file and see what information does it contains. But first of all, let's try to understand What a Metadata is?

Metadata is data about data. Metadata is structured information that explains, describes, or locates the original primary data, or that otherwise makes using the original primary data more efficient.

In simple words, metadata holds information about the data. In case of a PDF file with some data, the metadata will be the date of creation of the PDF, it may even have the Mac address of the computer on which it was created, name of the author, which software was used to create the PDF etc.

Now there is still one more thing we need to do before we could actually start to extract meta data from a PDF file. We need to install yet another python module known as pyPdf. To install it, just follow the steps:

Download pyPdf tar.gz file from here.
Extract the tar.gz file using the following command: tar -xvzf 'filename'
Now change your directory to the freshly extracted folder.
Install package by running, python setup.py install command.

Program to extract metadata from PDF file

Below is the program to extract the metadata from a PDF file:

meta_extract.py

#!usr/bin/env python
# This program displays metadata from pdf file

import pyPdf

def main():
    # Enter the location of 'ANONOPS_The_Press_Release.pdf'
    # Download the PDF if you haven't already
	filename = <LOCATION_OF_THE_PDF>
	
	pdfFile = pyPdf.PdfFileReader(file(filename,'rb'))
	data = pdfFile.getDocumentInfo()

	print "----Metadata of the file----"
	
	for metadata in data:
		print metadata+ ":" +data[metadata]

if __name__ == '__main__':
	main()

Output:

So now, whenever you receive any suspicious PDF file, you can easily access it metadata to find out its origin. Cool! isn't it?

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Introduction & Basics

Start with Network Analysis

Practical Application

Extracting Meta Data from PDF Files

Program to extract metadata from PDF file

Python MCQ Tests

Python Tutorial