Zum Hauptinhalt springen

PDF Metadata Extractor

Overview

This document explains the development of a custom Python-based PDF Metadata Extractor created for educational purposes and authorized document analysis.

The goal of this project was to build a command-line application capable of automatically extracting embedded metadata from PDF files and exporting the results into a structured CSV file.

Many PDF documents contain hidden metadata that is not directly visible when reading the file. This information may reveal authorship, software used, creation dates, modification timestamps, internal workflow details, and other useful intelligence.

The script was designed to support both:

  • single PDF file analysis
  • bulk scanning of directories containing multiple PDF files

The project demonstrates how Python can be used for practical metadata intelligence gathering, forensic triage, privacy reviews, and automated document auditing.


Objectives

The main goals of this project were:

  • develop a custom PDF metadata extraction tool in Python
  • support single-file and multi-file analysis
  • combine multiple PDF libraries for improved accuracy
  • export structured results to CSV format
  • normalize inconsistent metadata values
  • convert raw PDF timestamps into readable dates
  • improve terminal usability and output design
  • understand metadata analysis in cybersecurity workflows

Why Build a Custom Tool?

Although commercial forensic suites and professional document tools exist, building the logic manually in Python provides major learning benefits:

  • deeper understanding of PDF internals
  • hands-on Python scripting experience
  • metadata field handling
  • file automation workflows
  • CSV reporting skills
  • command-line interface development
  • investigative methodology

Creating the tool manually also helps explain how automated forensic and OSINT utilities work behind the scenes.


Why Metadata Matters

Metadata can expose valuable hidden information.

Examples include:

  • author names
  • usernames
  • company departments
  • document titles
  • internal software versions
  • printers or PDF generators
  • creation timestamps
  • modification history
  • keywords and tags

This makes metadata highly relevant for:

  • OSINT investigations
  • digital forensics
  • privacy audits
  • leak analysis
  • internal security reviews
  • red team reconnaissance

Technologies Used

TechnologyPurpose
Python 3Main programming language
PyPDF2Primary PDF metadata extraction
pdfminer.sixSecondary / fallback metadata parsing
csvExport data into spreadsheet format
argparseCommand-line argument parsing
pathlibModern file path handling

Why Two Libraries Were Used

Different PDF files store metadata in inconsistent ways.

Some files are handled well by PyPDF2, while others reveal additional fields through pdfminer.six.

Using both libraries increases coverage.

LibraryStrength
PyPDF2Fast, simple, reliable for common PDFs
pdfminer.sixBetter fallback for unusual structures

This dual-parser approach improves extraction quality.


Program Features

FeatureDescription
Single File ModeAnalyze one PDF document
Directory ModeScan multiple PDFs automatically
CSV ExportSave all findings to CSV
Metadata CleanupRemove broken whitespace / control chars
Timestamp ConversionConvert raw PDF dates
Page CounterDetect number of pages
PDF Version DetectionExtract PDF standard version
Styled CLI OutputProfessional terminal workflow

Supported Metadata Fields

FieldDescription
FilenameName of PDF file
FilepathFull location of file
TitleEmbedded document title
AuthorAuthor name
CreatorApplication that created file
CreatedCreation timestamp
ModifiedLast modification time
SubjectSubject field
KeywordsSearch tags
DescriptionReserved field
ProducerPDF producing engine
PDF VersionFile standard version
PagesTotal page count

How the Program Works

The workflow of the script is:

  1. Read command-line arguments
  2. Choose single-file or directory mode
  3. Open PDF file(s)
  4. Extract metadata using PyPDF2
  5. Fill missing values using pdfminer.six
  6. Normalize extracted values
  7. Convert PDF date strings
  8. Store results in memory
  9. Export findings to CSV file

This reproduces the logic of many document intelligence tools.


Installation

A Python virtual environment can be used.

python3 -m venv venv
source venv/bin/activate
python -m pip install PyPDF2==3.0.1 pdfminer.six==20221105

Command Line Usage

General syntax:

python pdf_metadata_extractor.py [mode] [target] -n output.csv

Parameters

ParameterMeaning
-fAnalyze one PDF file
-dAnalyze directory of PDFs
-nName of CSV output file

Example Commands

Single PDF

python pdf_metadata_extractor.py -f document.pdf -n metadata.csv

Multiple PDFs

python pdf_metadata_extractor.py -d ./pdfs -n all_metadata.csv

Default Output Name

python pdf_metadata_extractor.py -f report.pdf

Creates:

pdf_metadata.csv

Example CSV Output

FilenameAuthorTitleCreatedPages
report.pdfJohn DoeQuarterly Report2024-03-11 09:22:0012

Why This Tool Is Valuable

This project combines several practical cybersecurity skills:

  • Python development
  • file automation
  • metadata intelligence gathering
  • CSV reporting
  • document analysis
  • command-line tool creation
  • forensic triage methodology

It is therefore a strong beginner/intermediate security portfolio project.


Security Lessons Learned

Hidden Information Exists in Documents

Even harmless PDFs may reveal sensitive internal data.

Metadata Can Leak Identities

Author names, usernames, and software names may expose internal users.

Bulk Analysis Saves Time

Scanning many PDFs manually would be slow and inefficient.

Automation Is Powerful

Simple scripts can transform repetitive manual work into fast intelligence gathering.


Use Cases

ScenarioBenefit
OSINTCollect public document intelligence
ForensicsReview origins of files
Corporate AuditDetect metadata leaks
Privacy ReviewIdentify removable sensitive data
Red TeamPassive reconnaissance

Conclusion

The custom Python PDF Metadata Extractor successfully demonstrates how hidden metadata can be collected automatically from PDF files.

It supports:

  • single or bulk analysis
  • dual-library extraction
  • CSV reporting
  • readable timestamps
  • clean CLI workflow

This project provided valuable insight into metadata-based intelligence gathering and practical document security analysis.


This documentation is for educational purposes only.

Only analyze files you own or are explicitly authorized to inspect.