Why PDF Metadata Is The Most-Overlooked Privacy Leak

A PDF is a finished-looking thing. The text is locked into glyphs. The layout doesn't reflow. Selecting and copying text is sometimes deliberately difficult. So it feels like a sealed envelope.

Underneath that surface, every PDF you send is broadcasting:

Your name (the user account name on the machine that created it)
Your company name (if the document was created from a corporate template)
Your software stack (Word 2024, InDesign 2025, the exact Acrobat version)
The path of every embedded file on your network drive at the time of save
Creation date and modification date down to the second
A unique document identifier that follows the file across revisions

The two metadata stores

Most PDF cleaners scrub the Info dictionary and call it a day. But PDFs actually maintain two parallel metadata systems:

The Info dictionary — the original PDF 1.0 way to store metadata. Title, Author, Subject, Keywords, Creator (the app you authored in), Producer (the app that wrote the PDF), CreationDate, ModDate.
The XMP packet — an XML metadata stream that mirrors and extends the Info dictionary. Added in PDF 1.4 and now the default in most modern PDF writers. It can contain everything in the Info dictionary plus additional fields like document/instance IDs, derivation history (when a PDF was generated from another document), and tool-specific metadata.

Wipe one, leave the other, and you've changed nothing about what an attacker reads.

The famous failures

Government agencies regularly release "redacted" PDFs where the redactions cover visible text but leave metadata intact. Journalists then extract metadata to identify who wrote which paragraph. Corporate legal departments do the same with contracts.

The pattern is always similar: the lawyers redact the visible content (block out names, dollar amounts, location names), the redaction tool draws black rectangles over those regions, and the PDF is exported. Nobody checks whether the document's Info dictionary still says Author: Jane Smith, Senior Counsel.

What our PDF cleaner removes

For every PDF, we strip:

The complete Info dictionary (every key, including custom ones)
The XMP packet from the document catalog
Document IDs and instance IDs
(On Maximum Privacy preset) EXIF from images embedded inside the PDF

The PDF re-saves cleanly via pdf-lib so the metadata cannot be recovered through incremental update inspection — a forensic technique that recovers old states of a PDF from the bytes preserved between revisions.

Limits to what cleaning fixes

Removing metadata does not remove visible content. If sensitive information is written into the body of the document, only redacting (or deleting and re-saving) that content will remove it. Some redaction tools draw black rectangles that LOOK like they cover text but actually layer on top of it — the text is still selectable, copyable, and searchable. Always do a final inspection in a different PDF viewer before relying on a redacted PDF for high-stakes purposes.

Use our PDF metadata removal tool to strip metadata from your PDFs entirely in your browser — no upload, no signup.

Why PDF metadata is the most-overlookedprivacy leak in your business.

The two metadata stores

The famous failures

What our PDF cleaner removes

Limits to what cleaning fixes

Related guides

Why PDF metadata is the most-overlooked
privacy leak in your business.