What PDF metadata actually contains
Every PDF carries two parallel metadata stores: the legacy Info dictionary (Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModDate) and the modern XMP packet — an XML block that mirrors and extends the Info dictionary. Most PDF cleaners only wipe one. We wipe both.
The Author field in a PDF is set by the application that created the file. If you created the PDF in Word, it's your Word user account name. If you exported from Photoshop or InDesign, it's whatever username was logged into Adobe Creative Cloud. This is one of the most common sources of accidental identity exposure in shared documents.
What our tool removes from PDFs
- Info dictionary: Title, Author, Subject, Keywords, Creator, Producer, CreationDate, ModificationDate
- XMP packet: the parallel XML-based metadata stream
- Document and Instance IDs: tracking identifiers that follow the file across revisions
- Embedded image EXIF: photos placed inside the PDF retain their own metadata; we scrub them too (when using the Maximum Privacy preset)
What our tool preserves
We do not touch the visible content of the PDF. Text, images, fonts, page layout, form fields, signatures (where present), and hyperlinks all remain intact. The PDF will open and render exactly as before — just without the identifying metadata.
If you need to redact visible content (names, account numbers, locations written into the document), that requires a separate tool. Metadata removal only handles the hidden descriptive data.
How to remove PDF metadata with this tool
- Drop your PDF in the box above. It stays on your device — your browser parses it locally.
- Review the inspector. We list every Info dictionary field and XMP property currently in the file. You see exactly what's about to be removed.
- Pick a preset. "Privacy" wipes all personal/dating data. "Web Publishing" keeps copyright. "Legal Filing" preserves Title and Subject while wiping authorship. "Maximum Privacy" also strips embedded image EXIF.
- Download the cleaned PDF. The file is re-serialized cleanly so no metadata can be recovered via incremental update inspection.
- (Optional) Download the audit report. An HTML report listing every field that was removed, with SHA-256 hashes of the before and after files.
How a PDF stores metadata, structurally
To understand what we remove, it helps to understand how a PDF is built. A PDF file is
a collection of numbered objects — dictionaries, streams, arrays, and
primitives — tied together by a cross-reference table (the xref) that records
the byte offset of every object. At the end of the file, a trailer points
to two key objects: the document catalog (the /Root) and the information
dictionary (the /Info).
The Info dictionary is the classic metadata store. In raw PDF syntax it looks like this:
1 0 obj
<< /Title (Q3 Financial Review)
/Author (jane.doe)
/Creator (Microsoft Word)
/Producer (Acrobat Distiller 23.0)
/CreationDate (D:20250914103205+01'00')
/ModDate (D:20250914170412+01'00') >>
endobj
The modern parallel is the XMP metadata stream, an XML packet stored as
its own object and referenced from the catalog under /Metadata. It uses the
RDF/Dublin Core vocabulary and can duplicate everything in the Info dictionary plus
derivation history, document IDs, and tool-specific extensions. Because the two stores
exist in parallel, a cleaner that wipes only one leaves the other fully readable. Our tool
clears both: it empties every Info dictionary key and removes the /Metadata
reference from the catalog so the XMP packet is no longer part of the document.
The byte-accounting: why the file size changes
When metadata is removed, the file shrinks by roughly the combined size of the cleared fields plus the structural overhead that referenced them. You can express the cleaned size as:
cleaned_size = original_size − Σ(field_bytesᵢ) − xref_overhead + rebuild_padding
Here Σ(field_bytesᵢ) is the sum of the byte lengths of every removed
metadata field, xref_overhead is the cross-reference entries that pointed to
now-deleted objects, and rebuild_padding accounts for the fact that a
re-serialized PDF re-numbers and re-packs its object table, which can add or remove a few
bytes. Because the document is rewritten from a clean object graph rather than
appended to, old metadata cannot be recovered by reading earlier file revisions —
a recovery technique that works against tools which merely append an "update" that hides
the old values without deleting them.
How the integrity check works (SHA-256)
Every cleaned file in our tool is fingerprinted with the SHA-256 cryptographic hash function, both before and after cleaning. SHA-256 maps an input of any length to a fixed 256-bit (32-byte) output, conventionally written as 64 hexadecimal characters. It has two properties that make it ideal for an audit trail:
- Determinism: the same input always produces the same hash, so anyone can re-compute the hash of your cleaned file and confirm it matches the value in the audit report.
- Avalanche effect: changing a single bit of the input changes roughly half the output bits, so the "before" and "after" hashes will look completely unrelated — visible proof that the file genuinely changed.
The probability of two different files sharing a SHA-256 hash (a collision) is
approximately 1 / 2¹²⁸ for a targeted attack, a number so large it is treated
as computationally impossible. That is why the hash functions as a tamper-evident seal: if
the report's hash matches your file, the file is exactly the one that was cleaned.
When metadata removal is not enough
Removing metadata addresses the hidden, descriptive layer of a document. It does not touch the visible content. Three common situations call for more than metadata removal:
- Visible identifying text. If a name, address, or account number is written into the body of the PDF, it remains after metadata removal. You must redact or delete that text separately.
- Faux-redaction. Some tools draw black rectangles over text without removing the underlying characters. The text stays selectable and copyable beneath the box. True redaction deletes the content, not just hides it.
- Embedded files and attachments. A PDF can carry attached files, each with its own metadata. Our tool focuses on the document's own metadata; attached files should be cleaned individually.
A worked example: before and after
The clearest way to see what the tool does is to look at the raw Info dictionary of a real-world PDF — say, a quarterly report exported from Word — and the same object after cleaning. The dictionary on the left is what an attacker reads with a text editor; the one on the right is what remains after the Privacy preset runs.
Before — exposed
1 0 obj
<< /Title (Q3 Board Pack v7 FINAL)
/Author (m.okafor)
/Subject (Internal — do not circulate)
/Keywords (layoffs, restructure, 2025)
/Creator (Microsoft Word 2024)
/Producer (Acrobat Distiller 23.0)
/CreationDate (D:20250903081544+01'00')
/ModDate (D:20250914170412+01'00') >>
endobj
After — cleaned
1 0 obj
<< /Title ()
/Author ()
/Subject ()
/Keywords ()
/Creator ()
/Producer ()
/CreationDate (D:19700101000000Z)
/ModDate (D:19700101000000Z) >>
endobj
Notice how much that "before" block gives away that has nothing to do with the visible
document: the author's username, an internal classification note, keywords naming a
sensitive project, the exact authoring software, and a precise timeline of when the file
was created and last touched. The /Metadata XMP stream (not shown) duplicates
most of this in XML and is removed from the catalog at the same time. After cleaning, every
field is emptied and the dates are zeroed to the Unix epoch, so nothing identifying
remains.
Complete PDF metadata field reference
This table lists every field the tool inspects in a PDF, what each one can reveal about you or your organization, and how each preset treats it. "Removed" means the field is emptied; "Kept" means it is preserved because removing it would break rendering or because a preset deliberately retains it.
| Field | What it reveals | Privacy | Legal Filing | Max Privacy |
|---|---|---|---|---|
/Author | The OS username or full name of whoever created the file | Removed | Removed | Removed |
/Title | Document title, often an internal working name | Removed | Kept | Removed |
/Subject | Description or classification note | Removed | Kept | Removed |
/Keywords | Tags, often naming projects or clients | Removed | Removed | Removed |
/Creator | The application that authored the content (e.g. Word) | Removed | Removed | Removed |
/Producer | The library that wrote the PDF (e.g. Distiller) | Removed | Removed | Removed |
/CreationDate | Exact timestamp the file was created | Removed | Removed | Removed |
/ModDate | Exact timestamp of last modification | Removed | Removed | Removed |
/Metadata (XMP) | XML packet duplicating the above plus edit history and document IDs | Removed | Removed | Removed |
| Document / Instance ID | Unique identifiers that link file revisions together | Removed | Removed | Removed |
| Embedded image EXIF | GPS and camera data inside pictures placed in the PDF | Kept | Kept | Removed |
How this has actually burned people
The "redacted" report that wasn't
Government agencies have repeatedly published PDFs where the visible text was blacked out but the metadata was left intact. In several well-documented cases, journalists opened the file properties and found the author's name, the originating department, and revision timestamps that contradicted the official account of when a document existed.
The lesson is that redaction and metadata removal are two separate jobs. Covering text on the page does nothing to the Info dictionary or XMP packet sitting in the file's structure.
The whistleblower unmasked by /Author
Someone submits a sensitive document anonymously, having carefully removed their name
from the body text. But the PDF was exported from their personal copy of Word, so the
/Author field still carries their account name and the /Producer
field narrows down their software environment. A single glance at document properties
undoes all the care taken with the visible content.
Anyone sharing a document where authorship must stay private should treat metadata removal as mandatory, not optional.
The proposal that revealed the whole timeline
A vendor sends a polished proposal PDF. The recipient checks the metadata and sees a
/CreationDate from the morning of the deadline and a /ModDate
fifteen minutes before sending — revealing the proposal was rushed. In other cases,
keywords and titles have exposed that the same document was reused across multiple
competing clients.
How to verify the file is clean
You do not have to take any tool's word for it. After cleaning, you can confirm the metadata is gone using software you already have:
- On any OS: open the cleaned PDF in your PDF reader and check File → Properties (or Document Properties). The Author, Title, and other fields should be blank.
- Adobe Acrobat: File → Properties → Description tab shows the Info dictionary; the Additional Metadata button reveals the XMP packet.
- Command line (if you have ExifTool installed): run
exiftool cleaned.pdfand confirm only structural fields remain. - Quick text check: opening the raw PDF in a plain-text editor and searching for your name or username should return nothing.
The audit report the tool generates also records a SHA-256 hash of the cleaned file, so you can prove the file you are sharing is exactly the one that was cleaned.
Frequently asked questions
Will removing metadata break my PDF or change how it looks?
No. The page content, fonts, images, form fields, and layout are untouched. Only the Info dictionary and XMP packet are cleared. The document opens and renders identically.
Does this work on password-protected or encrypted PDFs?
If a PDF is encrypted, you will generally need to supply the password (or remove the encryption) before metadata can be rewritten, because the metadata objects themselves are encrypted. For owner-password-protected files that still open without a password, results vary by how the file was secured.
Can the original author be recovered after cleaning?
Because the file is rebuilt from a clean object graph rather than appended to, the cleared Info and XMP values are not retained anywhere in the output. There is no earlier "revision" inside the file to recover them from.
Is there a file-size limit?
Processing happens in your browser using your device's memory, so the practical limit is your available RAM rather than a server cap. Files up to several hundred megabytes process comfortably on a typical laptop.
Does it remove metadata from images embedded inside the PDF?
On the Maximum Privacy preset, yes — EXIF and GPS data inside pictures placed in the PDF are stripped. On the default Privacy preset, embedded image data is left alone so the operation stays fast; switch presets if you need it removed.
What is the difference between the Info dictionary and XMP?
They are two parallel metadata stores. The Info dictionary is the original PDF mechanism (simple key-value pairs); XMP is a newer XML-based packet that can hold the same data plus edit history and identifiers. A cleaner that wipes only one leaves the other readable, which is why this tool removes both.
Does cleaning remove digital signatures?
If a PDF is digitally signed, any change to the file — including metadata removal — will invalidate the signature, because the signature covers the file's bytes. If you need the signature intact, clean the document before it is signed, not after.
Will it strip text I can see on the page?
No. This tool only removes hidden metadata. Visible content — including any names, addresses, or numbers written into the document body — remains. Removing visible content is a separate task called redaction.
Is anything uploaded to a server?
No. The entire process runs in your browser using JavaScript and the pdf-lib library. Your file never leaves your device, which is the whole point of the tool.
Can I clean several PDFs at once?
Yes. Drop multiple files and each is processed locally, then returned individually or as a single ZIP archive accompanied by one audit report covering the whole batch.