Why Word documents leak so much
A .docx file isn't really one file — it's a ZIP archive containing dozens of XML documents and any embedded images. Microsoft's "Inspect Document" feature operates at the Word application layer, which means it cleans some of the well-known fields and leaves quite a lot of others behind.
Our tool unzips the archive, parses the XML directly, and scrubs metadata at every layer where it hides — including the layers Word doesn't expose to its own inspector.
What our tool removes from DOCX files
- Core properties (
docProps/core.xml): creator, lastModifiedBy, lastPrinted, created, modified, revision, title, subject, description, keywords, category - App properties (
docProps/app.xml): Company, Manager, Template, TotalTime, Application, AppVersion, page/word/character counts - Custom properties (
docProps/custom.xml): CRM, DMS, and template-injected custom XML properties (the most overlooked metadata category in enterprise documents) - Tracked changes:
w:ins(insertions accepted),w:del(deletions removed),w:moveFrom/w:moveTo,w:rPrChange,w:pPrChange - Comments:
word/comments.xml,commentsExtended.xml,commentsIds.xml, and threaded discussion files - Embedded image EXIF: every JPEG inside
word/media/gets its EXIF, IPTC, and XMP stripped
What Document Inspector misses
Microsoft Word's built-in Document Inspector is a good first pass but has known gaps:
- Embedded image metadata — Document Inspector does not strip EXIF from images inside the document. Our tool does.
- Custom XML parts — many CRM systems and document management platforms inject custom XML that survives Inspector.
- Template paths — the
Templatefield inapp.xmloften contains a full filesystem path revealing your organization's network structure. - Comment threads — Inspector usually catches comments, but in some configurations the extended comment files (containing reply threads) are left behind.
How to use this tool
- Drop your .docx file in the box above
- Review the inspector — see every field that's about to be removed
- Pick a preset (Privacy is the default; Maximum Privacy also wipes embedded EXIF)
- Download the cleaned file
- Optionally download the audit report with SHA-256 verification
Important: This tool removes metadata. It does not redact visible text. If sensitive information is written into the body of the document, you'll need to remove or redact it separately before sharing.
The anatomy of a .docx file
A modern Word document is not a single binary blob — it is a ZIP archive (the Office
Open XML, or OOXML, format) containing a small filesystem of XML parts. If you rename a
.docx to .zip and open it, you will find a structure like this:
my-document.docx
├── [Content_Types].xml
├── _rels/
├── docProps/
│ ├── core.xml ← author, dates, revision, title
│ ├── app.xml ← company, template, editing time
│ └── custom.xml ← CRM / DMS injected properties
└── word/
├── document.xml ← the actual text + tracked changes
├── comments.xml ← reviewer comments
└── media/ ← embedded images (with their own EXIF)
Metadata is spread across several of these parts, which is exactly why a single
"remove properties" action in Word does not catch all of it. Our cleaner unzips the
archive in your browser, rewrites the relevant XML parts, scrubs the document body of
revision markup, strips EXIF from any images in word/media/, then repacks the
archive.
How tracked changes are stored — and why "Accept All" is not enough
When change tracking is on, Word does not simply edit the text. It wraps every change in
markup. An inserted phrase is stored inside a <w:ins> element and a
deleted phrase inside a <w:del> element, each carrying an author name and
a timestamp:
<w:ins w:author="jane.doe" w:date="2025-09-14T10:32:00Z">
<w:r><w:t>confidential figure</w:t></w:r>
</w:ins>
Clicking "Accept All Changes" resolves the visible text, but depending on configuration
the document can retain related revision metadata — formatting-change records
(<w:rPrChange>, <w:pPrChange>), move operations, and
the author/date attributes themselves. Our cleaner explicitly walks
word/document.xml and removes every revision element: it keeps the content of
insertions (treating them as accepted), discards deletions, and deletes the change-tracking
attributes entirely so no author or timestamp survives.
The verification math behind the audit report
After cleaning, the tool computes a SHA-256 hash of both the original and the cleaned
file. SHA-256 reduces a file of any size to a fixed 256-bit fingerprint, written as 64
hexadecimal characters. Because the function exhibits the avalanche effect —
flipping one input bit flips about half the output bits — the before and after hashes look
entirely unrelated, which is visible proof the file changed. The chance of two distinct
files producing the same hash is on the order of 1 / 2¹²⁸, which is treated as
computationally impossible, so a matching hash reliably identifies a specific file.
A worked example: before and after
A .docx stores its core metadata in docProps/core.xml. Here is
what that file looks like in a typical document straight out of Word, and what remains after
the Privacy preset runs. The left side is what anyone can read by unzipping your document;
the right side is what survives.
Before — core.xml exposed
<cp:coreProperties>
<dc:creator>Ama Mensah</dc:creator>
<cp:lastModifiedBy>legal-review</cp:lastModifiedBy>
<dcterms:created>2025-08-30T09:14:00Z</dcterms:created>
<dcterms:modified>2025-09-14T17:04:00Z</dcterms:modified>
<cp:revision>47</cp:revision>
<dc:title>Settlement draft — confidential</dc:title>
</cp:coreProperties>
After — cleaned
<cp:coreProperties>
<dc:creator></dc:creator>
<cp:lastModifiedBy></cp:lastModifiedBy>
<dcterms:created>2000-01-01T00:00:00Z</dcterms:created>
<dcterms:modified>2000-01-01T00:00:00Z</dcterms:modified>
</cp:coreProperties>
The "before" version names the original author, the account that performed the legal review, the full creation and modification timeline, a revision count revealing the document went through 47 saves, and a title marking it confidential. The revision count alone can be telling — 47 revisions on a one-page letter signals it was heavily negotiated. After cleaning, the creator and editor are blank and the timestamps are reset to a neutral placeholder date.
Complete Word metadata field reference
Word documents scatter metadata across several parts inside the ZIP. This table covers each location, what it exposes, and how the cleaner treats it.
| Location / field | What it reveals | Action |
|---|---|---|
core.xml · creator | Original author's name or username | Removed |
core.xml · lastModifiedBy | Who last saved the file | Removed |
core.xml · created / modified | Exact creation and edit timestamps | Reset |
core.xml · revision | Number of times the document was saved | Removed |
core.xml · title / subject / keywords | Internal naming and tags | Removed |
app.xml · Company | Organization name from the Office license | Removed |
app.xml · Manager | Manager name if set in template | Removed |
app.xml · Template | Path to the template, often a network share | Removed |
app.xml · TotalTime | Cumulative minutes spent editing | Removed |
custom.xml | CRM / DMS injected properties (matter IDs, client codes) | Deleted |
document.xml · w:ins / w:del | Tracked insertions and deletions with author + date | Removed |
document.xml · rPrChange / pPrChange | Formatting-change history | Removed |
comments.xml | Reviewer comments with names and timestamps | Deleted |
word/media/ | EXIF/GPS inside embedded photos | Stripped |
| Document body text | Visible content you typed | Kept |
How this has actually burned people
The settlement offer that revealed the floor
A law firm sends a counterparty a Word document with "Accept All Changes" applied to the visible text. But the file still contained tracked-change history showing earlier, lower settlement figures that had been edited upward before sending. Opposing counsel recovered the deleted numbers from the XML and learned exactly how much room there was to negotiate.
This is the single most common way Word metadata causes real damage: "Accept All" makes the page look clean while leaving the negotiation history inside the file structure.
The agency name in the client's report
A consultancy delivers a strategy document under the client's logo. The
Company field in app.xml, populated from the consultancy's Office
installation, still reads with the agency's name. When the client forwards the document to
their board, the board sees who really wrote it.
The leaked memo traced to one laptop
An internal memo is leaked to the press. Investigators unzip the document and read the
creator and lastModifiedBy fields, plus the Template
path pointing to a specific department's network folder. The metadata narrows the source
to a handful of people even though the visible text gives nothing away.
How Word's own "Inspect Document" compares
Microsoft Word includes a built-in tool at File → Info → Check for Issues → Inspect Document. It is genuinely useful and worth running, but it has real gaps that catch people out:
- It operates at the application layer, so it removes what Word's interface knows about —
but it does not reliably strip EXIF from images embedded in
word/media/. - It can miss custom XML parts injected by third-party systems, which is exactly where enterprise tools store client and matter identifiers.
- It requires you to remember to run it every time, on every document, before every send — a manual step that is easy to skip under deadline pressure.
- The
Templatepath inapp.xml, which can expose your network structure, is not always cleared.
Our cleaner works on the file directly rather than through Word, so it catches the parts the application-level inspector leaves behind, and it does so the same way every time.
How to verify the file is clean
- In Word: open the cleaned file, go to File → Info, and confirm the Author, Last Modified By, and Company fields under Properties are blank.
- By unzipping: rename a copy of the file to
.zip, open it, and inspectdocProps/core.xmlanddocProps/app.xmlin a text editor — the creator and company elements should be empty. - Check for custom.xml: confirm
docProps/custom.xmlis no longer present in the archive. - The audit report records a SHA-256 hash of the cleaned file as tamper-evident proof.
Frequently asked questions
Will cleaning change my document's formatting or content?
No. Text, styles, fonts, tables, and images are preserved. Only metadata, comments, and tracked-change records are removed. Accepted edits remain in the final text.
Does this remove comments as well as tracked changes?
Yes. The comment parts (comments.xml and the extended comment files that
store reply threads) are deleted, and the comment reference markers in the document body
are removed with them.
What is in custom.xml and why does it matter?
Enterprise systems — document management platforms, CRMs, contract tools — often
inject custom properties into docProps/custom.xml. These can include
internal matter numbers, client IDs, and workflow states. Word's own inspector does not
always remove them; our cleaner deletes the part entirely.
Are images inside the document cleaned too?
Yes. Photos pasted into a Word document keep their own EXIF, including any GPS data.
The cleaner strips EXIF from JPEGs stored in word/media/.
Does "Accept All Changes" in Word do the same thing?
No. Accepting changes resolves the visible text but can leave revision metadata, formatting-change records, and author/date attributes inside the document XML. Our cleaner removes the revision markup itself, not just its visible effect.
Will the document still open normally in Word and Google Docs?
Yes. The cleaner rewrites the metadata XML and repackages the archive using standard
ZIP compression, producing a fully valid .docx that opens in Word, Google
Docs, LibreOffice, and Pages.
Does it work on .doc (the old format) too?
The old binary .doc format is handled by the legacy Office path, which
blanks the summary-information streams. For the best results, the modern
.docx format is recommended.
Are hidden text or fields removed?
Hidden text is content rather than metadata, so it is preserved. If your document contains hidden text you do not want to share, reveal and delete it in Word before cleaning.
Is anything uploaded to a server?
No. The document is unzipped, scrubbed, and repackaged entirely in your browser using the JSZip library. It never leaves your device.
Can I clean a batch of documents at once?
Yes. Drop multiple files and each is processed locally, then returned individually or as a single ZIP with one audit report for the whole set.