How Metadata Removal Works: The Algorithm for Each File Format

Removing metadata sounds simple — "delete the hidden data" — but the difficulty is that every file format hides that data in a different place and structure. A JPEG keeps it in labelled segments; a PNG in typed chunks; a HEIC in a separate item-location table that also points to the image itself. A safe cleaner has to understand each container well enough to remove the metadata without breaking the parts that make the file render or play.

This guide documents the actual algorithm used for each supported format. The recurring theme you will notice is structural editing, not re-encoding: in almost every case the right approach is to parse the container, drop or zero the metadata regions, fix up any size or offset fields, and write the file back — leaving the image or media payload byte-for-byte identical. That is what makes the operations lossless and fast.

The shared foundation: format detection by magic bytes

Before any format-specific logic runs, the engine identifies the format by reading the first several bytes — the "magic number" — rather than trusting the file extension, which can be wrong or spoofed. For example, a JPEG begins with the bytes FF D8 FF, a PNG with 89 50 4E 47, a PDF with %PDF, and a ZIP-based Office file with 50 4B 03 04. ISO-media files (MP4, MOV, HEIC) are identified by an ftyp box near the start, and the specific brand inside it distinguishes HEIC from MP4. Once the true format is known, the matching algorithm below is dispatched.

detectFormat(bytes):
  if bytes start with FF D8 FF        -> jpeg
  if bytes start with 89 50 4E 47     -> png
  if "RIFF"...."WEBP"                 -> webp
  if "RIFF"...."WAVE"                 -> wav
  if bytes start with "%PDF"          -> pdf
  if bytes start with 50 4B 03 04     -> zip-container (docx/xlsx/...)
  if "ftyp" box present               -> inspect brand: heic / mp4 / mov / m4a
  if "ID3" or MPEG frame sync         -> mp3
  ... and so on

JPEG — drop the metadata segments

A JPEG is a sequence of segments, each introduced by a marker byte FF followed by a marker type. Metadata lives in specific application segments: EXIF and XMP in APP1 (FF E1), IPTC/Photoshop in APP13 (FF ED), the ICC colour profile in APP2 (FF E2), and free-text comments in COM (FF FE). The algorithm walks segment by segment, copies through the structural ones, and discards the metadata ones until it reaches the Start-of-Scan marker, after which the compressed image data is copied verbatim.

cleanJPEG(bytes):
  output = [SOI marker]
  pos = 2
  while pos < length:
    marker = bytes[pos+1]
    if marker == SOS (FF DA): copy rest of file; break
    segLen = read 2-byte length
    if marker is APP1 and payload starts "Exif" or "http:"  -> drop (EXIF/XMP)
    elif marker is APP13                                    -> drop (IPTC)
    elif marker is COM                                      -> drop (comment)
    elif marker is APP2 and preset == max-privacy           -> drop (ICC)
    else                                                    -> copy segment
    pos += 2 + segLen
  return concatenate(output)

The colour profile is preserved by default because removing it can shift how colours display; only the Maximum Privacy preset strips it. Everything structural — the quantization tables, Huffman tables, frame headers, and the scan data itself — is untouched, so the decoded image is identical.

PNG — filter the chunk stream by type

A PNG is an 8-byte signature followed by a series of chunks, each with a length, a four-character type, the data, and a CRC. Because every chunk is self-describing and length-prefixed, removing one is clean: you simply do not copy it. The metadata-bearing chunk types are the text chunks (tEXt, iTXt, zTXt), the EXIF chunk (eXIf), the modification-time chunk (tIME), and the C2PA Content Credentials chunk (caBX).

cleanPNG(bytes):
  output = [8-byte PNG signature]
  strip = { tEXt, iTXt, zTXt, eXIf, tIME, caBX }
  if preset == max-privacy: strip += { iCCP, pHYs }
  pos = 8
  while pos < length:
    len  = read 4 bytes
    type = read 4 chars
    chunk = next (4 + 4 + len + 4) bytes   // len + type + data + CRC
    if type in strip: skip it
    else:             copy chunk
    if type == "IEND": break
  return concatenate(output)

The critical chunks (IHDR, PLTE, IDAT, IEND) are always copied, so the image data and palette survive intact. No CRC recalculation is needed because the chunks that remain keep their original CRCs.

WebP — drop RIFF chunks and fix the file size

WebP is a RIFF container: a 12-byte header (RIFF, total size, WEBP) followed by chunks. Metadata sits in the EXIF and XMP chunks (and the ICCP colour profile). The twist relative to PNG is that the RIFF header records the total file size, so after removing chunks the algorithm must recalculate and rewrite that size field, and it must respect RIFF's rule that chunks are padded to even byte boundaries.

cleanWebP(bytes):
  output = [12-byte RIFF/WEBP header]
  strip = { "EXIF", "XMP " }; if max-privacy: + "ICCP"
  pos = 12
  while pos < length:
    fourcc = read 4 chars
    size   = read 4-byte little-endian length
    padded = size + (size mod 2)        // word alignment
    chunk  = next (8 + padded) bytes
    if fourcc in strip: skip
    else:               copy
  rewrite RIFF size field = (sum of kept payloads) + 4
  return concatenate(output)

GIF — skip extension blocks

A GIF is a header, a logical screen descriptor, an optional global colour table, then a stream of blocks terminated by a trailer. Metadata lives in two extension blocks introduced by 0x21: the Comment Extension (0x21 FE) and the Application Extension (0x21 FF), the latter being where XMP is stored. The algorithm walks the block stream, copies image descriptors and graphic-control extensions (which carry timing, not metadata), and skips the comment and application extensions along with their sub-blocks.

cleanGIF(bytes):
  copy header + screen descriptor + global colour table
  loop:
    b = next byte
    if b == 0x3B (trailer): copy; break
    if b == 0x21 (extension):
      label = next byte
      if label in { 0xFE comment, 0xFF application }: skip block + sub-blocks
      else: copy (e.g. graphic control, plain text)
    elif b == 0x2C (image descriptor): copy descriptor + image sub-blocks
  return concatenate(output)

TIFF and camera RAW — zero the metadata tag values

A TIFF (and most RAW formats, which are TIFF-based: DNG, NEF, ARW, CR2) stores everything in Image File Directories (IFDs) — tables of tags, where each tag has an ID, a type, a count, and either an inline value or an offset to the value elsewhere in the file. Some tags are structural (they describe where the image strips live and how to decode them); others are metadata (Make, Model, Software, Artist, GPS, EXIF sub-IFD pointers).

Rewriting a TIFF's whole IFD and re-packing every offset is risky, so the algorithm takes a conservative route: it keeps a whitelist of structural tags needed to render the image, and for every other tag it zeroes the value bytes in place — either the inline value or the data the offset points to. The image strips are never moved or altered.

cleanTIFF(bytes):
  read byte order (II or MM), locate IFD0
  KEEP = { ImageWidth, ImageLength, BitsPerSample, Compression,
           StripOffsets, RowsPerStrip, StripByteCounts, ... }   // structural
  for each tag entry in IFD0:
    if tag in KEEP: leave untouched
    else:
      compute value byte-length from (type x count)
      if length > 4: zero the bytes at the value offset
      else:          zero the 4 inline value bytes
  return modified bytes

HEIC / HEIF — item-location surgery

HEIC is the hardest case and the reason many tools refuse it. It uses the ISO Base Media box structure (like MP4), and the metadata is not inline. A top-level meta box contains an item-information list (iinf) describing each stored item by ID and type, and an item-location list (iloc) giving the byte offset and length of each item's data inside the mdat region. The catch: the primary image is itself an item described by that same meta box, so you cannot just delete meta.

The algorithm is therefore surgical. It parses iinf to find items of type Exif or mime (XMP), looks those item IDs up in iloc to get their exact byte ranges, zeroes only those ranges in mdat, and renames the items' type entries to free so a reader ignores them. The image item's bytes are never touched — making the clean lossless.

cleanHEIC(bytes):
  find top-level "meta" box
  parse "iinf": build map itemID -> itemType
  targets = itemIDs where type in { "Exif", "mime" }
  parse "iloc": for each target itemID, read (offset, length) extents
  for each extent: zero bytes[offset .. offset+length] in mdat
  rename each target's "infe" type to "free"
  return modified bytes        // image item untouched -> lossless

MP4 / MOV / M4A — drop atoms from the box tree

ISO-media video and audio files are a tree of atoms (boxes), each with a 4-byte size and a 4-byte type. Descriptive metadata — including the GPS location on phone recordings — lives in the udta (user data) and meta atoms, typically nested inside the moov movie box. The algorithm walks the tree recursively, drops udta and meta wholesale, and rewrites the size fields of every parent box up the chain so the container stays valid. The media samples in mdat are never touched.

cleanISOBMFF(bytes):
  function rebuild(start, end):
    pieces = []
    for each atom in [start, end):
      if type in { "udta", "meta" }:  drop it
      elif type is a container (moov, trak, mdia, minf, stbl):
        inner = rebuild(children)
        rewrite this atom's size = header + len(inner)
        pieces += [header, inner]
      else: copy atom unchanged
    return concatenate(pieces)
  return rebuild(0, length)

MP3 — trim the ID3 tags off the ends

MP3 metadata sits in ID3 tags, conveniently located at the file's extremities. ID3v2 is a block at the very front whose header encodes its own length as a "synchsafe" integer (seven bits per byte). ID3v1, if present, is exactly the last 128 bytes and begins with the ASCII TAG. The algorithm computes the front tag's length, then checks the tail, and returns only the audio frames in between.

cleanMP3(bytes):
  start = 0; end = length
  if bytes start with "ID3":
    size = decode synchsafe length from header
    start = 10 + size              // skip ID3v2
  if last 128 bytes start with "TAG":
    end = length - 128             // drop ID3v1
  return bytes[start .. end]        // pure audio frames

FLAC — remove the comment and picture blocks

A FLAC file is the marker fLaC followed by a chain of metadata blocks, each flagged with a type and a "last block" bit, then the audio frames. Block type 4 is the Vorbis comment (tags) and type 6 is an embedded picture. The algorithm keeps the mandatory STREAMINFO block (type 0) and any others needed to decode, drops types 4 and 6, and fixes the "last block" flag on whatever metadata block ends up last.

cleanFLAC(bytes):
  output = ["fLaC"]
  for each metadata block:
    if type == 4 (VORBIS_COMMENT): drop
    elif type == 6 (PICTURE):      drop
    else: keep
  set "last-block" flag correctly on the final kept block
  append audio frames unchanged
  return concatenate(output)

OGG and WAV — in-place field blanking and chunk removal

OGG stores Vorbis comments inside bitstream pages that are individually checksummed, so a full rewrite would require recalculating per-page CRCs. The conservative approach blanks the comment field text in place. WAV is a RIFF container like WebP, so its metadata chunks (LIST/INFO, embedded id3, and the broadcast bext chunk) are removed and the RIFF size is recalculated.

cleanWAV(bytes):
  output = [12-byte RIFF/WAVE header]
  strip = { "LIST", "id3 ", "bext" }
  for each chunk: if id in strip skip, else copy (respect even-byte padding)
  rewrite RIFF size field
  return concatenate(output)

PDF — rebuild from a clean object graph

Rather than byte-editing a PDF's complex cross-reference structure, the algorithm parses the document with a PDF library, clears the Info dictionary fields (Title, Author, Subject, Keywords, Creator, Producer, and the dates), removes the XMP /Metadata reference from the document catalog, and re-saves. Because the document is re-serialized from a clean object graph rather than appended to, the old values are not preserved as a recoverable earlier revision.

cleanPDF(bytes):
  doc = parse(bytes)
  for field in [Title, Author, Subject, Keywords, Creator, Producer]:
    record old value (for the audit report); set to empty
  set CreationDate and ModDate to epoch
  if catalog has /Metadata: delete it          // drop XMP stream
  return doc.save()                            // re-serialized, clean

DOCX / XLSX / PPTX — rewrite the XML parts

Office Open XML files are ZIP archives of XML parts. The algorithm unzips the archive in memory, replaces docProps/core.xml with a blanked version (empty creator and lastModifiedBy, reset dates), blanks the company and related fields in docProps/app.xml, deletes docProps/custom.xml, removes comment parts, and — for Word — strips tracked-change markup from word/document.xml by accepting insertions and dropping deletions. It also runs the JPEG cleaner over any images in the media/ folder, then repackages the archive.

cleanOOXML(file):
  zip = unzip(file)
  replace docProps/core.xml with blank template
  blank Company / Manager / Template / TotalTime in docProps/app.xml
  delete docProps/custom.xml
  delete comment parts (comments.xml, ppt/comments/, ...)
  if Word:
    in document.xml:
      <w:ins>X</w:ins>     -> X      (accept insertion)
      <w:del>...</w:del>   -> (remove)
      remove rPrChange / pPrChange / move markup
  for each JPEG in (word|xl|ppt)/media/: run cleanJPEG
  return rezip(zip)

The common principle

Across all of these, the same design philosophy holds: understand the container, remove only the metadata regions, preserve everything structural, and fix up any sizes or offsets the removal disturbed. Nowhere is the media re-compressed. That is why every operation is lossless and runs in milliseconds — and, because the engine is pure JavaScript and these are all in-memory transforms, why it can run entirely in your browser without uploading your file anywhere.

You can see these algorithms in action — and inspect exactly what each one removed — using the image tool on our home page and the dedicated format pages. For the privacy reasoning behind why this matters, start with our pre-publishing checklist.

How metadata removal works: the algorithm for each file format.