Skip to content

Why Does ZIP Compression Ratio Vary Greatly Depending on the File?

Category: File Format
This article is currently available in Japanese only. We are working on translations.

Have you ever experienced 「the file barely became smaller even after ZIP compression」 or 「text files drastically reduced in size after compression」? ZIP compression rates vary greatly depending on file type. This article explains the reasons and characteristics of each file format.

Compressed size by file type (vs 100% original) 0% 100% TXT / CSV ~15% JSON / XML ~20% HTML / CSS / JS ~25% DOCX / XLSX ~90% PDF ~95% JPEG / PNG / MP3 ~99% MP4 / ZIP ~100%
Diagram: ZIP compression ratio by file type (smaller = better)

Compression Algorithm Used in ZIP: Deflate

The ZIP format (.zip) primarily uses the <strong>Deflate</strong> algorithm. Deflate is a combination of the following two techniques.

  • <strong>LZ77 (Lempel-Ziv 1977)</strong>: Replace repeating patterns in data with references to previous occurrences
  • <strong>Huffman coding</strong>: Represent frequently occurring characters with shorter bit sequences

In other words, <strong>data with many repetitive patterns has a high compression ratio</strong>, while random data or already-compressed data can barely be compressed at all.

Compression Rate by File Format

File FormatCompression ratio (approximate)Reason
Text (.txt)60–80% ReductionMany repeated characters and words
CSV70–85% reductionDelimiter and same pattern repeats
HTML / XML / JSON65–85% ReductionFrequent repetition of tags and key names
Log file70–90% reductionFrequent repetition of timestamp format
BMP (Uncompressed Image)50–80% ReductionMany consecutive pixels of the same color
PDF5–20% ReductionIn many cases, it is already compressed with zlib internally
PNG0–5% reductionAlready compressed with Deflate
JPEG0–5% reductionAlready compressed with DCT + Huffman
MP3 / AAC0–3% reductionAlready compressed with lossy compression
MP4 / H.2640–3% reductionAlready highly compressed
ZIP / GZ / 7z0–2% reduction (may increase in some cases)Re-compression of already compressed data is largely ineffective

When compressed files become even larger

When compressing already-compressed files like JPEG or MP4 with ZIP, the file size may <strong>increase slightly</strong> due to ZIP headers (file metadata). This is because the ZIP format includes a local file header (30 bytes or more) for each file and a central directory for the entire archive.

JPEGファイル (1.00 MB)
 └── ZIP圧縮後: 1.00 MB + ヘッダー(約50B)= わずかに増加

Difference between 「Store」 mode

ZIP has a <strong>Store</strong> mode that stores files without compression. When combining multiple already-compressed files (such as JPEG, MP4, etc.), using Store mode eliminates the CPU load of compression processing while storing them at equivalent sizes.

# zip コマンドで圧縮レベルを指定
zip -0 archive.zip image.jpg video.mp4   # Store(圧縮なし)
zip -9 archive.zip data.csv report.txt   # 最大圧縮

# Python で圧縮レベルを指定
import zipfile
with zipfile.ZipFile('archive.zip', 'w', zipfile.ZIP_DEFLATED, compresslevel=9) as zf:
    zf.write('data.csv')

Characteristics of test ZIP files

DevLab's test ZIP files contain <strong>random data (pseudo-random byte sequences)</strong> to precisely control file size. Since random data has maximum entropy, Deflate compression is nearly ineffective. Therefore, a "10MB ZIP file" remains approximately "10MB after decompression."

If you need a ZIP file that reaches a specific size after extraction, you can create a test file using a method like the one below.

# 解凍後ちょうど 100MB になるZIPを作成(ゼロバイト埋め、高圧縮)
dd if=/dev/zero bs=1M count=100 | zip -9 zero-100mb.zip -

# 解凍後ちょうど 100MB になるZIPを作成(ランダムデータ、ほぼ無圧縮)
dd if=/dev/urandom bs=1M count=100 | zip -0 random-100mb.zip -

Summary

  • ZIP Compression Ratio Is Determined by the Prevalence of <strong>Repeating Data Patterns</strong>
  • Text, CSV, and XML can be reduced by <strong>60–85%</strong>
  • JPEG, MP4, and Pre-compressed Files <strong>Cannot Be Compressed Much</strong> (May Even Increase Slightly)
  • When combining already-compressed files, save CPU with <strong>Store mode (-0)</strong>
  • DevLab's test ZIP files use random data, so the size remains nearly identical before and after decompression

→ <a href="/ja/files/zip/">Download test ZIP files here</a>

Test files for this article

  • → <a href="/ja/files/zip/" class="text-primary-600 dark:text-primary-400 hover:underline">ZIP test file list</a>
  • → <a href="/ja/files/csv/" class="text-primary-600 dark:text-primary-400 hover:underline">CSV Test Files List</a>

Related articles

  • → <a href="/ja/blog/file-format-quick-reference/" class="text-primary-600 dark:text-primary-400 hover:underline">File Format Quick Reference for Developers</a>
  • → <a href="/ja/blog/file-validation-checklist/" class="text-primary-600 dark:text-primary-400 hover:underline">Web Form File Validation Implementation Checklist</a>