HN Debrief

Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

  • AI
  • Infrastructure
  • Open Source
  • Developer Tools

The post shows a file-specific lossless compressor: train a 900 KB transformer to overfit a single file, use its byte predictions as probabilities for arithmetic coding, and ship both the coded residual and the model weights. On a 100 MB NYC taxi CSV, the author reports about 7 MB output. On a 100 MB slice of enwik9, the result is about 21 MB including the model. Runtime is the obvious catch. Training takes 20 to 30 minutes and compression and decompression each take about 45 minutes on a consumer GPU.

Treat this as a research curiosity, not a deployable codec. If you work on compression, the practical benchmark is still against strong domain codecs like ZPAQ and LZMA with published datasets, fixed settings, and total size including model weights.

Discussion mood

Mostly impressed by the experiment, but skeptical of the presentation until stronger baselines and dataset links were added. People liked the curiosity and the result on structured CSV data, yet kept coming back to novelty claims, reproducibility, and the huge speed cost.

Key insights

  1. 01

    ZPAQ is the baseline that clarifies this result

    Running ZPAQ at max effort put the numbers in perspective better than ZIP, zstd, or bzip2 alone. It reached 20.46 MB on the enwik9 slice and 9.57 MB on the taxi CSV, which means the transformer approach is roughly on par with a top-end classical compressor for enwik9 and ahead on the more repetitive CSV. That turns the project from a vague AI-compression demo into a narrower claim about where file-specific neural models may actually win.

    When evaluating neural compression, always include ZPAQ or similarly strong long-range codecs, not just mainstream tools tuned for speed. If your data looks more like logs, tables, or schema-heavy text than free-form prose, test there first.

      Attribution:
    • wildstrawberry #1
    • spidy__ #1 #2
  2. 02

    The learned file model is doing the real work

    Arithmetic coding is just the final packing step. The compression gain lives in the probability model, whether that model is a hand-built predictor, a context mixer, or an overfit transformer. That framing matters because it stops the project from sounding like a new coder and puts it where it belongs, as an experiment in whether a neural predictor can memorize one file better than classical models can explain it.

    Compare models, not just archive sizes. If you build on this idea, spend your effort on better prediction and account for model transmission explicitly.

      Attribution:
    • jmspring #1
  3. 03

    enwik9 slice choice can skew the benchmark

    Different parts of enwik9 are not equally predictable, so quoting results on an unspecified 100 MB slice weakens the comparison. Matt Mahoney’s benchmark notes show that compressibility varies across the file, which means a good result on one slice may say more about the slice than the method. For a public claim, enwik8 or a precisely named enwik9 segment is the safer target.

    If you publish compression numbers, pin the exact byte range or use a standard benchmark file. Otherwise nobody can tell whether the method improved or the sample got easier.

      Attribution:
    • atiedebee #1
  4. 04

    Specialized reformulations beat generic compression

    The game-state tangent landed on the same core lesson from another angle. Big wins often come from changing what must be represented, not from squeezing the same representation harder. In multiplayer netcode, sending player intents or state transitions instead of full world snapshots can dwarf any gain from a better codec. That sharpens the interpretation here too: overfitting one file is powerful because it bakes in structure that a general compressor has to rediscover every time.

    Before chasing a better compressor, ask whether your format or protocol can encode the underlying state more directly. Domain-specific representation changes can buy orders of magnitude more than a fancier backend codec.

      Attribution:
    • andai #1
    • purple-leafy #1

Against the grain

  1. 01

    Novelty claims were overstated

    Pointing to Fabrice Bellard’s earlier neural compression work cut against the implied sense of invention. The pushback was not that reimplementing the idea is bad, but that a Show HN pitch should acknowledge prior art up front if the concept is already known. That changes the read from “new technique” to “fresh implementation with new measurements.”

    If you present experimental work publicly, cite the closest predecessor in the intro. That heads off avoidable skepticism and lets readers focus on what is actually new in your version.

      Attribution:
    • userbinator #1
    • pentaphobe #1
  2. 02

    Model overhead is not negligible by default

    The author treated a 900 KB model as basically free for a 100 MB file, but that only holds at larger sizes and for especially compressible inputs. Once files get smaller, or once the model must grow toward the source size to keep improving compression, the fixed payload starts eating the gains fast. The same issue appears again if the target shifts from 100 MB experiments to a 1 GB file with a much larger model.

    Plot compression ratio against input size with model bytes included. Without a crossover curve, it is hard to know where this approach starts helping instead of hurting.

      Attribution:
    • spidy__ #1 #2

In plain english

Arithmetic coding
A lossless compression method that encodes data using probability estimates, usually achieving better packing than simple symbol-by-symbol codes.
bzip2
A lossless compression tool based on Burrows-Wheeler transform and Huffman coding, known for decent compression and slower speed than gzip.
Context mixer
A compression model that combines predictions from multiple contexts or submodels to estimate the next symbol more accurately.
CSV
Comma-separated values, a plain text tabular file format where each row contains fields separated by commas.
enwik9
A 1 gigabyte Wikipedia text benchmark used in data compression research, with enwik8 referring to the first 100 megabytes of that corpus.
GPU
Graphics Processing Unit, a processor often used to accelerate machine learning workloads.
LZMA2
Lempel-Ziv-Markov chain Algorithm 2, a high-compression lossless algorithm used in formats like 7z.
Transformer
A neural network architecture commonly used for language and sequence modeling that predicts the next element in a sequence from prior context.
ZPAQ
A high-compression archival format and tool that uses advanced context modeling and can be very slow at maximum settings.
zstd
Zstandard, a modern general-purpose lossless compression algorithm designed for strong speed and good compression.

Reference links

Compression benchmarks and references

Prior art and related experiments

Project resources

Adjacent systems example

  • de-pessimized netcode video
    Shared to illustrate how changing representation in multiplayer networking can beat brute-force compression.