Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

AI
Infrastructure
Open Source
Developer Tools

The post shows a file-specific lossless compressor: train a 900 KB transformer to overfit a single file, use its byte predictions as probabilities for arithmetic coding, and ship both the coded residual and the model weights. On a 100 MB NYC taxi CSV, the author reports about 7 MB output. On a 100 MB slice of enwik9, the result is about 21 MB including the model. Runtime is the obvious catch. Training takes 20 to 30 minutes and compression and decompression each take about 45 minutes on a consumer GPU.

Treat this as a research curiosity, not a deployable codec. If you work on compression, the practical benchmark is still against strong domain codecs like ZPAQ and LZMA with published datasets, fixed settings, and total size including model weights.

June 26, 2026
news.ycombinator.com
Discuss on HN

Key insights

ZPAQ is the baseline that clarifies this result

Running ZPAQ at max effort put the numbers in perspective better than ZIP, zstd, or bzip2 alone. It reached 20.46 MB on the enwik9 slice and 9.57 MB on the taxi CSV, which means the transformer approach is roughly on par with a top-end classical compressor for enwik9 and ahead on the more repetitive CSV. That turns the project from a vague AI-compression demo into a narrower claim about where file-specific neural models may actually win.

When evaluating neural compression, always include ZPAQ or similarly strong long-range codecs, not just mainstream tools tuned for speed. If your data looks more like logs, tables, or schema-heavy text than free-form prose, test there first.

Attribution:

wildstrawberry #1
spidy__ #1 #2

The learned file model is doing the real work

Arithmetic coding is just the final packing step. The compression gain lives in the probability model, whether that model is a hand-built predictor, a context mixer, or an overfit transformer. That framing matters because it stops the project from sounding like a new coder and puts it where it belongs, as an experiment in whether a neural predictor can memorize one file better than classical models can explain it.

Compare models, not just archive sizes. If you build on this idea, spend your effort on better prediction and account for model transmission explicitly.

Attribution:

jmspring #1

enwik9 slice choice can skew the benchmark

Different parts of enwik9 are not equally predictable, so quoting results on an unspecified 100 MB slice weakens the comparison. Matt Mahoney’s benchmark notes show that compressibility varies across the file, which means a good result on one slice may say more about the slice than the method. For a public claim, enwik8 or a precisely named enwik9 segment is the safer target.

If you publish compression numbers, pin the exact byte range or use a standard benchmark file. Otherwise nobody can tell whether the method improved or the sample got easier.

Attribution:

atiedebee #1

Specialized reformulations beat generic compression

The game-state tangent landed on the same core lesson from another angle. Big wins often come from changing what must be represented, not from squeezing the same representation harder. In multiplayer netcode, sending player intents or state transitions instead of full world snapshots can dwarf any gain from a better codec. That sharpens the interpretation here too: overfitting one file is powerful because it bakes in structure that a general compressor has to rediscover every time.

Before chasing a better compressor, ask whether your format or protocol can encode the underlying state more directly. Domain-specific representation changes can buy orders of magnitude more than a fancier backend codec.

Attribution:

andai #1
purple-leafy #1

Against the grain

Novelty claims were overstated

Pointing to Fabrice Bellard’s earlier neural compression work cut against the implied sense of invention. The pushback was not that reimplementing the idea is bad, but that a Show HN pitch should acknowledge prior art up front if the concept is already known. That changes the read from “new technique” to “fresh implementation with new measurements.”

If you present experimental work publicly, cite the closest predecessor in the intro. That heads off avoidable skepticism and lets readers focus on what is actually new in your version.

Attribution:

userbinator #1
pentaphobe #1

Model overhead is not negligible by default

The author treated a 900 KB model as basically free for a 100 MB file, but that only holds at larger sizes and for especially compressible inputs. Once files get smaller, or once the model must grow toward the source size to keep improving compression, the fixed payload starts eating the gains fast. The same issue appears again if the target shifts from 100 MB experiments to a 1 GB file with a much larger model.

Plot compression ratio against input size with model bytes included. Without a crossover curve, it is hard to know where this approach starts helping instead of hurting.

Attribution:

spidy__ #1 #2

In plain english

Arithmetic coding ↩

A lossless compression method that encodes data using probability estimates, usually achieving better packing than simple symbol-by-symbol codes.

bzip2 ↩

A lossless compression tool based on Burrows-Wheeler transform and Huffman coding, known for decent compression and slower speed than gzip.

Context mixer ↩

A compression model that combines predictions from multiple contexts or submodels to estimate the next symbol more accurately.

CSV ↩

Comma-separated values, a plain text tabular file format where each row contains fields separated by commas.

enwik9 ↩

A 1 gigabyte Wikipedia text benchmark used in data compression research, with enwik8 referring to the first 100 megabytes of that corpus.

GPU ↩

Graphics Processing Unit, a processor often used to accelerate machine learning workloads.

LZMA2 ↩

Lempel-Ziv-Markov chain Algorithm 2, a high-compression lossless algorithm used in formats like 7z.

Transformer ↩

A neural network architecture commonly used for language and sequence modeling that predicts the next element in a sequence from prior context.

ZPAQ ↩

A high-compression archival format and tool that uses advanced context modeling and can be very slow at maximum settings.

zstd ↩

Zstandard, a modern general-purpose lossless compression algorithm designed for strong speed and good compression.

Reference links

Compression benchmarks and references

Matt Mahoney data compression reference
Shared as a reference for conventional compression algorithms and benchmarks.
Hutter Prize benchmark
Referenced as the canonical benchmark for large text compression.
Matt Mahoney text benchmark data notes
Used to support the point that different enwik9 slices vary in predictability.
Lossless data compression comparison writeup
Recommended as a practical comparison of compression tools across speed and ratio tradeoffs.

Prior art and related experiments

Earlier Hacker News discussion of Fabrice Bellard neural compression work
Cited to show this broad idea had earlier precedent.
Stavros image compression with Stable Diffusion
Mentioned as a satirical but thought-provoking related experiment in model-based compression.
Speculative Speculative Decoding paper
Linked in response to the idea of using one model to help generate or compress another model.

Project resources

pym-particles repository
The project repo for the transformer-based compression experiment.
README benchmark results section
Added by the author to provide dataset links and benchmark details after reproducibility questions.
Project Nayuki arithmetic coder file in repo
Pointed out as the arithmetic coding implementation used in the project.

Adjacent systems example

de-pessimized netcode video
Shared to illustrate how changing representation in multiplayer networking can beat brute-force compression.

Show HN: Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Compression benchmarks and references

Prior art and related experiments

Project resources

Adjacent systems example