HN Debrief

F3

  • Infrastructure
  • Data
  • Developer Tools
  • Security
  • Open Source

F3 is a proposed columnar storage format in the same category as Parquet, ORC, Lance, Nimble, and Vortex. Its pitch is that today’s analytic formats are hard to evolve because readers only support a narrow, old-compatible subset of encodings. F3 tries to break that deadlock by storing metadata plus embedded Wasm decoders in the file, so a reader that does not understand a new encoding can still decode it through a sandboxed fallback path. The practical target is not general file interchange like PDFs or video. It is large tabular data, especially workloads that mix full scans with point lookups or ML-style access patterns.

Treat F3 as a research direction for mixed analytics and ML workloads, not something ready to displace Parquet in production. If you build data infrastructure, watch the broader trend behind it instead: formats are moving toward better random access and extensibility, but adoption will hinge on query engine support and a trusted decoder story.

Discussion mood

Curious but mostly skeptical. People liked the ambition and agreed Parquet has real evolutionary limits, but the repo’s weak explanation, the embedded Wasm security model, and the sheer advantage of Parquet compatibility made F3 look more like an interesting research prototype than a near-term production format.

Key insights

  1. 01

    Parquet’s limits show up outside batch analytics

    They put the pressure point in the right place. Parquet still works well for straightforward analytics, but mixed workloads have changed the target. Data science and ML pipelines often need both long scans and cheap random access against the same data. That is why newer formats like F3, Lance, and Vortex exist at all. The bottleneck is not cosmetic dissatisfaction with Parquet. It is that old reader compatibility makes it hard to add new encodings and hard to optimize for newer access patterns without forking the ecosystem.

    If your workload now mixes warehouse scans with retrieval, feature serving, or model-centric access patterns, benchmark newer formats instead of assuming Parquet is the ceiling. If your job is still mostly Spark-style batch processing, the migration case is much weaker.

      Attribution:
    • aduffy #1 #2
    • sanderjd #1
  2. 02

    Embedded Wasm is about codec evolution

    The sharpest defense of the Wasm design was that it solves a standards problem, not a language binding problem. Readers already know how to call C or Rust libraries when they choose to. The harder problem is shipping a file with a brand new encoding and having old readers still decode it. In that framing, Wasm is a fallback path that keeps files readable while the ecosystem catches up, rather than the primary execution path for every query.

    Evaluate F3-like ideas on whether they reduce rollout friction for new encodings. If you maintain a data platform, ask how often feature deployment is blocked by slow reader upgrades across teams and tools.

      Attribution:
    • yung_lean #1 #2
    • hahahacorn #1
  3. 03

    The safety model is mostly resource control

    The credible pro-Wasm case was not "Wasm is magically safe." It was that the main risks are controllable in boring ways. Inline Wasm can be disabled when you only trust known decoders. Wasm runtimes already support memory caps, traps on allocation growth, and instruction metering or timeouts. That shifts the problem from arbitrary native execution to policy and limits. It does not remove the burden, but it makes it operationally legible.

    If you ever consider embedded-decoder formats, insist on a runtime policy layer from day one. You need allowlists, memory ceilings, instruction budgets, and an easy way to refuse fallback execution on untrusted input.

      Attribution:
    • Omega359 #1
    • computomatic #1
    • johncolanduoni #1
    • titzer #1 #2
  4. 04

    Optimizers still need typed outputs

    The practical decoding contract is narrower than some readers feared. The Wasm path is not meant to return arbitrary application objects. It decodes bytes into typed Arrow buffers or primitive arrays. That keeps the format anchored to columnar data systems. Even so, the performance concern remains real. Once a query engine has to hand control to an opaque decoder, it loses some of the structural visibility that powers SIMD-friendly scans, predicate pushdown, and other read-time tricks.

    Do not treat "decodes to Arrow" as proof that engines will optimize it well. For production adoption, look for evidence that query planners can still skip work instead of paying an opaque decode cost up front.

      Attribution:
    • amluto #1
    • mort96 #1
    • gavinray #1
    • mmaunder #1
  5. 05

    Archival is the most plausible first niche

    The most believable near-term use is not replacing Parquet as the default operational format. It is long-lived interchange or archival where self-description and bundled decode logic are worth some overhead. People pointed to older systems that bundled procedures with data and to archive formats like RAR that already embed VM bytecode for decoding. In that niche, guaranteed readability can matter more than peak scan speed.

    If you manage cold storage, regulated data exchange, or research datasets that outlive their original stack, self-describing formats deserve a closer look. Just separate that requirement from your hot-path analytics format instead of forcing one format to do both.

      Attribution:
    • bijowo1676 #1
    • nine_k #1
    • jauntywundrkind #1
    • Qerub #1

Against the grain

  1. 01

    Compatibility beats elegance in real deployments

    The most grounded pushback was that file formats win through tooling support, not because they are conceptually cleaner. Parquet’s biggest advantage is not perfection. It is that every engine already speaks it, often through the oldest common subset. From that angle, a new format without a compatibility bridge is dead on arrival no matter how clever its internals are.

    Before adopting any new storage format, map the full reader and writer surface you depend on. If the format does not plug cleanly into that graph, assume migration friction will swamp theoretical performance gains.

      Attribution:
    • vouwfietsman #1 #2
    • chatmasta #1
  2. 02

    Sandboxed code still expands attack surface

    The hardest security objection was not confusion about what Wasm can access. It was that asking a parser to execute attacker-supplied bytecode creates a whole new class of failure modes. Even if I/O is absent and remote code execution is harder than with native plugins, you still hand attackers a programmable environment for denial of service and a new engine to target. Past document and font ecosystems are not reassuring here.

    If your ingestion path accepts third-party data, default to refusing embedded decoders unless there is a strong business reason not to. Security review has to cover the Wasm runtime itself, not just the file format spec.

      Attribution:
    • gavinray #1
    • jasonjayr #1
    • Retr0id #1
    • Kiboneu #1
    • bguebert #1
  3. 03

    The repo does not clear the trust bar

    A lot of skepticism came from the project surface, not the research idea. The README barely says what F3 is for, examples are thin, development looks quiet, and the benchmark story lives in a paper rather than in an obvious getting-started path. For infrastructure software, that presentation reads as lab work, not something teams can responsibly standardize on.

    Treat maturity signals as part of the technology decision. For a format that wants to sit under petabytes of data, documentation quality, active maintenance, and concrete examples are not polish. They are the product.

      Attribution:
    • adammarples #1
    • yung_lean #1
    • mmaunder #1
    • Arainach #1
    • sph #1

In plain english

Arrow buffers
The raw memory blocks used by Apache Arrow to represent typed columnar arrays.
columnar storage
A way of storing tabular data by columns instead of rows, which often speeds up analytics queries that read only some fields.
DuckDB
An in-process analytical database often used to query local files such as Parquet directly.
F3
A proposed columnar data file format designed as an alternative to formats like Parquet.
I/O
Input and output, meaning interaction with files, networks, devices, or other external systems.
Lance
A columnar data format and data system project aimed at analytics and machine learning workloads.
ML
Machine learning, a class of systems that train or run models on data.
Nimble
A newer data format project mentioned as a competitor in columnar analytics.
ORC
Optimized Row Columnar, another columnar file format used for large-scale analytics.
Parquet
A widely used open columnar file format for analytic data processing.
predicate pushdown
A query optimization where filters are applied as early as possible, often while reading data, to skip unnecessary work.
SIMD
Single Instruction, Multiple Data, a CPU feature that applies one operation to many values at once for speed.
Spark
Apache Spark, a distributed data processing engine commonly used for large-scale batch analytics.
vectorized reads
Reading and processing batches of values at a time rather than one record at a time, usually for better performance.
Vortex
A newer columnar data format project mentioned as another attempt to improve on Parquet.
Wasm
WebAssembly, a portable bytecode format that runs inside a virtual machine and is often used with sandboxing.

Reference links

Primary paper and project references

Related data format work

Security and sandboxing examples

Historical and commentary references