HN Debrief

How TimescaleDB compresses time-series data

  • Databases
  • Infrastructure
  • Open Source
  • Developer Tools
  • Hardware

The post explains TimescaleDB’s newer compressed storage path for time-series workloads. Data is reorganized into more columnar chunks, then encoded with schemes matched to the data type, including Gorilla-style compression for timestamps and floating-point values. The headline promise is very high storage reduction inside PostgreSQL, aimed at telemetry and IoT datasets where values repeat, move slowly, or can be represented as small deltas.

If you use PostgreSQL for telemetry or IoT, treat compression as a query-engine design choice, not a storage checkbox. Also verify your TimescaleDB package actually includes compression features before you plan around them, especially on distro builds.

Discussion mood

Mostly positive about the underlying techniques and their practicality for telemetry and analytics. The skepticism was aimed at the marketing-style title and at operational gotchas like licensing and weak JSONB compression, not at the core idea of columnar compression itself.

Key insights

  1. 01

    Compression only wins if queries get cheaper

    The key test is not the compression ratio. It is whether the encoding lets the database skip reads, turn expensive string work into integer filters, or run deterministic functions like UPPER() once on a dictionary instead of once per row. That framing shifts the story from storage savings to execution plan quality, which is where these systems actually earn their keep.

    When evaluating compressed storage, benchmark filtered and aggregated queries, not just disk usage. Ask specifically whether the engine can prune data, operate on dictionary codes, and avoid row-by-row decompression.

      Attribution:
    • gopalv #1
  2. 02

    Metadata and layout do most of the work

    Per-segment stats like min, max, distinct counts, and Bloom filters can answer or prune many analytic queries before any payload is decompressed. The comment also points out that disk layout, top-N early stopping, filter pushdown, and parallel execution are first-order design choices. Compression is only one layer in a much larger optimization stack.

    If you build or buy a time-series store, inspect segment metadata, pruning behavior, and execution strategy. A flashy codec will not rescue a poor on-disk layout or weak pushdown path.

      Attribution:
    • tudorg #1
  3. 03

    Modern compression weakens the case for lossy historians

    For IoT and industrial telemetry, commenters argued that older historian-era compromises came from expensive storage and weak compression for floating-point streams. With Gorilla-style lossless encoding and cheap storage, keeping every sample is often affordable. That changes the default architecture from specialized lossy historians toward general databases and file formats like Parquet-backed Delta tables.

    If you still rely on lossy ingest rules mainly to save space, rerun the math with current codecs and storage costs. You may be able to keep raw signals and simplify downstream analysis and auditing.

      Attribution:
    • heliosAtwork #1
    • lkanwoqwp #1
    • niltecedu #1
  4. 04

    JSONB is still a bad fit

    TimescaleDB has a long-standing issue around JSONB compression, and a commenter with production use says medium-sized JSON blobs remain a pain point. Another comment points to Iceberg Variant encoding proposals from Databricks and Snowflake as a more promising direction. The important idea is to break semi-structured payloads into typed, columnar chunks so the engine can prune and filter them like ordinary columns.

    If your telemetry schema hides most value inside JSONB, do not assume time-series compression will save you. Model hot fields explicitly or watch emerging typed semi-structured formats before committing to large JSON-heavy tables.

      Attribution:
    • PaulWaldman #1
    • kevinob11 #1
    • gopalv #1
  5. 05

    Packaging and licensing can block the feature

    Compression is not just a technical capability. It may be absent from the TimescaleDB package your Linux distribution ships because those builds only include Apache-licensed parts. That creates an easy failure mode where teams think they are evaluating TimescaleDB, but they are really evaluating a stripped-down build.

    Check the exact package and license terms in your environment before designing around compression. Validate feature availability in staging with the same install path you plan to use in production.

      Attribution:
    • self_awareness #1

Against the grain

  1. 01

    The headline ratio reads like marketing

    The criticism is that 'up to 98%' is the kind of claim that hides workload dependence and says little about what a real deployment should expect. The reply says the number came from an actual MQTT-backed database, but that still leaves the central problem: without query benchmarks and data-shape context, the ratio is hard to interpret.

    Treat headline compression numbers as anecdotal until you see your own data distribution and query mix. Ask for before-and-after results on latency, CPU, and scan volume, not just storage charts.

      Attribution:
    • robocat #1
    • lkanwoqwp #1

In plain english

Apache
Here, the Apache License, a permissive open source software license.
Databricks
A data platform company that builds analytics and AI tools, closely associated with Apache Spark and Delta Lake.
Delta tables
Tables managed with the Delta Lake format, which adds metadata and transactional features on top of file-based data storage.
filter pushdown
Applying query filters as early as possible, often inside the storage or decompression layer, to avoid reading unnecessary data.
Gorilla
A compression scheme for time-series data, introduced by Facebook, that stores timestamps and numeric values efficiently using small deltas and bit-level encoding.
Iceberg
Apache Iceberg, an open table format for large analytic datasets stored in files like Parquet.
IoT
Internet of Things, a label for specialized device-focused editions of software and hardware.
JSONB
PostgreSQL’s binary storage format for JSON data, designed for querying and indexing semi-structured documents.
MQTT
Message Queuing Telemetry Transport, a lightweight messaging protocol often used by sensors and IoT devices.
OT
Operational Technology, meaning industrial systems and control environments such as factories, plants, or field equipment.
Parquet
A columnar file format commonly used for analytic data processing.
partition pruning
Skipping whole partitions of data during a query because metadata shows they cannot match the filter.
PostgreSQL
A widely used open source relational database system, often shortened to Postgres or PG.
Snowflake
A cloud data warehouse platform for analytics and large-scale SQL workloads.
swinging-door compression
A lossy time-series compression method used in industrial historians that drops points while keeping values within an error bound.
time-series
Data recorded over time, usually as timestamped measurements or events.
TimescaleDB
A PostgreSQL extension built for time-series data such as metrics, events, and sensor readings.
Variant encoding
A proposed way to store semi-structured data in typed, more query-friendly form inside columnar systems.

Reference links

Semi-structured data and compression proposals

Related time-series engine work

  • xataio/deltax
    A PostgreSQL extension project mentioned by a commenter as another attempt to optimize time-series analytics and ClickBench performance.

Industrial historian context