HN Debrief

Launch HN: Expanse (YC P26) – Unlock Wasted GPU Capacity

  • AI
  • Infrastructure
  • Developer Tools
  • Startups

Expanse says GPU and HPC clusters waste huge amounts of capacity because users ask for far more walltime, memory, and compute than their jobs usually need. The company installs alongside SLURM or Kubernetes, reads submission scripts, source code, and live node telemetry, then predicts resource needs, likely failures, and code-level fixes before the job runs. Their key claim is that this is not an LLM wrapper. It is a cluster-specific multimodal model that learns how a particular environment behaves, because the same workload can perform very differently across hardware topologies.

If you operate expensive shared compute, the biggest near-term win may be better prediction and visibility around memory, walltime, and bursty job phases rather than smarter scheduling alone. If you build in this market, security posture and deployment model need to be legible up front or buyers will dismiss you before they get to the technical value.

Discussion mood

Mostly positive on the technical problem and skeptical on the business and deployment details. People who use clusters recognized over-allocation as real and painful, but they pressed on whether the product can handle security expectations, fit how HPC users actually behave, and deliver value when incentives to optimize are weak.

Key insights

  1. 01

    Burst-shaped jobs create hidden waste

    A lot of wasted capacity sits inside individual jobs, not in obviously idle machines. Real workloads like genomics pipelines swing between CPU-heavy, memory-heavy, and IO-bound phases, but schedulers usually force users to reserve the peak profile for the whole run. That makes a job look fully allocated even when large chunks of its walltime are lightly using the hardware.

    Look for waste inside long-running jobs before assuming the main problem is cluster-wide placement. Profiling resource usage over time is likely to unlock more capacity than static per-job averages.

      Attribution:
    • mbreese #1 #2
  2. 02

    User incentives fight utilization gains

    Researchers usually care about shortest time to result and least operational hassle. If the cluster does not bill them directly for waste, a sloppy bash pipeline that runs today often beats a carefully decomposed workflow that uses fewer resources. That is why over-allocation persists even when everybody knows it is inefficient.

    Products in this space should reduce tuning work for users instead of expecting behavior change. If you run a shared cluster, pair recommendations with policy or pricing levers if you want utilization to actually move.

      Attribution:
    • mbreese #1
  3. 03

    Security posture must be obvious immediately

    The sharpest commercial feedback came from someone who read the homepage and docs and still assumed risky telemetry egress and SaaS dependency. Expanse replied that deployments are air-gapped, data stays in the customer environment, and the daemon is not required for jobs to run. The gap between those two readings is the important signal. Enterprise buyers will reject the product on first impression if the architecture is not unmistakable.

    For infrastructure sold into sensitive environments, lead with deployment boundaries, data flow, and failure modes before the optimization story. Put the architecture diagram where buyers can see it in the first minute.

      Attribution:
    • mike_d #1
    • ismaeel_bashir #1

Against the grain

  1. 01

    Cloud providers already do placement optimization

    Large clouds and newer GPU providers are not ignoring this problem. They already use oversubscription and smarter placement to squeeze more out of fleets, so the easy wins may be gone in environments that own the full stack. That shifts the strongest use case toward clusters where user-level job requests and on-prem workflow habits create waste the provider cannot automatically smooth away.

    Do not assume every low-utilization compute environment needs a new prediction layer. Separate managed cloud fleets from research and enterprise clusters where scheduling input quality is the real bottleneck.

      Attribution:
    • nostrebored #1
    • aleksiy123 #1
  2. 02

    Low utilization is not always waste

    Some spare capacity is intentional. Operators may hold back headroom for disaster recovery, failover, or future demand spikes. Expanse said its measurements target waste inside already allocated user jobs rather than reserved idle capacity, which is an important distinction because headline utilization numbers can otherwise overstate the addressable problem.

    When evaluating utilization products, ask whether they reduce over-requesting inside jobs or just count capacity that was intentionally reserved. Those are different problems with different buyers.

      Attribution:
    • flounder3 #1
    • ismaeel_bashir #1

In plain english

GPU
Graphics Processing Unit, a processor that is often used for parallel math workloads like machine learning.
HPC
High-performance computing, which means large shared systems used for heavy scientific, engineering, or data processing workloads.
IO
Input and output, meaning data reads and writes to storage or network devices.
Kubernetes
An orchestration system that automates deployment and scheduling of software across clusters of machines.
LLM
Large language model, a machine learning system trained on large amounts of text that can generate and analyze language and code.
multimodal
Able to work with more than one kind of data, such as text, images, audio, or video in the same model.
SLURM
Simple Linux Utility for Resource Management, a common job scheduler for HPC clusters that queues and allocates compute resources.
telemetry
Operational data collected from systems while they run, such as memory use, GPU utilization, or network activity.
VPC
Virtual private cloud, a logically isolated network environment inside a cloud provider.

Reference links

Company and product references

Founder evaluation reference

Related tooling mentioned by commenters