HN Debrief

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

  • AI
  • Developer Tools
  • Open Source
  • Infrastructure
  • Economics

Xiaomi says its new MiMo-v2.5-Pro-UltraSpeed mode pushes a roughly 1 trillion parameter mixture-of-experts coding model past 1,000 tokens per second, using a stack of software and model-side optimizations rather than exotic hardware. The post highlights FP4 quantization applied mainly to the experts, multi-token prediction, and its TileRT runtime. Pricing is only about 3x the regular tier, which is why the headline landed. Readers saw that as another sign that Chinese providers are forcing down both price and latency for models that are now close enough in quality for many coding tasks.

If you build agent workflows, start measuring end-to-end cycle time instead of model speed alone. Faster inference is now good enough to change UX and orchestration, but the next bottlenecks are test latency, review, and the cost of verifying much more output.

Discussion mood

Mostly excited, with a practical edge. People liked the prospect of near-instant coding and voice workflows, but the enthusiasm was tempered by two recurring concerns: model speed is not the real bottleneck in many software loops, and faster generation just shifts the pain to verification, review, and costs.

Key insights

  1. 01

    Fast inference shifts pain to build loops

    Once the model is quick enough to feel interactive, the waiting moves to compiles, CI, tests, MCP calls, and other external tooling. That changes what to optimize next. Teams that want the benefit of ultrafast models need to treat build and test latency as first-class product work, not background annoyance.

    Profile your agent loop end to end, including tests and tool calls. If inference is no longer dominant, spend engineering effort on parallel tests, better caches, and tighter validation harnesses before paying more for even faster models.

      Attribution:
    • skybrian #1
    • switchbak #1
    • erikus #1
    • efromvt #1
  2. 02

    Tokens per second is the wrong buying metric

    Throughput looks impressive, but it does not tell you how much work gets done per dollar or per minute. Some models finish tasks quickly because they are terse and reliable. Others burn huge token budgets, struggle with agent harnesses, or only look good in one-shot benchmarks. The more useful comparison is completed task cost and elapsed time under your own workflow.

    Evaluate models on a fixed set of real tasks with your own harness. Track wall-clock time, token consumption, retries, and manual cleanup instead of relying on provider TPS claims or headline benchmarks.

      Attribution:
    • sarjann #1
    • SwellJoe #1
    • gertlabs #1
    • _pdp_ #1
  3. 03

    Speed makes guardrails mandatory

    The faster the model, the less time a human has to catch it going off the rails. That pushes teams toward a staged workflow where one model plans, another executes, and tests and spec documents narrow the space before any write access happens. A second model can cheaply review for disagreements and surface only the real ambiguities.

    Separate planning from execution in your tooling. Require specs, tests, and review checkpoints before giving a fast model broad edit or API privileges.

      Attribution:
    • coderbants #1
    • bendangelo #1
    • noncoml #1
    • petesergeant #1
  4. 04

    Voice and live UI are the first obvious winners

    Coding is not the only use case where latency matters. Voice assistants have a brutally tight delay budget, and live prototyping gets much better when you can say "make the font bigger" and watch it update immediately. Speed unlocks categories where a slow frontier model feels unusable even if its answers are better.

    If you work on voice, support, or interactive design tools, revisit product ideas you previously dropped because of latency. Some now become viable even before model quality takes another leap.

      Attribution:
    • prplfsh #1
    • philipkglass #1
    • eli #1
  5. 05

    Fast models improve flow only if used deliberately

    The better framing was not worker replacement but workflow shape. Slow agents encourage tab hoarding and context switching. Faster ones let people stay on one task, use AI for deeper exploration, and add polish, tests, and docs instead of just spraying prompts at a slot machine. The value comes from using speed to stay in flow, not to cram more churn into the day.

    Design team practices around fewer parallel threads and tighter human feedback cycles. If people are using speed only to produce more output, you are capturing the wrong benefit.

      Attribution:
    • dakiol #1
    • powerapple #1
    • dilyevsky #1
    • enraged_camel #1
  6. 06

    Chinese pricing is becoming strategic pressure

    Readers increasingly see Chinese labs as changing the market, not just offering bargain alternatives. Lower prices, open weights, and fewer fears of vendor lock-in make them attractive to companies already frustrated by rising prices and product churn from US providers. Even if the very top US models still lead, the economic moat looks weaker when "good enough" gets this cheap and fast.

    Avoid assuming the closed US labs will remain safe default vendors. Build a multi-model stack and keep an exit path to open-weight or lower-cost providers if pricing or product terms move against you.

      Attribution:
    • amunozo #1
    • hobofan #1
    • kypro #1
    • MangoCoffee #1
  7. 07

    Selective access could become a competitive choke point

    Several people were less worried about the benchmark than about who gets access to these speeds. If near-frontier models at 1,000 TPS open new classes of products, gated "ultra-speed" tiers turn compute allocation into a market power lever. The gap may matter even if the underlying model quality gap is small.

    Watch access terms as closely as model quality. A vendor with limited fast capacity can shape which startups can ship low-latency products and which are stuck waiting.

      Attribution:
    • h14h #1
    • __natty__ #1
    • OtomotO #1

Against the grain

  1. 01

    The technical novelty may be overstated

    Some readers were not convinced Xiaomi unveiled a breakthrough so much as a polished bundle of known ideas. FP4 quantization for MoE models, persistent kernels, and pipelined GPU scheduling are not new on their own. The hard part is integration at this scale, but the post did not make it easy to see which piece is genuinely novel versus good engineering and good marketing.

    Do not assume a durable moat from one flashy systems post. Treat these announcements as evidence that optimization techniques are diffusing fast across the market.

      Attribution:
    • jbellis #1
    • lostmsu #1
    • buildbot #1
  2. 02

    Most teams will not pay much for extra speed

    A few commenters argued that coding output is already fast enough for mainstream software work. If your real constraints are product decisions, review, testing, and deployment, paying a premium for 10x faster text generation will not produce a proportional business gain. The strongest demand may come from niche interactive use cases rather than ordinary app development.

    Before upgrading to premium low-latency tiers, quantify whether model delay is really hurting delivery. Many teams will get more from fixing process and validation than from buying another speed tier.

      Attribution:
    • HarHarVeryFunny #1 #2
    • PhunkyPhil #1
    • harel #1
  3. 03

    Verification still dominates nontrivial software

    Skeptics pushed back on the idea that instant generation changes the essence of software work. For real products, the hard part is knowing what to build, checking edge cases, and trusting the result in production. Faster generation can even worsen the problem by flooding teams with output they are less able to evaluate than before.

    Use fast models where acceptance criteria are crisp and easy to test. For ambiguous product work, keep human ownership on requirements and validation rather than assuming speed will close the gap.

      Attribution:
    • overgard #1
    • DenisM #1
    • unglaublich #1
    • oulipo2 #1
  4. 04

    Higher output can make the work less satisfying

    Not everyone saw speed as progress. Some developers said the more the chatbot does, the less pride they feel in the result. The issue was not productivity but authorship. If the machine produced the artifact, the emotional reward of making something yourself can disappear even when the shipped outcome is better.

    If you lead a team, do not measure success only in throughput. People who care about craft may disengage if their role becomes prompt, review, and cleanup all day.

      Attribution:
    • fullstop #1
    • dd8601fn #1 #2

In plain english

CI
Continuous Integration, the automated process that runs builds and tests when code changes are submitted.
FP4
4-bit floating point, an even lower-precision format used to shrink model memory and cost further.
GPU
Graphics Processing Unit, a processor that is often used for parallel math workloads like machine learning.
MCP
Model Context Protocol, a way for AI tools to connect to external tools, data sources, or services.
MoE
Mixture of Experts, a model architecture that routes each input through only some of its expert components.
open weights
A model release where the trained parameters are published so others can run or adapt the model themselves.
ROI
Return on Investment, a measure of whether the savings or gains from an expense justify the upfront cost.
TileRT
Xiaomi’s runtime system for serving these models, intended to improve inference speed on GPUs.
tokens per second
A measure of how quickly a language model generates text, counted in small chunks of words or characters called tokens.
TPS
Tokens per second, a common measure of model output speed.

Reference links

Benchmarks and rankings

Model and infrastructure references

Fast inference hardware and demos

  • Cerebras Kimi K2.6 speed post
    Shared to compare Xiaomi’s claimed speed with Cerebras’ own throughput claims
  • Taalas
    Mentioned as another company pushing extreme inference speed through hardware specialization
  • ChatJimmy
    Linked as a public demo of very high-speed model inference

Costs, policy, and market structure

Workflow and reliability reading