MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

AI
Developer Tools
Open Source
Infrastructure
Economics

Xiaomi says its new MiMo-v2.5-Pro-UltraSpeed mode pushes a roughly 1 trillion parameter mixture-of-experts coding model past 1,000 tokens per second, using a stack of software and model-side optimizations rather than exotic hardware. The post highlights FP4 quantization applied mainly to the experts, multi-token prediction, and its TileRT runtime. Pricing is only about 3x the regular tier, which is why the headline landed. Readers saw that as another sign that Chinese providers are forcing down both price and latency for models that are now close enough in quality for many coding tasks.

The useful conclusion was not "faster is always better." It was that sub-second or near-instant responses change how people actually work. Several people using DeepSeek Flash, Gemini Flash, Cerebras, and Groq said fast models keep them in a single-task flow instead of opening three tabs and context-switching while agents grind away. That makes coding agents feel less like batch jobs and more like an interactive tool. But once inference gets this fast, the bottleneck moves immediately to compiles, CI, flaky tests, tool calls, and human review. A lot of current "agent time" is really waiting on those loops. People were also clear that raw tokens per second is a misleading metric by itself. A model that burns far more tokens to finish a task can feel slower and costlier than a nominally slower model. Benchmarks and anecdotes in the comments pointed in different directions depending on provider load, harness design, and whether the task was one-shot coding or iterative agent work. That fed a broader point that the field still lacks a good ROI metric. Buyers want to know cost per completed task, not benchmark scores or a flashy throughput number. On capability, the consensus was that speed amplifies both upside and failure modes. Fast models are great for cheap validation passes, repeated attempts, live UI iteration, and harnesses that loop through tests until something passes. They are also great at making bad changes before you can stop them. That pushes teams toward more structure around agents. People described planning modes, written specs, test-first guardrails, and even a second model reviewing the first. The shared instinct was that if generation becomes cheap and instant, verification becomes the real product. The business angle got almost as much attention as the engineering. Many readers think US labs are losing their pricing power as Chinese open-weight or cheaper hosted models get close enough on coding work. Complaints about rising Copilot, Gemini, Claude, and GPT costs showed up repeatedly, along with frustration at closed APIs and provider churn. Even commenters who doubted Xiaomi’s exact numbers still read the announcement as pressure on incumbents. Competing on throughput is now a real axis, not a side show.

If you build agent workflows, start measuring end-to-end cycle time instead of model speed alone. Faster inference is now good enough to change UX and orchestration, but the next bottlenecks are test latency, review, and the cost of verifying much more output.

June 8, 2026
mimo.xiaomi.com
Discuss on HN

Discussion mood

Mostly excited, with a practical edge. People liked the prospect of near-instant coding and voice workflows, but the enthusiasm was tempered by two recurring concerns: model speed is not the real bottleneck in many software loops, and faster generation just shifts the pain to verification, review, and costs.

Key insights

Fast inference shifts pain to build loops

Once the model is quick enough to feel interactive, the waiting moves to compiles, CI, tests, MCP calls, and other external tooling. That changes what to optimize next. Teams that want the benefit of ultrafast models need to treat build and test latency as first-class product work, not background annoyance.

Profile your agent loop end to end, including tests and tool calls. If inference is no longer dominant, spend engineering effort on parallel tests, better caches, and tighter validation harnesses before paying more for even faster models.

Attribution:

skybrian #1
switchbak #1
erikus #1
efromvt #1

Tokens per second is the wrong buying metric

Throughput looks impressive, but it does not tell you how much work gets done per dollar or per minute. Some models finish tasks quickly because they are terse and reliable. Others burn huge token budgets, struggle with agent harnesses, or only look good in one-shot benchmarks. The more useful comparison is completed task cost and elapsed time under your own workflow.

Evaluate models on a fixed set of real tasks with your own harness. Track wall-clock time, token consumption, retries, and manual cleanup instead of relying on provider TPS claims or headline benchmarks.

Attribution:

sarjann #1
SwellJoe #1
gertlabs #1
_pdp_ #1

Speed makes guardrails mandatory

The faster the model, the less time a human has to catch it going off the rails. That pushes teams toward a staged workflow where one model plans, another executes, and tests and spec documents narrow the space before any write access happens. A second model can cheaply review for disagreements and surface only the real ambiguities.

Separate planning from execution in your tooling. Require specs, tests, and review checkpoints before giving a fast model broad edit or API privileges.

Attribution:

coderbants #1
bendangelo #1
noncoml #1
petesergeant #1

Voice and live UI are the first obvious winners

Coding is not the only use case where latency matters. Voice assistants have a brutally tight delay budget, and live prototyping gets much better when you can say "make the font bigger" and watch it update immediately. Speed unlocks categories where a slow frontier model feels unusable even if its answers are better.

If you work on voice, support, or interactive design tools, revisit product ideas you previously dropped because of latency. Some now become viable even before model quality takes another leap.

Attribution:

prplfsh #1
philipkglass #1
eli #1

Fast models improve flow only if used deliberately

The better framing was not worker replacement but workflow shape. Slow agents encourage tab hoarding and context switching. Faster ones let people stay on one task, use AI for deeper exploration, and add polish, tests, and docs instead of just spraying prompts at a slot machine. The value comes from using speed to stay in flow, not to cram more churn into the day.

Design team practices around fewer parallel threads and tighter human feedback cycles. If people are using speed only to produce more output, you are capturing the wrong benefit.

Attribution:

dakiol #1
powerapple #1
dilyevsky #1
enraged_camel #1

Chinese pricing is becoming strategic pressure

Readers increasingly see Chinese labs as changing the market, not just offering bargain alternatives. Lower prices, open weights, and fewer fears of vendor lock-in make them attractive to companies already frustrated by rising prices and product churn from US providers. Even if the very top US models still lead, the economic moat looks weaker when "good enough" gets this cheap and fast.

Avoid assuming the closed US labs will remain safe default vendors. Build a multi-model stack and keep an exit path to open-weight or lower-cost providers if pricing or product terms move against you.

Attribution:

amunozo #1
hobofan #1
kypro #1
MangoCoffee #1

Selective access could become a competitive choke point

Several people were less worried about the benchmark than about who gets access to these speeds. If near-frontier models at 1,000 TPS open new classes of products, gated "ultra-speed" tiers turn compute allocation into a market power lever. The gap may matter even if the underlying model quality gap is small.

Watch access terms as closely as model quality. A vendor with limited fast capacity can shape which startups can ship low-latency products and which are stuck waiting.

Attribution:

h14h #1
__natty__ #1
OtomotO #1

Against the grain

The technical novelty may be overstated

Some readers were not convinced Xiaomi unveiled a breakthrough so much as a polished bundle of known ideas. FP4 quantization for MoE models, persistent kernels, and pipelined GPU scheduling are not new on their own. The hard part is integration at this scale, but the post did not make it easy to see which piece is genuinely novel versus good engineering and good marketing.

Do not assume a durable moat from one flashy systems post. Treat these announcements as evidence that optimization techniques are diffusing fast across the market.

Attribution:

jbellis #1
lostmsu #1
buildbot #1

Most teams will not pay much for extra speed

A few commenters argued that coding output is already fast enough for mainstream software work. If your real constraints are product decisions, review, testing, and deployment, paying a premium for 10x faster text generation will not produce a proportional business gain. The strongest demand may come from niche interactive use cases rather than ordinary app development.

Before upgrading to premium low-latency tiers, quantify whether model delay is really hurting delivery. Many teams will get more from fixing process and validation than from buying another speed tier.

Attribution:

HarHarVeryFunny #1 #2
PhunkyPhil #1
harel #1

Verification still dominates nontrivial software

Skeptics pushed back on the idea that instant generation changes the essence of software work. For real products, the hard part is knowing what to build, checking edge cases, and trusting the result in production. Faster generation can even worsen the problem by flooding teams with output they are less able to evaluate than before.

Use fast models where acceptance criteria are crisp and easy to test. For ambiguous product work, keep human ownership on requirements and validation rather than assuming speed will close the gap.

Attribution:

overgard #1
DenisM #1
unglaublich #1
oulipo2 #1

Higher output can make the work less satisfying

Not everyone saw speed as progress. Some developers said the more the chatbot does, the less pride they feel in the result. The issue was not productivity but authorship. If the machine produced the artifact, the emotional reward of making something yourself can disappear even when the shipped outcome is better.

If you lead a team, do not measure success only in throughput. People who care about craft may disengage if their role becomes prompt, review, and cleanup all day.

Attribution:

fullstop #1
dd8601fn #1 #2

In plain english

CI ↩

Continuous Integration, an automated process that runs tests and checks when code changes are submitted.

FP4 ↩

A 4-bit floating-point representation used to compress model computations or weights.

GPU ↩

Graphics processing unit, a chip originally designed for graphics that is now widely used to train and run AI models.

MCP ↩

Model Context Protocol, a way for AI assistants or other tools to connect to software tools and structured capabilities.

MoE ↩

Mixture of Experts, a model design that routes work to a subset of specialized internal components instead of using the whole model every time.

open weights ↩

AI models whose learned parameters are made available so others can run or adapt them.

ROI ↩

Return on investment, the value gained from time or money spent.

TileRT ↩

Xiaomi’s runtime system for serving these models, intended to improve inference speed on GPUs.

tokens per second ↩

A speed measure for language models showing how many text tokens they generate each second.

TPS ↩

Tokens per second, a speed measure for how quickly a model generates text or code.

Reference links

Benchmarks and rankings

Will it Mythos benchmark post
Used as an anecdotal benchmark source for model speed comparisons
Bench report final
Shared as a larger benchmark run comparing 21 models on speed
Gertlabs rankings
Cited to support the claim that MiMo v2.5 Pro performs strongly on agentic coding benchmarks
Gertlabs one-shot coding rankings
Used to show DeepSeek v4 Pro performs better in one-shot coding than in custom harness tests

Model and infrastructure references

OpenRouter DeepSeek v4 Pro throughput page
Referenced to question whether DeepSeek Pro is really that fast in practice
TileRT GitHub repository
Pointed to as the runtime Xiaomi says it uses for its speedups
vLLM recipe for Xiaomi MiMo v2.5 Pro
Used to clarify the model’s attention and expert configuration
MiMo v2.5 Pro FP4 DFlash model card
Referenced while debating the exact architecture tradeoffs behind the speedup

Fast inference hardware and demos

Cerebras Kimi K2.6 speed post
Shared to compare Xiaomi’s claimed speed with Cerebras’ own throughput claims
Taalas
Mentioned as another company pushing extreme inference speed through hardware specialization
ChatJimmy
Linked as a public demo of very high-speed model inference

Costs, policy, and market structure

Reuters on China's 'involution' competition trend
Used to explain aggressive Chinese pricing pressure
Anthropic 2028 AI leadership
Cited as an example of US lab framing around leadership and model economics
Reuters on OpenAI accusing DeepSeek of distillation
Referenced in the argument that lower-cost Chinese models may benefit from distillation
Wikipedia on solar power in China
Shared to support the claim that China’s energy buildout may help lower inference costs

Workflow and reliability reading

Addy Osmani on cognitive surrender
Recommended as a framing for the risk of over-delegating thinking to AI tools
In-context regression paper
Linked to argue that models may genuinely perform regression-like estimation in context