Xiaomi says its new MiMo-v2.5-Pro-UltraSpeed mode pushes a roughly 1 trillion parameter mixture-of-experts coding model past 1,000 tokens per second, using a stack of software and model-side optimizations rather than exotic hardware. The post highlights FP4 quantization applied mainly to the experts, multi-token prediction, and its TileRT runtime. Pricing is only about 3x the regular tier, which is why the headline landed. Readers saw that as another sign that Chinese providers are forcing down both price and latency for models that are now close enough in quality for many coding tasks.
The useful conclusion was not "faster is always better." It was that sub-second or near-instant responses change how people actually work. Several people using DeepSeek Flash, Gemini Flash, Cerebras, and Groq said fast models keep them in a single-task flow instead of opening three tabs and context-switching while agents grind away. That makes coding agents feel less like batch jobs and more like an interactive tool. But once inference gets this fast, the bottleneck moves immediately to compiles,
CI, flaky tests, tool calls, and human review. A lot of current "agent time" is really waiting on those loops.
People were also clear that raw tokens per second is a misleading metric by itself. A model that burns far more tokens to finish a task can feel slower and costlier than a nominally slower model. Benchmarks and anecdotes in the comments pointed in different directions depending on provider load, harness design, and whether the task was one-shot coding or iterative agent work. That fed a broader point that the field still lacks a good
ROI metric. Buyers want to know cost per completed task, not benchmark scores or a flashy throughput number.
On capability, the consensus was that speed amplifies both upside and failure modes. Fast models are great for cheap validation passes, repeated attempts, live UI iteration, and harnesses that loop through tests until something passes. They are also great at making bad changes before you can stop them. That pushes teams toward more structure around agents. People described planning modes, written specs, test-first guardrails, and even a second model reviewing the first. The shared instinct was that if generation becomes cheap and instant, verification becomes the real product.
The business angle got almost as much attention as the engineering. Many readers think US labs are losing their pricing power as Chinese open-weight or cheaper hosted models get close enough on coding work. Complaints about rising Copilot, Gemini, Claude, and GPT costs showed up repeatedly, along with frustration at closed APIs and provider churn. Even commenters who doubted Xiaomi’s exact numbers still read the announcement as pressure on incumbents. Competing on throughput is now a real axis, not a side show.