The linked post compared DeepSeek V4 Pro and GPT-5.5 Pro on four ad hoc tasks, with another model acting as judge, then declared DeepSeek more precise at instruction following, schema compliance, and edge cases. That claim did not survive contact with technically literate readers. The dominant reaction was that this is not a serious benchmark. Four single-run tasks, vague scoring, no real test cases, and an LLM judge are nowhere near enough to support a headline like this. Several people also pointed out likely judging errors in the regex example and noted that the article itself reads like AI-generated SEO copy.
What did land was the broader practical point behind the bad article. A lot of firsthand users said DeepSeek V4 Pro and even Flash are now close enough to GPT and Claude on many coding tasks that the economics matter more than tiny ranking differences. The recurring pattern was not “DeepSeek is best.” It was “DeepSeek is cheap enough and competent enough that you should route a lot more work to it.” People using coding agents, review loops, and test-driven workflows said one-shot comparisons miss how these tools are actually used. In real setups, you write a plan, constrain the task, run tests, add an adversarial reviewer, and let the model iterate. Under those conditions, weaker but far cheaper models can be the rational choice for evals, batch jobs, security scans, code generation from clear specs, and any workflow where failures are easy to detect.
The practical consensus was sharper than the headline. GPT-5.5 and Claude still have an edge on ambiguous, open-ended, high-stakes work where the model must infer unstated requirements or recover from unclear prompts. DeepSeek often needs tighter specs and can loop, hallucinate, or go off-pattern more readily. But once the task is well framed and correctness is machine-checkable, price dominates. Multiple users reported that DeepSeek’s caching and
token pricing make large-scale
API use dramatically cheaper than OpenAI or Anthropic, to the point that frontier models are hard to justify except as escalation paths for the hardest cases. That turned the thread from “is this benchmark fake” into “there may be no durable API moat if open-weight or cheaper Chinese models stay this close.”
The thread also exposed the other constraint that matters now: trust. Some people are comfortable sending work to DeepSeek because they distrust US labs just as much. Others will not send sensitive code to a Chinese-hosted endpoint regardless of quality or price. That keeps self-hosting, third-party providers, and regional compliance options central to adoption. So the useful takeaway was not that DeepSeek definitively beat GPT-5.5 Pro on precision. It was that enough builders now see DeepSeek as a viable default for a large share of coding workloads that the cost curve, not the benchmark chart, is the real story.