Computer use in Gemini 3.5 Flash

AI
Developer Tools
Enterprise Software
Automation

Google’s post introduces computer use for Gemini 3.5 Flash, meaning the model can observe a screen, decide what to click, type into apps or websites, and complete multi-step tasks through a UI instead of a purpose-built API. The pitch is broad compatibility. If software already works for a human, an LLM can in principle operate it too. Google frames 3.5 Flash as fast and cheap enough for this kind of agentic work.

The reaction was not excitement about a breakthrough. It was frustration that Google still seems behind on the surrounding product. People kept coming back to missing or weak pieces in the Gemini ecosystem: no MCP support in the main app, confusion around Gemini CLI versus Antigravity, weak repo-level coding workflows compared with Codex and Claude Code, and an app experience that several people called throttled, forgetful, or just bad. That shaped how the announcement landed. Many saw it less as a new platform and more as Google checking a box its competitors already checked. On the substance, the consensus was that computer use is a brute-force interface, not an elegant one. It is slow, costly, and easy to derail. But plenty of people still think it will be useful because the world runs on brittle UIs, proprietary internal tools, SSO-gated portals, and software with no usable API. In that setting, screenshot-and-click automation is ugly but real. The sharper framing was that computer use is a fallback for the long tail. If you control the stack, you should expose tools, APIs, accessibility hooks, or direct shell access instead. If you do not control the stack, computer use may be the only thing that works this week. People were also skeptical of Google’s benchmark presentation. The cited chart showed Gemini close to leading models on an OSWorld-style workload, but not actually ahead of the top Claude and OpenAI entries. Several readers said the more relevant story is price and latency. Flash may make sense as a front-line model where failures are tolerable and retries are cheap. That fit a broader pattern in the comments. Even users unimpressed by Gemini’s quality said they reach for Flash because it is fast and inexpensive. Reliability remained the biggest drag. Multiple firsthand reports described Gemini as hit-or-miss, quick to regress, prone to giving up, or oddly over-guardrailed. Others said they see the opposite and get solid results, especially through the API rather than the consumer app. The practical conclusion was not that Gemini is unusable. It was that Google still has a consistency problem across interfaces, regions, and task types. For computer use, that is a serious weakness because UI automation amplifies every small mistake into wasted time or broken state.

Treat computer-use agents as a fallback layer for messy legacy workflows, not your main integration strategy. If you are choosing a vendor today, app quality, tool support, and instruction-following still matter more than a benchmark bar chart.

June 24, 2026
blog.google
Discuss on HN

Key insights

The weak point is the Gemini product layer

What kept surfacing was not model capability in isolation but the missing interaction layer around it. No MCP support in the main app, poor native workflow support, and a rough consumer experience make Gemini harder to use than Claude or OpenAI even when the underlying model is competitive enough. That shifts Gemini toward being just another API choice instead of a sticky user-facing platform.

If you are evaluating Gemini for end-user workflows, test the actual app and tool surface, not just the model. A better model does not help much if users need a separate frontend or custom glue to unlock it.

Attribution:

satvikpendem #1
anticorporate #1
tonyrice #1
solarkraft #1
mitchell_h #1

Computer use is a fallback for software you cannot integrate

The useful framing was not that screenshot control is the future of software interfaces. It is that enterprises are full of locked-down systems, broken accessibility, and internal tools with no practical API path. In those cases, computer use wins by default because it can act on the only interface available. Where you do have leverage, structured tools or accessibility APIs are a better substrate for LLMs.

Use computer use for the stubborn edge cases. For anything you own or can influence, invest in direct tool interfaces, shell access, or accessibility hooks so you are not paying screenshot tax forever.

Attribution:

airstrike #1
thorum #1
chatmasta #1
Chu4eeno #1
gdudeman #1

Direct APIs beat visual clicking when available

Several builders argued that mature automations should graduate away from DOM and screenshot loops. Reverse engineering the network calls, extracting a stable JSON endpoint, or building a lightweight browser extension cuts cost dramatically and removes a lot of brittle perception work. The computer-use path is valuable for discovery and one-off tasks, but it is an expensive steady-state architecture.

If an agent repeatedly touches the same site or workflow, inspect the underlying requests and replace UI driving with a direct integration. Keep the visual agent for bootstrapping, not for production volume.

Attribution:

arjunchint #1
Rebelgecko #1
revolvingthrow #1

Flash is being judged as a speed-price trade

People were willing to forgive a lot because Flash is fast and cheap. That makes it attractive for search-adjacent help, lightweight agents, or customer-facing flows where retries are acceptable. The catch is obvious. Once failure costs rise, low latency stops being a bargain and starts becoming churn.

Model selection here should be task-tiered. Put Flash on high-volume, low-risk flows and route expensive or stateful work to stronger models before cheap mistakes become operator time.

Attribution:

smallstepforman #1
anigbrowl #1
staticman2 #1
ai_fry_ur_brain #1
SoMomentary #1

UI automation gets better when the app exposes test hooks

Comments about TUI and GUI building pointed to a more promising direction than raw screenshot control. Desktop apps built on frameworks like Qt already expose testing machinery, and tools like Squish can turn that into a cleaner interface for automation. The less the model has to infer from pixels, the more reliable iteration becomes.

If your product has a GUI, expose structured test surfaces early. You will get better AI automation from instrumented widgets than from hoping a vision model can reliably poke pixels.

Attribution:

fridder #1
Chu4eeno #1
IncreasePosts #1

Against the grain

Gemini can outperform rivals on table extraction

One firsthand report cut against the broader complaints about reliability. For screenshot-to-CSV work on PDF tables, Gemini reportedly beat ChatGPT consistently across several examples. That suggests some of the criticism is task-dependent, and vision-heavy extraction may be one of the areas where Gemini is stronger than its reputation suggests.

Do not write Gemini off based on coding chatter alone. If your workload is OCR, tables, or document extraction, run your own bake-off before standardizing on another model.

Attribution:

hashta #1

Some users are not seeing the refusal problem

Despite repeated complaints about overtuned guardrails, others said Gemini answered the cited prompts normally and that Antigravity worked fine for them. That does not erase the refusal reports, but it does suggest the behavior may vary by surface, rollout, account state, or prompt phrasing rather than reflecting a single blanket policy.

When teams report that Gemini is unusably restrictive, reproduce the exact prompt in the same product surface and region before making a platform call. The variance itself is a risk, but it is different from a universal hard limit.

Attribution:

sva_ #1
kordlessagain #1
nout #1

In plain english

API ↩

Application Programming Interface, a way for software to call another service programmatically.

CLI ↩

Command-line interface, a text-based way to use software tools from a terminal.

DOM ↩

Document Object Model, the structured representation of a web page that code can inspect and manipulate.

JSON ↩

JavaScript Object Notation, a common text format for structured data exchanged between systems.

MCP ↩

Model Context Protocol, a way for AI models to connect to external tools and data sources.

OSWorld ↩

A benchmark for testing whether AI models can complete tasks by operating computers and software interfaces.

Qt ↩

A software framework for building graphical desktop applications.

Squish ↩

A commercial GUI test automation tool that can interact with desktop and web applications.

SSO ↩

Single sign-on, a system that lets users access multiple services with one identity provider.

TUI ↩

Text user interface, software operated through text-based screens rather than graphical windows and buttons.

Reference links

Google announcement and evals

Introducing computer use in Gemini 3.5 Flash
The main announcement describing Gemini 3.5 Flash computer use.
Gemini 3.5 Flash evals methodology
Google’s methodology page for the benchmark results cited in the post and comments.

Google tools and product references

Antigravity CLI
Google’s CLI product that commenters cited as the current command-line path for Gemini-style coding workflows.

Computer use in Gemini 3.5 Flash

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Google announcement and evals

Google tools and product references