HN Debrief

Computer use in Gemini 3.5 Flash

  • AI
  • Developer Tools
  • Enterprise Software
  • Automation

Google’s post introduces computer use for Gemini 3.5 Flash, meaning the model can observe a screen, decide what to click, type into apps or websites, and complete multi-step tasks through a UI instead of a purpose-built API. The pitch is broad compatibility. If software already works for a human, an LLM can in principle operate it too. Google frames 3.5 Flash as fast and cheap enough for this kind of agentic work.

Treat computer-use agents as a fallback layer for messy legacy workflows, not your main integration strategy. If you are choosing a vendor today, app quality, tool support, and instruction-following still matter more than a benchmark bar chart.

Discussion mood

Mostly skeptical and mildly annoyed. People see the feature as potentially useful for ugly real-world software, but they do not trust Gemini’s reliability or Google’s product execution enough to treat it as a leader here.

Key insights

  1. 01

    The weak point is the Gemini product layer

    What kept surfacing was not model capability in isolation but the missing interaction layer around it. No MCP support in the main app, poor native workflow support, and a rough consumer experience make Gemini harder to use than Claude or OpenAI even when the underlying model is competitive enough. That shifts Gemini toward being just another API choice instead of a sticky user-facing platform.

    If you are evaluating Gemini for end-user workflows, test the actual app and tool surface, not just the model. A better model does not help much if users need a separate frontend or custom glue to unlock it.

      Attribution:
    • satvikpendem #1
    • anticorporate #1
    • tonyrice #1
    • solarkraft #1
    • mitchell_h #1
  2. 02

    Computer use is a fallback for software you cannot integrate

    The useful framing was not that screenshot control is the future of software interfaces. It is that enterprises are full of locked-down systems, broken accessibility, and internal tools with no practical API path. In those cases, computer use wins by default because it can act on the only interface available. Where you do have leverage, structured tools or accessibility APIs are a better substrate for LLMs.

    Use computer use for the stubborn edge cases. For anything you own or can influence, invest in direct tool interfaces, shell access, or accessibility hooks so you are not paying screenshot tax forever.

      Attribution:
    • airstrike #1
    • thorum #1
    • chatmasta #1
    • Chu4eeno #1
    • gdudeman #1
  3. 03

    Direct APIs beat visual clicking when available

    Several builders argued that mature automations should graduate away from DOM and screenshot loops. Reverse engineering the network calls, extracting a stable JSON endpoint, or building a lightweight browser extension cuts cost dramatically and removes a lot of brittle perception work. The computer-use path is valuable for discovery and one-off tasks, but it is an expensive steady-state architecture.

    If an agent repeatedly touches the same site or workflow, inspect the underlying requests and replace UI driving with a direct integration. Keep the visual agent for bootstrapping, not for production volume.

      Attribution:
    • arjunchint #1
    • Rebelgecko #1
    • revolvingthrow #1
  4. 04

    Flash is being judged as a speed-price trade

    People were willing to forgive a lot because Flash is fast and cheap. That makes it attractive for search-adjacent help, lightweight agents, or customer-facing flows where retries are acceptable. The catch is obvious. Once failure costs rise, low latency stops being a bargain and starts becoming churn.

    Model selection here should be task-tiered. Put Flash on high-volume, low-risk flows and route expensive or stateful work to stronger models before cheap mistakes become operator time.

      Attribution:
    • smallstepforman #1
    • anigbrowl #1
    • staticman2 #1
    • ai_fry_ur_brain #1
    • SoMomentary #1
  5. 05

    UI automation gets better when the app exposes test hooks

    Comments about TUI and GUI building pointed to a more promising direction than raw screenshot control. Desktop apps built on frameworks like Qt already expose testing machinery, and tools like Squish can turn that into a cleaner interface for automation. The less the model has to infer from pixels, the more reliable iteration becomes.

    If your product has a GUI, expose structured test surfaces early. You will get better AI automation from instrumented widgets than from hoping a vision model can reliably poke pixels.

      Attribution:
    • fridder #1
    • Chu4eeno #1
    • IncreasePosts #1

Against the grain

  1. 01

    Gemini can outperform rivals on table extraction

    One firsthand report cut against the broader complaints about reliability. For screenshot-to-CSV work on PDF tables, Gemini reportedly beat ChatGPT consistently across several examples. That suggests some of the criticism is task-dependent, and vision-heavy extraction may be one of the areas where Gemini is stronger than its reputation suggests.

    Do not write Gemini off based on coding chatter alone. If your workload is OCR, tables, or document extraction, run your own bake-off before standardizing on another model.

      Attribution:
    • hashta #1
  2. 02

    Some users are not seeing the refusal problem

    Despite repeated complaints about overtuned guardrails, others said Gemini answered the cited prompts normally and that Antigravity worked fine for them. That does not erase the refusal reports, but it does suggest the behavior may vary by surface, rollout, account state, or prompt phrasing rather than reflecting a single blanket policy.

    When teams report that Gemini is unusably restrictive, reproduce the exact prompt in the same product surface and region before making a platform call. The variance itself is a risk, but it is different from a universal hard limit.

      Attribution:
    • sva_ #1
    • kordlessagain #1
    • nout #1

In plain english

API
Application Programming Interface, a way for software to call another service programmatically.
CLI
Command-line interface, a text-based way to use software tools from a terminal.
DOM
Document Object Model, the structured representation of a web page that code can inspect and manipulate.
JSON
JavaScript Object Notation, a common text format for structured data exchanged between systems.
MCP
Model Context Protocol, a way for AI models to connect to external tools and data sources.
OSWorld
A benchmark for testing whether AI models can complete tasks by operating computers and software interfaces.
Qt
A software framework for building graphical desktop applications.
Squish
A commercial GUI test automation tool that can interact with desktop and web applications.
SSO
Single sign-on, a system that lets users access multiple services with one identity provider.
TUI
Text user interface, software operated through text-based screens rather than graphical windows and buttons.

Reference links

Google announcement and evals

Google tools and product references

  • Antigravity CLI
    Google’s CLI product that commenters cited as the current command-line path for Gemini-style coding workflows.